5.10 Standard error estimation
Recall that the log likelihood of item response matrix \(\mathbf{Y}\) of all \(N\) students can be written by \[\begin{equation} \ell(\mathbf{Y}) = \log \prod\limits_{i = 1}^N L ({\mathbf{Y}_i}) = \sum\limits_{i = 1}^N \log L({\mathbf{Y}_i}) \end{equation}\]
Our goal is to find \(\tau=(\mathbf{g},\mathbf{s},\mathbf{p})\) of dimension \(D\times 1\) that maximize \(\ell(\mathbf{Y})\) such that, \[ \frac{\partial \ell(\mathbf{Y})}{\partial \tau}=\mathbf{0} \] Equivalently, \[ \sum\limits_{i = 1}^N\frac{\partial \ell(\mathbf{Y}_i)}{\partial \tau_d}=\mathbf{0} \]
5.10.1 Score function
The gradient of the log-likelihood function with respect to the parameter vector, or \[\mathbf{u}(\tau)=\frac{\partial \ell(\mathbf{y})}{\partial \tau}\] is called the score function. Here, we use \(\mathbf{y}\) to denote a vector of item responses.
Some takeaways:
Score function \(\mathbf{u}(\tau)\) is a vector of length \(D\).
Score function is a function of parameters, given a certain response vector.
Under certain regularity conditions, the expected value of the score, evaluated at the true parameter value, is zero.
\[ E\bigg[\mathbf{u}(\tau)\bigg]=\mathbf{0} \] Because the expectation in the equation above is taken over all possible response vector \(\mathbf{y}\), it can be written as \[ E\bigg[\mathbf{u}(\tau)\bigg]=\sum_{\mathbf{y}\in \Omega}\frac{\partial \ell(\mathbf{y})}{\partial (\tau)}p(\mathbf{y}|\tau) \]
5.10.2 Expected Fisher information
The expected Fisher information of a response vector \(\mathbf{y}\) about parameters \(\tau\) , denoted by \(I_\mathbf{y}(\tau)\), is the variance-covariance matrix of the score function.
\(I_\mathbf{y}(\tau)\) is a \(D\times D\) matrix, and can be calculated as \[ I_\mathbf{y}(\tau)=E_{\mathbf{y}}\Bigg[\mathbf{u}(\tau)\bigg(\mathbf{u}(\tau)\bigg)^T\Bigg] \] Again, because the expectation in the equation is taken over all possible response vector \(\mathbf{y}\), the Fisher information can be written as
\[ I_\mathbf{y}(\tau)=\sum_{\mathbf{y}\in \Omega}\Bigg[\mathbf{u}(\tau)\bigg(\mathbf{u}(\tau)\bigg)^T\Bigg]p(\mathbf{y}|\tau) \] where \(\Omega\) is set of all possible response vectors.
Under mild regularity conditions, Fisher information can also be defined using second derivatives. In particular, the \((a,b)th\) element in Fisher information can be written by
\[ I_{\mathbf{y}}(\tau)_{a,b}=-E_{\mathbf{y}}\Bigg[\frac{\partial^2 \ell(\mathbf{y})}{\partial \tau_a\tau_b}\Bigg] \]
Expected Fisher information is an expectation and thus has nothing to do with the data.
Calculation of the expected Fisher information involves finding the score function for all response vectors, which may not be feasible if the test length is large.
The following term is also the \((a,b)\)the element of a Hessian matrix: \[ \frac{\partial^2 \ell(\mathbf{y})}{\partial \tau_a\tau_b} \]
5.10.3 Observed Fisher information
The observed Fisher information is defined using second derivatives. In particular, the \((a,b)th\) element in the observed Fisher information can be written by
\[ J_{\mathbf{y}}(\tau)_{a,b}=-\frac{1}{N}\sum_{i=1}^N\Bigg[\frac{\partial^2 \ell(\mathbf{y}_i)}{\partial \tau_a\tau_b}\Bigg] \] The observed Fisher information can also be calculated using the so-called outer product of gradient method: \[ J_{\mathbf{y}}(\tau)_{a,b}=-\frac{1}{N}\sum_{i=1}^N\Bigg[\frac{\partial \ell(\mathbf{y}_i)}{\partial (\tau)_a}\bigg(\frac{\partial \ell(\mathbf{y}_i)}{\partial (\tau)_b}\bigg)^T\Bigg] \]
5.10.4 Variance-covariance matrix of model parameters
One of the most important properties of maximum likelihood estimation is that, under regularity conditions, the ML estimator of \(\tau\) holds \[ \hat{\tau}\rightarrow \mathcal{N}\bigg(\tau,{FI(\tau)}^{-1}\bigg) \] This means that the estimator asymptotically follows the normal distribution, which is called asymptotic normality.
About \(FI(\tau)\):
\(FI(\tau)\) should be evaluated at the true parameter \(\tau\). However, we do not know \(\tau\) in practice. We can evaluate \(FI(\tau)\) at its ML estimates \(\hat{\tau}\).
\(FI(\hat{\tau})\) can be estiamted using either the expected or observed information we discussed above.
If we have \(N\) response vectors in the sample, \[ FI(\hat{\tau})=N\times I_{\mathbf{y}}(\hat{\tau}) \] \[ FI(\hat{\tau})=N\times J_{\mathbf{y}}(\hat{\tau}) \]
\(I_{\mathbf{y}}(\hat{\tau})\) or \(J_{\mathbf{y}}(\hat{\tau})\) is called unit Fisher information, because the expectation or average involved makes them to measure the average information in each response vector. If we have more data in the sample, we should typically have more information.
Standard errors of model parameters can be calculated as the squared root of the diagonal elements of the inverse of \(FI(\hat{\tau})\).