Generalized least squares

In standard linear regression models, one observes data \{y_i,x_{ij}\}_{i=1, \dots, n,j=2, \dots, k} on n statistical units with k − 1 predictor values and one response value each. The response values are placed in a vector,\mathbf{y} \equiv \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}, and the predictor values are placed in the design matrix,\mathbf{X} \equiv \begin{pmatrix} 1 & x_{12} & x_{13} & \cdots & x_{1k} \\ 1 & x_{22} & x_{23} & \cdots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n2} & x_{n3} & \cdots & x_{nk} \end{pmatrix} , where each row is a vector of the k predictor variables (including a constant) for the ith data point. The model assumes that the conditional mean of \mathbf{y} given \mathbf{X} to be a linear function of \mathbf{X} and that the conditional variance of the error term given \mathbf{X} is a known non-singular covariance matrix, \mathbf{\Omega}. That is, \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \operatorname{E}[\boldsymbol\varepsilon\mid\mathbf{X}]=0, \quad \operatorname{Cov}[\boldsymbol\varepsilon\mid\mathbf{X}]= \boldsymbol{\Omega}, where \boldsymbol\beta \in \mathbb{R}^k is a vector of unknown constants, called "regression coefficients", which are estimated from the data. If \mathbf{b} is a candidate estimate for \boldsymbol{\beta}, then the residual vector for \mathbf{b} is \mathbf{y}- \mathbf{X} \mathbf{b}. The generalized least squares method estimates \boldsymbol{\beta} by minimizing the squared Mahalanobis length of this residual vector: \begin{align} {\hat{\boldsymbol{\beta}}} & = \underset{\mathbf{b}}\operatorname{arg min}\,(\mathbf{y}- \mathbf{X} \mathbf{b})^{ \mathrm{T}}\mathbf{\Omega}^{-1}(\mathbf{y}- \mathbf{X} \mathbf{b}) \\ & = \underset{ \mathbf{b}}\operatorname{arg min}\,\mathbf{y}^{\mathrm{T}}\,\mathbf{\Omega}^{-1}\mathbf{y} + (\mathbf{X} \mathbf{b})^{\mathrm{T}} \mathbf{\Omega}^{-1} \mathbf{X} \mathbf{b} - \mathbf{y}^{\mathrm{T}}\mathbf{\Omega}^{-1}\mathbf{X} \mathbf{b}-(\mathbf{X} \mathbf{b})^{\mathrm{T}}\mathbf{\Omega}^{-1}\mathbf{y}\, , \end{align} which is equivalent to {\hat{\boldsymbol\beta}} = \underset{ \mathbf{b}}\operatorname{arg min}\,\mathbf{y}^{\mathrm{T}}\,\mathbf{\Omega}^{-1}\mathbf{y} + \mathbf{b}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \mathbf{\Omega}^{-1} \mathbf{X} \mathbf{b} -2 \mathbf{b}^{\mathrm{T}} \mathbf{X} ^{\mathrm{T}}\mathbf{\Omega}^{-1}\mathbf{y}, which is a quadratic programming problem. The stationary point of the objective function occurs when 2 \mathbf{X}^{\mathrm{T}} \mathbf{\Omega}^{-1} \mathbf{X} { \mathbf{b}} -2 \mathbf{X} ^{\mathrm{T}}\mathbf{\Omega}^{-1}\mathbf{y} = 0 , so the estimator is {\hat{\boldsymbol \beta}} = \left( \mathbf{X}^{\mathrm{T}} \mathbf{\Omega}^{-1} \mathbf{X} \right)^{-1} \mathbf{X}^{ \mathrm{T}}\mathbf{\Omega}^{-1}\mathbf{y}. The quantity \mathbf{\Omega}^{-1} is known as the precision matrix (or dispersion matrix), a generalization of the diagonal weight matrix. Properties The GLS estimator is unbiased, consistent, efficient, and asymptotically normal with\operatorname{E}[\hat\boldsymbol\beta\mid\mathbf{X}] = \boldsymbol\beta, \quad\text{and}\quad \operatorname{Cov}[\hat{\boldsymbol\beta}\mid\mathbf{X}] = (\mathbf{X}^{\mathrm{T}}\boldsymbol\Omega^{-1}\mathbf{X})^{-1}.GLS is equivalent to applying ordinary least squares (OLS) to a linearly transformed version of the data. This can be seen by factoring \mathbf{\Omega} = \mathbf{C} \mathbf{C}^{ \mathrm{T}} using a method such as Cholesky decomposition. Left-multiplying both sides of \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} by \mathbf{C}^{-1} yields an equivalent linear model: \mathbf{y}^{*} = \mathbf{X}^{*} \boldsymbol{\beta} + \boldsymbol{\varepsilon}^{*}, \quad \text{where} \quad \mathbf{y}^{*} = \mathbf{C}^{-1} \mathbf{y}, \quad \mathbf{X}^{*} = \mathbf{C}^{-1} \mathbf{X}, \quad \boldsymbol{\varepsilon}^{*} = \mathbf{C}^{-1} \boldsymbol{\varepsilon}.In this model, \operatorname{Var}[{\boldsymbol{\varepsilon}}^{*}\mid\mathbf{X}]= \mathbf{C}^{-1} \mathbf{\Omega} \left(\mathbf{C}^{-1} \right)^{\mathrm{T}} = \mathbf{I}, where \mathbf{I} is the identity matrix. Then, \boldsymbol{\beta} can be efficiently estimated by applying OLS to the transformed data, which requires minimizing the objective, \left(\mathbf{y}^{*} - \mathbf{X}^{*} \boldsymbol{\beta} \right)^{\mathrm{T}} (\mathbf{y}^{*} - \mathbf{X}^{*} \boldsymbol{\beta}) = (\mathbf{y}- \mathbf{X} \mathbf{b})^{\mathrm{T}}\,\mathbf{\Omega}^{-1}(\mathbf{y}- \mathbf{X} \mathbf{b}). This transformation effectively standardizes the scale of and de-correlates the errors. When OLS is used on data with homoscedastic errors, the Gauss–Markov theorem applies, so the GLS estimate is the best linear unbiased estimator for \boldsymbol{\beta}. == Weighted least squares ==