Parameter estimation in linear models in high-dimensions: a data set consists of a response vector Y \in \mathbb{R}^n and a
design matrix X \in \mathbb{R}^{n \times p} with p \gg n. Our goal is to estimate the unknown vector \beta=(\beta_1, \dots, \beta_p) \in \mathbb{R}^{p} of regression coefficients where \beta is often assumed to be
sparse, in the sense that the
cardinality of the set S := \{j : \beta_j \neq 0 \} is small by comparison with p. The most basic statistical model for the relationship between a
covariate vector x \in \mathbb{R}^p and a
response variable y \in \mathbb{R} is the
linear model : y = x^\top \beta + \epsilon, where \beta \in \mathbb{R}^p is an unknown parameter vector, and \epsilon is random noise with mean zero and variance \sigma^2. Given independent responses Y_1,\ldots,Y_n, with corresponding covariates x_1,\ldots,x_n, from this model, we can form the response vector Y = (Y_1,\ldots,Y_n)^\top, and
design matrix X = (x_1,\ldots,x_n)^\top \in \mathbb{R}^{n \times p}. When n \geq p and the design matrix has full
column rank (i.e. its columns are
linearly independent), the
ordinary least squares estimator of \beta is : \hat{\beta} := (X^\top X)^{-1} X^\top Y. When \epsilon \sim N(0,\sigma^2), it is
known that \hat{\beta} \sim N_p\bigl(\beta,\sigma^2(X^\top X)^{-1}\bigr). Thus, \hat{\beta} is an
unbiased estimator of \beta, and the
Gauss-Markov theorem tells us that it is the
Best Linear Unbiased Estimator. However,
overfitting is a concern when p is of comparable magnitude to n: the matrix X^\top X in the definition of \hat{\beta} may become
ill-conditioned, with a small minimum
eigenvalue. In such circumstances \mathbb{E}(\|\hat{\beta} - \beta\|^2) = \sigma^2 \mathrm{tr}\bigl((X^\top X)^{-1}\bigr) will be large (since the
trace of a matrix is the sum of its eigenvalues). Even worse, when p > n, the matrix X^\top X is
singular. (See Section 1.2 and Exercise 1.2 in .) It is important to note that the deterioration in estimation performance in high dimensions observed in the previous paragraph is not limited to the ordinary least squares estimator. In fact, statistical inference in high dimensions is intrinsically hard, a phenomenon known as the
curse of dimensionality, and it can be shown that no estimator can do better in a worst-case sense without additional information (see Example 15.10). Nevertheless, the situation in high-dimensional statistics may not be hopeless when the data possess some low-dimensional structure. One common assumption for high-dimensional linear regression is that the vector of regression coefficients is
sparse, in the sense that most coordinates of \beta are zero. Many statistical procedures, including the
Lasso, have been proposed to fit high-dimensional linear models under such sparsity assumptions.
Covariance matrix estimation Another example of a high-dimensional statistical phenomenon can be found in the problem of
covariance matrix estimation. Suppose that we observe X_1,\ldots,X_n \in \mathbb{R}^p, which are
i.i.d. draws from some zero mean distribution with an unknown covariance matrix \Sigma \in \mathbb{R}^{p \times p}. A natural
unbiased estimator of \Sigma is the
sample covariance matrix : \widehat{\Sigma} := \frac{1}{n} \sum_{i=1}^n X_iX_i^\top. In the low-dimensional setting where n increases and p is held fixed, \widehat{\Sigma} is a
consistent estimator of \Sigma in any
matrix norm. When p grows with n, on the other hand, this consistency result may fail to hold. As an illustration, suppose that each X_i \sim N_p(0,I) and p/n \rightarrow \alpha \in (0,1). If \widehat{\Sigma} were to consistently estimate \Sigma = I, then the eigenvalues of \widehat{\Sigma} should approach one as n increases. It turns out that this is not the case in this high-dimensional setting. Indeed, the largest and smallest eigenvalues of \widehat{\Sigma} concentrate around (1 + \sqrt{\alpha})^2 and (1 - \sqrt{\alpha})^2, respectively, according to the limiting distribution derived by
Tracy and Widom, and these clearly deviate from the unit eigenvalues of \Sigma. Further information on the asymptotic behaviour of the eigenvalues of \widehat{\Sigma} can be obtained from the
Marchenko–Pastur law. From a non-asymptotic point of view, the maximum eigenvalue \lambda_{\mathrm{max}}(\widehat{\Sigma}) of \widehat{\Sigma} satisfies : \mathbb{P}\left(\lambda_{\mathrm{max}}(\widehat{\Sigma}) \geq (1 + \sqrt{p/n} + \delta)^2\right) \leq e^{-n\delta^2/2}, for any \delta \geq 0 and all choices of pairs of n,p. Again, additional low-dimensional structure is needed for successful covariance matrix estimation in high dimensions. Examples of such structures include
sparsity,
low rankness and
bandedness. Similar remarks apply when estimating an inverse covariance matrix
(precision matrix). ==History==