A random vector
X ∈
Rp (a
p×1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix Σ precisely if Σ ∈
Rp ×
p is a
positive-definite matrix and the
probability density function of
X is :f(x)=(2\pi)^{-\frac{p}{2}}\, \det(\Sigma)^{-\frac{1}{2}} \exp\left(-{1 \over 2} (x-\mu)^\mathrm{T} \Sigma^{-1} (x-\mu)\right) where
μ ∈
Rp×1 is the
expected value of
X. The
covariance matrix Σ is the multidimensional analog of what in one dimension would be the
variance, and :(2\pi)^{-\frac{p}{2}}\det(\Sigma)^{-\frac{1}{2}} normalizes the density f(x) so that it integrates to 1. Suppose now that
X1, ...,
Xn are
independent and identically distributed samples from the distribution above. Based on the
observed values
x1, ...,
xn of this
sample, we wish to estimate Σ.
First steps The likelihood function is: : \mathcal{L}(\mu,\Sigma)=(2\pi)^{-\frac{np}{2}}\, \prod_{i=1}^n \det(\Sigma)^{-\frac{1}{2}} \exp\left(-\frac{1}{2} (x_i-\mu)^\mathrm{T} \Sigma^{-1} (x_i-\mu)\right) It is fairly readily shown that the
maximum-likelihood estimate of the mean vector
μ is the "
sample mean" vector: :\overline{x}=\frac{x_1+\cdots+x_n}{n}. See
the section on estimation in the article on the normal distribution for details; the process here is similar. Since the estimate \bar{x} does not depend on Σ, we can just substitute it for
μ in the
likelihood function, getting : \mathcal{L}(\overline{x},\Sigma) \propto \det(\Sigma)^{-\frac{n}{2}} \exp\left(-{1 \over 2} \sum_{i=1}^n (x_i-\overline{x})^\mathrm{T} \Sigma^{-1} (x_i-\overline{x})\right), and then seek the value of Σ that maximizes the likelihood of the data (in practice it is easier to work with log \mathcal{L}).
The trace of a 1 × 1 matrix Now we come to the first surprising step: regard the
scalar (x_i-\overline{x})^\mathrm{T} \Sigma^{-1} (x_i-\overline{x}) as the
trace of a 1×1 matrix. This makes it possible to use the identity tr(
AB) = tr(
BA) whenever
A and
B are matrices so shaped that both products exist. We get :\begin{align} \mathcal{L}(\overline{x},\Sigma) &\propto \det(\Sigma)^{-\frac{n}{2}} \exp\left(-{1 \over 2} \sum_{i=1}^n \left ( \left (x_i-\overline{x} \right )^\mathrm{T} \Sigma^{-1} \left (x_i-\overline{x} \right ) \right ) \right) \\ &=\det(\Sigma)^{-\frac{n}{2}} \exp\left(-{1 \over 2} \sum_{i=1}^n \operatorname{tr} \left ( \left (x_i-\overline{x} \right ) \left (x_i-\overline{x} \right )^\mathrm{T} \Sigma^{-1} \right ) \right) \\ &=\det(\Sigma)^{-\frac{n}{2}} \exp\left(-{1 \over 2} \operatorname{tr} \left( \sum_{i=1}^n \left (x_i-\overline{x} \right ) \left (x_i-\overline{x} \right )^\mathrm{T} \Sigma^{-1} \right) \right)\\ &=\det(\Sigma)^{-\frac{n}{2}} \exp\left(-{1 \over 2} \operatorname{tr} \left( S \Sigma^{-1} \right) \right) \end{align} where :S=\sum_{i=1}^n (x_i-\overline{x}) (x_i-\overline{x})^\mathrm{T} \in \mathbf{R}^{p\times p}. S is sometimes called the
scatter matrix, and is positive definite if there exists a subset of the data consisting of p affinely independent observations (which we will assume).
Using the spectral theorem It follows from the
spectral theorem of
linear algebra that a positive-definite symmetric matrix
S has a unique positive-definite symmetric square root
S1/2. We can again use the
"cyclic property" of the trace to write :\det(\Sigma)^{-\frac{n}{2}} \exp\left(-{1 \over 2} \operatorname{tr} \left( S^{\frac{1}{2}} \Sigma^{-1} S^{\frac{1}{2}} \right) \right). Let
B =
S1/2 Σ −1
S1/2. Then the expression above becomes :\det(S)^{-\frac{n}{2}} \det(B)^{\frac{n}{2}} \exp\left(-{1 \over 2} \operatorname{tr} (B) \right). The positive-definite matrix
B can be diagonalized, and then the problem of finding the value of
B that maximizes :\det(B)^{\frac{n}{2}} \exp\left(-{1 \over 2} \operatorname{tr} (B) \right) Since the trace of a square matrix equals the sum of eigenvalues (
"trace and eigenvalues"), the equation reduces to the problem of finding the eigenvalues
λ1, ...,
λp that maximize :\prod_{i=1}^n \lambda_i^{\frac{n}{2}} \exp \left (-\frac{\lambda_i}{2} \right ). This is just a calculus problem and we get
λi =
n for all
i. Thus, assume
Q is the matrix of eigen vectors, then :B = Q (n I_p) Q^{-1} = n I_p i.e.,
n times the
p×
p identity matrix.
Concluding steps Finally we get :\Sigma=S^{\frac{1}{2}} B^{-1} S^{\frac{1}{2}}=S^{\frac{1}{2}}\left(\frac 1 n I_p\right)S^{\frac{1}{2}} = \frac S n, i.e., the
p×
p "sample covariance matrix" :{S \over n} = {1 \over n}\sum_{i=1}^n (X_i-\overline{X})(X_i-\overline{X})^\mathrm{T} is the maximum-likelihood estimator of the "population covariance matrix" Σ. At this point we are using a capital
X rather than a lower-case
x because we are thinking of it "as an estimator rather than as an estimate", i.e., as something random whose probability distribution we could profit by knowing. The random matrix
S can be shown to have a
Wishart distribution with
n − 1 degrees of freedom. That is: :\sum_{i=1}^n (X_i-\overline{X})(X_i-\overline{X})^\mathrm{T} \sim W_p(\Sigma,n-1).
Alternative derivation An alternative derivation of the maximum likelihood estimator can be performed via
matrix calculus formulae (see also
differential of a determinant and
differential of the inverse matrix). It also verifies the aforementioned fact about the maximum likelihood estimate of the mean. Re-write the likelihood in the log form using the trace trick: :\ln \mathcal{L}(\mu,\Sigma) = \operatorname{constant} -{n \over 2} \ln \det(\Sigma) -{1 \over 2} \operatorname{tr} \left[ \Sigma^{-1} \sum_{i=1}^n (x_i-\mu) (x_i-\mu)^\mathrm{T} \right]. The differential of this log-likelihood is :d \ln \mathcal{L}(\mu,\Sigma) = -\frac{n}{2} \operatorname{tr} \left[ \Sigma^{-1} \left\{ d \Sigma \right\} \right] -{1 \over 2} \operatorname{tr} \left[ - \Sigma^{-1} \{ d \Sigma \} \Sigma^{-1} \sum_{i=1}^n (x_i-\mu)(x_i-\mu)^\mathrm{T} - 2 \Sigma^{-1} \sum_{i=1}^n (x_i - \mu) \{ d \mu \}^\mathrm{T} \right]. It naturally breaks down into the part related to the estimation of the mean, and to the part related to the estimation of the variance. The
first order condition for maximum, d \ln \mathcal{L}(\mu,\Sigma)=0, is satisfied when the terms multiplying d \mu and d \Sigma are identically zero. Assuming (the maximum likelihood estimate of) \Sigma is non-singular, the first order condition for the estimate of the mean vector is : \sum_{i=1}^n (x_i - \mu) = 0, which leads to the maximum likelihood estimator :\widehat \mu = \bar X = {1 \over n} \sum_{i=1}^n X_i. This lets us simplify :\sum_{i=1}^n (x_i-\mu)(x_i-\mu)^\mathrm{T} = \sum_{i=1}^n (x_i-\bar x)(x_i-\bar x)^\mathrm{T} = S as defined above. Then the terms involving d \Sigma in d \ln L can be combined as : -{1 \over 2} \operatorname{tr} \left( \Sigma^{-1} \left\{ d \Sigma \right\} \left[ nI_p - \Sigma^{-1} S \right] \right). The first order condition d \ln \mathcal{L}(\mu,\Sigma)=0 will hold when the term in the square bracket is (matrix-valued) zero. Pre-multiplying the latter by \Sigma and dividing by n gives :\widehat \Sigma = {1 \over n} S, which of course coincides with the canonical derivation given earlier. Dwyer points out that decomposition into two terms such as appears above is "unnecessary" and derives the estimator in two lines of working. Note that it may be not trivial to show that such derived estimator is the unique global maximizer for likelihood function. ==Intrinsic covariance matrix estimation==