Maximum likelihood estimation

We model a set of observations as a random sample from an unknown joint probability distribution which is expressed in terms of a set of parameters. The goal of maximum likelihood estimation is to determine the parameters for which the observed data have the highest joint probability. We write the parameters governing the joint distribution as a vector \; \theta = \left[ \theta_{1},\, \theta_2,\, \ldots,\, \theta_k \right]^{\mathsf{T}} \; so that this distribution falls within a parametric family \; \{ f(\cdot\,;\theta) \mid \theta \in \Theta \} \;, where \, \Theta \, is called the parameter space, a finite-dimensional subset of Euclidean space. Evaluating the joint density at the observed data sample \; \mathbf{y} = (y_1, y_2, \ldots, y_n) \; gives a real-valued function, \mathcal{L}_{n}(\theta) = \mathcal{L}_{n}(\theta; \mathbf{y}) = f_{n}(\mathbf{y}; \theta) \;, which is called the likelihood function. For independent random variables, f_{n}(\mathbf{y}; \theta) will be the product of univariate density functions: f_{n}(\mathbf{y}; \theta) = \prod_{k=1}^n \, f_k^\mathsf{univar}(y_k; \theta) ~. The goal of maximum likelihood estimation is to find the values of the model parameters that maximize the likelihood function over the parameter space, that is: \hat{\theta} = \underset{\theta\in\Theta}{\operatorname{arg\;max}}\,\mathcal{L}_{n}(\theta\,;\mathbf{y}) ~. Intuitively, this selects the parameter values that make the observed data most probable. The specific value ~ \hat{\theta} = \hat{\theta}_{n}(\mathbf{y}) \in \Theta ~ that maximizes the likelihood function \, \mathcal{L}_{n} \, is called the maximum likelihood estimate. Further, if the function \; \hat{\theta}_{n} : \mathbb{R}^{n} \to \Theta \; so defined is measurable, then it is called the maximum likelihood estimator. It is generally a function defined over the sample space, i.e. taking a given sample as its argument. A sufficient but not necessary condition for its existence is for the likelihood function to be continuous over a parameter space \, \Theta \, that is compact. For an open \, \Theta \, the likelihood function may increase without ever reaching a supremum value. In practice, it is often convenient to work with the natural logarithm of the likelihood function, called the log-likelihood: \ell(\theta\,;\mathbf{y}) = \ln \mathcal{L}_{n}(\theta\,;\mathbf{y}) ~. Since the logarithm is a monotonic function, the maximum of \; \ell(\theta\,;\mathbf{y}) \; occurs at the same value of \theta as does the maximum of \, \mathcal{L}_{n} ~. If \ell(\theta\,;\mathbf{y}) is differentiable in \, \Theta \,, necessary conditions for the occurrence of a maximum (or a minimum) are \frac{\partial \ell}{\partial \theta_{1}} = 0, \quad \frac{\partial \ell}{\partial \theta_{2}} = 0, \quad \ldots, \quad \frac{\partial \ell}{\partial \theta_{k}} = 0 ~, known as the likelihood equations. For some models, these equations can be explicitly solved for \, \widehat{\theta\,} \,, but in general no closed-form solution to the maximization problem is known or available, and an MLE can only be found via numerical optimization. Another problem is that in finite samples, there may exist multiple roots for the likelihood equations. Whether the identified root \, \widehat{\theta\,} \, of the likelihood equations is indeed a (local) maximum depends on whether the matrix of second-order partial and cross-partial derivatives, the so-called Hessian matrix \mathbf{H}\left(\widehat{\theta\,}\right) = \begin{bmatrix} \left. \frac{\partial^2 \ell}{\partial \theta_1^2} \right|_{\theta=\widehat{\theta\,}} & \left. \frac{\partial^2 \ell}{\partial \theta_1 \, \partial \theta_2} \right|_{\theta=\widehat{\theta\,}} & \dots & \left. \frac{\partial^2 \ell}{\partial \theta_1 \, \partial \theta_k} \right|_{\theta=\widehat{\theta\,}} \\ \left. \frac{\partial^2 \ell}{\partial \theta_2 \, \partial \theta_1} \right|_{\theta=\widehat{\theta\,}} & \left. \frac{\partial^2 \ell}{\partial \theta_2^2} \right|_{\theta=\widehat{\theta\,}} & \dots & \left. \frac{\partial^2 \ell}{\partial \theta_2 \, \partial \theta_k} \right|_{\theta=\widehat{\theta\,}} \\ \vdots & \vdots & \ddots & \vdots \\ \left. \frac{\partial^2 \ell}{\partial \theta_k \, \partial \theta_1} \right|_{\theta=\widehat{\theta\,}} & \left. \frac{\partial^2 \ell}{\partial \theta_k \, \partial \theta_2} \right|_{\theta=\widehat{\theta\,}} & \dots & \left. \frac{\partial^2 \ell}{\partial \theta_k^2} \right|_{\theta=\widehat{\theta\,}} \end{bmatrix} ~, is negative semi-definite at \widehat{\theta\,}, as this indicates local concavity. Conveniently, most common probability distributions – in particular the exponential family – are logarithmically concave. Restricted parameter space While the domain of the likelihood function—the parameter space—is generally a finite-dimensional subset of Euclidean space, additional restrictions sometimes need to be incorporated into the estimation process. The parameter space can be expressed as \Theta = \left\{ \theta : \theta \in \mathbb{R}^{k},\; h(\theta) = 0 \right\} ~, where \; h(\theta) = \left[ h_{1}(\theta), h_{2}(\theta), \ldots, h_{r}(\theta) \right] \; is a vector-valued function mapping \, \mathbb{R}^{k} \, into \; \mathbb{R}^{r} ~. Estimating the true parameter \theta belonging to \Theta then, as a practical matter, means to find the maximum of the likelihood function subject to the constraint ~h(\theta) = 0 ~. Theoretically, the most natural approach to this constrained optimization problem is the method of substitution, that is "filling out" the restrictions \; h_{1}, h_{2}, \ldots, h_{r} \; to a set \; h_{1}, h_{2}, \ldots, h_{r}, h_{r+1}, \ldots, h_{k} \; in such a way that \; h^{\ast} = \left[ h_{1}, h_{2}, \ldots, h_{k} \right] \; is a one-to-one function from \mathbb{R}^{k} to itself, and reparameterize the likelihood function by setting \; \phi_{i} = h_{i}(\theta_{1}, \theta_{2}, \ldots, \theta_{k}) ~. Because of the equivariance of the maximum likelihood estimator, the properties of the MLE apply to the restricted estimates also. For instance, in a multivariate normal distribution the covariance matrix \, \Sigma \, must be positive-definite; this restriction can be imposed by replacing \; \Sigma = \Gamma^{\mathsf{T}} \Gamma \;, where \Gamma is a real upper triangular matrix and \Gamma^{\mathsf{T}} is its transpose. In practice, restrictions are usually imposed using the method of Lagrange which, given the constraints as defined above, leads to the restricted likelihood equations \frac{\partial \ell}{\partial \theta} - \frac{\partial h(\theta)^\mathsf{T}}{\partial \theta} \lambda = 0 and h(\theta) = 0 \;, where ~ \lambda = \left[ \lambda_{1}, \lambda_{2}, \ldots, \lambda_{r}\right]^\mathsf{T} ~ is a column-vector of Lagrange multipliers and \; \frac{\partial h(\theta)^\mathsf{T}}{\partial \theta} \; is the Jacobian matrix of partial derivatives. This in turn allows for a statistical test of the "validity" of the constraint, known as the Lagrange multiplier test. Nonparametric maximum likelihood estimation Nonparametric maximum likelihood estimation can be performed using the empirical likelihood. == Properties ==