A maximum likelihood estimator is an
extremum estimator obtained by maximizing, as a function of
θ, the
objective function \widehat{\ell\,}(\theta\,;x). If the data are
independent and identically distributed, then we have \widehat{\ell\,}(\theta\,;x)= \sum_{i=1}^n \ln f(x_i\mid\theta), this being the sample analogue of the expected log-likelihood \ell(\theta) = \operatorname{\mathbb E}[\, \ln f(x_i\mid\theta) \,], where this expectation is taken with respect to the true density. Maximum-likelihood estimators have no optimum properties for finite samples, in the sense that (when evaluated on finite samples) other estimators may have greater concentration around the true parameter-value. However, like other estimation methods, maximum likelihood estimation possesses a number of attractive
limiting properties: As the sample size increases to infinity, sequences of maximum likelihood estimators have these properties: •
Consistency: the sequence of MLEs converges in probability to the value being estimated. •
Equivariance: If \hat{\theta} is the maximum likelihood estimator for \theta , and if g(\theta) is a bijective transform of \theta , then the maximum likelihood estimator for \alpha = g(\theta) is \hat{\alpha} = g(\hat{\theta} ) . The equivariance property can be generalized to non-bijective transforms, although it applies in that case on the maximum of an induced likelihood function which is not the true likelihood in general. •
Efficiency, i.e. it achieves the
Cramér–Rao lower bound when the sample size tends to infinity. This means that no consistent estimator has lower asymptotic
mean squared error than the MLE (or other estimators attaining this bound), which also means that MLE has
asymptotic normality. • Second-order efficiency after correction for bias.
Consistency Under the conditions outlined below, the maximum likelihood estimator is
consistent. The consistency means that if the data were generated by f(\cdot\,;\theta_0) and we have a sufficiently large number of observations
n, then it is possible to find the value of
θ0 with arbitrary precision. In mathematical terms this means that as
n goes to infinity the estimator \widehat{\theta\,}
converges in probability to its true value: \widehat{\theta\,}_\mathrm{mle}\ \xrightarrow{\text{p}}\ \theta_0. Under slightly stronger conditions, the estimator converges
almost surely (or
strongly): \widehat{\theta\,}_\mathrm{mle}\ \xrightarrow{\text{a.s.}}\ \theta_0. In practical applications, data is never generated by f(\cdot\,;\theta_0). Rather, f(\cdot\,;\theta_0) is a model, often in idealized form, of the process generated by the data. It is a common aphorism in statistics that
all models are wrong. Thus, true consistency does not occur in practical applications. Nevertheless, consistency is often considered to be a desirable property for an estimator to have. To establish consistency, the following conditions are sufficient. {{ordered list \theta \neq \theta_0 \quad \Leftrightarrow \quad f(\cdot\mid\theta)\neq f(\cdot\mid\theta_0). In other words, different parameter values
θ correspond to different distributions within the model. If this condition did not hold, there would be some value
θ1 such that
θ0 and
θ1 generate an identical distribution of the observable data. Then we would not be able to distinguish between these two parameters even with an infinite amount of data—these parameters would have been
observationally equivalent. The identification condition is absolutely necessary for the ML estimator to be consistent. When this condition holds, the limiting likelihood function
ℓ(
θ·) has unique global maximum at
θ0. The identification condition establishes that the log-likelihood has a unique global maximum. Compactness implies that the likelihood cannot approach the maximum value arbitrarily close at some other point (as demonstrated for example in the picture on the right). Compactness is only a sufficient condition and not a necessary condition. Compactness can be replaced by some other conditions, such as: \operatorname{\mathbb P} \Bigl[\; \ln f(x\mid\theta) \;\in\; C^0(\Theta) \;\Bigr] = 1. The continuity here can be replaced with a slightly weaker condition of
upper semi-continuity. \Bigl|\ln f(x\mid\theta)\Bigr| By the
uniform law of large numbers, the dominance condition together with continuity establish the uniform convergence in probability of the log-likelihood: \sup_{\theta\in\Theta} \left|\widehat{\ell\,}(\theta\mid x) - \ell(\theta)\,\right|\ \xrightarrow{\text{p}}\ 0. }} The dominance condition can be employed in the case of
i.i.d. observations. In the non-i.i.d. case, the uniform convergence in probability can be checked by showing that the sequence \widehat{\ell\,}(\theta\mid x) is
stochastically equicontinuous. If one wants to demonstrate that the ML estimator \widehat{\theta\,} converges to
θ0
almost surely, then a stronger condition of uniform convergence almost surely has to be imposed: \sup_{\theta\in\Theta} \left\|\;\widehat{\ell\,}(\theta\mid x) - \ell(\theta)\;\right\| \ \xrightarrow{\text{a.s.}}\ 0. Additionally, if (as assumed above) the data were generated by f(\cdot\,;\theta_0), then under certain conditions, it can also be shown that the maximum likelihood estimator
converges in distribution to a normal distribution. Specifically, \sqrt{n} \left(\widehat{\theta\,}_\mathrm{mle} - \theta_0\right)\ \xrightarrow{d}\ \mathcal{N}\left(0,\, I^{-1}\right) where is the
Fisher information matrix.
Functional invariance The maximum likelihood estimator selects the parameter value which gives the observed data the largest possible probability (or probability density, in the continuous case). If the parameter consists of a number of components, then we define their separate maximum likelihood estimators, as the corresponding component of the MLE of the complete parameter. Consistent with this, if \widehat{\theta\,} is the MLE for \theta, and if g(\theta) is any transformation of \theta, then the MLE for \alpha=g(\theta) is by definition \widehat{\alpha} = g(\,\widehat{\theta\,}\,). \, It maximizes the so-called
profile likelihood: \bar{L}(\alpha) = \sup_{\theta: \alpha = g(\theta)} L(\theta). \, The MLE is also equivariant with respect to certain transformations of the data. If y=g(x) where g is one to one and does not depend on the parameters to be estimated, then the density functions satisfy f_Y(y) = f_X(g^{-1}(y)) \, |(g^{-1}(y))^{\prime}| and hence the likelihood functions for X and Y differ only by a factor that does not depend on the model parameters. For example, the MLE parameters of the log-normal distribution are the same as those of the normal distribution fitted to the logarithm of the data. In fact, in the log-normal case if X\sim\mathcal{N}(0, 1), then Y=g(X)=e^{X} follows a
log-normal distribution. The density of Y follows with f_X standard
Normal and g^{-1}(y) = \log(y) , |(g^{-1}(y))^{\prime}| = \frac{1}{y} for y > 0.
Efficiency As assumed above, if the data were generated by ~f(\cdot\,;\theta_0)~, then under certain conditions, it can also be shown that the maximum likelihood estimator
converges in distribution to a normal distribution. It is -consistent and asymptotically efficient, meaning that it reaches the
Cramér–Rao bound. Specifically, b_h \; \equiv \; \operatorname{\mathbb E} \biggl[ \; \left( \widehat\theta_\mathrm{mle} - \theta_0 \right)_h \; \biggr] \; = \; \frac{1}{\,n\,} \, \sum_{i, j, k = 1}^m \; \mathcal{I}^{h i} \; \mathcal{I}^{j k} \left( \frac{1}{\,2\,} \, K_{i j k} \; + \; J_{j,i k} \right) where \mathcal{I}^{j k} (with superscripts) denotes the (
j,k)-th component of the
inverse Fisher information matrix \mathcal{I}^{-1}, and \frac{1}{\,2\,} \, K_{i j k} \; + \; J_{j,i k} \; = \; \operatorname{\mathbb E}\,\biggl[\; \frac12 \frac{\partial^3 \ln f_{\theta_0}(X_t)}{\partial\theta_i\;\partial\theta_j\;\partial\theta_k} + \frac{\;\partial\ln f_{\theta_0}(X_t)\;}{\partial\theta_j}\,\frac{\;\partial^2\ln f_{\theta_0}(X_t)\;}{\partial\theta_i \, \partial\theta_k} \; \biggr] ~ . Using these formulae it is possible to estimate the second-order bias of the maximum likelihood estimator, and
correct for that bias by subtracting it: \widehat{\theta\,}^*_\text{mle} = \widehat{\theta\,}_\text{mle} - \widehat{b\,} ~ . This estimator is unbiased up to the terms of order , and is called the
bias-corrected maximum likelihood estimator. This bias-corrected estimator is (at least within the curved exponential family), meaning that it has minimal mean squared error among all second-order bias-corrected estimators, up to the terms of the order . It is possible to continue this process, that is to derive the third-order bias-correction term, and so on. However, the maximum likelihood estimator is
not third-order efficient.
Relation to Bayesian inference A maximum likelihood estimator coincides with the
most probable Bayesian estimator given a
uniform prior distribution on the
parameters. Indeed, the
maximum a posteriori estimate is the parameter that maximizes the probability of given the data, given by Bayes' theorem: \operatorname{\mathbb P}(\theta\mid x_1,x_2,\ldots,x_n) = \frac{f(x_1,x_2,\ldots,x_n\mid\theta)\operatorname{\mathbb P}(\theta)}{\operatorname{\mathbb P}(x_1,x_2,\ldots,x_n)} where \operatorname{\mathbb P}(\theta) is the prior distribution for the parameter and where \operatorname{\mathbb P}(x_1,x_2,\ldots,x_n) is the probability of the data averaged over all parameters. Since the denominator is independent of , the Bayesian estimator is obtained by maximizing f(x_1,x_2,\ldots,x_n\mid\theta)\operatorname{\mathbb P}(\theta) with respect to . If we further assume that the prior \operatorname{\mathbb P}(\theta) is a uniform distribution, the Bayesian estimator is obtained by maximizing the likelihood function f(x_1,x_2,\ldots,x_n\mid\theta). Thus the Bayesian estimator coincides with the maximum likelihood estimator for a uniform prior distribution \operatorname{\mathbb P}(\theta).
Application of maximum-likelihood estimation in Bayes decision theory In many practical applications in
machine learning, maximum-likelihood estimation is used as the model for parameter estimation. The Bayesian Decision theory is about designing a classifier that minimizes total expected risk, especially, when the costs (the loss function) associated with different decisions are equal, the classifier is minimizing the error over the whole distribution. Thus, the Bayes Decision Rule is stated as :"decide \;w_1\; if ~\operatorname{\mathbb P}(w_1|x) \; > \; \operatorname{\mathbb P}(w_2|x)~;~ otherwise decide \;w_2\;" where \;w_1\,, w_2\; are predictions of different classes. From a perspective of minimizing error, it can also be stated as w = \underset{ w }{\operatorname{arg\;max}} \; \int_{-\infty}^\infty \operatorname{\mathbb P}(\text{ error}\mid x)\operatorname{\mathbb P}(x)\,\operatorname{d}x~ where \operatorname{\mathbb P}(\text{ error}\mid x) = \operatorname{\mathbb P}(w_1\mid x)~ if we decide \;w_2\; and \;\operatorname{\mathbb P}(\text{ error}\mid x) = \operatorname{\mathbb P}(w_2\mid x)\; if we decide \;w_1\;. By applying
Bayes' theorem \operatorname{\mathbb P}(w_i \mid x) = \frac{\operatorname{\mathbb P}(x \mid w_i) \operatorname{\mathbb P}(w_i)}{\operatorname{\mathbb P}(x)}, and if we further assume the zero-or-one loss function, which is a same loss for all errors, the Bayes Decision rule can be reformulated as: h_\text{Bayes} = \underset{ w }{\operatorname{arg\;max}} \, \bigl[\, \operatorname{\mathbb P}(x\mid w)\,\operatorname{\mathbb P}(w) \,\bigr]\;, where h_\text{Bayes} is the prediction and \;\operatorname{\mathbb P}(w)\; is the
prior probability.
Relation to minimizing Kullback–Leibler divergence and cross entropy Finding \hat \theta that maximizes the likelihood is asymptotically equivalent to finding the \hat \theta that defines a probability distribution (Q_{\hat \theta}) that has a minimal distance, in terms of
Kullback–Leibler divergence, to the real probability distribution from which our data were generated (i.e., generated by P_{\theta_0}). In an ideal world, P and Q are the same (and the only thing unknown is \theta that defines P), but even if they are not and the model we use is misspecified, still the MLE will give us the "closest" distribution (within the restriction of a model Q that depends on \hat \theta) to the real distribution P_{\theta_0}.
Prediction bias Maximum likelihood estimates of parameters can be substituted into expressions for the
probability density function,
cumulative distribution function, or
quantile function, to generate predictions of probabilities or quantiles of out-of-sample events. This method for predicting probabilities is recommended in statistics text-books and actuarial textbooks, and is widely used in the scientific literature. However, maximum likelihood prediction fails to propagate the uncertainty around the maximum likelihood parameter estimates into the prediction. As a result, the predicted probabilities are not well
calibrated, and should not be expected to correspond to the frequencies of out-of-sample events. In particular, tail exceedance probabilities and tail exceedance quantiles are typically underestimated, sometimes dramatically. The underestimation is largest when there is little training data, many parameters being estimated, and for the far tail. For cases where this prediction bias is a problem, Bayesian predictions can provide a solution if the prior is chosen so as to reduce or eliminate the bias. == Examples ==