Likelihood function

The likelihood function, parameterized by a (possibly multivariate) parameter \theta, is usually defined differently for discrete and continuous probability distributions (a more general definition is discussed below). Given a probability density or mass function x\mapsto f(x \mid \theta), where x is a realization of the random variable X, the likelihood function is \theta\mapsto f(x \mid \theta), often written \mathcal{L}(\theta \mid x). In other words, when f(x\mid\theta) is viewed as a function of x with \theta fixed, it is a probability density function, and when viewed as a function of \theta with x fixed, it is a likelihood function. In the frequentist paradigm, the notation f(x\mid\theta) is often avoided and instead f(x;\theta) or f(x,\theta) are used to indicate that \theta is regarded as a fixed unknown quantity rather than as a random variable being conditioned on. The likelihood function does not specify the probability that \theta is the truth, given the observed sample X = x. Such an interpretation is a common error, with potentially disastrous consequences (see prosecutor's fallacy). Discrete probability distribution Let X be a discrete random variable with probability mass function p depending on a parameter \theta. Then the function \mathcal{L}(\theta \mid x) = p_\theta (x) = P_\theta (X=x) = \text{Pr}\{ X=x \mid \Theta=\theta \} , considered as a function of \theta, a possible value of the deterministic but unknown parameter \Theta, is the likelihood function, given the outcome x of the random variable X. Sometimes the probability of "the value x of X for the parameter value \theta" is written as or . The likelihood is the probability that a particular outcome x is observed when the true value of the parameter is \theta, equivalent to the probability mass on x; it is not a probability density over the parameter \theta. The likelihood, \mathcal{L}(\theta \mid x) , should not be confused with P(\theta \mid x), which is the posterior probability of \theta given the data x. Example Consider a simple statistical model of a coin flip: a single parameter p_\text{H} that expresses the "fairness" of the coin. The parameter is the probability that a coin lands heads up ("H") when tossed. p_\text{H} can take on any value within the range 0.0 to 1.0. For a perfectly fair coin, p_\text{H} = 0.5. Imagine flipping a fair coin twice, and observing two heads in two tosses ("HH"). Assuming that each successive coin flip is i.i.d., then the probability of observing HH is P(\text{HH} \mid p_\text{H}=0.5) = 0.5^2 = 0.25. Equivalently, the likelihood of observing "HH" assuming p_\text{H} = 0.5 is \mathcal{L}(p_\text{H}=0.5 \mid \text{HH}) = 0.25. This is not the same as saying that P(p_\text{H} = 0.5 \mid \text{HH}) = 0.25, a conclusion which could only be reached via Bayes' theorem given knowledge about the marginal probabilities P(p_\text{H} = 0.5) and P(\text{HH}). Now suppose that the coin is not a fair coin, but instead that p_\text{H} = 0.3. Then the probability of two heads on two flips is P(\text{HH} \mid p_\text{H}=0.3) = 0.3^2 = 0.09. Hence \mathcal{L}(p_\text{H}=0.3 \mid \text{HH}) = 0.09. More generally, for each value of p_\text{H}, we can calculate the corresponding likelihood. The result of such calculations is displayed in Figure 1. The integral of \mathcal{L} over [0, 1] is 1/3; likelihoods need not integrate or sum to one over the parameter space. Continuous probability distribution Let X be a random variable following an absolutely continuous probability distribution with density function f (a function of x) which depends on a parameter \theta. Then the function \mathcal{L}(\theta \mid x) = f_\theta (x), considered as a function of \theta, is the likelihood function (of \theta, given the outcome X=x). Again, \mathcal{L} is not a probability density or mass function over \theta, despite being a function of \theta given the observation X = x. Relationship between the likelihood and probability density functions The use of the probability density in specifying the likelihood function above is justified as follows. Given an observation x_j, the likelihood for the interval [x_j, x_j + h], where h > 0 is a constant, is given by \mathcal{L}(\theta \mid x \in [x_j, x_j{+}h]) . Observe that \mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x \in [x_j, x_j{+}h]) = \mathop\operatorname{arg\,max}_\theta \frac{1}{h} \mathcal{L}(\theta\mid x \in [x_j, x_j{+}h]) , since h is positive and constant. Because \begin{align} \mathop\operatorname{arg\,max}_\theta \frac 1 h \mathcal{L}(\theta\mid x \in [x_j, x_j{+}h]) &= \mathop\operatorname{arg\,max}_\theta \frac 1 h \Pr(x_j \leq x \leq x_j{+}h \mid \theta) \\ &= \mathop\operatorname{arg\,max}_\theta \frac 1 h \int_{x_j}^{x_j+h} f(x\mid \theta) \,dx, \end{align} where f(x\mid \theta) is the probability density function, it follows that \mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x \in [x_j, x_j + h]) = \mathop\operatorname{arg\,max}_\theta \frac{1}{h} \int_{x_j}^{x_j+h} f(x\mid\theta) \,dx . The first fundamental theorem of calculus provides that \lim_{h \to 0^{+}} \frac 1 h \int_{x_j}^{x_j+h} f(x\mid\theta) \,dx = f(x_j \mid \theta). Then \begin{align} \mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x_j) &= \mathop\operatorname{arg\,max}_\theta \left[ \lim_{h\to 0^+} \mathcal{L}(\theta\mid x \in [x_j,\, x_j{+}h]) \right] \\[4pt] &= \mathop\operatorname{arg\,max}_\theta \left[ \lim_{h\to 0^{+}} \frac{1}{h} \int_{x_j}^{x_j+h} f(x\mid\theta) \,dx \right] \\[4pt] &= \mathop\operatorname{arg\,max}_\theta f(x_j \mid \theta). \end{align} Therefore, \mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x_j) = \mathop\operatorname{arg\,max}_\theta f(x_j \mid \theta), and so maximizing the probability density at x_j amounts to maximizing the likelihood of the specific observation x_j . In general In measure-theoretic probability theory, the density function is defined as the Radon–Nikodym derivative of the probability distribution relative to a common dominating measure. The likelihood function is this density interpreted as a function of the parameter, rather than the random variable. Thus, we can construct a likelihood function for any distribution, whether discrete, continuous, a mixture, or otherwise. (Likelihoods are comparable, e.g. for parameter estimation, only if they are Radon–Nikodym derivatives with respect to the same dominating measure.) The above discussion of the likelihood for discrete random variables uses the counting measure, under which the probability density at any outcome equals the probability of that outcome. Likelihoods for mixed continuous–discrete distributions The above can be extended in a simple way to allow consideration of distributions which contain both discrete and continuous components. Suppose that the distribution consists of a number of discrete probability masses p_k (\theta) and a density f(x\mid\theta), where the sum of all the p's added to the integral of f is always one. Assuming that it is possible to distinguish an observation corresponding to one of the discrete probability masses from one which corresponds to the density component, the likelihood function for an observation from the continuous component can be dealt with in the manner shown above. For an observation from the discrete component, the likelihood function for an observation from the discrete component is simply \mathcal{L}(\theta \mid x )= p_k(\theta), where k is the index of the discrete probability mass corresponding to observation x, because maximizing the probability mass (or probability) at x amounts to maximizing the likelihood of the specific observation. The fact that the likelihood function can be defined in a way that includes contributions that are not commensurate (the density and the probability mass) arises from the way in which the likelihood function is defined up to a constant of proportionality, where this "constant" can change with the observation x, but not with the parameter \theta. Regularity conditions In the context of parameter estimation, the likelihood function is usually assumed to obey certain conditions, known as regularity conditions. These conditions are in various proofs involving likelihood functions, and need to be verified in each particular application. For maximum likelihood estimation, the existence of a global maximum of the likelihood function is of the utmost importance. By the extreme value theorem, it suffices that the likelihood function is continuous on a compact parameter space for the maximum likelihood estimator to exist. While the continuity assumption is usually met, the compactness assumption about the parameter space is often not, as the bounds of the true parameter values might be unknown. In that case, concavity of the likelihood function plays a key role. More specifically, if the likelihood function is twice continuously differentiable on the k-dimensional parameter space \Theta assumed to be an open connected subset of \mathbb{R}^{k} \,, there exists a unique maximum \hat{\theta} \in \Theta if the matrix of second partials \mathbf{H}(\theta) \equiv \left[\, \frac{ \partial^2 L }{ \partial \theta_i \, \partial \theta_j } \,\right]_{i,j=1,1}^{n_\mathrm{i},n_\mathrm{j}} \; is negative definite for every \theta \in \Theta at which the gradient \nabla L \equiv \left[ \frac{ \partial L }{\partial \theta_i} \right]_{i=1}^{n_\mathrm{i}} vanishes, and if the likelihood function approaches a constant on the boundary of the parameter space, \partial \Theta, i.e., \lim_{\theta \to \partial \Theta} L(\theta) = 0 \;, which may include the points at infinity if \Theta is unbounded. Mäkeläinen and co-authors prove this result using Morse theory while informally appealing to a mountain pass property. Mascarenhas restates their proof using the mountain pass theorem. In the proofs of consistency and asymptotic normality of the maximum likelihood estimator, additional assumptions are made about the probability densities that form the basis of a particular likelihood function. These conditions were first established by Chanda. In particular, for almost all x, and for all \, \theta \in \Theta \,, \frac{\partial \log f}{\partial \theta_r} \,, \quad \frac{\partial^2 \log f}{\partial \theta_r \partial \theta_s} \,, \quad \frac{\partial^3 \log f}{\partial \theta_r \, \partial \theta_s \, \partial \theta_t} \, exist for all \, r, s, t = 1, 2, \ldots, k \, in order to ensure the existence of a Taylor expansion. Second, for almost all x and for every \, \theta \in \Theta \, it must be that \left| \frac{\partial f}{\partial \theta_r} \right| where H is such that \, \int_{-\infty}^{\infty} H_{rst}(z) \, dz \leq M This boundedness of the derivatives is needed to allow for differentiation under the integral sign. And lastly, it is assumed that the information matrix, \mathbf{I}(\theta) = \int_{-\infty}^{\infty} \frac{\partial \log f}{\partial \theta_r}\ \frac{\partial \log f}{\partial \theta_s}\ f\, dz is positive definite and \, \left| \mathbf{I}(\theta) \right| \, is finite. This ensures that the score has a finite variance. The above conditions are sufficient, but not necessary. That is, a model that does not meet these regularity conditions may or may not have a maximum likelihood estimator of the properties mentioned above. Further, in case of non-independently or non-identically distributed observations additional properties may need to be assumed. In Bayesian statistics, almost identical regularity conditions are imposed on the likelihood function in order to proof asymptotic normality of the posterior probability, and therefore to justify a Laplace approximation of the posterior in large samples. ==Likelihood ratio and relative likelihood==