Discrete case We have some testable information
I about a quantity
x taking values in {
x1,
x2,...,
xn}. We assume this information has the form of
m constraints on the expectations of the functions
fk; that is, we require our probability distribution to satisfy the moment inequality/equality constraints: \sum_{i=1}^n \Pr(x_i)f_k(x_i) \geq F_k \qquad k = 1, \ldots,m. where the F_k are observables. We also require the probability density to sum to one, which may be viewed as a primitive constraint on the identity function and an observable equal to 1 giving the constraint \sum_{i=1}^n \Pr(x_i) = 1. The probability distribution with maximum information entropy subject to these inequality/equality constraints is of the form: \Pr(x_i) = \frac{1}{Z(\lambda_1,\ldots, \lambda_m)} \exp\left[\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)\right], for some \lambda_1,\ldots,\lambda_m. It is sometimes called the
Gibbs distribution. The normalization constant is determined by: Z(\lambda_1,\ldots, \lambda_m) = \sum_{i=1}^n \exp\left[\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)\right], and is conventionally called the
partition function. (The
Pitman–Koopman theorem states that the necessary and sufficient condition for a
sampling distribution to admit
sufficient statistics of bounded dimension is that it have the general form of a maximum entropy distribution.) The
λk parameters are Lagrange multipliers. In the case of equality constraints their values are determined from the solution of the nonlinear equations F_k = \frac{\partial}{\partial \lambda_k} \log Z(\lambda_1,\ldots, \lambda_m). In the case of inequality constraints, the Lagrange multipliers are determined from the solution of a
convex optimization program with linear constraints. In both cases, there is no
closed form solution, and the computation of the Lagrange multipliers usually requires
numerical methods.
Continuous case For
continuous distributions, the Shannon entropy cannot be used, as it is only defined for discrete probability spaces. Instead
Edwin Jaynes (1963, 1968, 2003) gave the following formula, which is closely related to the
relative entropy (see also
differential entropy). H_c=-\int p(x)\log\frac{p(x)}{q(x)}\,dx where
q(
x), which Jaynes called the "
invariant measure", is proportional to the
limiting density of discrete points. For now, we shall assume that
q is known; we will discuss it further after the solution equations are given. A closely related quantity, the relative entropy, is usually defined as the
Kullback–Leibler divergence of
p from
q (although it is sometimes, confusingly, defined as the negative of this). The inference principle of minimizing this, due to Kullback, is known as the
Principle of Minimum Discrimination Information. We have some testable information
I about a quantity
x which takes values in some
interval of the
real numbers (all integrals below are over this interval). We assume this information has the form of
m constraints on the expectations of the functions
fk, i.e. we require our probability density function to satisfy the inequality (or purely equality) moment constraints: \int p(x)f_k(x)\,dx \geq F_k \qquad k = 1, \dotsc,m. where the F_k are observables. We also require the probability density to integrate to one, which may be viewed as a primitive constraint on the identity function and an observable equal to 1 giving the constraint \int p(x)\,dx = 1. The probability density function with maximum
Hc subject to these constraints is: p(x) = \frac{q(x) \exp\left[\lambda_1 f_1(x) + \dotsb + \lambda_m f_m(x)\right]}{Z(\lambda_1,\dotsc, \lambda_m)} with the
partition function determined by Z(\lambda_1,\dotsc, \lambda_m) = \int q(x)\exp\left[\lambda_1 f_1(x) + \dotsb + \lambda_m f_m(x)\right]\,dx. As in the discrete case, in the case where all moment constraints are equalities, the values of the \lambda_k parameters are determined by the system of nonlinear equations: F_k = \frac{\partial}{\partial \lambda_k} \log Z(\lambda_1,\dotsc, \lambda_m). In the case with inequality moment constraints the Lagrange multipliers are determined from the solution of a
convex optimization program. The invariant measure function
q(
x) can be best understood by supposing that
x is known to take values only in the
bounded interval (
a,
b), and that no other information is given. Then the maximum entropy probability density function is p(x) = A \cdot q(x), \qquad a where
A is a normalization constant. The invariant measure function is actually the prior density function encoding 'lack of relevant information'. It cannot be determined by the principle of maximum entropy, and must be determined by some other logical method, such as the
principle of transformation groups or
marginalization theory.
Examples For several examples of maximum entropy distributions, see the article on
maximum entropy probability distributions. ==Justifications for the principle of maximum entropy==