Most of the commonly used distributions form an exponential family or subset of an exponential family, listed in the subsection below. The subsections following it are a sequence of increasingly more general mathematical definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of
discrete or
continuous probability distributions.
Examples of exponential family distributions Exponential families include many of the most common distributions. Among many others, exponential families includes the following: •
normal •
exponential •
gamma •
chi-squared •
beta •
Dirichlet •
Bernoulli •
categorical •
Poisson •
Wishart •
inverse Wishart •
geometric A number of common distributions are exponential families, but only when certain parameters are fixed and known. For example: •
binomial (with fixed number of trials) •
multinomial (with fixed number of trials) •
negative binomial (with fixed number of failures) Note that in each case, the parameters which must be fixed are those that set a limit on the range of values that can possibly be observed. Examples of common distributions that are
not exponential families are
Student's t, most
mixture distributions, and even the family of
uniform distributions when the bounds are not fixed. See the section below on
examples for more discussion.
Scalar parameter The value of \theta is called the
parameter of the family. A single-parameter exponential family is a set of probability distributions whose
probability density function (or
probability mass function, for the case of a
discrete distribution) can be expressed in the form f_X{\left( x\, \big|\, \theta \right)} = h(x)\, \exp \left[ \eta(\theta) \cdot T(x) - A(\theta) \right] where , , , and are known functions. The function must be non-negative. An alternative, equivalent form often given is f_X{\left( x\ \big|\ \theta \right)} = h(x) \, g(\theta) \, \exp \left[\eta(\theta) \cdot T(x)\right] or equivalently f_X{\left( x\ \big|\ \theta \right)} = \exp\left[ \eta(\theta) \cdot T(x) - A(\theta) + B(x) \right]. In terms of
log probability, \log(f_X{\left( x\ \big|\ \theta \right)}) = \eta(\theta) \cdot T(x) - A(\theta) + B(x). Note that g(\theta) = e^{-A(\theta)} and h(x) = e^{B(x)}.
Support must be independent of Importantly, the
support of f_X{\left( x \big| \theta \right)} (all the possible x values for which f_X\!\left( x \big| \theta \right) is greater than 0 ) is required to
not depend on \theta ~. This requirement can be used to exclude a parametric family distribution from being an exponential family. For example: The
Pareto distribution has a pdf which is defined for x \geq x_{\mathsf m} (the minimum value, x_m\ , being the scale parameter) and its support, therefore, has a lower limit of x_{\mathsf m} ~. Since the support of f_{\alpha, x_m}\!(x) is dependent on the value of the parameter, the family of
Pareto distributions does not form an exponential family of distributions (at least when x_m is unknown). Another example:
Bernoulli-type distributions –
binomial,
negative binomial,
geometric distribution, and similar – can only be included in the exponential class if the number of
Bernoulli trials, , is treated as a fixed constant – excluded from the free parameter(s) \theta – since the allowed number of trials sets the limits for the number of "successes" or "failures" that can be observed in a set of trials.
Vector valued and Often x is a vector of measurements, in which case T(x) may be a function from the space of possible values of x to the real numbers. More generally, \eta(\theta) and T(x) can each be vector-valued such that \eta(\theta) \cdot T(x) is real-valued. However, see the discussion below on
vector parameters, regarding the exponential family.
Canonical formulation If \eta(\theta) = \theta \ , then the exponential family is said to be in
canonical form. By defining a transformed parameter \eta = \eta(\theta)\ , it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since \eta(\theta) can be multiplied by any nonzero constant, provided that is multiplied by that constant's reciprocal, or a constant can be added to \eta(\theta) and multiplied by \exp\left[{-c} \cdot T(x)\,\right] to offset it. In the special case that \eta(\theta) = \theta and , then the family is called a
natural exponential family. Even when x is a scalar, and there is only a single parameter, the functions \eta(\theta) and T(x) can still be vectors, as described below. The function A(\theta)\ , or equivalently g(\theta)\ , is automatically determined once the other functions have been chosen, since it must assume a form that causes the distribution to be
normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of \eta\ , even when \eta(\theta) is not a
one-to-one function, i.e. two or more different values of \theta map to the same value of \eta(\theta)\ , and hence \eta(\theta) cannot be inverted. In such a case, all values of \theta mapping to the same \eta(\theta) will also have the same value for A(\theta) and g(\theta) ~.
Factorization of the variables involved What is important to note, and what characterizes all exponential family variants, is that the parameter(s) and the observation variable(s) must
factorize (can be separated into products each of which involves only one type of variable), either directly or within either part (the base or exponent) of an
exponentiation operation. Generally, this means that all of the factors constituting the density or mass function must be of one of the following forms: \begin{align} f(x) , && c^{f(x)} , && {[f(x)]}^c , && {[f(x)]}^{g(\theta)} , && {[f(x)]}^{h(x)g(\theta)} , \\ g(\theta) , && c^{g(\theta)} , && {[g(\theta)]}^c , && {[g(\theta)]}^{f(x)} , && ~~\mathsf{ or }~~ {[g(\theta)]}^{h(x)j(\theta)} , \end{align} where and are arbitrary functions of , the observed statistical variable; and are arbitrary functions of \theta, the fixed parameters defining the shape of the distribution; and is any arbitrary constant expression (i.e. a number or an expression that does not change with either or \theta ). There are further restrictions on how many such factors can occur. For example, the two expressions: {[f(x) g(\theta)]}^{h(x)j(\theta)}, \qquad {[f(x)]}^{h(x)j(\theta)} {[g(\theta)]}^{h(x)j(\theta)}, are the same, i.e. a product of two "allowed" factors. However, when rewritten into the factorized form, \begin{align} {\left[f(x) g(\theta)\right]}^{h(x) j(\theta)} &= {\left[f(x)\right]}^{h(x) j(\theta)} {\left[g(\theta)\right]}^{h(x) j(\theta)} \\[4pt] &= \exp\left\{{[h(x) \log f(x)] j(\theta) + h(x) [j(\theta) \log g(\theta)]}\right\}, \end{align} it can be seen that it cannot be expressed in the required form. (However, a form of this sort is a member of a
curved exponential family, which allows multiple factorized terms in the exponent.) To see why an expression of the form {[f(x)]}^{g(\theta)} qualifies, {[f(x)]}^{g(\theta)} = e^{g(\theta) \log f(x)} and hence factorizes inside of the exponent. Similarly, {[f(x)]}^{h(x) g(\theta)} = e^{h(x) g(\theta) \log f(x)} = e^{[h(x) \log f(x)] g(\theta)} and again factorizes inside of the exponent. A factor consisting of a sum where both types of variables are involved (e.g. a factor of the form 1 + f(x) g(\theta)) cannot be factorized in this fashion (except in some cases where occurring directly in an exponent); this is why, for example, the
Cauchy distribution and
Student's t distribution are not exponential families.
Vector parameter The definition in terms of one
real-number parameter can be extended to one
real-vector parameter \boldsymbol \theta \equiv \begin{bmatrix} \theta_1 & \theta_2 & \cdots & \theta_s \end{bmatrix}^\mathsf{T}. A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as f_X(x \mid \boldsymbol{\theta}) = h(x)\,\exp\left(\sum_{i=1}^s \eta_i({\boldsymbol \theta}) T_i(x) - A({\boldsymbol \theta}) \right)~, or in a more compact form, f_X(x\mid\boldsymbol \theta) = h(x) \,\exp\left[\boldsymbol\eta(\boldsymbol{\theta}) \cdot \mathbf{T}(x) - A({\boldsymbol \theta}) \right] This form writes the sum as a
dot product of vector-valued functions \boldsymbol\eta({\boldsymbol \theta}) and . An alternative, equivalent form often seen is f_X(x\mid\boldsymbol \theta) = h(x) \, g(\boldsymbol \theta) \, \exp\left[\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x)\right] As in the scalar valued case, the exponential family is said to be in
canonical form if \eta_i({\boldsymbol \theta}) = \theta_i ~,\quad \forall i\,. A vector exponential family is said to be
curved if the dimension of \boldsymbol \theta \equiv \begin{bmatrix} \theta_1 & \theta_2 & \cdots & \theta_d \end{bmatrix}^\mathsf T is less than the dimension of the vector \boldsymbol{\eta}(\boldsymbol \theta) \equiv \begin{bmatrix} \eta_1{\!(\boldsymbol \theta)} & \eta_2{\!(\boldsymbol \theta)} & \cdots & \eta_s{\!(\boldsymbol \theta)} \end{bmatrix}^\mathsf T~. That is, if the
dimension, , of the parameter vector is less than the
number of functions, , of the parameter vector in the above representation of the probability density function. Most common distributions in the exponential family are
not curved, and many algorithms designed to work with any exponential family implicitly or explicitly assume that the distribution is not curved. Just as in the case of a scalar-valued parameter, the function A(\boldsymbol \theta) or equivalently g(\boldsymbol \theta) is automatically determined by the normalization constraint, once the other functions have been chosen. Even if \boldsymbol\eta(\boldsymbol\theta) is not one-to-one, functions A(\boldsymbol \eta) and g(\boldsymbol \eta) can be defined by requiring that the distribution is normalized for each value of the natural parameter \boldsymbol\eta. This yields the
canonical form f_X(x\mid\boldsymbol \eta) = h(x) \exp\left[\boldsymbol\eta \cdot \mathbf{T}(x) - A({\boldsymbol \eta})\right], or equivalently f_X(x\mid\boldsymbol \eta) = h(x) g(\boldsymbol \eta) \exp\left[\boldsymbol\eta \cdot \mathbf{T}(x)\right]. The above forms may sometimes be seen with \boldsymbol\eta^\mathsf T \mathbf{T}(x) in place of \boldsymbol\eta \cdot \mathbf{T}(x)\,. These are exactly equivalent formulations, merely using different notation for the
dot product.
Vector parameter, vector variable The vector-parameter form over a single scalar-valued random variable can be trivially expanded to cover a joint distribution over a vector of random variables. The resulting distribution is simply the same as the above distribution for a scalar-valued random variable with each occurrence of the scalar replaced by the vector \mathbf{x} = \begin{bmatrix} x_1 & x_2 & \cdots & x_k \end{bmatrix}^{\mathsf T}. The dimensions of the random variable need not match the dimension of the parameter vector, nor (in the case of a curved exponential function) the dimension of the natural parameter \boldsymbol\eta and
sufficient statistic . The distribution in this case is written as f_X{\left(\mathbf{x}\mid\boldsymbol \theta\right)} = h(\mathbf{x}) \, \exp\!\left[\sum_{i=1}^s \eta_i({\boldsymbol \theta}) T_i(\mathbf{x}) - A({\boldsymbol \theta})\right] Or more compactly as f_X{\left(\mathbf{x}\mid\boldsymbol \theta\right)} = h(\mathbf{x}) \, \exp\left[\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(\mathbf{x}) - A({\boldsymbol \theta})\right] Or alternatively as f_X{\left(\mathbf{x} \mid \boldsymbol \theta\right)} = g(\boldsymbol \theta) \, h(\mathbf{x}) \, \exp\left[ \boldsymbol\eta(\boldsymbol{\theta}) \cdot \mathbf{T}(\mathbf{x})\right]
Measure-theoretic formulation We use
cumulative distribution functions (CDF) in order to encompass both discrete and continuous distributions. Suppose is a non-decreasing function of a real variable. Then
Lebesgue–Stieltjes integrals with respect to dH(\mathbf{x}) are integrals with respect to the
reference measure of the exponential family generated by . Any member of that exponential family has cumulative distribution function dF{\left(\mathbf{x} \mid \boldsymbol\theta\right)} = \exp\left[\boldsymbol\eta(\theta) \cdot \mathbf{T}(\mathbf{x}) - A(\boldsymbol\theta)\right] ~ dH(\mathbf{x}) \,. is a
Lebesgue–Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and is actually the
cumulative distribution function of a probability distribution. If is absolutely continuous with a density f(x) with respect to a reference measure dx (typically
Lebesgue measure), one can write dF(x) = f(x) \, dx . In this case, is also absolutely continuous and can be written dH(x) = h(x) \, dx so the formulas reduce to that of the previous paragraphs. If is discrete, then is a
step function (with steps on the
support of ). Alternatively, we can write the probability measure directly as P\left(d\mathbf{x} \mid \boldsymbol\theta\right) = \exp\left[ \boldsymbol\eta(\theta) \cdot \mathbf{T}(\mathbf{x}) - A(\boldsymbol\theta) \right] ~ \mu(d\mathbf{x})\,. for some reference measure \mu\,. == Interpretation ==