Most, but not all, common families of distributions are
exponential families. Exponential families have a large number of useful properties. One of these is that all members have
conjugate prior distributions — whereas very few other distributions have conjugate priors.
Prior predictive distribution in exponential families Another useful property is that the
probability density function of the
compound distribution corresponding to the prior predictive distribution of an
exponential family distribution
marginalized over its
conjugate prior distribution can be determined analytically. Assume that F(x|\boldsymbol{\theta}) is a member of the exponential family with parameter \boldsymbol{\theta} that is parametrized according to the
natural parameter \boldsymbol{\eta} = \boldsymbol{\eta}(\boldsymbol{\theta}), and is distributed as :p_F(x|\boldsymbol{\eta}) = h(x)g(\boldsymbol{\eta})e^{\boldsymbol{\eta}^{\rm T}\mathbf{T}(x)} while G(\boldsymbol{\eta}|\boldsymbol{\chi},\nu) is the appropriate conjugate prior, distributed as :p_G(\boldsymbol{\eta}|\boldsymbol{\chi},\nu) = f(\boldsymbol{\chi},\nu)g(\boldsymbol{\eta})^\nu e^{\boldsymbol{\eta}^{\rm T}\boldsymbol{\chi}} Then the prior predictive distribution H (the result of compounding F with G) is : \begin{align} p_H(x|\boldsymbol{\chi},\nu) &= {\displaystyle \int\limits_\boldsymbol{\eta} p_F(x|\boldsymbol{\eta}) p_G(\boldsymbol{\eta}|\boldsymbol{\chi},\nu) \,\operatorname{d}\boldsymbol{\eta}} \\ &= {\displaystyle \int\limits_\boldsymbol{\eta} h(x)g(\boldsymbol{\eta})e^{\boldsymbol{\eta}^{\rm T}\mathbf{T}(x)} f(\boldsymbol{\chi},\nu)g(\boldsymbol{\eta})^\nu e^{\boldsymbol{\eta}^{\rm T}\boldsymbol{\chi}} \,\operatorname{d}\boldsymbol{\eta}} \\ &= {\displaystyle h(x) f(\boldsymbol{\chi},\nu) \int\limits_\boldsymbol{\eta} g(\boldsymbol{\eta})^{\nu+1} e^{\boldsymbol{\eta}^{\rm T}(\boldsymbol{\chi} + \mathbf{T}(x))} \,\operatorname{d}\boldsymbol{\eta}} \\ &= h(x) \dfrac{f(\boldsymbol{\chi},\nu)}{f(\boldsymbol{\chi} + \mathbf{T}(x), \nu+1)} \end{align} The last line follows from the previous one by recognizing that the function inside the integral is the density function of a random variable distributed as G(\boldsymbol{\eta}| \boldsymbol{\chi} + \mathbf{T}(x), \nu+1), excluding the
normalizing function f(\dots)\,. Hence the result of the integration will be the reciprocal of the normalizing function. The above result is independent of choice of parametrization of \boldsymbol{\theta}, as none of \boldsymbol{\theta}, \boldsymbol{\eta} and g(\dots)\, appears. (g(\dots)\, is a function of the parameter and hence will assume different forms depending on choice of parametrization.) For standard choices of F and G, it is often easier to work directly with the usual parameters rather than rewrite in terms of the
natural parameters. The reason the integral is tractable is that it involves computing the
normalization constant of a density defined by the product of a
prior distribution and a
likelihood. When the two are
conjugate, the product is a
posterior distribution, and by assumption, the normalization constant of this distribution is known. As shown above, the
density function of the compound distribution follows a particular form, consisting of the product of the function h(x) that forms part of the density function for F, with the quotient of two forms of the normalization "constant" for G, one derived from a prior distribution and the other from a posterior distribution. The
beta-binomial distribution is a good example of how this process works. Despite the analytical tractability of such distributions, they are in themselves usually not members of the
exponential family. For example, the three-parameter
Student's t distribution,
beta-binomial distribution and
Dirichlet-multinomial distribution are all predictive distributions of exponential-family distributions (the
normal distribution,
binomial distribution and
multinomial distributions, respectively), but none are members of the exponential family. This can be seen above due to the presence of functional dependence on \boldsymbol{\chi} + \mathbf{T}(x). In an exponential-family distribution, it must be possible to separate the entire density function into multiplicative factors of three types: (1) factors containing only variables, (2) factors containing only parameters, and (3) factors whose logarithm factorizes between variables and parameters. The presence of \boldsymbol{\chi} + \mathbf{T}(x){\chi} makes this impossible unless the "normalizing" function f(\dots)\,either ignores the corresponding argument entirely or uses it only in the exponent of an expression.
Posterior predictive distribution in exponential families When a conjugate prior is being used, the posterior predictive distribution belongs to the same family as the prior predictive distribution, and is determined simply by plugging the updated hyperparameters for the posterior distribution of the parameter(s) into the formula for the prior predictive distribution. Using the general form of the posterior update equations for exponential-family distributions (see the
appropriate section in the exponential family article), we can write out an explicit formula for the posterior predictive distribution: : \begin{array}{lcl} p(\tilde{x}|\mathbf{X},\boldsymbol{\chi},\nu) &=& p_H\left(\tilde{x}|\boldsymbol{\chi} + \mathbf{T}( \mathbf{X}), \nu+N\right) \end{array} where :\mathbf{T}(\mathbf{X}) = \sum_{i=1}^N \mathbf{T}(x_i) This shows that the posterior predictive distribution of a series of observations, in the case where the observations follow an
exponential family with the appropriate
conjugate prior, has the same probability density as the compound distribution, with parameters as specified above. The observations themselves enter only in the form \mathbf{T}(\mathbf{X}) = \sum_{i=1}^N \mathbf{T}(x_i) . This is termed the
sufficient statistic of the observations, because it tells us everything we need to know about the observations in order to compute a posterior or posterior predictive distribution based on them (or, for that matter, anything else based on the
likelihood of the observations, such as the
marginal likelihood).
Joint predictive distribution, marginal likelihood It is also possible to consider the result of compounding a joint distribution over a fixed number of
independent identically distributed samples with a prior distribution over a shared parameter. In a Bayesian setting, this comes up in various contexts: computing the prior or posterior predictive distribution of multiple new observations, and computing the
marginal likelihood of observed data (the denominator in
Bayes' law). When the distribution of the samples is from the exponential family and the prior distribution is conjugate, the resulting compound distribution will be tractable and follow a similar form to the expression above. It is easy to show, in fact, that the joint compound distribution of a set \mathbf{X} = \{x_1, \dots, x_N\} for N observations is :p_H(\mathbf{X}|\boldsymbol{\chi},\nu) = \left( \prod_{i=1}^N h(x_i) \right) \dfrac{f(\boldsymbol{\chi},\nu)}{f\left(\boldsymbol{\chi} + \mathbf{T}(\mathbf{X}), \nu+N \right)} This result and the above result for a single compound distribution extend trivially to the case of a distribution over a vector-valued observation, such as a
multivariate Gaussian distribution. ==Relation to Gibbs sampling==