Moments Let X = (X_1, \ldots, X_K)\sim\operatorname{Dir}(\boldsymbol\alpha). Let \alpha_0 = \sum_{i=1}^K \alpha_i. Then \operatorname{E}[X_i] = \frac{\alpha_i}{\alpha_0}, \operatorname{Var}[X_i] = \frac{\alpha_i (\alpha_0-\alpha_i)}{\alpha_0^2 (\alpha_0+1)}. Furthermore, if i\neq j \operatorname{Cov}[X_i,X_j] = \frac{- \alpha_i \alpha_j}{\alpha_0^2 (\alpha_0+1)}. The covariance matrix is
singular. More generally, moments of Dirichlet-distributed random variables can be expressed in the following way. For \boldsymbol{t}=(t_1,\dotsc,t_K) \in \mathbb{R}^K, denote by \boldsymbol{t}^{\circ i} = (t_1^i,\dotsc,t_K^i) its -th
Hadamard power. Then, \operatorname{E}\left[ (\boldsymbol{t} \cdot \boldsymbol{X})^n \right] = \frac{n! \, \Gamma ( \alpha_0 )}{\Gamma (\alpha_0+n)} \sum \frac{{t_1}^{k_1} \cdots {t_K}^{k_K}}{k_1! \cdots k_K!} \prod_{i=1}^K \frac{\Gamma(\alpha_i + k_i)}{\Gamma(\alpha_i)} = \frac{n! \, \Gamma ( \alpha_0 )}{\Gamma (\alpha_0+n)} Z_n(\boldsymbol{t}^{\circ 1} \cdot \boldsymbol{\alpha}, \cdots, \boldsymbol{t}^{\circ n} \cdot \boldsymbol{\alpha}), where the sum is over non-negative integers k_1,\ldots,k_K with n=k_1+\cdots+k_K, and Z_n is the
cycle index polynomial of the
Symmetric group of degree . We have the special case \operatorname{E}\left[ \boldsymbol{t} \cdot \boldsymbol{X} \right] = \frac{\boldsymbol{t} \cdot \boldsymbol{\alpha}}{\alpha_0}. The multivariate analogue \operatorname{E}\left[ (\boldsymbol{t}_1 \cdot \boldsymbol{X})^{n_1} \cdots (\boldsymbol{t}_q \cdot \boldsymbol{X})^{n_q} \right] for vectors \boldsymbol{t}_1, \dotsc, \boldsymbol{t}_q \in \mathbb{R}^K can be expressed in terms of a color pattern of the exponents n_1, \dotsc, n_q in the sense of the
Pólya enumeration theorem. Particular cases include the simple computation \operatorname{E}\left[\prod_{i=1}^K X_i^{\beta_i}\right] = \frac{B\left(\boldsymbol{\alpha} + \boldsymbol{\beta}\right)}{B\left(\boldsymbol{\alpha}\right)} = \frac{\Gamma\left(\sum\limits_{i=1}^K \alpha_{i}\right)}{\Gamma\left[\sum\limits_{i=1}^K (\alpha_i+\beta_i)\right]}\times\prod_{i=1}^K \frac{\Gamma(\alpha_i+\beta_i)}{\Gamma(\alpha_i)}.
Mode The
mode of the distribution is the vector with x_i = \frac{\alpha_i - 1}{\alpha_0 - K}, \qquad \alpha_i > 1.
Marginal distributions The
marginal distributions are
beta distributions: X_i \sim \operatorname{Beta} (\alpha_i, \alpha_0 - \alpha_i). Also see below.
Conjugate to categorical or multinomial The Dirichlet distribution is the
conjugate prior distribution of the
categorical distribution (a generic
discrete probability distribution with a given number of possible outcomes) and
multinomial distribution (the distribution over observed counts of each possible category in a set of categorically distributed observations). This means that if a data point has either a categorical or multinomial distribution, and the
prior distribution of the distribution's parameter (the vector of probabilities that generates the data point) is distributed as a Dirichlet, then the
posterior distribution of the parameter is also a Dirichlet. Intuitively, in such a case, starting from what we know about the parameter prior to observing the data point, we then can update our knowledge based on the data point and end up with a new distribution of the same form as the old one. This means that we can successively update our knowledge of a parameter by incorporating new observations one at a time, without running into mathematical difficulties. Formally, this can be expressed as follows. Given a model \begin{array}{rcccl} \boldsymbol\alpha &=& \left(\alpha_1, \ldots, \alpha_K \right) &=& \text{concentration hyperparameter} \\ \mathbf{p}\mid\boldsymbol\alpha &=& \left(p_1, \ldots, p_K \right ) &\sim& \operatorname{Dir}(K, \boldsymbol\alpha) \\ \mathbb{X}\mid\mathbf{p} &=& \left(\mathbf{x}_1, \ldots, \mathbf{x}_K \right ) &\sim& \operatorname{Cat}(K,\mathbf{p}) \end{array} then the following holds: \begin{array}{rcccl} \mathbf{c} &=& \left(c_1, \ldots, c_K \right ) &=& \text{number of occurrences of category }i \\ \mathbf{p} \mid \mathbb{X},\boldsymbol\alpha &\sim& \operatorname{Dir}(K,\mathbf{c}+\boldsymbol\alpha) &=& \operatorname{Dir} \left (K,c_1+\alpha_1,\ldots,c_K+\alpha_K \right) \end{array} This relationship is used in
Bayesian statistics to estimate the underlying parameter of a
categorical distribution given a collection of samples. Intuitively, we can view the
hyperprior vector as
pseudocounts, i.e. as representing the number of observations in each category that we have already seen. Then we simply add in the counts for all the new observations (the vector ) in order to derive the posterior distribution. In Bayesian
mixture models and other
hierarchical Bayesian models with mixture components, Dirichlet distributions are commonly used as the prior distributions for the
categorical variables appearing in the models. See the section on
applications below for more information.
Relation to Dirichlet-multinomial distribution In a model where a Dirichlet prior distribution is placed over a set of
categorical-valued observations, the
marginal joint distribution of the observations (i.e. the joint distribution of the observations, with the prior parameter
marginalized out) is a
Dirichlet-multinomial distribution. This distribution plays an important role in
hierarchical Bayesian models, because when doing
inference over such models using methods such as
Gibbs sampling or
variational Bayes, Dirichlet prior distributions are often marginalized out. See the
article on this distribution for more details.
Entropy If is a \operatorname{Dir}(\boldsymbol\alpha) random variable, the
differential entropy of (in
nat units) is h(\boldsymbol X) = \operatorname{E}[- \ln f(\boldsymbol X)] = \ln \operatorname{B}(\boldsymbol\alpha) + (\alpha_0-K)\psi(\alpha_0) - \sum_{j=1}^K (\alpha_j-1)\psi(\alpha_j) where \psi is the
digamma function. The following formula for \operatorname{E}[\ln(X_i)] can be used to derive the differential
entropy above. Since the functions \ln(X_i) are the sufficient statistics of the Dirichlet distribution, the
exponential family differential identities can be used to get an analytic expression for the expectation of \ln(X_i) (see equation (2.62) in ) and its associated covariance matrix: \operatorname{E}[\ln(X_i)] = \psi(\alpha_i)-\psi(\alpha_0) and \operatorname{Cov}[\ln(X_i),\ln(X_j)] = \psi'(\alpha_i) \delta_{ij} - \psi'(\alpha_0) where \psi is the
digamma function, \psi' is the
trigamma function, and \delta_{ij} is the
Kronecker delta. The spectrum of
Rényi information for values other than \lambda = 1 is given by F_R(\lambda) = (1-\lambda)^{-1} \left( - \lambda \log \mathrm{B}(\boldsymbol\alpha) + \sum_{i=1}^K \log \Gamma(\lambda(\alpha_i - 1) + 1) - \log \Gamma(\lambda (\alpha_0 - K) + K ) \right) and the information entropy is the limit as \lambda goes to 1. Another related interesting measure is the entropy of a discrete categorical (one-of-K binary) vector with probability-mass distribution , i.e., P(Z_i=1, Z_{j\ne i} = 0 | \boldsymbol X) = X_i . The conditional
information entropy of , given is S(\boldsymbol X) = H(\boldsymbol Z | \boldsymbol X) = \operatorname{E}_{\boldsymbol Z}[- \log P(\boldsymbol Z | \boldsymbol X ) ] = \sum_{i=1}^K - X_i \log X_i This function of is a scalar random variable. If has a symmetric Dirichlet distribution with all \alpha_i = \alpha, the expected value of the entropy (in
nat units) is \operatorname{E}[S(\boldsymbol X)] = \sum_{i=1}^K \operatorname{E}[- X_i \ln X_i] = \psi(K\alpha + 1) - \psi(\alpha + 1)
Kullback–Leibler divergence The
Kullback–Leibler (KL) divergence between two Dirichlet distributions, \text{Dir}(\boldsymbol\alpha) and \text{Dir}(\boldsymbol{\beta}), over the same simplex is: : \begin{aligned} D_{\mathrm{KL}}\big(\mathrm{Dir}(\boldsymbol{\alpha}) \,\|\, \mathrm{Dir}(\boldsymbol{\beta})\big) &= \log \frac{\Gamma\left(\sum_{i=1}^K \alpha_i\right)}{\Gamma\left(\sum_{i=1}^K \beta_i\right)} + \sum_{i=1}^K \left[ \log \frac{\Gamma(\beta_i)}{\Gamma(\alpha_i)} + (\alpha_i - \beta_i) \left( \psi(\alpha_i) - \psi\left(\sum_{j=1}^K \alpha_j\right) \right) \right] \end{aligned}
Aggregation If X = (X_1, \ldots, X_K)\sim\operatorname{Dir}(\alpha_1,\ldots,\alpha_K) then, if the random variables with subscripts and are dropped from the vector and replaced by their sum, X' = (X_1, \ldots, X_i + X_j, \ldots, X_K)\sim\operatorname{Dir} (\alpha_1, \ldots, \alpha_i + \alpha_j, \ldots, \alpha_K). This aggregation property may be used to derive the marginal distribution of X_i mentioned above.
Neutrality If X = (X_1, \ldots, X_K)\sim\operatorname{Dir}(\boldsymbol\alpha), then the vector is said to be
neutral in the sense that
X is independent of X^{(-K)} Combining this with the property of aggregation it follows that is independent of \left(\frac{X_1}{X_1+\cdots +X_{j-1}},\frac{X_2}{X_1+\cdots +X_{j-1}},\ldots,\frac{X_{j-1}}{X_1+\cdots +X_{j-1}} \right). In fact it is true, further, for the Dirichlet distribution, that for 3\le j\le K-1, the pair \left(X_1+\cdots +X_{j-1}, X_j+\cdots +X_K\right), and the two vectors \left(\frac{X_1}{X_1+\cdots +X_{j-1}},\frac{X_2}{X_1+\cdots +X_{j-1}},\ldots,\frac{X_{j-1}}{X_1+\cdots +X_{j-1}} \right) and \left(\frac{X_j}{X_j+\cdots +X_K},\frac{X_{j+1}}{X_j+\cdots +X_K},\ldots,\frac{X_K}{X_j+\cdots +X_K} \right), viewed as triple of normalised random vectors, are
mutually independent. The analogous result is true for partition of the indices into any other pair of non-singleton subsets.
Characteristic function The characteristic function of the Dirichlet distribution is a
confluent form of the
Lauricella hypergeometric series. It is given by
Phillips as CF\left(s_1,\ldots,s_{K-1}\right) = \operatorname{E}\left(e^{i\left(s_1X_1+\cdots+s_{K-1}X_{K-1} \right)} \right)= \Psi^{\left[K-1\right]} (\alpha_1,\ldots,\alpha_{K-1};\alpha_0;is_1,\ldots, is_{K-1}) where \Psi^{[m]} (a_1,\ldots,a_m;c;z_1,\ldots z_m) = \sum\frac{(a_1)_{k_1} \cdots (a_m)_{k_m} \, z_1^{k_1} \cdots z_m^{k_m}}{(c)_k\,k_1!\cdots k_m!}. The sum is over non-negative integers k_1,\ldots,k_m and k=k_1+\cdots+k_m. Phillips goes on to state that this form is "inconvenient for numerical calculation" and gives an alternative in terms of a
complex path integral: \Psi^{[m]} = \frac{\Gamma(c)}{2\pi i}\int_L e^t\,t^{a_1+\cdots+a_m-c}\,\prod_{j=1}^m (t-z_j)^{-a_j} \, dt where denotes any path in the complex plane originating at -\infty, encircling in the positive direction all the singularities of the integrand and returning to -\infty.
Inequality Probability density function f \left(x_1,\ldots, x_{K-1}; \alpha_1,\ldots, \alpha_K \right) plays a key role in a multifunctional inequality which implies various bounds for the Dirichlet distribution. Another inequality relates the moment-generating function of the Dirichlet distribution to the convex conjugate of the scaled reversed Kullback-Leibler divergence: \log \operatorname{E}\left(\exp{\sum_{i=1}^K s_i X_i } \right) \leq \sup_p \sum_{i=1}^K \left(p_i s_i - \alpha_i\log\left(\frac{\alpha_i}{\alpha_0 p_i} \right)\right), where the supremum is taken over spanning the -simplex. ==Related distributions==