Variational Bayesian methods

Problem In variational inference, the posterior distribution over a set of unobserved variables \mathbf{Z} = \{Z_1 \dots Z_n\} given some data \mathbf{X} is approximated by a so-called variational distribution, Q(\mathbf{Z}): : P(\mathbf{Z}\mid \mathbf{X}) \approx Q(\mathbf{Z}). The distribution Q(\mathbf{Z}) is restricted to belong to a family of distributions of simpler form than P(\mathbf{Z}\mid \mathbf{X}) (e.g. a family of Gaussian distributions), selected with the intention of making Q(\mathbf{Z}) similar to the true posterior, P(\mathbf{Z}\mid \mathbf{X}). The similarity (or dissimilarity) is measured in terms of a dissimilarity function d(Q; P) and hence inference is performed by selecting the distribution Q(\mathbf{Z}) that minimizes d(Q; P). KL divergence The most common type of variational Bayes uses the Kullback–Leibler divergence (KL-divergence) of Q from P as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined as :D_{\mathrm{KL}}(Q \parallel P) \triangleq \sum_\mathbf{Z} Q(\mathbf{Z}) \log \frac{Q(\mathbf{Z})}{P(\mathbf{Z}\mid \mathbf{X})}. Note that Q and P are reversed from what one might expect. This use of reversed KL-divergence is conceptually similar to the expectation–maximization algorithm. (Using the KL-divergence in the other way produces the expectation propagation algorithm.) Intractability Variational techniques are typically used to form an approximation for: :P(\mathbf Z \mid \mathbf X) = \frac{P(\mathbf X \mid \mathbf Z)P(\mathbf Z)}{P(\mathbf X)} = \frac{P(\mathbf X \mid \mathbf Z)P(\mathbf Z)}{\int_{\mathbf Z} P(\mathbf X,\mathbf Z') \,d\mathbf Z'} The marginalization over \mathbf Z to calculate P(\mathbf X) in the denominator is typically intractable, because, for example, the search space of \mathbf Z is combinatorially large. Therefore, we seek an approximation, using Q(\mathbf Z) \approx P(\mathbf Z \mid \mathbf X). Evidence lower bound Given that P(\mathbf Z \mid \mathbf X) = \frac{P(\mathbf X, \mathbf Z)}{P(\mathbf X)}, the KL-divergence above can also be written as : \begin{array}{rl} D_{\mathrm{KL}}(Q \parallel P) &= \sum_\mathbf{Z} Q(\mathbf{Z}) \left[ \log \frac{Q(\mathbf{Z})}{P(\mathbf{Z},\mathbf{X})} + \log P(\mathbf{X}) \right]\\ &= \sum_\mathbf{Z} Q(\mathbf{Z}) \left[ \log Q(\mathbf{Z}) - \log P(\mathbf{Z},\mathbf{X}) \right] + \sum_\mathbf{Z} Q(\mathbf{Z}) \left[ \log P(\mathbf{X}) \right] \end{array} Because P(\mathbf{X}) is a constant with respect to \mathbf Z and \sum_\mathbf{Z} Q(\mathbf{Z}) = 1 because Q(\mathbf{Z}) is a distribution, we have : D_{\mathrm{KL}}(Q \parallel P) = \sum_\mathbf{Z} Q(\mathbf{Z}) \left[ \log Q(\mathbf{Z}) - \log P(\mathbf{Z},\mathbf{X}) \right] + \log P(\mathbf{X}) which, according to the definition of expected value (for a discrete random variable), can be written as follows : D_{\mathrm{KL}}(Q \parallel P) = \mathbb{E}_{\mathbf Q } \left[ \log Q(\mathbf{Z}) - \log P(\mathbf{Z},\mathbf{X}) \right] + \log P(\mathbf{X}) which can be rearranged to become : \begin{array}{rl} \log P(\mathbf{X}) &= D_{\mathrm{KL}}(Q \parallel P) - \mathbb{E}_{\mathbf Q } \left[ \log Q(\mathbf{Z}) - \log P(\mathbf{Z},\mathbf{X}) \right] \\ &= D_{\mathrm{KL}}(Q\parallel P) + \mathcal{L}(Q) \end{array} As the log-evidence \log P(\mathbf{X}) is fixed with respect to Q, maximizing the final term \mathcal{L}(Q) minimizes the KL divergence of Q from P. By appropriate choice of Q, \mathcal{L}(Q) becomes tractable to compute and to maximize. Hence we have both an analytical approximation Q for the posterior P(\mathbf{Z}\mid \mathbf{X}), and a lower bound \mathcal{L}(Q) for the log-evidence \log P(\mathbf{X}) (since the KL-divergence is non-negative). The lower bound \mathcal{L}(Q) is known as the (negative) variational free energy in analogy with thermodynamic free energy because it can also be expressed as a negative energy \operatorname{E}_{Q}[\log P(\mathbf{Z},\mathbf{X})] plus the entropy of Q. The term \mathcal{L}(Q) is also known as Evidence Lower Bound, abbreviated as ELBO, to emphasize that it is a lower (worst-case) bound on the log-evidence of the data. Proofs By the generalized Pythagorean theorem of Bregman divergence, of which KL-divergence is a special case, it can be shown that: : D_{\mathrm{KL}}(Q\parallel P) \geq D_{\mathrm{KL}}(Q\parallel Q^{*}) + D_{\mathrm{KL}}(Q^{*}\parallel P), \forall Q^{*} \in\mathcal{C} where \mathcal{C} is a convex set and the equality holds if: : Q = Q^{*} \triangleq \arg\min_{Q\in\mathcal{C}}D_{\mathrm{KL}}(Q\parallel P). In this case, the global minimizer Q^{*}(\mathbf{Z}) = q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)q^{*}(\mathbf{Z}_2) = q^{*}(\mathbf{Z}_2\mid\mathbf{Z}_1)q^{*}(\mathbf{Z}_1), with \mathbf{Z}=\{\mathbf{Z_1},\mathbf{Z_2}\}, can be found as follows: : \begin{array}{rl} q^{*}(\mathbf{Z}_2) &= \frac{P(\mathbf{X})}{\zeta(\mathbf{X})}\frac{P(\mathbf{Z}_2\mid\mathbf{X})}{\exp(D_{\mathrm{KL}}(q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)\parallel P(\mathbf{Z}_1\mid\mathbf{Z}_2,\mathbf{X})))} \\ &= \frac{1}{\zeta(\mathbf{X})}\exp\mathbb{E}_{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\left(\log\frac{P(\mathbf{Z},\mathbf{X})}{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\right), \end{array} in which the normalizing constant is: : \begin{array}{rl} \zeta(\mathbf{X}) &=P(\mathbf{X})\int_{\mathbf{Z}_2}\frac{P(\mathbf{Z}_2\mid\mathbf{X})}{\exp(D_{\mathrm{KL}}(q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)\parallel P(\mathbf{Z}_1\mid\mathbf{Z}_2,\mathbf{X})))} \\ &= \int_{\mathbf{Z}_{2}}\exp\mathbb{E}_{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\left(\log\frac{P(\mathbf{Z},\mathbf{X})}{q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2)}\right). \end{array} The term \zeta(\mathbf{X}) is often called the evidence lower bound (ELBO) in practice, since P(\mathbf{X})\geq\zeta(\mathbf{X})=\exp(\mathcal{L}(Q^{*})), as shown above. By interchanging the roles of \mathbf{Z}_1 and \mathbf{Z}_2, we can iteratively compute the approximated q^{*}(\mathbf{Z}_1) and q^{*}(\mathbf{Z}_2) of the true model's marginals P(\mathbf{Z}_1\mid\mathbf{X}) and P(\mathbf{Z}_2\mid\mathbf{X}), respectively. Although this iterative scheme is guaranteed to converge monotonically, the converged Q^{*} is only a local minimizer of D_{\mathrm{KL}}(Q\parallel P). If the constrained space \mathcal{C} is confined within independent space, i.e. q^{*}(\mathbf{Z}_1\mid\mathbf{Z}_2) = q^{*}(\mathbf{Z_1}),the above iterative scheme will become the so-called mean field approximation Q^{*}(\mathbf{Z}) = q^{*}(\mathbf{Z}_1)q^{*}(\mathbf{Z}_2),as shown below. ==Mean field approximation==