Step-by-step recipe The above example shows the method by which the variational-Bayesian approximation to a
posterior probability density in a given
Bayesian network is derived: • Describe the network with a
graphical model, identifying the observed variables (data) \mathbf{X} and unobserved variables (
parameters \boldsymbol\Theta and
latent variables \mathbf{Z}) and their
conditional probability distributions. Variational Bayes will then construct an approximation to the posterior probability p(\mathbf{Z},\boldsymbol\Theta\mid\mathbf{X}). The approximation has the basic property that it is a factorized distribution, i.e. a product of two or more
independent distributions over disjoint subsets of the unobserved variables. • Partition the unobserved variables into two or more subsets, over which the independent factors will be derived. There is no universal procedure for doing this; creating too many subsets yields a poor approximation, while creating too few makes the entire variational Bayes procedure intractable. Typically, the first split is to separate the parameters and latent variables; often, this is enough by itself to produce a tractable result. Assume that the partitions are called \mathbf{Z}_1,\ldots,\mathbf{Z}_M. • For a given partition \mathbf{Z}_j, write down the formula for the best approximating distribution q_j^{*}(\mathbf{Z}_j\mid \mathbf{X}) using the basic equation \ln q_j^{*}(\mathbf{Z}_j\mid \mathbf{X}) = \operatorname{E}_{i \neq j} [\ln p(\mathbf{Z}, \mathbf{X})] + \text{constant} . • Fill in the formula for the
joint probability distribution using the graphical model. Any component conditional distributions that don't involve any of the variables in \mathbf{Z}_j can be ignored; they will be folded into the constant term. • Simplify the formula and apply the expectation operator, following the above example. Ideally, this should simplify into expectations of basic functions of variables not in \mathbf{Z}_j (e.g. first or second raw
moments, expectation of a logarithm, etc.). In order for the variational Bayes procedure to work well, these expectations should generally be expressible analytically as functions of the parameters and/or
hyperparameters of the distributions of these variables. In all cases, these expectation terms are constants with respect to the variables in the current partition. • The functional form of the formula with respect to the variables in the current partition indicates the type of distribution. In particular, exponentiating the formula generates the
probability density function (PDF) of the distribution (or at least, something proportional to it, with unknown
normalization constant). In order for the overall method to be tractable, it should be possible to recognize the functional form as belonging to a known distribution. Significant mathematical manipulation may be required to convert the formula into a form that matches the PDF of a known distribution. When this can be done, the normalization constant can be reinstated by definition, and equations for the parameters of the known distribution can be derived by extracting the appropriate parts of the formula. • When all expectations can be replaced analytically with functions of variables not in the current partition, and the PDF put into a form that allows identification with a known distribution, the result is a set of equations expressing the values of the optimum parameters as functions of the parameters of variables in other partitions. • When this procedure can be applied to all partitions, the result is a set of mutually linked equations specifying the optimum values of all parameters. • An
expectation–maximization (EM) type procedure is then applied, picking an initial value for each parameter and the iterating through a series of steps, where at each step we cycle through the equations, updating each parameter in turn. This is guaranteed to converge.
Most important points Due to all of the mathematical manipulations involved, it is easy to lose track of the big picture. The important things are: • The idea of variational Bayes is to construct an analytical approximation to the
posterior probability of the set of unobserved variables (parameters and latent variables), given the data. This means that the form of the solution is similar to other
Bayesian inference methods, such as
Gibbs sampling — i.e. a distribution that seeks to describe everything that is known about the variables. As in other Bayesian methods — but unlike e.g. in
expectation–maximization (EM) or other
maximum likelihood methods — both types of unobserved variables (i.e. parameters and latent variables) are treated the same, i.e. as
random variables. Estimates for the variables can then be derived in the standard Bayesian ways, e.g. calculating the mean of the distribution to get a single
point estimate or deriving a
credible interval, highest density region, etc. • "Analytical approximation" means that a formula can be written down for the posterior distribution. The formula generally consists of a product of well-known probability distributions, each of which
factorizes over a set of unobserved variables (i.e. it is
conditionally independent of the other variables, given the observed data). This formula is not the true posterior distribution, but an approximation to it; in particular, it will generally agree fairly closely in the lowest
moments of the unobserved variables, e.g. the
mean and
variance. • The result of all of the mathematical manipulations is (1) the identity of the probability distributions making up the factors, and (2) mutually dependent formulas for the parameters of these distributions. The actual values of these parameters are computed numerically, through an alternating iterative procedure much like EM.
Compared with expectation–maximization (EM) Variational Bayes (VB) is often compared with
expectation–maximization (EM). The actual numerical procedure is quite similar, in that both are alternating iterative procedures that successively converge on optimum parameter values. The initial steps to derive the respective procedures are also vaguely similar, both starting out with formulas for probability densities and both involving significant amounts of mathematical manipulations. However, there are a number of differences. Most important is
what is being computed. • EM computes point estimates of posterior distribution of those random variables that can be categorized as "parameters", but only estimates of the actual posterior distributions of the latent variables (at least in "soft EM", and often only when the latent variables are discrete). The point estimates computed are the
modes of these parameters; no other information is available. • VB, on the other hand, computes estimates of the actual posterior distribution of all variables, both parameters and latent variables. When point estimates need to be derived, generally the
mean is used rather than the mode, as is normal in Bayesian inference. Concomitant with this, the parameters computed in VB do
not have the same significance as those in EM. EM computes optimum values of the parameters of the Bayes network itself. VB computes optimum values of the parameters of the distributions used to approximate the parameters and latent variables of the Bayes network. For example, a typical Gaussian
mixture model will have parameters for the mean and variance of each of the mixture components. EM would directly estimate optimum values for these parameters. VB, however, would first fit a distribution to these parameters — typically in the form of a
prior distribution, e.g. a
normal-scaled inverse gamma distribution — and would then compute values for the parameters of this prior distribution, i.e. essentially
hyperparameters. In this case, VB would compute optimum estimates of the four parameters of the normal-scaled inverse gamma distribution that describes the joint distribution of the mean and variance of the component. ==A more complex example==