Variational Bayesian inference Suppose we have an observable random variable X, and we want to find its true distribution p^*. This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find p^* exactly, forcing us to search for a good '
approximation.''''' That is, we define a sufficiently large parametric family \{p_\theta\}_{\theta\in\Theta} of distributions, then solve for \min_\theta L(p_\theta, p^*) for some loss function L. One possible way to solve this is by considering small variation from p_\theta to p_{\theta + \delta \theta}, and solve for L(p_\theta, p^*) - L(p_{\theta+\delta \theta}, p^*) =0. This is a problem in the
calculus of variations, thus it is called the
variational method. Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the
Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider
implicitly parametrized probability distributions: • First, define a simple distribution p(z) over a latent random variable Z. Usually a normal distribution or a uniform distribution suffices. • Next, define a family of complicated functions f_\theta (such as a
deep neural network) parametrized by \theta. • Finally, define a way to convert any f_\theta(z) into a distribution (in general simple too, but unrelated to p(z)) over the observable random variable X. For example, let f_\theta(z) = (f_1(z), f_2(z)) have two outputs, then we can define the corresponding distribution over X to be the normal distribution \mathcal N(f_1(z), e^{f_2(z)}). This defines a family of joint distributions p_\theta over (X, Z). It is very easy to sample (x, z) \sim p_\theta: simply sample z\sim p, then compute f_\theta(z), and finally sample x \sim p_\theta(\cdot | z) using f_\theta(z). In other words, we have a
generative model for both the observable and the latent. Now, we consider a distribution p_\theta good, if it is a close approximation of p^*:p_\theta(X) \approx p^*(X)since the distribution on the right side is over X only, the distribution on the left side must marginalize the latent variable Z away. In general, it's impossible to perform the integral p_\theta(x) = \int p_\theta(x|z)p(z)dz, forcing us to perform another approximation. Since p_\theta(x) = \frac{p_\theta(x|z)p(z)}{p_\theta(z|x)} (
Bayes' rule), it suffices to find a good approximation of p_\theta(z|x). So define another distribution family q_\phi(z|x) and use it to approximate p_\theta(z|x). This is a
discriminative model for the latent. The entire situation is summarized in the following table: In
Bayesian language, X is the observed evidence, and Z is the latent/unobserved. The distribution p over Z is the
prior distribution over Z, p_\theta(x|z) is the
likelihood function, and p_\theta(z|x) is the
posterior distribution over Z. Given an observation x, we can
infer what z likely gave rise to x by computing p_\theta(z|x). The usual Bayesian method is to estimate the integral p_\theta(x) = \int p_\theta(x|z)p(z)dz, then compute by
Bayes' rule p_\theta(z|x) = \frac{p_\theta(x|z)p(z)}{p_\theta(x)}. This is expensive to perform in general, but if we can simply find a good approximation q_\phi(z|x) \approx p_\theta(z|x) for most x, z, then we can infer z from x cheaply. Thus, the search for a good q_\phi is also called
amortized inference. All in all, we have found a problem of
variational Bayesian inference.
Deriving the ELBO A basic result in variational inference is that minimizing the
Kullback–Leibler divergence (KL-divergence) is equivalent to maximizing the log-likelihood:\mathbb{E}_{x\sim p^*(x)}[\ln p_\theta (x)] = -H(p^*) - D_{\mathit{KL}}(p^*(x) \| p_\theta(x))where H(p^*) = -\mathbb \mathbb E_{x\sim p^*}[\ln p^*(x)] is the
entropy of the true distribution. So if we can maximize \mathbb{E}_{x\sim p^*(x)}[\ln p_\theta (x)], we can minimize D_{\mathit{KL}}(p^*(x) \| p_\theta(x)), and consequently find an accurate approximation p_\theta \approx p^*. To maximize \mathbb{E}_{x\sim p^*(x)}[\ln p_\theta (x)], we simply sample many x_i\sim p^*(x), i.e. use
importance samplingN\max_\theta \mathbb{E}_{x\sim p^*(x)}[\ln p_\theta (x)]\approx \max_\theta \sum_i \ln p_\theta (x_i)where N is the number of samples drawn from the true distribution. This approximation can be seen as
overfitting.{{NoteTag|note=In fact, by
Jensen's inequality, \mathbb E_{x\sim p^{*}(x)}\left[\max _{\theta }\sum _{i}\ln p_{\theta }(x_{i})\right]\geq \max _{\theta }\mathbb E_{x\sim p^{*}(x)}\left[\sum _{i}\ln p_{\theta }(x_{i})\right]=N\max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)] The estimator is biased upwards. This can be seen as overfitting: for some finite set of sampled data x_{i} , there is usually some \theta that fits them better than the entire p^{*} distribution.|name=in fact}} In order to maximize \sum_i \ln p_\theta (x_i), it's necessary to find \ln p_\theta(x):\ln p_\theta(x) = \ln \int p_\theta(x|z) p(z)dzThis usually has no closed form and must be estimated. The usual way to estimate integrals is
Monte Carlo integration with
importance sampling:\int p_\theta(x|z) p(z)dz = \mathbb E_{z\sim q_\phi(\cdot|x)}\left[\frac{p_\theta (x, z)}{q_\phi(z|x)}\right]where q_\phi(z|x) is a
sampling distribution over z that we use to perform the Monte Carlo integration. So we see that if we sample z\sim q_\phi(\cdot|x), then \frac{p_\theta (x, z)}{q_\phi(z|x)} is an unbiased estimator of p_\theta(x). Unfortunately, this does not give us an unbiased estimator of \ln p_\theta(x), because \ln is nonlinear. Indeed, we have by
Jensen's inequality, \ln p_\theta(x)= \ln \mathbb E_{z\sim q_\phi(\cdot|x)}\left[\frac{p_\theta (x, z)}{q_\phi(z|x)}\right] \geq \mathbb E_{z\sim q_\phi(\cdot|x)}\left[\ln\frac{p_\theta (x, z)}{q_\phi(z|x)}\right]In fact, all the obvious estimators of \ln p_\theta(x) are biased downwards, because no matter how many samples of z_i\sim q_\phi(\cdot | x) we take, we have by Jensen's inequality:\mathbb E_{z_i \sim q_\phi(\cdot|x)}\left[ \ln \left(\frac 1N \sum_i \frac{p_\theta (x, z_i)}{q_\phi(z_i|x)}\right) \right] \leq \ln \mathbb E_{z_i \sim q_\phi(\cdot|x)}\left[ \frac 1N \sum_i \frac{p_\theta (x, z_i)}{q_\phi(z_i|x)} \right] = \ln p_\theta(x) Subtracting the right side, we see that the problem comes down to a biased estimator of zero:\mathbb E_{z_i \sim q_\phi(\cdot|x)}\left[ \ln \left(\frac 1N \sum_i \frac{p_\theta (z_i|x)}{q_\phi(z_i|x)}\right) \right] \leq 0At this point, we could branch off towards the development of an importance-weighted autoencoder{{NoteTag|note=By the
delta method, we have\mathbb E_{z_i \sim q_\phi(\cdot|x)}\left[ \ln \left(\frac 1N \sum_i \frac{p_\theta (z_i|x)}{q_\phi(z_i|x)}\right) \right] \approx -\frac{1}{2N} \mathbb V_{z \sim q_\phi(\cdot|x)}\left[\frac{p_\theta (z|x)}{q_\phi(z|x)}\right] = O(N^{-1})If we continue with this, we would obtain the importance-weighted autoencoder.|name=importance-weighted}}, but we will instead continue with the simplest case with N=1:\ln p_\theta(x)= \ln \mathbb E_{z\sim q_\phi(\cdot|x)}\left[\frac{p_\theta (x, z)}{q_\phi(z|x)}\right] \geq \mathbb E_{z\sim q_\phi(\cdot|x)}\left[\ln\frac{p_\theta (x, z)}{q_\phi(z|x)}\right]The tightness of the inequality has a closed form:\ln p_\theta(x)- \mathbb E_{z\sim q_\phi(\cdot|x)}\left[\ln\frac{p_\theta (x, z)}{q_\phi(z|x)}\right] = D_{\mathit{KL}}(q_\phi(\cdot | x)\| p_\theta(\cdot | x))\geq 0We have thus obtained the ELBO function:L(\phi, \theta; x) := \ln p_\theta(x) - D_{\mathit{KL}}(q_\phi(\cdot | x)\| p_\theta(\cdot | x))
Maximizing the ELBO For fixed x, the optimization \max_{\theta, \phi} L(\phi, \theta; x) simultaneously attempts to maximize \ln p_\theta(x) and minimize D_{\mathit{KL}}(q_\phi(\cdot | x)\| p_\theta(\cdot | x)). If the parametrization for p_\theta and q_\phi are flexible enough, we would obtain some \hat\phi, \hat \theta, such that we have simultaneously \ln p_{\hat \theta}(x) \approx \max_\theta \ln p_\theta(x); \quad q_{\hat\phi}(\cdot | x)\approx p_{\hat\theta}(\cdot | x)Since\mathbb{E}_{x\sim p^*(x)}[\ln p_\theta (x)] = -H(p^*) - D_{\mathit{KL}}(p^*(x) \| p_\theta(x))we have\ln p_{\hat \theta}(x) \approx \max_\theta -H(p^*) - D_{\mathit{KL}}(p^*(x) \| p_\theta(x))and so\hat\theta \approx \arg\min D_{\mathit{KL}}(p^*(x) \| p_\theta(x))In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model p_{\hat\theta} \approx p^* and an accurate discriminative model q_{\hat\phi}(\cdot | x)\approx p_{\hat\theta}(\cdot | x). == Main forms ==