Numerous variations of the basic Gibbs sampler exist. The goal of these variations is to reduce the
autocorrelation between samples sufficiently to overcome any added computational costs.
Blocked Gibbs sampler • A
blocked Gibbs sampler groups two or more variables together and samples from their
joint distribution conditioned on all other variables, rather than sampling from each one individually. For example, in a
hidden Markov model, a blocked Gibbs sampler might sample from all the
latent variables making up the
Markov chain in one go, using the
forward-backward algorithm.
Collapsed Gibbs sampler • A
collapsed Gibbs sampler integrates out (
marginalizes over) one or more variables when sampling for some other variable. For example, imagine that a model consists of three variables
A,
B, and
C. A simple Gibbs sampler would sample from
p(
A |
B,
C), then
p(
B |
A,
C), then
p(
C |
A,
B). A collapsed Gibbs sampler might replace the sampling step for
A with a sample taken from the marginal distribution
p(
A |
C), with variable
B integrated out in this case. Alternatively, variable
B could be collapsed out entirely, alternately sampling from
p(
A |
C) and
p(
C |
A) and not sampling over
B at all. The distribution over a variable
A that arises when collapsing a parent variable
B is called a
compound distribution; sampling from this distribution is generally tractable when
B is the
conjugate prior for
A, particularly when
A and
B are members of the
exponential family. For more information, see the article on
compound distributions or Liu (1994).
Implementing a collapsed Gibbs sampler Collapsing Dirichlet distributions In
hierarchical Bayesian models with
categorical variables, such as
latent Dirichlet allocation and various other models used in
natural language processing, it is quite common to collapse out the
Dirichlet distributions that are typically used as
prior distributions over the categorical variables. The result of this collapsing introduces dependencies among all the categorical variables dependent on a given Dirichlet prior, and the joint distribution of these variables after collapsing is a
Dirichlet-multinomial distribution. The conditional distribution of a given categorical variable in this distribution, conditioned on the others, assumes an extremely simple form that makes Gibbs sampling even easier than if the collapsing had not been done. The rules are as follows: • Collapsing out a Dirichlet prior node affects only the parent and children nodes of the prior. Since the parent is often a constant, it is typically only the children that we need to worry about. • Collapsing out a Dirichlet prior introduces dependencies among all the categorical children dependent on that prior — but
no extra dependencies among any other categorical children. (This is important to keep in mind, for example, when there are multiple Dirichlet priors related by the same hyperprior. Each Dirichlet prior can be independently collapsed and affects only its direct children.) • After collapsing, the conditional distribution of one dependent children on the others assumes a very simple form: The probability of seeing a given value is proportional to the sum of the corresponding hyperprior for this value, and the count of all of the
other dependent nodes assuming the same value. Nodes not dependent on the same prior
must not be counted. The same rule applies in other iterative inference methods, such as
variational Bayes or
expectation maximization; however, if the method involves keeping partial counts, then the partial counts for the value in question must be summed across all the other dependent nodes. Sometimes this summed up partial count is termed the
expected count or similar. The probability is
proportional to the resulting value; the actual probability must be determined by normalizing across all the possible values that the categorical variable can take (i.e. adding up the computed result for each possible value of the categorical variable, and dividing all the computed results by this sum). • If a given categorical node has dependent children (e.g. when it is a
latent variable in a
mixture model), the value computed in the previous step (expected count plus prior, or whatever is computed) must be multiplied by the actual conditional probabilities (
not a computed value that is proportional to the probability!) of all children given their parents. See the article on the
Dirichlet-multinomial distribution for a detailed discussion. • In the case where the group membership of the nodes dependent on a given Dirichlet prior may change dynamically depending on some other variable (e.g. a categorical variable indexed by another latent categorical variable, as in a
topic model), the same expected counts are still computed, but need to be done carefully so that the correct set of variables is included. See the article on the
Dirichlet-multinomial distribution for more discussion, including in the context of a topic model.
Collapsing other conjugate priors In general, any conjugate prior can be collapsed out, if its only children have distributions conjugate to it. The relevant math is discussed in the article on
compound distributions. If there is only one child node, the result will often assume a known distribution. For example, collapsing an
inverse-gamma-distributed variance out of a network with a single
Gaussian child will yield a
Student's t-distribution. (For that matter, collapsing both the mean and variance of a single Gaussian child will still yield a Student's t-distribution, provided both are conjugate, i.e. Gaussian mean, inverse-gamma variance.) If there are multiple child nodes, they will all become dependent, as in the
Dirichlet-
categorical case. The resulting
joint distribution will have a closed form that resembles in some ways the compound distribution, although it will have a product of a number of factors, one for each child node, in it. In addition, and most importantly, the resulting
conditional distribution of one of the child nodes given the others (and also given the parents of the collapsed node(s), but
not given the children of the child nodes) will have the same density as the
posterior predictive distribution of all the remaining child nodes. Furthermore, the posterior predictive distribution has the same density as the basic compound distribution of a single node, although with different parameters. The general formula is given in the article on
compound distributions. For example, given a Bayes network with a set of conditionally
independent identically distributed Gaussian-distributed nodes with
conjugate prior distributions placed on the mean and variance, the conditional distribution of one node given the others after compounding out both the mean and variance will be a
Student's t-distribution. Similarly, the result of compounding out the
gamma prior of a number of
Poisson-distributed nodes causes the conditional distribution of one node given the others to assume a
negative binomial distribution. In these cases where compounding produces a well-known distribution, efficient sampling procedures often exist, and using them will often (although not necessarily) be more efficient than not collapsing, and instead sampling both prior and child nodes separately. However, in the case where the compound distribution is not well-known, it may not be easy to sample from, since it generally will not belong to the
exponential family and typically will not be
log-concave (which would make it easy to sample using
adaptive rejection sampling, since a closed form always exists). In the case where the child nodes of the collapsed nodes themselves have children, the conditional distribution of one of these child nodes given all other nodes in the graph will have to take into account the distribution of these second-level children. In particular, the resulting conditional distribution will be proportional to a product of the compound distribution as defined above, and the conditional distributions of all of the child nodes given their parents (but not given their own children). This follows from the fact that the full conditional distribution is proportional to the joint distribution. If the child nodes of the collapsed nodes are
continuous, this distribution will generally not be of a known form, and may well be difficult to sample from despite the fact that a closed form can be written, for the same reasons as described above for non-well-known compound distributions. However, in the particular case that the child nodes are
discrete, sampling is feasible, regardless of whether the children of these child nodes are continuous or discrete. In fact, the principle involved here is described in fair detail in the article on the
Dirichlet-multinomial distribution.
Gibbs sampler with ordered overrelaxation • A Gibbs sampler with
ordered overrelaxation samples a given odd number of candidate values for x_j^{(i)} at any given step and sorts them, along with the single value for x_j^{(i-1)} according to some well-defined ordering. If x_j^{(i-1)} is the
sth smallest in the sorted list then the x_j^{(i)} is selected as the
sth largest in the sorted list. For more information, see Neal (1995).
Other extensions It is also possible to extend Gibbs sampling in various ways. For example, in the case of variables whose conditional distribution is not easy to sample from, a single iteration of
slice sampling or the
Metropolis–Hastings algorithm can be used to sample from the variables in question. It is also possible to incorporate variables that are not
random variables, but whose value is
deterministically computed from other variables.
Generalized linear models, e.g.
logistic regression (aka "
maximum entropy models"), can be incorporated in this fashion. (BUGS, for example, allows this type of mixing of models.) == Failure modes ==