The Metropolis–Hastings algorithm can draw samples from any
probability distribution with
probability density P(x), provided that we know a function f(x) proportional to the
density P and the values of f(x) can be calculated. The requirement that f(x) must only be proportional to the density, rather than exactly equal to it, makes the Metropolis–Hastings algorithm particularly useful, because it removes the need to calculate the density's normalization factor, which is often extremely difficult in practice. The Metropolis–Hastings algorithm generates a sequence of sample values in such a way that, as more and more sample values are produced, the distribution of values more closely approximates the desired distribution. These sample values are produced iteratively in such a way, that the distribution of the next sample depends only on the current sample value, which makes the sequence of samples a
Markov chain. Specifically, at each iteration, the algorithm proposes a candidate for the next sample value based on the current sample value. Then, with some probability, the candidate is either accepted, in which case the candidate value is used in the next iteration, or it is rejected in which case the candidate value is discarded, and the current value is reused in the next iteration. The probability of acceptance is determined by comparing the values of the function f(x) of the current and candidate sample values with respect to the desired distribution. The method used to propose new candidates is characterized by the probability distribution g(x\mid y) (sometimes written Q(x\mid y)) of a new proposed sample x given the previous sample y. This is called the
proposal density,
proposal function, or
jumping distribution. A common choice for g(x\mid y) is a
Gaussian distribution centered at y, so that points closer to y are more likely to be visited next, making the sequence of samples into a
Gaussian random walk. In the original paper by Metropolis et al. (1953), g(x\mid y) was suggested to be a uniform distribution limited to some maximum distance from y. More complicated proposal functions are also possible, such as those of
Hamiltonian Monte Carlo,
Langevin Monte Carlo, or
preconditioned Crank–Nicolson. For the purpose of illustration, the Metropolis algorithm, a special case of the Metropolis–Hastings algorithm where the proposal function is symmetric, is described below. ;Metropolis algorithm (symmetric proposal distribution): Let f(x) be a function that is proportional to the desired probability density function P(x) (a.k.a. a target distribution). • Initialization: Choose an arbitrary point x_t to be the first observation in the sample and choose a proposal function g(x\mid y). In this section, g is assumed to be symmetric; in other words, it must satisfy g(x\mid y) = g(y\mid x). • For each iteration
t: •
Propose a candidate x' for the next sample by picking from the distribution g(x'\mid x_t). •
Calculate the
acceptance ratio \alpha = f(x')/f(x_t), which will be used to decide whether to accept or reject the candidate. Because
f is proportional to the density of
P, we have that \alpha = f(x')/f(x_t) = P(x')/P(x_t). •
Accept or reject: • Generate a uniform random number u \in [0, 1]. • If u \le \alpha, then
accept the candidate by setting x_{t+1} = x', • If u > \alpha, then
reject the candidate and set x_{t+1} = x_t instead. This algorithm proceeds by randomly attempting to move about the
sample space, sometimes accepting the moves and sometimes remaining in place. P(x) at specific point x is proportional to the iterations spent on the point by the algorithm. Note that the acceptance ratio \alpha indicates how probable the new proposed sample is with respect to the current sample, according to the distribution whose density is P(x). If we attempt to move to a point that is more probable than the existing point (i.e. a point in a higher-density region of P(x) corresponding to an \alpha > 1 \ge u), we will always accept the move. However, if we attempt to move to a less probable point, we will sometimes reject the move, and the larger the relative drop in probability, the more likely we are to reject the new point. Thus, we will tend to stay in (and return large numbers of samples from) high-density regions of P(x), while only occasionally visiting low-density regions. Intuitively, this is why this algorithm works and returns samples that follow the desired distribution with density P(x). Compared with an algorithm like
adaptive rejection sampling that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages: • The samples are
autocorrelated. Even though over the long term they do correctly follow P(x), a set of nearby samples will be correlated with each other and not correctly reflect the distribution. This means that effective sample sizes can be significantly lower than the number of samples actually taken, leading to large errors. • Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density. As a result, a
burn-in period is typically necessary, where an initial number of samples are thrown away. On the other hand, most simple
rejection sampling methods suffer from the "
curse of dimensionality", where the probability of rejection increases exponentially as a function of the number of dimensions. Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree, and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high. As a result, MCMC methods are often the methods of choice for producing samples from
hierarchical Bayesian models and other high-dimensional statistical models used nowadays in many disciplines. In
multivariate distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point. When the number of dimensions is high, finding the suitable jumping distribution to use can be difficult, as the different individual dimensions behave in very different ways, and the jumping width (see above) must be "just right" for all dimensions at once to avoid excessively slow mixing. An alternative approach that often works better in such situations, known as
Gibbs sampling, involves choosing a new sample for each dimension separately from the others, rather than choosing a sample for all dimensions at once. That way, the problem of sampling from potentially high-dimensional space will be reduced to a collection of problems to sample from small dimensionality. This is especially applicable when the multivariate distribution is composed of a set of individual
random variables in which each variable is conditioned on only a small number of other variables, as is the case in most typical
hierarchical models. The individual variables are then sampled one at a time, with each variable conditioned on the most recent values of all the others. Various algorithms can be used to choose these individual samples, depending on the exact form of the multivariate distribution: some possibilities are the
adaptive rejection sampling methods, a simple one-dimensional Metropolis–Hastings step, or
slice sampling. ==Formal derivation==