Deep belief network

The training method for RBMs proposed by Geoffrey Hinton for use with training "Product of Experts" models is called contrastive divergence (CD). CD provides an approximation to the maximum likelihood method that would ideally be applied for learning the weights. In training a single RBM, weight updates are performed with gradient descent via the following equation: w_{ij}(t+1) = w_{ij}(t) + \eta\frac{\partial \log(p(v))}{\partial w_{ij}} where, p(v) is the probability of a visible vector, which is given by p(v) = \frac{1}{Z}\sum_he^{-E(v,h)}. Z is the partition function (used for normalizing) and E(v,h) is the energy function assigned to the state of the network. A lower energy indicates the network is in a more "desirable" configuration. The gradient \frac{\partial \log(p(v))}{\partial w_{ij}} has the simple form \langle v_ih_j\rangle_\text{data} - \langle v_ih_j\rangle_\text{model} where \langle\cdots\rangle_p represent averages with respect to distribution p. The issue arises in sampling \langle v_ih_j\rangle_\text{model} because this requires extended alternating Gibbs sampling. CD replaces this step by running alternating Gibbs sampling for n steps (values of n = 1 perform well). After n steps, the data are sampled and that sample is used in place of \langle v_ih_j\rangle_\text{model}. The CD procedure works as follows: Although the approximation of CD to maximum likelihood is crude (does not follow the gradient of any function), it is empirically effective. (RBM) with fully connected visible and hidden units. Note there are no hidden-hidden or visible-visible connections. == See also ==