CRFs are a type of
discriminative undirected probabilistic graphical model.
Lafferty,
McCallum and Pereira However, there exist special cases for which exact inference is feasible: • If the graph is a chain or a tree, message passing algorithms yield exact solutions. The algorithms used in these cases are analogous to the
forward-backward and
Viterbi algorithm for the case of HMMs. • If the CRF only contains pair-wise potentials and the energy is
submodular, combinatorial min cut/max flow algorithms yield exact solutions. If exact inference is impossible, several algorithms can be used to obtain approximate solutions. These include: •
Loopy belief propagation • Alpha expansion • Mean field inference •
Linear programming relaxations
Parameter learning Learning the parameters \theta is usually done by
maximum likelihood learning for p(Y_i|X_i; \theta). If all nodes have exponential family distributions and all nodes are observed during training, this
optimization is convex. It can be solved for example using
gradient descent algorithms, or
Quasi-Newton methods such as the
L-BFGS algorithm. On the other hand, if some variables are unobserved, the inference problem has to be solved for these variables. Exact inference is intractable in general graphs, so approximations have to be used.
Examples In sequence modeling, the graph of interest is usually a chain graph. An input sequence of observed variables X represents a sequence of observations and Y represents a hidden (or unknown) state variable that needs to be inferred given the observations. The Y_{i} are structured to form a chain, with an edge between each Y_{i-1} and Y_{i}. As well as having a simple interpretation of the Y_{i} as "labels" for each element in the input sequence, this layout admits efficient algorithms for: • model
training, learning the conditional distributions between the Y_{i} and feature functions from some corpus of training data. •
decoding, determining the probability of a given label sequence Y given X. •
inference, determining the
most likely label sequence Y given X. The conditional dependency of each Y_{i} on X is defined through a fixed set of
feature functions of the form f(i, Y_{i-1}, Y_{i}, X), which can be thought of as measurements on the input sequence that partially determine the
likelihood of each possible value for Y_{i}. The model assigns each feature a numerical weight and combines them to determine the probability of a certain value for Y_{i}. Linear-chain CRFs have many of the same applications as conceptually simpler hidden Markov models (HMMs), but relax certain assumptions about the input and output sequence distributions. An HMM can loosely be understood as a CRF with very specific feature functions that use constant probabilities to model state transitions and emissions. Conversely, a CRF can loosely be understood as a generalization of an HMM that makes the constant transition probabilities into arbitrary functions that vary across the positions in the sequence of hidden states, depending on the input sequence. Notably, in contrast to HMMs, CRFs can contain any number of feature functions, the feature functions can inspect the entire input sequence X at any point during inference, and the range of the feature functions need not have a probabilistic interpretation. ==Variants==