In 1980, the first significant statistical language model was proposed, and during the decade IBM performed '
Shannon-style' experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.
Models based on word n-grams Exponential Maximum entropy language models encode the relationship between a word and the
n-gram history using feature functions. The equation is P(w_m \mid w_1,\ldots,w_{m-1}) = \frac{1}{Z(w_1,\ldots,w_{m-1})} \exp (a^T f(w_1,\ldots,w_m)) where Z(w_1,\ldots,w_{m-1}) is the
partition function, a is the parameter vector, and f(w_1,\ldots,w_m) is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain
n-gram. It is helpful to use a prior on a or some form of
regularization. The log-bilinear model is another example of an exponential language model.
Skip-gram model == Neural models ==