To choose a value for
n in an
n-gram model, it is necessary to find the right trade-off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones.
Smoothing techniques There are problems of balance weight between
infrequent grams (for example, if a proper name appeared in the training data) and
frequent grams. Also, items not seen in the training data will be given a
probability of 0.0 without
smoothing. For unseen but plausible data from a sample, one can introduce
pseudocounts. Pseudocounts are generally motivated on Bayesian grounds. In practice it was necessary to
smooth the probability distributions by also assigning non-zero probabilities to unseen words or
n-grams. The reason is that models derived directly from the
n-gram frequency counts have severe problems when confronted with any
n-grams that have not explicitly been seen before –
the zero-frequency problem. Various smoothing methods were used, from simple "add-one" (Laplace) smoothing (assign a count of 1 to unseen
n-grams; see
Rule of succession) to more sophisticated models, such as
Good–Turing discounting or
back-off models. Some of these methods are equivalent to assigning a
prior distribution to the probabilities of the
n-grams and using
Bayesian inference to compute the resulting
posterior n-gram probabilities. However, the more sophisticated smoothing models were typically not derived in this fashion, but instead through independent considerations. •
Linear interpolation (e.g., taking the
weighted mean of the unigram, bigram, and trigram) •
Good–Turing discounting •
Witten–Bell discounting •
Lidstone's smoothing •
Katz's back-off model (trigram) •
Kneser–Ney smoothing Skip-gram language model Skip-gram language model is an attempt at overcoming the data sparsity problem that the preceding model (i.e. word
n-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are
skipped over (thus the name "skip-gram"). Formally, a -skip--gram is a length- subsequence where the components occur at distance at most from each other. For example, in the input text: :
the rain in Spain falls mainly on the plain the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences :
the in,
rain Spain,
in falls,
Spain mainly,
falls on,
mainly the, and
on plain. In skip-gram model, semantic relations between words are represented by
linear combinations, capturing a form of
compositionality. For example, in some such models, if is the function that maps a word to its -d vector representation, then v(\mathrm{king}) - v(\mathrm{male}) + v(\mathrm{female}) \approx v(\mathrm{queen}) where ≈ is made precise by stipulating that its right-hand side must be the
nearest neighbor of the value of the left-hand side. == Syntactic
n-grams ==