The Brown Corpus Research on part-of-speech tagging has been closely tied to
corpus linguistics. The first major corpus of English for computer analysis was the
Brown Corpus developed at
Brown University by
Henry Kučera and
W. Nelson Francis, in the mid-1960s. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences). The
Brown Corpus was painstakingly "tagged" with part-of-speech markers over many years. A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, article then noun can occur, but article then verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata so that by the late 70s the tagging was nearly perfect (allowing for some cases on which even human speakers might not agree). This corpus has been used for innumerable studies of word-frequency and of part-of-speech and inspired the development of similar "tagged" corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as
CLAWS and
VOLSUNGA. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word
British National Corpus, even though larger corpora are rarely so thoroughly curated. For some time, part-of-speech tagging was considered an inseparable part of
natural language processing, because there are certain cases where the correct part of speech cannot be decided without understanding the
semantics or even the
pragmatics of the context. This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.
Use of hidden Markov models In the mid-1980s, researchers in Europe began to use
hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the
Lancaster-Oslo-Bergen Corpus of British English. HMMs involve counting cases (such as from the Brown Corpus) and making a table of the probabilities of certain sequences. For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal. The same method can, of course, be used to benefit from knowledge about the following words. More advanced ("higher-order") HMMs learn the probabilities not only of pairs but triples or even larger sequences. So, for example, if you've just seen a noun followed by a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb. When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with the highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this and achieved accuracy in the 93–95% range.
Eugene Charniak points out in
Statistical techniques for natural language parsing (1997) that merely assigning the most common tag to each known word and the tag "
proper noun" to all unknowns will approach 90% accuracy because many words are unambiguous, and many others only rarely represent their less-common parts of speech. CLAWS pioneered the field of HMM-based part of speech tagging but was quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many options (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech. HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bi-directional inference algorithm.
Dynamic programming methods In 1987,
Steven DeRose and Kenneth W. Church independently developed
dynamic programming algorithms to solve the same problem in vastly less time. Their methods were similar to the
Viterbi algorithm known for some time in other fields. DeRose used a table of pairs, while Church used a table of triples and a method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (an actual measurement of triple probabilities would require a much larger corpus). Both methods achieved an accuracy of over 95%. DeRose's 1990 dissertation at
Brown University included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective. These findings were surprisingly disruptive to the field of natural language processing. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on. CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated from the other levels of processing; this, in turn, simplified the theory and practice of computerized language analysis and encouraged researchers to find ways to separate other pieces as well. Markov Models became the standard method for the part-of-speech assignment.
Unsupervised taggers The methods already discussed involve working from a pre-existing corpus to learn tag probabilities. It is, however, also possible to
bootstrap using "unsupervised" tagging. Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones. With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights. These two categories can be further subdivided into rule-based, stochastic, and neural approaches.
Other taggers and methods Some current major algorithms for part-of-speech tagging include the
Viterbi algorithm,
Brill tagger,
Constraint Grammar, and the
Baum-Welch algorithm (also known as the forward-backward algorithm). Hidden Markov model and
visible Markov model taggers can both be implemented using the Viterbi algorithm. The rule-based Brill tagger is unusual in that it learns a set of rule patterns, and then applies those patterns rather than optimizing a statistical quantity. Many
machine learning methods have also been applied to the problem of POS tagging. Methods such as
SVM,
maximum entropy classifier,
perceptron, and
nearest-neighbor have all been tried, and most can achieve accuracy above 95%. A direct comparison of several methods is reported (with references) at the ACL Wiki. This comparison uses the Penn tag set on some of the Penn Treebank data, so the results are directly comparable. However, many significant taggers are not included (perhaps because of the labor involved in reconfiguring them for this particular dataset). Thus, it should not be assumed that the results reported here are the best that can be achieved with a given approach; nor even the best that
have been achieved with a given approach. In 2014, a paper reported using the
structure regularization method for part-of-speech tagging, achieving 97.36% on a standard benchmark dataset. ==See also==