Comparison of methods can be divided in 2 groups by amount of words to test. The difference consists in the amount of analysis and processing: •
all-words task implies disambiguating all the words of the text •
lexical sample consists in disambiguating some previously chosen target words. It is assumed that the former one is more realistic evaluation, although with very laborious testing of results. Initially only the latter was used in evaluation but later the former was included. Lexical sample organizers had to choose samples on which the systems were to be tested. A criticism of earlier forays into lexical-sample WSD evaluation is that the lexical sample had been chosen according to the whim of the experimenter (or, to coincide with earlier experimenters' selections). For English Senseval, a sampling frame was devised in which words were classified according to their frequency (in the BNC) and their polysemy level (in WordNet). Also, inclusion POS-tagging problem was a matter of discussion and it was decided that samples should be words with known part of speech and some indeterminants (for ex. 15 noun tasks, 13 verb tasks, 8 adjectives, and 5 indeterminates). For comparison purposes, known, yet simple, algorithms named baselines are used. These include different variants of
Lesk algorithm or
most frequent sense algorithm. == Evaluation measures ==