The vocabulary mismatch between user created queries and relevant documents in a corpus causes the term mismatch problem in
information retrieval. Zhao and Callan (2010) were perhaps the first to quantitatively study the vocabulary mismatch problem in a retrieval setting. Their results show that an average query term fails to appear in 30-40% of the documents that are relevant to the user query. They also showed that this probability of mismatch is a central probability in one of the fundamental probabilistic retrieval models, the
Binary Independence Model. They developed novel term weight prediction methods that can lead to potentially 50-80% accuracy gains in retrieval over strong keyword retrieval models. Further research along the line shows that expert users can use Boolean Conjunctive Normal Form expansion to improve retrieval performance by 50-300% over unexpanded keyword queries.
Mitigation techniques •
Full-text indexing instead of only indexing keywords or abstracts • Use of
controlled vocabularies in both indexing and retrieval, such as
taxonomies or
ontologies • Indexing text on inbound links from other documents (or other
social tagging) •
Query expansion. Query expansion might be interactive, meaning the user can choose related words, or automatic, meaning the retrieval system adds extra words to the query without user input. A 2012 study by Zhao and Callan using expert created manual
conjunctive normal form queries has shown that searchonym expansion in the Boolean conjunctive normal form is much more effective than the traditional bag of word expansion e.g.
Rocchio expansion. • Translation-based models == Other contexts ==