MarketOkapi BM25
Company Profile

Okapi BM25

In information retrieval, Okapi BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

The ranking function
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters. One of the most prominent instantiations of the function is as follows. Given a query , containing keywords q_1, ..., q_n, the BM25 score of a document is: : \text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{\text{avgdl}}\right)} where f(q_i, D) is the number of times that the keyword q_i occurs in the document , |D| is the length of the document in words, and is the average document length in the text collection from which documents are drawn. k_1 and are free parameters, usually chosen, in absence of an advanced optimization, as k_1 \in [1.2,2.0] and b = 0.75. \text{IDF}(q_i) is the IDF (inverse document frequency) weight of the query term q_i. It is usually computed as: :\text{IDF}(q_i) = \ln \left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}+1\right) where is the total number of documents in the collection, and n(q_i) is the number of documents containing q_i. There are several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component is derived from the Binary Independence Model. == IDF information theoretic interpretation ==
IDF information theoretic interpretation
Here is an interpretation from information theory. Suppose a query term q appears in n(q) documents. Then a randomly picked document D will contain the term with probability \frac{n(q)}{N} (where N is again the cardinality of the set of documents in the collection). Therefore, the information content of the message "D contains q" is: :-\log \frac{n(q)}{N} = \log \frac{N}{n(q)}. Now suppose we have two query terms q_1 and q_2. If the two terms occur in documents entirely independently of each other, then the probability of seeing both q_1 and q_2 in a randomly picked document D is: :\frac{n(q_1)}{N} \cdot \frac{n(q_2)}{N}, and the information content of such an event is: :\sum_{i=1}^{2} \log \frac{N}{n(q_i)}. With a small variation, this is exactly what is expressed by the IDF component of BM25. == Modifications ==
Modifications
• At the extreme values of the coefficient BM25 turns into ranking functions known as BM11 (for b=1) and BM15 (for b=0). • BM25F (or the BM25 model with Extension to Multiple Weighted Fields) is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) with possibly different degrees of importance, term relevance saturation and length normalization. BM25F defines each type of field as a stream, applying a per-stream weighting to scale each stream against the calculated score. • BM25+ is an extension of BM25. BM25+ was developed to address one deficiency of the standard BM25 in which the component of term frequency normalization by document length is not properly lower-bounded; as a result of this deficiency, long documents which do match the query term can often be scored unfairly by BM25 as having a similar relevancy to shorter documents that do not contain the query term at all. The scoring formula of BM25+ only has one additional free parameter \delta (a default value is in absence of a training data) as compared with BM25: : \text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \left[ \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{\text{avgdl}}\right)} + \delta \right] == References ==
General references
• • • • • == External links ==
tickerdossier.comtickerdossier.substack.com