Okapi BM25

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters. One of the most prominent instantiations of the function is as follows. Given a query , containing keywords q_1, ..., q_n, the BM25 score of a document is: : \text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{\text{avgdl}}\right)} where f(q_i, D) is the number of times that the keyword q_i occurs in the document , |D| is the length of the document in words, and is the average document length in the text collection from which documents are drawn. k_1 and are free parameters, usually chosen, in absence of an advanced optimization, as k_1 \in [1.2,2.0] and b = 0.75. \text{IDF}(q_i) is the IDF (inverse document frequency) weight of the query term q_i. It is usually computed as: :\text{IDF}(q_i) = \ln \left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}+1\right) where is the total number of documents in the collection, and n(q_i) is the number of documents containing q_i. There are several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component is derived from the Binary Independence Model. == IDF information theoretic interpretation ==