Cosine similarity

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula: :\mathbf{A}\cdot\mathbf{B} =\left\|\mathbf{A}\right\|\left\|\mathbf{B}\right\|\cos\theta Given two n-dimensional vectors of attributes, A and B, the cosine similarity, , is represented using a dot product and magnitude as :\text{cosine similarity} =S_C (A,B):= \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \cdot \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }, where A_i and B_i are the ith components of vectors \mathbf{A} and \mathbf{B}, respectively. The resulting similarity ranges from −1 meaning exactly opposite, to +1 meaning exactly the same, with 0 indicating orthogonality (no correlation), while in-between values indicate intermediate similarity or dissimilarity. For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from 0 \to 1, since the term frequencies cannot be negative. This remains true when using TF-IDF weights. The angle between two term frequency vectors cannot be greater than 90°. If the attribute vectors are normalized by subtracting the vector means (e.g., A - \bar{A}), the measure is called the centered cosine similarity and is equivalent to the Pearson correlation coefficient. For an example of centering, : \text{if}\, A = [A_1, A_2]^T, \text{ then } \bar{A} = \left[\frac{(A_1+A_2)}{2},\frac{(A_1+A_2)}{2}\right]^T, : \text{ so } A-\bar{A}= \left[\frac{(A_1-A_2)}{2},\frac{(-A_1+A_2)}{2}\right]^T. Cosine distance When the distance between two unit-length vectors is defined to be the length of their vector difference then \operatorname{dist}(\mathbf A, \mathbf B) = \sqrt{(\mathbf A - \mathbf B) \cdot (\mathbf A - \mathbf B)} = \sqrt{\mathbf A \cdot \mathbf A -2(\mathbf A \cdot \mathbf B) + \mathbf B \cdot \mathbf B} = \sqrt{2(1-S_C(\mathbf A, \mathbf B))}\,. Nonetheless the cosine distance is often defined without the square root or factor of 2: : \text{cosine distance} = D_C(A,B) := 1 - S_C(A,B)\,. It is important to note that, by virtue of being proportional to squared Euclidean distance, the cosine distance is not a true distance metric; it does not exhibit the triangle inequality property — or, more formally, the Schwarz inequality — and it violates the coincidence axiom. To repair the triangle inequality property while maintaining the same ordering, one can convert to Euclidean distance \sqrt{2(1- S_C(A,B))} or angular distance . Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below. Angular distance and similarity The normalized angle, referred to as angular distance, between any two vectors A and B is a formal distance metric and can be calculated from the cosine similarity. The complement of the angular distance metric can then be used to define angular similarity function bounded between 0 and 1, inclusive. When the vector elements may be positive or negative: :\text{angular distance} = D_{\theta} := \frac{ \arccos( \text{cosine similarity} ) }{ \pi } = \frac{\theta}{\pi} :\text{angular similarity} = S_{\theta} := 1 - \text{angular distance} = 1 - \frac{\theta}{\pi} Or, if the vector elements are always positive: :\text{angular distance} = D_{\theta} := \frac{ 2 \cdot \arccos( \text{cosine similarity} ) }{ \pi } = \frac{2\theta}{\pi} :\text{angular similarity} = S_{\theta} := 1 - \text{angular distance} = 1 - \frac{2\theta}{\pi} Unfortunately, computing the inverse cosine () function is slow, making the use of the angular distance more computationally expensive than using the more common (but not metric) cosine distance above. L2-normalized Euclidean distance Another effective proxy for cosine distance can be obtained by L_2 normalisation of the vectors, followed by the application of normal Euclidean distance. Using this technique each term in each vector is first divided by the magnitude of the vector, yielding a vector of unit length. Then the Euclidean distance over the end-points of any two vectors is a proper metric which gives the same ordering as the cosine distance (a monotonic transformation of Euclidean distance; see below) for any comparison of vectors, and furthermore avoids the potentially expensive trigonometric operations required to yield a proper metric. Once the normalisation has occurred, the vector space can be used with the full range of techniques available to any Euclidean space, notably standard dimensionality reduction techniques. This normalised form distance is often used within many deep learning algorithms. Otsuka–Ochiai coefficient In biology, there is a similar concept known as the Otsuka–Ochiai coefficient named after Yanosuke Otsuka (also spelled as Ōtsuka, Ootsuka or Otuka, ) and Akira Ochiai (), also known as the Ochiai–Barkman or Ochiai coefficient, which can be represented as: :K =\frac{\sqrt} Here, A and B are sets, and |A| is the number of elements in A. If sets are represented as bit vectors, the Otsuka–Ochiai coefficient can be seen to be the same as the cosine similarity. It is identical to the score introduced by Godfrey Thomson. In a recent book, the coefficient is tentatively misattributed to another Japanese researcher with the family name Otsuka. The confusion arises because in 1957 Akira Ochiai attributes the coefficient only to Otsuka (no first name mentioned) who in turn cites the original 1936 article by Yanosuke Otsuka. == Properties ==