Clustering or
Cluster analysis is a data mining technique that is used to discover patterns in data by grouping similar objects together. It involves partitioning a set of data points into groups or clusters based on their similarities. One of the fundamental aspects of clustering is how to measure similarity between data points. Similarity measures play a crucial role in many clustering techniques, as they are used to determine how closely related two data points are and whether they should be grouped together in the same cluster. A similarity measure can take many different forms depending on the type of data being clustered and the specific problem being solved. One of the most commonly used similarity measures is the
Euclidean distance, which is used in many clustering techniques including
K-means clustering and
Hierarchical clustering. The Euclidean distance is a measure of the straight-line distance between two points in a high-dimensional space. It is calculated as the square root of the sum of the squared differences between the corresponding coordinates of the two points. For example, if we have two data points (x_1,y_1) and (x_2,y_2), the Euclidean distance between them is d = \surd[(x_2-x_1)^2 + (y_2-y_1)^2]. File:Nuclear Profile Similarity heatmap.png|thumb|Heatmap of HIST1 region, which is located on mouse chromosome 13 at the following coordinates: [21.7 Mb, 24.1 Mb]. Another commonly used similarity measure is the
Jaccard index or Jaccard similarity, which is used in clustering techniques that work with binary data such as presence/absence data or Boolean data; The Jaccard similarity is particularly useful for clustering techniques that work with text data, where it can be used to identify clusters of similar documents based on their shared features or keywords. It is calculated as the size of the intersection of two sets divided by the size of the union of the two sets: J(A,B)={ A\bigcap B\over A\bigcup B}. Similarities among 162 Relevant Nuclear Profile are tested using the Jaccard Similarity measure (see figure with heatmap). The Jaccard similarity of the nuclear profile ranges from 0 to 1, with 0 indicating no similarity between the two sets and 1 indicating perfect similarity with the aim of clustering the most similar nuclear profile. Manhattan distance, also known as
Taxicab geometry, is a commonly used similarity measure in clustering techniques that work with continuous data. It is a measure of the distance between two data points in a high-dimensional space, calculated as the sum of the absolute differences between the corresponding coordinates of the two points \left\vert x_1 - x_2 \right\vert +\left\vert y_1 -y_2 \right\vert. When dealing with mixed-type data, including nominal, ordinal, and numerical attributes per object,
Gower's distance (or similarity) is a common choice as it can handle different types of variables implicitly. It first computes similarities between the pair of variables in each object, and then combines those similarities to a single weighted average per object-pair. As such, for two objects i and j having p descriptors, the similarity S is defined as: S_{ij} = \frac{\sum_{k=1}^pw_{ijk}s_{ijk}}{\sum_{k=1}^pw_{ijk}}, where the w_{ijk} are non-negative weights and s_{ijk} is the similarity between the two objects regarding their k-th variable. In
spectral clustering, a similarity, or affinity, measure is used to transform data to overcome difficulties related to lack of convexity in the shape of the data distribution. The measure gives rise to an (n, n)-sized '''''' for a set of points, where the entry (i,j) in the matrix can be simply the (reciprocal of the)
Euclidean distance between i and j, or it can be a more complex measure of distance such as the Gaussian e^{-\|s_1 - s_2\|^2/2\sigma^2}. The choice of similarity measure depends on the type of data being clustered and the specific problem being solved. For example, working with continuous data such as gene expression data, the Euclidean distance or cosine similarity may be appropriate. If working with binary data such as the presence of a genomic loci in a nuclear profile, the Jaccard index may be more appropriate. Lastly, working with data that is arranged in a grid or lattice structure, such as image or signal processing data, the Manhattan distance is particularly useful for the clustering. ==Use in recommender systems==