Cluster analysis

Cluster analysis, or clustering, is a data analysis technique aimed at partitioning a set of objects into groups such that objects within the same group exhibit greater similarity to one another than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

Definition

The notion of a "cluster" cannot be precisely defined, which is one of the reasons why there are so many clustering algorithms. There is a common denominator: a group of data objects. However, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given. The notion of a cluster, as found by different algorithms, varies significantly in its properties. Understanding these "cluster models" is key to understanding the differences between the various algorithms. Typical cluster models include: • s: for example, hierarchical clustering builds models based on distance connectivity. • s: for example, the k-means algorithm represents each cluster by a single mean vector. • s: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the expectation-maximization algorithm. • s: for example, DBSCAN, OPTICS and HDBSCAN defines clusters as connected dense regions in the data space. • s: in biclustering (also known as co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes. • s: some algorithms do not provide a refined model for their results and just provide the grouping information. • s: a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the HCS clustering algorithm. • Signed graph models: Every path in a signed graph has a sign from the product of the signs on the edges. Under the assumptions of balance theory, edges may change sign and result in a bifurcated graph. The weaker "clusterability axiom" (no cycle has exactly one negative edge) yields results with more than two clusters, or subgraphs with only positive edges. • s: the most well-known unsupervised neural network is the self-organizing map and these models can usually be characterized as similar to one or more of the above models, and including subspace models when neural networks implement a form of Principal Component Analysis or Independent Component Analysis. A "clustering" is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example, a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished as: • '''': each object belongs to a cluster or not • ' (also: '): each object belongs to each cluster to a certain degree (for example, a likelihood of belonging to the cluster) There are also finer distinctions possible, for example: • '''': each object belongs to exactly one cluster • '''': objects can also belong to no cluster; in which case they are considered outliers • '' (also: alternative clustering, multi-view clustering''): objects may belong to more than one cluster; usually involving hard clusters • '''': objects that belong to a child cluster also belong to the parent cluster • '''': while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap == Algorithms ==

Algorithms

As listed above, clustering algorithms can be categorized based on their cluster model. The following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms. Not all provide models for their clusters and can thus not easily be categorized. An overview of algorithms explained in Wikipedia can be found in the list of statistics algorithms. There is no objectively "correct" clustering algorithm, but as it was noted, "clustering is in the eye of the beholder." The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another. An algorithm that is designed for one kind of model will generally fail on a data set that contains a radically different kind of model. Connectivity-based clustering (hierarchical clustering) Connectivity-based clustering, also known as hierarchical clustering, is based on the idea that objects are more related to nearby objects than to those farther away. These algorithms form clusters by connecting objects based on their distance. A cluster can be understood in terms of the maximum distance required to connect its elements. At different distance thresholds, different cluster groupings appear. These groupings can be visualized using a dendrogram, a tree-like diagram that shows how clusters merge as the distance increases. This explains the term "hierarchical clustering": instead of producing a single partition of the data set, the algorithm builds a hierarchy of clusters that merge at different distances. In a dendrogram, the y-axis shows the distance at which clusters merge, while the x-axis arranges objects so that clusters appear as continuous branches. Connectivity-based clustering is a family of methods that differ in how distances between clusters are computed. In addition to choosing a distance function, the user must also select a linkage criterion, which determines how the distance between clusters is calculated. Common linkage criteria include single-linkage clustering (minimum distance between points), complete linkage clustering (maximum distance), and UPGMA or WPGMA (average linkage based on mean distances). Hierarchical clustering can be either agglomerative (starting with individual elements and merging them) or divisive (starting with the full data set and splitting it). In agglomerative hierarchical clustering, the algorithm typically proceeds as follows: • Start with each data point as its own cluster. • Identify the two closest clusters based on a chosen distance measure. • Merge them into a single cluster. • Recalculate distances between the new cluster and the remaining clusters using the selected linkage criterion. • Repeat until all data points are merged into a single cluster. This process produces a full hierarchy of possible clusterings rather than a single final result. A specific clustering can be obtained by selecting a cut level in the dendrogram, which determines how many clusters are formed. These methods do not produce a unique partitioning of the data set, but rather a hierarchy from which the user must choose appropriate clusters. They are also sensitive to outliers, which may appear as separate clusters or cause other clusters to merge. This effect, especially in single-linkage clustering, is known as the "chaining phenomenon". In the general case, the complexity is \mathcal{O}(n^3) for agglomerative clustering and \mathcal{O}(2^{n-1}) for divisive clustering, which makes them computationally expensive for large data sets. For some special cases, more efficient methods (with complexity \mathcal{O}(n^2)) are known, such as SLINK for single-linkage and CLINK for complete-linkage clustering. File:SLINK-Gaussian-data.svg|Single-linkage on Gaussian data. At 35 clusters, the biggest cluster starts fragmenting into smaller parts, while before it was still connected to the second largest due to the single-link effect. File:SLINK-density-data.svg|Single-linkage on density-based clusters. 20 clusters extracted, most of which contain single elements, since linkage clustering does not have a notion of "noise". Centroid-based clustering In centroid-based clustering, each cluster is represented by a central vector, which is not necessarily a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized. The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A particularly well-known approximate method is Lloyd's algorithm, often just referred to as "k-means algorithm" (although another algorithm introduced this name). It does however only find a local optimum, and is commonly run multiple times with different random initializations. Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means). Most k-means-type algorithms require the number of clusters – k – to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid; often yielding improperly cut borders of clusters. This happens primarily because the algorithm optimizes cluster centers, not cluster borders. Steps involved in the centroid-based clustering algorithm are: • Choose, k distinct clusters at random. These are the initial centroids to be improved upon. • Suppose a set of observations, . Assign each observation to the centroid to which it has the smallest squared Euclidean distance. This results in k distinct groups, each containing unique observations. • Recalculate centroids (see k-means clustering). • Exit iff the new centroids are equivalent to the previous iteration's centroids. Else, repeat the algorithm, the centroids have yet to converge. K-means has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model-based clustering, and Lloyd's algorithm as a variation of the Expectation-maximization algorithm for this model discussed below. File:KMeans-Gaussian-data.svg|k-means separates data into Voronoi cells, which assumes equal-sized clusters (not adequate here). File:KMeans-density-data.svg|k-means cannot represent density-based clusters. The following pseudocode describes the standard iterative refinement form of k-means. The algorithm alternates between an assignment step, which labels each point by its nearest centroid, and an update step, which recomputes each centroid as the mean of its assigned points. Convergence is guaranteed in a finite number of iterations, though the result may be a local optimum. input: dataset \mathbf{x}_1,...,\mathbf{x}_P, initializations for centroids \mathbf{c}_1,...,\mathbf{c}_K, maximum number of iterations J for \,\,j = 1,\ldots,J # Cluster assignments for \,\,p = 1,\ldots,P a_p =\underset{k=1,\ldots,K}{\mbox{argmin}}\,\,\left\Vert \mathbf{c}_{k}-\mathbf{x}_{p}\right\Vert _{2} # Update centroid locations for \,\,k = 1,\ldots,K \text{denote } S_k \text{ the index set of points } X_p\text{ currently assigned to the }k_{th}\text{ cluster} \text{update }c_k\text{ via }c_k=\frac{1}{\left|S_{k}\right|}\underset{p\in S_{k}}{\sum}\mathbf{x}_{p} # Update cluster assignments using final for \,\,p = 1,\ldots,P a_p =\underset{k=1,\ldots,K}{\mbox{argmin}}\,\,\left\Vert \mathbf{c}_{k}-\mathbf{x}_{p}\right\Vert _{2} output: optimal centroids and assignments density-based clustering method is DBSCAN. In contrast to many newer methods, it features a well-defined cluster model called "density-reachability". Similar to linkage-based clustering, it is based on connecting points within certain distance thresholds. However, it only connects points that satisfy a density criterion, in the original variant defined as a minimum number of other objects within this radius. A cluster consists of all density-connected objects (which can form a cluster of an arbitrary shape, in contrast to many other methods) plus all objects that are within these objects' range. Another interesting property of DBSCAN is that its complexity is fairly low – it requires a linear number of range queries on the database – and that it will discover essentially the same results (it is deterministic for core and noise points, but not for border points) in each run, therefore there is no need to run it multiple times. OPTICS is a generalization of DBSCAN that removes the need to choose an appropriate value for the range parameter \varepsilon, and produces a hierarchical result related to that of linkage clustering. DeLi-Clu, Density-Link-Clustering combines ideas from single-linkage clustering and OPTICS, eliminating the \varepsilon parameter entirely and offering performance improvements over OPTICS by using an R-tree index. HDBSCAN extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters. The key drawback of DBSCAN and OPTICS is that they expect some kind of density drop to detect cluster borders. On data sets with, for example, overlapping Gaussian distributions – a common use case in artificial data – the cluster borders produced by these algorithms will often look arbitrary, because the cluster density decreases continuously. On a data set consisting of mixtures of Gaussians, these algorithms are nearly always outperformed by methods such as EM clustering that are able to precisely model this kind of data. Mean-shift is a clustering approach where each object is moved to the densest area in its vicinity, based on kernel density estimation. Eventually, objects converge to local maxima of density. Similar to k-means clustering, these "density attractors" can serve as representatives for the data set, but mean-shift can detect arbitrary-shaped clusters similar to DBSCAN. Due to the expensive iterative procedure and density estimation, mean-shift is usually slower than DBSCAN or k-Means. Besides that, the applicability of the mean-shift algorithm to multidimensional data is hindered by the unsmooth behaviour of the kernel density estimate, which results in over-fragmentation of cluster tails. In this technique, we create a grid structure, and the comparison is performed on grids (also known as cells). The grid-based technique is fast and has low computational complexity. There are two types of grid-based clustering methods: STING and CLIQUE. Steps involved in the grid-based clustering algorithm are: • Divide data space into a finite number of cells. • Randomly select a cell ‘c’, where c should not be traversed beforehand. • Calculate the density of ‘c’ • If the density of ‘c’ greater than threshold density • Mark cell ‘c’ as a new cluster • Calculate the density of all the neighbors of ‘c’ • If the density of a neighboring cell is greater than threshold density then, add the cell in the cluster and repeat steps 4.2 and 4.3 till there is no neighbor with a density greater than threshold density. • Repeat steps 2, 3, and 4 till all the cells are traversed. • Stop. Big data With the increasing need to process Big data, the willingness to trade semantic meaning of the generated clusters for performance becomes more relevant. Therefore, efforts have been put into improving the performance of existing algorithms. Among them are CLARANS, and BIRCH. This led to the development of pre-clustering methods such as canopy clustering, which can process huge data sets efficiently, but the resulting "clusters" are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering. Subspace clustering For high-dimensional data, many methods fail due to the curse of dimensionality, which renders particular distance functions problematic in high-dimensional spaces. This led to clustering algorithms for high-dimensional data that focus on subspace clustering (where only some attributes are used, and cluster models include the relevant attributes for the cluster) and correlation clustering that also looks for arbitrary rotated ("correlated") subspace clusters that can be modeled by giving a correlation of their attributes. Examples for such clustering algorithms are CLIQUE and SUBCLU. Ideas from density-based clustering methods (in particular the DBSCAN/OPTICS family of algorithms) have been adapted to subspace clustering (HiSC, hierarchical subspace clustering and DiSH) and correlation clustering (HiCO, hierarchical correlation clustering, 4C using "correlation connectivity" and ERiC exploring hierarchical density-based correlation clusters). Several different clustering systems based on mutual information have been proposed. One is Marina Meilă's variation of information metric; another provides hierarchical clustering. Using genetic algorithms, a wide range of different fit-functions can be optimized, including mutual information. Also belief propagation, a recent development in computer science and statistical physics, has led to the creation of new types of clustering algorithms. == Evaluation and assessment ==

Evaluation and assessment

Evaluation (or "validation") of clustering results is as difficult as the clustering itself. Popular approaches involve "internal" evaluation, where the clustering is summarized to a single quality score, "external" evaluation, where the clustering is compared to an existing "ground truth" classification, "manual" evaluation by a human expert, and "indirect" evaluation by evaluating the utility of the clustering in its intended application. which is highly subjective. Nevertheless, such statistics can be quite informative in identifying bad clusterings, but one should not dismiss subjective human evaluation. For example, the following methods can be used to assess the quality of clustering algorithms based on internal criterion: ==== Davies–Bouldin index ==== The Davies–Bouldin index can be calculated by the following formula: DB = \frac {1} {n} \sum_{i=1}^{n} \max_{j\neq i}\left(\frac{\sigma_i + \sigma_j} {d(c_i,c_j)}\right) where n is the number of clusters, c_i is the centroid of cluster i, \sigma_i is the average distance of all elements in cluster i to centroid c_i, and d(c_i,c_j) is the distance between centroids c_i and c_j. Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm based on this criterion. ==== Dunn index ==== The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. For each cluster partition, the Dunn index can be calculated by the following formula: : D = \frac{\min_{1 \leq i where d(i,j) represents the distance between clusters i and j, and d '(k) measures the intra-cluster distance of cluster k. The inter-cluster distance d(i,j) between two clusters may be any number of distance measures, such as the distance between the centroids of the clusters. Similarly, the intra-cluster distance d '(k) may be measured in a variety of ways, such as the maximal distance between any pair of elements in cluster k. Since internal criterion seek clusters with high intra-cluster similarity and low inter-cluster similarity, algorithms that produce clusters with high Dunn index are more desirable. ==== Silhouette coefficient ==== The silhouette coefficient contrasts the average distance to elements in the same cluster with the average distance to elements in other clusters. Objects with a high silhouette value are considered well clustered, objects with a low value may be outliers. This index works well with k-means clustering, and is also used to determine the optimal number of clusters. Area Under the Curve for Clustering (AUCC) This matrix consider pairs of objects: the distance between the pair as a scoring function and the particion pairs defined true positive, true negatives, false negatives and true negatives by considering if pairs are in the same clusters of not. This index borrow the same characteristics as the AUC in the supervised scenario including an expected value of 0.5 and vizualisation of results. External evaluation In external evaluation, clustering results are evaluated based on data that was not used for clustering, such as known class labels and external benchmarks. Such benchmarks consist of a set of pre-classified items, and these sets are often created by (expert) humans. Thus, the benchmark sets can be thought of as a gold standard for evaluation. Additionally, from a knowledge discovery point of view, the reproduction of known knowledge may not necessarily be the intended result. A number of measures are adapted from variants used to evaluate classification tasks. In place of counting the number of times a class was correctly assigned to a single data point (known as true positives), such pair counting metrics assess whether each pair of data points that is truly in the same cluster is predicted to be in the same cluster. computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. It can be computed using the following formula: : RI = \frac {TP + TN} {TP + FP + FN + TN} where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. The instances being counted here are the number of correct pairwise assignments. That is, TP is the number of pairs of points that are clustered together in the predicted partition and in the ground truth partition, FP is the number of pairs of points that are clustered together in the predicted partition but not in the ground truth partition etc. If the dataset is of size N, then TP + TN + FP + FN = \binom{N}{2}. One issue with the Rand index is that false positives and false negatives are equally weighted. This may be an undesirable characteristic for some clustering applications. The F-measure addresses this concern, as does the chance-corrected adjusted Rand index. ==== F-measure ==== The F-measure can be used to balance the contribution of false negatives by weighting recall through a parameter \beta \geq 0. Let precision and recall (both external evaluation measures in themselves) be defined as follows: P = \frac {TP } {TP + FP } R = \frac {TP } {TP + FN} where P is the precision rate and R is the recall rate. We can calculate the F-measure by using the following formula: computes the similarity between the clusters returned by the clustering algorithm and the benchmark classifications. The higher the value of the Fowlkes–Mallows index the more similar the clusters and the benchmark classifications are. It can be computed using the following formula: FM = \sqrt{ \frac {TP}{TP+FP} \cdot \frac{TP}{TP+FN} } where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. The FM index is the geometric mean of the precision and recall P and R, and is thus also known as the G-measure, while the F-measure is their harmonic mean. Moreover, precision and recall are also known as Wallace's indices B^I and B^{II}. Chance normalized versions of recall, precision and G-measure correspond to Informedness, Markedness and Matthews Correlation and relate strongly to Kappa. Chi index The Chi index is an external validation index that measure the clustering results by applying the chi-squared statistic. This index scores positively the fact that the labels are as sparse as possible across the clusters, i.e., that each cluster has as few different labels as possible. The higher the value of the Chi Index the greater the relationship between the resulting clusters and the label used. ====Mutual information==== The mutual information is an information theoretic measure of how much information is shared between a clustering and a ground-truth classification that can detect a non-linear similarity between two clustering. Normalized mutual information is a family of corrected-for-chance variants of this that has a reduced bias for varying cluster numbers. Cluster tendency To measure cluster tendency is to measure to what degree clusters exist in the data to be clustered, and may be performed as an initial test, before attempting clustering. One way to do this is to compare the data against random data. On average, random data should not have clusters . • Hopkins statistic :There are multiple formulations of the Hopkins statistic. A typical one is as follows. Let X be the set of n data points in d dimensional space. Consider a random sample (without replacement) of m \ll n data points with members x_i. Also generate a set Y of m uniformly randomly distributed data points. Now define two distance measures, u_i to be the distance of y_i \in Y from its nearest neighbor in X and w_i to be the distance of x_i \in X from its nearest neighbor in X. We then define the Hopkins statistic as: :: H=\frac{\sum_{i=1}^m{u_i^d}}{\sum_{i=1}^m{u_i^d}+\sum_{i=1}^m{w_i^d}} \,, :With this definition, uniform random data should tend to have values near to 0.5, and clustered data should tend to have values nearer to 1. :However, data containing just a single Gaussian will also score close to 1, as this statistic measures deviation from a uniform distribution, not multimodality, making this statistic largely useless in application (as real data never is remotely uniform). == Ethics and Fairness ==

Ethics and Fairness

As clustering algorithms are increasingly deployed by corporations and governmental organizations to categorize populations and automate decisions with real world data, concerns regarding algorithmic bias have become more prevalent. Because clustering is a form of unsupervised learning, it identifies patterns within existing data. Consequently, these models can inadvertently reinforce historical inequalities already existent in the training datasets. Fairness Definitions in Clustering Achieving fairness in unsupervised learning based on real world data is impossible, as there is no ground truth way to label what is “correct”.While clustering algorithms are mathematical in nature, they are susceptible to systemic biases given that the underlying data reflects both historical and societal prejudices. In response, researchers have developed "Fair Clustering" frameworks, such as the Fairlet approach, which ensures that each cluster maintains a balanced representation of protected groups relative to the overall population. Disparate Impact and Proxies Under the legal doctrine of disparate impact, a process is considered discriminatory if it results in disproportionately adverse outcomes for a protected class, even if the algorithm is facially neutral (i.e., it does not explicitly use attributes like race or gender). Unfairness often occurs through proxy variables. For instance, even if a dataset is stripped of race, a clustering algorithm might use ZIP codes or educational background as features. Because these variables are often highly related to economic status and ethnicity, the resulting clusters will effectively segregate individuals by these characteristics. Real-World Impacts The application of clustering in predictive policing has demonstrated how historical bias in data can create feedback loops. • Case Study (Chicago) • The Chicago Police Department's strategic subject list used clustering to identify individuals likely to be involved in future crime. However, a 2016 study found that the model primarily targeted individuals based on previous police contact rather than actual criminal activity, disproportionately affecting minority communities without reducing crime rates. This was due to the model being based on police data, and thus only truly able to classify based on police activity instead of actual crime. • Demographic Disparities • Research into facial recognition clustering has shown that error rates are significantly higher for women and people of color. For example, the Gender Shades project found that certain clustering-based classification systems had error rates of up to 34.7% for dark-skinned women, compared to 0.8% for white men. == Applications ==

Applications

Cluster analysis is used for data analysis across a wide range of fields. Natural Sciences In the natural sciences, techniques such as hierarchical clustering, k-means,dimensionality reduction, principal component analysis (PCA), and t-SNE are frequently used to make sense of the dense data. ==== Biology, Computational Biology, and Bioinformatics ==== • Plant and Animal Ecology • Cluster analysis is used to describe and compare communities of organisms across heterogeneous environments. It is also widely used in plant systematics to generate artificial phylogenies clusters of organisms that share similar attributes. • Transcriptomics • Clustering is used to build groups of genes with related expression patterns, such as the HCS clustering algorithm. Such clusters often contain related proteins, such as enzymes for a specific metabolic pathway, or co-regulated genes. This helps inform scientists about previously unknown similarities between genes. • Sequence analysis • Sequence clustering is used to group sequences into gene families. This is a foundational concept in bioinformatics and evolutionary biology. • Human genetic clustering • In population genetics and genomic epidemiology, the similarities between genetic datasets are used to inform population structures. Medicine and Medical Data approaches. • Medical imaging • On PET scans, cluster analysis can be used to differentiate between different types of tissue in three-dimensional images for a wide range of diagnostic purposes. It is equally important in MRI analysis and histopathology, where automated segmentation of these regions reduces the burden of manual annotation. ==== Climate and Earth Sciences ==== • Climate • Clustering is used to find weather regimes and atmospheric patterns, helping meteorologists categorize large scale weather circulation as well as study the effect that climate variability has on regional weather. • Geochemistry & Petroleum Geology • The spatial clustering of chemical properties across different sampling locations helps geochemists identify mineralization zones, pollution plumes, and geological boundaries. • Mathematical chemistry • Clustering assists in structural similarity among chemical compounds. topological indices. Data Science and Technology In computing and technology applications, clustering is the driving force behind machine learning's unsupervised learning, and is embedded in systems ranging from search engines to recommendation platforms. Common algorithms in this field include k-means, DBSCAN, mean shift, and spectral clustering, often applied after feature extraction or embedding, which are steps that map raw data, such as text, images, sensor logs, into a vector space amenable to distance based grouping. pipelines. Machine Learning and Pattern Recognition • Image segmentation • Image segmentation is the process of partitioning a digital image into multiple meaningful regions so that pixels with similar attributes — color, intensity, or texture — are grouped together. This simplifies analysis and supports downstream tasks such as object detection and scene understanding, with applications in medical imaging, computer vision, satellite imaging, and everyday tools like face detection and photo editing. • Anomaly detection • Anomalies and outliers are typically defined with respect to the clustering structure of a dataset — points that fall far from any established cluster are flagged as unusual. This makes clustering central to intrusion detection, fraud detection, and industrial fault monitoring. • Markov chain Monte Carlo methods • Clustering is utilized to locate and characterize extrema in the target distribution, supporting more efficient sampling in high-dimensional probabilistic inference. • Field robotics • Clustering algorithms support robotic situational awareness, enabling robots to track objects and detect outliers in real-time sensor data streams. Document and Text Clustering visualisation of word embeddings from 19th-century literature. Proximity in the 2D projection reflects semantic similarity — a common precursor step to document clustering. • Natural language processing • Clustering can resolve lexical ambiguity and automatically organise large corpora of unstructured text into topically coherent groups — a process known as document clustering. Applications include news aggregation, customer feedback analysis, and automated literature review. • DevOps • Clustering has been used to analyze the effectiveness of DevOps teams and to group deployment pipelines by performance characteristics. The World Wide Web and Networks clustering reveals distinct conversational sub-groups within the broader network. • Social Network Analysis • In the study of social networks, clustering is used to identify communities within large groups of people — revealing echo chambers, influence hubs, and organic interest groups that would be opaque in a simple follower graph. • Search Result Grouping • Clustering can create a more relevant set of search results than traditional ranked lists, particularly when a search term is ambiguous. This reduces visual clutter and improves rendering performance. • Recommender Systems • Recommender systems use clustering to predict a user's unknown preferences by analyzing the tastes and activities of similar users within the same cluster. === Business, Economics, and Social Sciences === In business and the social sciences, clustering is most often applied to tabular survey and transaction data, where the aim is to segment populations into consumer groups, create detailed market analysis insights, or illustrate voter preferences. Business and Economics • Finance • Cluster analysis has been used to group stocks into sectors based on return co-movement, supporting portfolio construction and risk management. It is also applied in credit risk modeling to group borrowers with similar risk profiles and in algorithmic trading to identify regime shifts in market behavior. • Market research • Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use it to partition the general population of consumers into market segments, illuminating relationships between different consumer groups for use in product positioning, new product development, and test market selection. • Grouping of shopping items • Clustering can group the vast array of shopping items on the web into sets of unique products. For example, all items on eBay can be clustered into unique products sections, such as home goods or clothing. Social Sciences • Sequence analysis in social sciences • Cluster analysis is used to identify patterns in family life trajectories, professional careers, and daily or weekly time use, yielding life-course typologies that shed light on social stratification and inequality. • Crime analysis • Clustering can identify areas with elevated incidences of particular types of crime. By pinpointing "hot spots" where similar crimes have occurred over time, law enforcement agencies can deploy resources more strategically. • Educational data mining • Cluster analysis is used to identify groups of schools or students with similar academic profiles, enabling targeted interventions and more equitable allocation of resources. • Typologies and opinion research • Projects such as those undertaken by the Pew Research Center use cluster analysis to discern typologies of opinions, habits, and demographics from poll data, informing both political analysis and strategic communications. == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com