Standard algorithm (naive k-means) The most common algorithm uses an iterative refinement technique. Due to its ubiquity, it is often called "the
k-means algorithm"; it is also referred to as
Lloyd's algorithm, particularly in the computer science community. It is sometimes also referred to as "naïve
k-means", because there exist much faster alternatives. Given an initial set of means (see below), the algorithm proceeds by alternating between two steps: •
Assignment step: Assign each observation to the cluster with the nearest mean (centroid): that with the least squared
Euclidean distance. (Mathematically, this means partitioning the observations according to the
Voronoi diagram generated by the means.) S_i^{(t)} = \left \{ x_p : \left \| x_p - m^{(t)}_i \right \|^2 \le \left \| x_p - m^{(t)}_j \right \|^2 \ \forall j, 1 \le j \le k \right\}, where each x_p is assigned to exactly one S^{(t)}, even if it could be assigned to two or more of them. •
Update step: Recalculate means (
centroids) for observations assigned to each cluster. This is also called refitting. m^{(t+1)}_i = \frac{1}{\left|S^{(t)}_i\right|} \sum_{x_j \in S^{(t)}_i} x_j The objective function in
k-means is the WCSS (within cluster sum of squares). After each iteration, the WCSS monotonically decreases, giving a nonnegative monotonically decreasing sequence. This guarantees that the
k-means always converges, but not necessarily to the global optimum. The algorithm has converged when the assignments no longer change or equivalently, when the WCSS has become stable. The algorithm is not guaranteed to find the optimal cluster assignment. The algorithm is often presented as assigning objects to the nearest cluster by distance. Using a different distance function other than (squared) Euclidean distance may prevent the algorithm from converging. Various modifications of
k-means such as spherical
k-means and
k-medoids have been proposed to allow using other distance measures. ;Pseudocode The below pseudocode outlines the implementation of the standard
k-means clustering algorithm. Initialization of centroids, distance metric between points and centroids, and the calculation of new centroids are design choices and will vary with different implementations. In this example pseudocode, distance() returns the distance between the specified points.
function kmeans(k, points)
is // Initialize centroids centroids ← list of k starting centroids converged ← false
while converged == false
do // Create empty clusters clusters ← list of k empty lists // Assign each point to the nearest centroid
for i ← 0
to length(points) - 1
do point ← points[i] closestIndex ← 0 minDistance ← distance(point, centroids[0])
for j ← 1
to k - 1
do d ←
distance(point, centroids[j])
if d File:K Means Example Step 1.svg|1.
k initial "means" (in this case
k=3) are randomly generated within the data domain (shown in color). File:K Means Example Step 2.svg|2.
k clusters are created by associating every observation with the nearest mean. The partitions here represent the
Voronoi diagram generated by the means. File:K Means Example Step 3.svg|3. The
centroid of each of the
k clusters becomes the new mean. File:K Means Example Step 4.svg|4. Steps 2 and 3 are repeated until convergence has been reached. The algorithm does not guarantee convergence to the global optimum. The result may depend on the initial clusters. As the algorithm is usually fast, it is common to run it multiple times with different starting conditions. However, worst-case performance can be slow: in particular certain point sets, even in two dimensions, converge in exponential time, that is . These point sets do not seem to arise in practice: this is corroborated by the fact that the
smoothed running time of
k-means is polynomial. The "assignment" step is referred to as the "expectation step", while the "update step" is a maximization step, making this algorithm a variant of the
generalized expectation–maximization algorithm.
Complexity Finding the optimal solution to the
k-means clustering problem for observations in
d dimensions is: •
NP-hard in general
Euclidean space (of
d dimensions) even for two clusters, •
NP-hard for a general number of clusters
k even in the plane, • if
k and
d (the dimension) are fixed, the problem can be exactly solved in time O(n^{dk+1}), where
n is the number of entities to be clustered. Thus, a variety of
heuristic algorithms such as Lloyd's algorithm given above are generally used. The running time of Lloyd's algorithm (and most variants) is O(n k d i), where: •
n is the number of
d-dimensional vectors (to be clustered) •
k the number of clusters •
i the number of iterations needed until convergence. On data that does have a clustering structure, the number of iterations until convergence is often small, and results only improve slightly after the first dozen iterations. Lloyd's algorithm is therefore often considered to be of "linear" complexity in practice, although it is in the
worst case superpolynomial when performed until convergence. • In the worst-case, Lloyd's algorithm needs i = 2^{\Omega(\sqrt{n})} iterations, so that the worst-case complexity of Lloyd's algorithm is
superpolynomial. Lloyd's algorithm is the standard approach for this problem. However, it spends a lot of processing time computing the distances between each of the
k cluster centers and the
n data points. Since points usually stay in the same clusters after a few iterations, much of this work is unnecessary, making the naïve implementation very inefficient. Some implementations use caching and the triangle inequality in order to create bounds and accelerate Lloyd's algorithm.
Optimal number of clusters Finding the optimal number of clusters (
k) for
k-means clustering is a crucial step to ensure that the clustering results are meaningful and useful. Several techniques are available to determine a suitable number of clusters. Here are some of commonly used methods: •
Elbow method (clustering): This method involves plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use. However, the notion of an "elbow" is not well-defined and this is known to be unreliable. •
Silhouette (clustering): Silhouette analysis measures the quality of clustering and provides an insight into the separation distance between the resulting clusters. A higher silhouette score indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. •
Gap statistic: The Gap Statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The optimal k is the value that yields the largest gap statistic. •
Davies–Bouldin index: The Davies-Bouldin index is a measure of the how much separation there is between clusters. Lower values of the Davies-Bouldin index indicate a model with better separation. •
Calinski-Harabasz index: This Index evaluates clusters based on their compactness and separation. The index is calculated using the ratio of between-cluster variance to within-cluster variance, with higher values indicate better-defined clusters. •
Rand index: It calculates the proportion of agreement between the two clusters, considering both the pairs of elements that are correctly assigned to the same or different clusters. Higher values indicate greater similarity and better clustering quality. To provide a more accurate measure, the Adjusted Rand Index (ARI), introduced by Hubert and Arabie in 1985, corrects the Rand Index by adjusting for the expected similarity of all pairings due to chance.
Variations •
Jenks natural breaks optimization:
k-means applied to univariate data •
k-medians clustering uses the median in each dimension instead of the mean, and this way minimizes L_1 norm (
Taxicab geometry). •
k-medoids (also: Partitioning Around Medoids, PAM) uses the medoid instead of the mean, and this way minimizes the sum of distances for
arbitrary distance functions. •
Fuzzy C-Means Clustering is a soft version of
k-means, where each data point has a fuzzy degree of belonging to each cluster. •
Gaussian mixture models trained with
expectation–maximization algorithm (EM algorithm) maintains probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means. •
k-means++ chooses initial centers in a way that gives a provable upper bound on the WCSS objective. • The filtering algorithm uses
k-d trees to speed up each
k-means step. • Some methods attempt to speed up each
k-means step using the
triangle inequality. • Hierarchical variants such as Bisecting
k-means,
X-means clustering and G-means clustering
repeatedly split clusters to build a hierarchy, and can also try to automatically determine the optimal number of clusters in a dataset. •
Internal cluster evaluation measures such as
cluster silhouette can be helpful at
determining the number of clusters. • Minkowski weighted
k-means automatically calculates cluster specific feature weights, supporting the intuitive idea that a feature may have different degrees of relevance at different features. These weights can also be used to re-scale a given data set, increasing the likelihood of a cluster validity index to be optimized at the expected number of clusters. • Mini-batch
k-means:
k-means variation using "mini batch" samples for data sets that do not fit into memory. •
Otsu's method Hartigan–Wong method Hartigan and Wong's method \Delta(x,n,m) = \frac{ \mid S_n \mid }{ \mid S_n \mid - 1} \cdot \lVert \mu_n - x \rVert^2 - \frac{ \mid S_m \mid }{ \mid S_m \mid + 1} \cdot \lVert \mu_m - x \rVert^2.
Global optimization and meta-heuristics The classical
k-means algorithm and its variations are known to only converge to local minima of the minimum-sum-of-squares clustering problem defined as \mathop\operatorname{arg\,min}_\mathbf{S} \sum_{i=1}^{k} \sum_{\mathbf x \in S_i} \left\| \mathbf x - \boldsymbol\mu_i \right\|^2 . Many studies have attempted to improve the convergence behavior of the algorithm and maximize the chances of attaining the global optimum (or at least, local minima of better quality). Initialization and restart techniques discussed in the previous sections are one alternative to find better solutions. More recently, global optimization algorithms based on
branch-and-bound and
semidefinite programming have produced ‘’provenly optimal’’ solutions for datasets with up to 4,177 entities and 20,531 features. As expected, due to the
NP-hardness of the subjacent optimization problem, the computational time of optimal algorithms for
k-means quickly increases beyond this size. Optimal solutions for small- and medium-scale still remain valuable as a benchmark tool, to evaluate the quality of other heuristics. To find high-quality local minima within a controlled computational time but without optimality guarantees, other works have explored
metaheuristics and other
global optimization techniques, e.g., based on incremental approaches and convex optimization, random swaps (i.e.,
iterated local search),
variable neighborhood search and
genetic algorithms. It is indeed known that finding better local minima of the minimum sum-of-squares clustering problem can make the difference between failure and success to recover cluster structures in feature spaces of high dimension. == Discussion ==