Rate distortion theory has been applied to choosing
k called the "jump" method, which determines the number of clusters that maximizes efficiency while minimizing error by
information-theoretic standards.{{Cite journal |author1=Catherine A. Sugar |author1-link= Catherine Sugar |author2=Gareth M. James | title = Finding the number of clusters in a data set: An information-theoretic approach | journal = Journal of the American Statistical Association | volume = 98 | issue = January | pages = 750–763 | year = 2003 | doi = 10.1198/016214503000000666 The distortion of a clustering of some input data is formally defined as follows: Let the data set be modeled as a
p-dimensional
random variable,
X, consisting of a
mixture distribution of
G components with common
covariance, . If we let c_1 \ldots c_K be a set of
K cluster centers, with c_X the closest center to a given sample of
X, then the minimum average distortion per dimension when fitting the
K centers to the data is: : d_K = \frac{1}{p} \min_{c_1 \ldots c_K}{E[(X - c_X)^T\Gamma^{-1}(X - c_X)]} This is also the average
Mahalanobis distance per dimension between
X and the closest cluster center c_X. Because the minimization over all possible sets of cluster centers is prohibitively complex, the distortion is computed in practice by generating a set of cluster centers using a standard clustering algorithm and computing the distortion using the result. The pseudo-code for the jump method with an input set of
p-dimensional data points
X is: JumpMethod(X):
Let Y = (p/2)
Init a list D, of size n+1
Let D[0] = 0
For k = 1 ... n: Cluster X with k clusters (e.g., with k-means)
Let d = Distortion of the resulting clustering D[k] = d^(-Y)
Define J(i) = D[i] - D[i-1]
Return the k between 1 and n that maximizes J(k) The choice of the transform power Y = (p/2) is motivated by
asymptotic reasoning using results from rate distortion theory. Let the data
X have a single, arbitrarily
p-dimensional
Gaussian distribution, and let fixed K = \lfloor\alpha^p\rfloor, for some greater than zero. Then the distortion of a clustering of
K clusters in the
limit as
p goes to infinity is \alpha^{-2} . It can be seen that asymptotically, the distortion of a clustering to the power (-p/2) is proportional to \alpha^p, which by definition is approximately the number of clusters
K. In other words, for a single Gaussian distribution, increasing
K beyond the true number of clusters, which should be one, causes a linear growth in distortion. This behavior is important in the general case of a mixture of multiple distribution components. Let
X be a mixture of
G p-dimensional Gaussian distributions with common covariance. Then for any fixed
K less than
G, the distortion of a clustering as
p goes to infinity is infinite. Intuitively, this means that a clustering of less than the correct number of clusters is unable to describe asymptotically high-dimensional data, causing the distortion to increase without limit. If, as described above,
K is made an increasing function of
p, namely, K = \lfloor\alpha^p\rfloor, the same result as above is achieved, with the value of the distortion in the limit as
p goes to infinity being equal to \alpha^{-2}. Correspondingly, there is the same proportional relationship between the transformed distortion and the number of clusters,
K. Putting the results above together, it can be seen that for sufficiently high values of
p, the transformed distortion d_K^{-p/2} is approximately zero for
K <
G, then jumps suddenly and begins increasing linearly for
K ≥
G. The jump algorithm for choosing
K makes use of these behaviors to identify the most likely value for the true number of clusters. Although the mathematical support for the method is given in terms of asymptotic results, the algorithm has been
empirically verified to work well in a variety of data sets with reasonable dimensionality. In addition to the localized jump method described above, there exists a second algorithm for choosing
K using the same transformed distortion values known as the broken line method. The broken line method identifies the jump point in the graph of the transformed distortion by doing a simple
least squares error line fit of two line segments, which in theory will fall along the
x-axis for
K <
G, and along the linearly increasing phase of the transformed distortion plot for
K ≥
G. The broken line method is more robust than the jump method in that its decision is global rather than local, but it also relies on the assumption of Gaussian mixture components, whereas the jump method is fully
non-parametric and has been shown to be viable for general mixture distributions. == Silhouette method ==