In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate
distance d, such as the Euclidean distance, between
single observations of the data set, and a linkage criterion, which specifies the dissimilarity of
sets as a function of the pairwise distances of observations in the sets. The choice of metric as well as linkage can have a major impact on the result of the clustering, where the lower level metric determines which objects are most
similar, whereas the linkage criterion influences the shape of the clusters. For example, complete-linkage tends to produce more spherical clusters than single-linkage. The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations. Some commonly used linkage criteria between two sets of observations
A and
B and a distance
d are: \sum_{a \in A }\sum_{ b \in B} d(a,b). = \sum_{x\in A\cup B} \lVert x - \mu_{A\cup B} \rVert^2 - \sum_{x\in A} \lVert x - \mu_{A} \rVert^2 - \sum_{x\in B} \lVert x - \mu_{B} \rVert^2 - \frac{1}\sum_{x\in A} \lVert x - \mu_{A} \rVert^2 - \frac{1}\sum_{x\in B} \lVert x - \mu_{B} \rVert^2= \text{Var}(A\cup B) - \text{Var}(A) - \text{Var}(B) - \min_{m\in A} \sum_{y\in A} d(m,y) - \min_{m\in B} \sum_{y\in B} d(m,y) Some of these can only be recomputed recursively (WPGMA, WPGMC), for many a recursive computation with Lance-Williams-equations is more efficient, while for other (Hausdorff, Medoid) the distances have to be computed with the slower full formula. Other linkage criteria include: • The probability that candidate clusters spawn from the same distribution function (V-linkage). • The product of in-degree and out-degree on a k-nearest-neighbour graph (graph degree linkage). • The increment of some cluster descriptor (i.e., a quantity defined for measuring the quality of a cluster) after merging two clusters.
Comparison of linkage methods The choice of linkage criterion significantly affects the shape and quality of the resulting clusters. Each method has distinct strengths and weaknesses depending on the structure of the data.
Single linkage (also called the
nearest-neighbor method) defines the distance between two clusters as the minimum distance between any pair of points, one from each cluster. This method can detect and handle clusters of arbitrary shape, including elongated and
non-globular structures. However, it is highly sensitive to
noise and
outliers, because a single pair of close points is sufficient to merge two clusters.
Complete linkage defines the distance between two clusters as the maximum distance between any pair of points across the two clusters. It is less susceptible to noise and outliers than single linkage. However, complete linkage is biased toward producing globular clusters of similar size and may incorrectly break apart large or irregularly shaped clusters.
Average linkage (specifically UPGMA, the unweighted pair group method with arithmetic mean) defines the distance between two clusters as the average of all pairwise distances between points in the two clusters. It represents a compromise between the extremes of single and complete linkage. Average linkage is more robust to noise than single linkage but, like complete linkage, it tends to be biased toward detecting globular clusters.
Computational complexity The naive implementation of agglomerative hierarchical clustering has a time complexity of O(
n³) and a space complexity of O(
n²), where
n is the number of data points, because at each of the
n − 1 merge steps the algorithm must search and update an
n ×
n proximity matrix. For single linkage, optimized algorithms based on minimum spanning trees reduce the time complexity to O(
n² log
n). For other linkage criteria, the use of
priority queues and nearest-neighbor chains can also achieve O(
n² log
n) time in practice. == Hausdorff linkage ==