The numerator of the CH index is the between-cluster separation (BCSS) divided by its degrees of freedom. The number of degrees of freedom of BCSS is
k - 1, since fixing the centroids of
k - 1 clusters also determines the
kth centroid, as its value makes the weighted sum of all centroids match the overall data centroid. The denominator of the CH index is the within-cluster dispersion (WCSS) divided by its degrees of freedom
. The number of degrees of freedom of WCSS is
n -
k, since fixing the centroid of each cluster reduces the degrees of freedom by one. This is because given a centroid
ci of cluster
Ci, the assignment of
ni - 1 points to that cluster also determines the assignment of the
nith point, since the overall mean of the points assigned to the cluster should be equal to
ci. Dividing both the BCSS and WCSS by their degrees of freedom helps to normalize the values, making them comparable across different numbers of clusters. Without this normalization, the CH index could be artificially inflated for higher values of
k, making it hard to determine whether an increase in the index value is due to genuinely better clustering or just due to the increased number of clusters. A higher value of CH indicates a better clustering, because it means that the data points are more spread out between clusters than they are within clusters. Although there is no satisfactory probabilistic foundation to support the use of CH index, the criterion has some desirable mathematical properties as shown in. discuss the effectiveness of using CH index for cluster evaluation relative to other internal clustering evaluation metrics. Maulik and Bandyopadhyay evaluate the performance of three clustering algorithms using four cluster validity indices, including
Davies–Bouldin index,
Dunn index, Calinski–Harabasz index and a newly developed index. Wang et al. have suggested an improved index for clustering validation based on
Silhouette indexing and Calinski–Harabasz index. == Finding the optimal number of clusters ==