Elbow Method Alternatives to find Number of Clusters
The elbow method was the most popular method to find the K or the number of clusters in K-Means.
However, the elbow method could be considered a heuristic approach because the optimal value of K may vary depending on the data and subjectivity.
That is why there are a few methods for selecting the number of clusters; they are:
Silhouette coefficient
The method measures the similarity of each data point with the assigned cluster and compares the similarity to the other clusters. The coefficient would vary from -1 to 1, with higher values indicating a better clustering performance. In contrast, 0 means the data is close to the boundary.
Gap statistic method
The method would compares the total within-cluster variation to the null reference distribution of the data for each K. The best K is the one that maximizes the gap statistic. It’s an effective method for higher dimensions data.
Calinski-Harabasz index
It would calculate the ratio of the sum of squared distances between the data points and the assigned cluster centres to the sum of squared distances between the cluster centres and the global mean. The K value with the highest Calinski-Harabasz index is the optimal value.
Davies-Bouldin Index
It measures the average similarity between each K and its most similar K compared to the average distance between each K and its furthest K. Lower values indicate better clustering performance. The Davies-Bouldin index is considered to be more robust to noise and outliers.
Bayesian Information Criterion (BIC)
The BIC is based on the Bayesian probability theory and is designed to balance the goodness of fit of a model to the data with the complexity of the model. The K value with the lowest BIC score is the optimal value.
That is all for today! Put your comment if you want to see any more tips on machine learning.