Why K-Means Failed at Non-Convex Shape Data-NBD Lite #8

The popular clustering methods "might" not a silver bullet for everything.

Cornellius Yudha Wijaya

Sep 14, 2024

K-Means is a popular clustering algorithm that separates datasets into K number of non-overlapping clusters.

It’s widely used in data analysis and mining, and it’s very intuitive.

However, K-Means has a few weaknesses. One of them is that it failed at the Non-Convex shape data.

Let’s take a look at the following clustering visualization.

When the dataset is in a convex or roughly spherical shape, K-Means easily clustered them.

However, let’s look at the K-Means process when the dataset is in the crescent moon shape.

We can see that the K-Means can’t be separated enough; instead, they mix up data points from different curves.

Humans can easily see how data on the moon should be separated. However, machine algorithms only have the information from their algorithm.

So, let’s understand a little bit about how K-Means works. Here is a simple visualization of how the algorithm performs its clustering process.

K-Means basically works by setting the K number of clusters we want, and we initiate the centroid as much as the K.

We would then assign each data point to the centroid based on the closest Euclidean distance.

By iterating the process to reposition the centroid until it converges, we would get the clustering result.

As we see the method, the assumption is that K-Means would inherently assume that the clusters are convex in shape.

The assumption comes from the algorithm dependence of Euclidean Distance.

With Euclidean distance, each cluster is basically a convex set because any two points within a cluster are connected by straight lines that lie entirely within that cluster.

The convex set property is present because data points are assigned to the nearest centroid and the regions where each centroid is the closest form convex shapes.

Because no lines exit the cluster between two data points, the K-Means would inherently produce convex clusters.

As a result of the convex property, the algorithm has a hard time capturing clusters where the line between two points might pass outside the cluster boundaries.

If you evaluate the dataset as in non-convex shape, you might want to try other clustering alternatives such as:

That’s all for today! I hope this helps you understand why K-Means might not be the best for non-convex shape data.

Are there any more things you would love to discuss? Let’s talk about it together!

👇👇👇

Non-Brand Data

Discussion about this post