There are many different clustering algorithms. The question is when to use which? Today I learned what to consider when choosing which clustering algorithms to go for. To reiterate there are four families of clusterings:

  1. Distribution-based. EM-GMM

  2. Centroid-based. K-Means

  3. Connectivity-based. Hierarchical

  4. Density-based. Mean-shift and DBSCAN

Hierarchical Clustering

If you don’t know the exact number of clusters, or the optimal clusters within your dataset, or you want to find groups within groups within groups, then hierarchical clustering is your best bet! As mentioned in previous post, hierarchical clustering (agglomerative) perform clustering bottom-up where each data point is treated as a cluster and clusters are merged into one using distance metrics.

Density-Based Clustering

The bad thing about hierarchical clustering is that it fails to pick up noisy data and would include noisy data into different clusters. To solve this, you can look into density-based clustering where it performs clustering by grouping data points that are tightly packed together and the other data points are considered noise. An example of density-based clustering is the DBSCAN, which was introduced in the previous post. DBSCAN is great cause you don’t need to specify the number of clusters and it allows you to capture clusters of different shapes.

Is there a place for K-Means Clustering?

K-means considers every data point in the dataset and use that information to improve the clustering over time! K-means is the simplest clustering algorithm but you do need to specify the number of clusters ahead of time. K-means would be a good starting point!



Data Scientist

Leave a Reply