There are many different clustering algorithms. The question is when to use which? Today I learned what to consider when choosing which clustering algorithms to go for. To reiterate there are four families of clusterings:

Distributionbased. EMGMM

Centroidbased. KMeans

Connectivitybased. Hierarchical

Densitybased. Meanshift and DBSCAN
Hierarchical Clustering
If you don’t know the exact number of clusters, or the optimal clusters within your dataset, or you want to find groups within groups within groups, then hierarchical clustering is your best bet! As mentioned in previous post, hierarchical clustering (agglomerative) perform clustering bottomup where each data point is treated as a cluster and clusters are merged into one using distance metrics.
DensityBased Clustering
The bad thing about hierarchical clustering is that it fails to pick up noisy data and would include noisy data into different clusters. To solve this, you can look into densitybased clustering where it performs clustering by grouping data points that are tightly packed together and the other data points are considered noise. An example of densitybased clustering is the DBSCAN, which was introduced in the previous post. DBSCAN is great cause you don’t need to specify the number of clusters and it allows you to capture clusters of different shapes.
Is there a place for KMeans Clustering?
Kmeans considers every data point in the dataset and use that information to improve the clustering over time! Kmeans is the simplest clustering algorithm but you do need to specify the number of clusters ahead of time. Kmeans would be a good starting point!