In many data science and machine learning problems, one of the key questions is do we have data? Specifically, do we have labelled data? In most cases, we either don’t have labelled data, or we have limited labelled data, or we are required to train on labelled data that’s different from inference data (zeroshot learning). Therefore, I believe it’s important to get familiar with unsupervised techniques. Clustering seems to be a good place to start ðŸ™‚
What is clustering in ML?
Clustering is an unsupervised technique that aims to group input data into a number of categories based on different underlying generative features.
Why is clustering so important and what are the different clustering methods?
Clustering allows us to group data together without the need of labelled data. There are 4 main clustering methods:

Densitybased. Uses dense region as clusters. Example includes DBSCAN and OPTICS

Hierarchicalbased. New clusters are formed based on previously formed one, resulting in a treelike structure. There are two categories: agglomerative (bottomup) and divisive (topdown)

Partitioning. Partition the objects into k clusters and each partition forms one cluster. Example includes KMeans

Gridbased. Data space converted into finite number of gridlike structure. Examples include STING, CLIQUE
What’s the best simplest clustering algorithm to begin with and describe it?
Kmeans clustering. It’s a very fast algorithm with O(n) complexity. It works as follows:

Select number of clusters

Randomly initiliase all the center points (number of clusters)

Each data is classified according to the distance between the point and the group center

Based on the classified data, we recompute the group center by taking the average of all the vectors within the group

Repeat step 3 and 4 until group centers no longer change that much
The disadvantages are as follows:

Have to predecide on the number of clusters

Clustering results can’t be replicated as cluster centers are randomly initiated