In many data science and machine learning problems, one of the key questions is do we have data? Specifically, do we have labelled data? In most cases, we either don’t have labelled data, or we have limited labelled data, or we are required to train on labelled data that’s different from inference data (zero-shot learning). Therefore, I believe it’s important to get familiar with unsupervised techniques. Clustering seems to be a good place to start 🙂

What is clustering in ML?

Clustering is an unsupervised technique that aims to group input data into a number of categories based on different underlying generative features.

Why is clustering so important and what are the different clustering methods?

Clustering allows us to group data together without the need of labelled data. There are 4 main clustering methods:

  1. Density-based. Uses dense region as clusters. Example includes DBSCAN and OPTICS

  2. Hierarchical-based. New clusters are formed based on previously formed one, resulting in a tree-like structure. There are two categories: agglomerative (bottom-up) and divisive (top-down)

  3. Partitioning. Partition the objects into k clusters and each partition forms one cluster. Example includes KMeans

  4. Grid-based. Data space converted into finite number of grid-like structure. Examples include STING, CLIQUE

What’s the best simplest clustering algorithm to begin with and describe it?

K-means clustering. It’s a very fast algorithm with O(n) complexity. It works as follows:

  1. Select number of clusters

  2. Randomly initiliase all the center points (number of clusters)

  3. Each data is classified according to the distance between the point and the group center

  4. Based on the classified data, we recompute the group center by taking the average of all the vectors within the group

  5. Repeat step 3 and 4 until group centers no longer change that much

The disadvantages are as follows:

  1. Have to pre-decide on the number of clusters

  2. Clustering results can’t be replicated as cluster centers are randomly initiated



Data Scientist

Leave a Reply