My goal for today is to be able to understanding and describe the 4 clustering algorithms below. The source medium post in linked below!
Describe MeanShift clustering.
MeanShift clustering uses sliding windows to find dense areas of data points and cluster them together. It works as follows:

Plot data points into a twodimensional space

Randomly start at a given point with a circular sliding window of radius r

Continue shifting the kernel to a higher density region till convergence. This is done so by shifting the center point to the mean of the data points within the sliding window. The density is measured by how many points are inside the window

Continue to shift the kernel until it’s not possible to accommodate for more points within the kernel
This process is repeated with many different data points and sliding windows until all the points lie within a window. The window with multiple sliding windows overlapping and with the most data points is preserved and data points are then clustered together based on the sliding window. The benefit is that you don’t need to select the number of clusters! The downside is you need to select the appropriate radius for your sliding window.
Describe DensityBased Spatial clustering of Applications with Noise (DBSCAN).
DBSCAN is a densitybased clustering algorithm that uses two main hyperparameters: epsilon and minimum data points. It works as follows:

Randomly start at a data point that hasn’t been visited

All the data points within epsilon distance are neighbourhood points

If the number of neighbouring points exceed the minimum predetermined data points, we would start the clustering process, otherwise it would be labelled as noise

The data points within the epsilon distance of the first data point of the new cluster would become part of the same cluster. This process is repeated for all the new data points until convergence

Once we completed the current cluster, we would randomly jump to a new unvisited point to potentially form new clusters. This process is repeated until you have visited all the data points
DBSCAN doesn’t require you to predetermined the number of clusters and it identifies outliers as noises! It can also form different size and shape clusters. The drawback is that DBSCAN doesn’t work well when clusters have different density as this is difficult to capture with a single epsilon and minimum data points thresholds.
Describe ExpectationMaximisation (EM) clustering with Gaussian Mixture Models (GMM).
Using the GMM allows us to use the mean and standard deviation to describe the shape of the clusters. We assume that data points are Gaussian distributed. Each Gaussian distribution is assigned to a single cluster. We use expectationmaximisation to find the optimal mean and standard deviation for each gaussian distribution. It works as follows:

Select the number of clusters and randomly initialised the Gaussian parameters for each cluster

Given the Gaussian distribution of each cluster, compute the probability that each data point belongs to a cluster. The closer the point is to the Gaussian’s center, the more likely it belongs to that cluster

Based on the probabilities, we compute a new set of Gaussian parameters to maximise the probabilities of data points within the clusters. Repeat step 2 and 3 until convergence
GMMs are very flexible in terms of cluster covariance and it covers different shape of clusters. KMeans is a special case of GMM. In addition, GMMs use probabilities means that each data point can have multiple clusters with different probabilities.
Describe Agglomerative Hierarchical clustering.
Agglomerative clustering is a bottomup approach that works as follows:

Treat each data point as a single cluster

Use a distance metric to measure the distance between two clusters

Each iteration, we would combine two clusters with the shortest distance into one cluster

Repeat step 3 until we reach the root of the tree (one cluster which contains all the data points)
Hierarchical clustering doesn’t require us to specify number of clusters. We can specify the number of clusters (that looks best) once we have computed the whole dendrogram! The algorithm is not sensitive to the choice of distance metric.