k-means clustering

Intro

Minimises within-cluster sum of squared distances, also known as inertia

$$ Objective function = Euclidean Distance = \sum_i^{k} \sum_j (x_{ij}-\mu_{i})^2 $$

Limitations:

Number of clusters (K) has to be specified in advance
Is sensitive to the initial placement of centroids and may converge to local optima
It assumes clusters are spherical and equally sized, which may not always hold true

Initialisation:

Start by randomly initialising K centroids
Assignment:

Each data point in the dataset is assigned to the nearest centroid based on some distance metric, typically the Euclidean distance.
Update:
- Recalculate the centroids of the newly formed clusters
- Each centroid is updated to be the mean of all data points assigned to its cluster
Iteration:

Steps 2 and 3 are repeated until convergence criteria is met:
- centroids no longer change significantly between iterations
- or when a maximum number of iterations is reached

Random Initialisation:
- Randomly select K data points from the dataset as the initial centroids
- Chosen uniformly at random from the dataset
K-means++ Initialisation:

Improvement over random initialisation that aims to choose initial centroids that are well spread out in the feature space and leads to better convergence
- Starts by selecting one data point randomly as the first centroid
- Subsequent centroids are chosen with a probability proportional to the squared distance from the nearest already chosen centroid

Importance of centroid initialisation