Intro
Minimises within-cluster sum of squared distances, also known as inertia
$$
Objective function = Euclidean Distance = \sum_i^{k} \sum_j (x_{ij}-\mu_{i})^2
$$
Limitations:
- Number of clusters (K) has to be specified in advance
- Is sensitive to the initial placement of centroids and may converge to local optima
- It assumes clusters are spherical and equally sized, which may not always hold true
Steps
-
Initialisation:
Start by randomly initialising K centroids
-
Assignment:
Each data point in the dataset is assigned to the nearest centroid based on some distance metric, typically the Euclidean distance.
-
Update:
- Recalculate the centroids of the newly formed clusters
- Each centroid is updated to be the mean of all data points assigned to its cluster
-
Iteration:
Steps 2 and 3 are repeated until convergence criteria is met:
- centroids no longer change significantly between iterations
- or when a maximum number of iterations is reached
Initialisation
-
Random Initialisation:
- Randomly select K data points from the dataset as the initial centroids
- Chosen uniformly at random from the dataset
-
K-means++ Initialisation:
Improvement over random initialisation that aims to choose initial centroids that are well spread out in the feature space and leads to better convergence
- Starts by selecting one data point randomly as the first centroid
- Subsequent centroids are chosen with a probability proportional to the squared distance from the nearest already chosen centroid
Importance of centroid initialisation
- Convergence Speed: Poor initialisation can lead to slower convergence