Intro

Minimises within-cluster sum of squared distances, also known as inertia

$$ Objective function = Euclidean Distance = \sum_i^{k} \sum_j (x_{ij}-\mu_{i})^2 $$

Limitations:

Steps

  1. Initialisation:

    Start by randomly initialising K centroids

  2. Assignment:

    Each data point in the dataset is assigned to the nearest centroid based on some distance metric, typically the Euclidean distance.

  3. Update:

  4. Iteration:

    Steps 2 and 3 are repeated until convergence criteria is met:

Initialisation

  1. Random Initialisation:

  2. K-means++ Initialisation:

    Improvement over random initialisation that aims to choose initial centroids that are well spread out in the feature space and leads to better convergence

Importance of centroid initialisation

  1. Convergence Speed: Poor initialisation can lead to slower convergence