- Organises the data into a tree-like structure, known as a dendrogram, where clusters at the bottom level represent individual data points, and clusters at higher levels represent combinations of lower-level clusters
- Doesn't require the user to specify the number of clusters beforehand, unlike partitioning algorithms like K-means
- Flexibility in Cluster Shapes and Sizes: Can handle clusters of arbitrary shapes and sizes, making it more flexible than some other algorithms that assume spherical or convex clusters
- Interpretability: Provides a hierarchical structure that can be visually interpreted through dendrograms, making it easier to understand the relationships between clusters and subclusters.
Agglomerative vs divisive
Agglomerative clustering starts with individual data points as separate clusters and merges them iteratively, while divisive clustering starts with all data points in a single cluster and splits them recursively.
Agglomerative is more commonly used.
Approach
- Initialisation: Begin by treating each data point as a separate cluster
- Compute Pairwise Distances: Calculate the distance between each pair of clusters (initially, individual data points are considered as clusters). Common distance metrics include Euclidean distance, Manhattan distance, and others.
- Merge Closest Clusters:
- Identify the two closest clusters based on a chosen linkage criterion. The linkage criterion determines how the distance between clusters is calculated. (Details in separate section)
- Merge the two closest clusters into a single cluster
- Update Distance Matrix: Update the distance matrix to reflect the distances between the newly formed cluster and the remaining clusters using the chosen linkage criterion
- Repeat Until Termination: Repeat steps 3 and 4 until only a single cluster containing all data points remains
- Construct Dendrogram:
- Select Number of Clusters: Determine the number of clusters by cutting the dendrogram at an appropriate level. The choice of the number of clusters depends on the specific problem and the desired granularity of the clustering.
Linkage criteria
Used to measure the distance between clusters during the clustering process.
- Single Linkage (Minimum Linkage):
- Single linkage defines the distance between two clusters as the shortest distance between any two points in the two clusters
- It tends to merge clusters based on the closest pair of points, resulting in elongated clusters.
- Single linkage is sensitive to chaining effects and can be affected by outliers and noise.