Hierarchical Clustering

Builds a tree-based hierarchy of clusters, which is often visualized using a diagram called a dendrogram.

Types of Hierarchical Clustering

Agglomerative Clustering (Bottom-Up): The algorithm starts by treating each individual data sample as its own cluster. It then iteratively merges the two "closest" or most similar clusters, continuing this process until all samples are contained within a single, all-encompassing cluster.
Divisive Clustering (Top-Down): It begins with one large cluster containing all data samples. In each step, it recursively splits the most appropriate cluster into two smaller ones, continuing until each sample forms its own cluster or another stopping criterion is met.

The Agglomerative Clustering Algorithm

The bottom-up agglomerative process can be summarized with the following steps:

Begin by assigning each sample $x_{i}$ to its own cluster, $c_{i}$ .
Compute the proximity (distance) between all pairs of clusters.
Find the pair of clusters with the highest similarity (or lowest distance).
Merge this pair of clusters into a single new cluster.
Update the proximity matrix to reflect the distances between this new cluster and all other existing clusters.
Repeat steps 3-5 until only one cluster remains.

The Dendrogram

The output of a hierarchical clustering algorithm is typically visualized as a dendrogram. This tree diagram provides a rich illustration of how clusters were progressively merged.

Pasted image 20250922174342.png

Image taken from Pattern Classification

The vertical axis of the dendrogram represents the cluster distance or dissimilarity. This is the value of the linkage criterion at which each merge occurred. By drawing a horizontal line across the dendrogram, one can "cut" the tree to obtain a specific number of clusters. A cut at a lower distance threshold results in a larger number of smaller clusters, reflecting a finer granularity. A cut at a higher distance results in fewer, larger clusters.

Measuring Distance Between Clusters (Linkage Criteria)

Single Linkage: This method can identify clusters of non-elliptical shapes but is sensitive to noise and can result in long, drawn-out clusters, a phenomenon known as the "chaining effect".
$D (c_{1}, c_{2}) = min_{x_{1} \in c_{1}, x_{2} \in c_{2}} D (x_{1}, x_{2})$
Complete Linkage: This approach is less susceptible to the chaining effect and tends to produce more compact, roughly spherical clusters.
$D (c_{1}, c_{2}) = max_{x_{1} \in c_{1}, x_{2} \in c_{2}} D (x_{1}, x_{2})$
Average Linkage: The distance is calculated as the average of all pairwise distances between the points in the two clusters. This serves as a compromise between the single and complete linkage methods and is generally less sensitive to outliers.
$D (c_{1}, c_{2}) = \frac{1}{| c_{1} | | c_{2} |} \sum_{x_{1} \in c_{1}} \sum_{x_{2} \in c_{2}} D (x_{1}, x_{2})$
Ward's Method: This method takes a different approach based on variance. It merges the pair of clusters that leads to the minimum increase in the total within-cluster variance. The variance is measured as the sum of squared differences between the points and the centroid of a cluster. The increase in the sum of squared errors (SSE) when merging two clusters, $c_{a}$ and $c_{b}$ , is:
$Δ (c_{a}, c_{b}) = \sum_{x \in c_{a} \cup c_{b}} ∥ x - μ_{a b} ∥^{2} - \sum_{x \in c_{a}} ∥ x - μ_{a} ∥^{2} - \sum_{x \in c_{b}} ∥ x - μ_{b} ∥^{2}$
where $μ$ is the centroid of a cluster. This can be simplified to:
$Δ (c_{a}, c_{b}) = \frac{| c_{a} | | c_{b} |}{| c_{a} | + | c_{b} |} ∥ μ_{a} - μ_{b} ∥^{2}$
Ward's method tends to produce clusters that are compact and of similar size.

"AI AI CAPTAIN!"