Machine Learning

Unsupervised Learning / K-Means Clustering

2. K-Means Clustering

K-Means is the most widely used partitioning clustering algorithm. It partitions n data points into k clusters by minimizing the total within-cluster sum of squared distances (inertia).

Algorithm Steps

INITIALIZATION: Randomly select k data points as initial centroids: μ₁, μ₂, ..., μₖ
ASSIGNMENT STEP: Assign each data point xᵢ to the nearest centroid.
Cⱼ = { xᵢ : ||xᵢ - μⱼ||² ≤ ||xᵢ - μₗ||² ∀ l ≠ j }
UPDATE STEP: Recompute each centroid as the mean of all points in the cluster.
μⱼ = (1 / |Cⱼ|) × Σ(xᵢ) for xᵢ ∈ Cⱼ
CONVERGENCE CHECK: If centroids do not change (or change is below a threshold), STOP. Otherwise, repeat from Step 2.

Worked Example (k=2)

Points: A(1,1), B(1.5,2), C(3,4), D(5,7), E(3.5,5), F(4.5,5)

Initialize: Pick C1=(1,1) and C2=(5,7).
Assignment (Iteration 1): Compute distance from points to C1 and C2.
- Cluster 1 (closest to C1): {A, B, C}
- Cluster 2 (closest to C2): {D, E, F}
Update Centroids:
- New C1 = ((1+1.5+3)/3, (1+2+4)/3) = (1.83, 2.33)
- New C2 = ((5+3.5+4.5)/3, (7+5+5)/3) = (4.33, 5.67)
Repeat Assignment (Iteration 2): Using new centroids.
- Cluster 1: {A, B}
- Cluster 2: {C, D, E, F}
Final Centroids: New C1=(1.25, 1.5), New C2=(4.0, 5.25). Next assignment yields no changes. CONVERGED.

The Elbow Method

Used to determine the optimal k. Run K-Means for various k values and plot the Inertia (within-cluster sum of squares). The "elbow" point where inertia starts decreasing linearly is the optimal k.

K-Means++

Improves initialization. Instead of purely random centers, it picks the first center randomly, and selects subsequent centers with a probability proportional to their squared distance from existing centers.

Intro to Unsupervised Learning K-Medoids Clustering