Machine Learning
Unsupervised Learning / K-Means Clustering
2. K-Means Clustering
K-Means is the most widely used partitioning clustering algorithm. It partitions n data points into k clusters by minimizing the total within-cluster sum of squared distances (inertia).
Algorithm Steps
- INITIALIZATION: Randomly select k data points as initial centroids: μ₁, μ₂, ..., μₖ
- ASSIGNMENT STEP: Assign each data point xᵢ to the nearest centroid.Cⱼ = { xᵢ : ||xᵢ - μⱼ||² ≤ ||xᵢ - μₗ||² ∀ l ≠ j }
- UPDATE STEP: Recompute each centroid as the mean of all points in the cluster.μⱼ = (1 / |Cⱼ|) × Σ(xᵢ) for xᵢ ∈ Cⱼ
- CONVERGENCE CHECK: If centroids do not change (or change is below a threshold), STOP. Otherwise, repeat from Step 2.
Worked Example (k=2)
Points: A(1,1), B(1.5,2), C(3,4), D(5,7), E(3.5,5), F(4.5,5)
- Initialize: Pick C1=(1,1) and C2=(5,7).
- Assignment (Iteration 1): Compute distance from points to C1 and C2.
- Cluster 1 (closest to C1): {A, B, C}
- Cluster 2 (closest to C2): {D, E, F}
- Update Centroids:
- New C1 = ((1+1.5+3)/3, (1+2+4)/3) = (1.83, 2.33)
- New C2 = ((5+3.5+4.5)/3, (7+5+5)/3) = (4.33, 5.67)
- Repeat Assignment (Iteration 2): Using new centroids.
- Cluster 1: {A, B}
- Cluster 2: {C, D, E, F}
- Final Centroids: New C1=(1.25, 1.5), New C2=(4.0, 5.25). Next assignment yields no changes. CONVERGED.
The Elbow Method
Used to determine the optimal k. Run K-Means for various k values and plot the Inertia (within-cluster sum of squares). The "elbow" point where inertia starts decreasing linearly is the optimal k.
K-Means++
Improves initialization. Instead of purely random centers, it picks the first center randomly, and selects subsequent centers with a probability proportional to their squared distance from existing centers.