Machine Learning

Unsupervised Learning / K-Means Clustering


2. K-Means Clustering

K-Means is the most widely used partitioning clustering algorithm. It partitions n data points into k clusters by minimizing the total within-cluster sum of squared distances (inertia).

Algorithm Steps

  1. INITIALIZATION: Randomly select k data points as initial centroids: μ₁, μ₂, ..., μₖ
  2. ASSIGNMENT STEP: Assign each data point xᵢ to the nearest centroid.
    Cⱼ = { xᵢ : ||xᵢ - μⱼ||² ≤ ||xᵢ - μₗ||² ∀ l ≠ j }
  3. UPDATE STEP: Recompute each centroid as the mean of all points in the cluster.
    μⱼ = (1 / |Cⱼ|) × Σ(xᵢ) for xᵢ ∈ Cⱼ
  4. CONVERGENCE CHECK: If centroids do not change (or change is below a threshold), STOP. Otherwise, repeat from Step 2.

Worked Example (k=2)

Points: A(1,1), B(1.5,2), C(3,4), D(5,7), E(3.5,5), F(4.5,5)

  1. Initialize: Pick C1=(1,1) and C2=(5,7).
  2. Assignment (Iteration 1): Compute distance from points to C1 and C2.
    • Cluster 1 (closest to C1): {A, B, C}
    • Cluster 2 (closest to C2): {D, E, F}
  3. Update Centroids:
    • New C1 = ((1+1.5+3)/3, (1+2+4)/3) = (1.83, 2.33)
    • New C2 = ((5+3.5+4.5)/3, (7+5+5)/3) = (4.33, 5.67)
  4. Repeat Assignment (Iteration 2): Using new centroids.
    • Cluster 1: {A, B}
    • Cluster 2: {C, D, E, F}
  5. Final Centroids: New C1=(1.25, 1.5), New C2=(4.0, 5.25). Next assignment yields no changes. CONVERGED.
The Elbow Method

Used to determine the optimal k. Run K-Means for various k values and plot the Inertia (within-cluster sum of squares). The "elbow" point where inertia starts decreasing linearly is the optimal k.

K-Means++

Improves initialization. Instead of purely random centers, it picks the first center randomly, and selects subsequent centers with a probability proportional to their squared distance from existing centers.