ML Algorithms
Unsupervised Learning
10 / 13

K-Means & K-Means++ Clustering

Partition data into K groups by minimizing within-cluster variance — with smart initialization, validation, and the failure modes you must know.

1. The Objective: What K-Means Actually Optimizes

K-Means minimizes the Within-Cluster Sum of Squares (WCSS), also called inertia. Given data points x₁ … xₙ and K centroids μ₁ … μ_K, it solves:

J = Σₖ Σ_(x ∈ Cₖ) ‖x − μₖ‖²

This is NP-hard in general, so we use Lloyd's algorithm — a fast heuristic that converges to a local minimum. Different initializations give different local minima, which is exactly why K-Means++ matters.

2. Lloyd's Algorithm Step-by-Step

1
Initialize K centroids
Pick K starting centers — randomly, or smarter via K-Means++ (see §4).
2
Assignment step (E-step)
Assign every point to its nearest centroid by squared Euclidean distance:
Cₖ = { x : ‖x − μₖ‖² ≤ ‖x − μⱼ‖² ∀ j }
3
Update step (M-step)
Recompute each centroid as the mean of points assigned to it:
μₖ ← (1 / |Cₖ|) · Σ_(x ∈ Cₖ) x
4
Repeat until convergence
Stop when assignments don't change, centroids move less than a tolerance, or after a max number of iterations. WCSS is guaranteed to decrease monotonically.
Why squared Euclidean?
The mean is the point that minimizes squared Euclidean distance to a set. That's exactly why the M-step uses the arithmetic mean — and why K-Means is brittle on non-Euclidean data (use K-Medoids or spectral clustering instead).

3. Visualizing Three Well-Separated Clusters

Synthetic data with 3 natural blobs (true K = 3)
0.00.02.01.84.03.66.05.48.07.210.09.0feature 1feature 2
Class 1
Class 0

On well-separated, roughly spherical clusters K-Means works beautifully. The challenge is figuring out K, dealing with bad initial centroids, and recognizing when the assumptions break down.

4. How Clusters Are Categorized in K-Means

Each data point is placed into exactly one cluster (a "hard" assignment) based on which centroid is closest by squared Euclidean distance. The geometry that results — the set of regions that map to each centroid — is called a Voronoi tessellation. The whole feature space is sliced into K convex polygonal regions, one per centroid, with straight-line boundaries.

4.1 The Categorization Rule

label(x) = argminₖ ‖x − μₖ‖²

That single rule is the entire "category" decision: compute the distance from a point to every centroid, pick the smallest. Ties (which are rare in practice on continuous data) are broken arbitrarily.

4.2 Voronoi Regions — the Shape of a K-Means Cluster

Because the boundary between any two clusters is the perpendicular bisector of the line joining their centroids, K-Means clusters are always convexand have straight (linear) decision boundaries. This is why K-Means fails on crescent-shaped, ring-shaped, or otherwise non-convex clusters.

4.3 Categories of Clusters You Will Encounter

Real datasets produce clusters of very different quality. Knowing the category helps you decide whether K-Means is the right tool or whether you should switch algorithm.

Cluster categoryWhat it looks likeK-Means behavior
Well-separatedCompact blobs, large gaps between groupsExcellent — converges to global optimum easily
Touching / overlappingBlobs whose tails interpenetrateBoundary points get mis-assigned; silhouette drops
Density-varyingOne dense cluster, one sparse cluster of equal countCentroid of sparse cluster drifts toward dense one
Unequal size (mass)1000 vs 50 pointsLarge cluster 'steals' the boundary; small one shrinks
Elongated / anisotropicCigar-shaped along one axisSplits the cigar into halves perpendicular to long axis
Non-convex (rings, moons)Curved manifoldsFails — use DBSCAN or spectral clustering instead
Nested / hierarchicalCluster within a clusterCannot represent; needs hierarchical or GMM
Noise / outliers presentA few extreme pointsOutliers pull centroids; consider K-Medoids

4.4 Hard vs Soft Categorization

K-Means uses hard assignment: every point belongs to one cluster with probability 1. Contrast this with soft / fuzzy methods (Fuzzy C-Means, Gaussian Mixtures) where each point has a membership weight for every cluster.

4.5 Anatomy of a Single Cluster Cₖ

Every cluster K-Means produces is fully described by four quantities:

  • Centroid μₖ — arithmetic mean of all member points; the cluster's "prototype".
  • Members Cₖ — the set of points whose nearest centroid is μₖ.
  • Within-cluster scatter — Σ ‖x − μₖ‖² over x ∈ Cₖ; how tight the cluster is.
  • Radius / spread — max or mean distance from μₖ to its members; useful for outlier flags.

4.6 End-to-End Categorization Pipeline

Key intuition
A K-Means "category" is just "which centroid am I closest to?". Everything else — Voronoi cells, convex boundaries, the limitations on ring or moon shapes — follows mechanically from that one rule.

5. K-Means++: Smarter Initialization

Random initialization can produce terrible local minima — two centroids landing in the same true cluster, leaving another cluster split in half. K-Means++ (Arthur & Vassilvitskii, 2007) fixes this by spreading initial centroids out probabilistically.

The K-Means++ Procedure

1
Pick first centroid uniformly at random
Choose μ₁ uniformly from the data points.
2
Compute D(x) for every point
For each remaining point, let D(x) be the distance to its nearest already-chosen centroid.
3
Sample next centroid weighted by D(x)²
Pick the next centroid with probability proportional to D(x)². Far-away points are far more likely to be selected — pushing centroids apart.
P(x) = D(x)² / Σⱼ D(xⱼ)²
4
Repeat until you have K centroids
Then run standard Lloyd's iterations from this seeded start.
Theoretical guarantee
K-Means++ is provably O(log K)-competitive in expectation with the optimal clustering. Pure random init has no such guarantee — it can be arbitrarily bad.

Random vs K-Means++ on the Same Data

Final WCSS across 50 runs (lower is better) on the 3-blob dataset above
Init strategyBest WCSSMean WCSSWorst WCSSRuns converging to optimum
Random62.1118.4284.731 / 50
K-Means++62.164.271.348 / 50

K-Means++ doesn't just have a better mean — it dramatically reduces variance, so you don't need as many restarts. In scikit-learn it is the default (init="k-means++"), and n_init=10 runs Lloyd's 10 times and keeps the best.

6. Choosing K — Three Complementary Methods

6.1 The Elbow Method (WCSS vs K)

Plot WCSS for K = 1, 2, 3, …. WCSS always decreases as K grows (more centroids fit data better), but the marginal gain shrinks past the true number of clusters. The "elbow" is where the curve flattens.

Elbow plot — clear bend at K = 3
0.0215.6431.2646.8862.41078.01.02.43.85.26.68.0minNumber of clusters KWCSS (inertia)

The elbow is sometimes ambiguous (smooth curves, no clear bend). When that happens, fall back to the silhouette score.

6.2 Silhouette Score

For each point i, define a(i) = mean distance to other points in its cluster, and b(i) = mean distance to points in the next-nearest cluster. The silhouette is:

s(i) = (b(i) − a(i)) / max(a(i), b(i)) ∈ [−1, 1]
  • s ≈ 1: point is well inside its cluster, far from others (good)
  • s ≈ 0: point sits on the boundary between two clusters
  • s < 0: point is probably mis-assigned

The mean silhouette across all points is a single quality score. Pick K that maximizes it.

Silhouette score peaks at K = 3
23.24.45.66.8800.200.400.600.801Kmean silhouette
silhouette(K)

6.3 Gap Statistic & Davies–Bouldin

  • Gap statistic (Tibshirani et al.): compares your WCSS to that expected under a uniform null reference distribution. Pick the smallest K where the gap is within one standard error of the next K.
  • Davies–Bouldin index: average ratio of within-cluster scatter to between-cluster separation. Lower is better.
  • Calinski–Harabasz index (variance ratio criterion): higher is better. Fast and well-correlated with silhouette.

7. Worked Numerical Example (K = 2, 1D Data)

Points: {1, 2, 3, 10, 11, 12}. Initial centroids picked badly: μ₁ = 2, μ₂ = 3.

Lloyd's iterations — even from a poor init, K-Means converges quickly here
Iterμ₁μ₂Cluster 1Cluster 2WCSS
02.03.0{1, 2}{3, 10, 11, 12}60.5
11.59.0{1, 2, 3}{10, 11, 12}4.0
22.011.0{1, 2, 3}{10, 11, 12}4.0
32.011.0{1, 2, 3}{10, 11, 12}4.0 ✓ converged

Notice WCSS dropped from 60.5 → 4.0 in two iterations. But with a worse init, say μ₁ = 1, μ₂ = 2, the algorithm could converge to a different local minimum where one cluster is {1} and the other {2, 3, 10, 11, 12} (WCSS ≈ 76). That's the local-minimum trap K-Means++ helps avoid.

8. Limitations & Failure Modes

  • K must be specified in advance. Use elbow / silhouette / gap, or switch to DBSCAN / HDBSCAN / Mean Shift which discover K automatically.
  • Assumes spherical, equal-variance clusters. Elongated, curved, or density-varying clusters break it. Try Gaussian Mixture Models for ellipsoidal clusters or spectral clustering for manifold structure.
  • Sensitive to feature scaling. A feature with range [0, 1000] will dominate one with range [0, 1]. Always StandardScaler first.
  • Sensitive to outliers. One extreme point pulls a centroid badly because of squared distance. Use K-Medoids (PAM) with L1 distance for robustness.
  • Curse of dimensionality. Euclidean distance becomes uninformative in very high dimensions — reduce with PCA or use cosine-based clustering on text/embeddings.
  • Empty clusters can occur if no points are assigned to a centroid. sklearn handles this by reseeding the empty centroid to the farthest point.

9. Variants Worth Knowing

VariantWhat changesWhen to use
MiniBatch K-MeansUpdates centroids on small random batchesMassive datasets (millions of points), streaming
K-Medoids (PAM)Centers must be actual data points; uses L1Robust to outliers; works with arbitrary distances
Bisecting K-MeansRecursive 2-means splitsHierarchical structure; often better local optima
Fuzzy C-MeansSoft assignments (membership ∈ [0,1])Overlapping clusters, gene expression
Gaussian Mixture (EM)Probabilistic, ellipsoidal clustersNon-spherical, soft cluster boundaries
K-Means|| (parallel)Scalable K-Means++ for distributed dataSpark / very large data

10. Production-Ready Python

import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, y_true = make_blobs(n_samples=1500, centers=4, cluster_std=0.8, random_state=42)

# 1. ALWAYS scale first
X = StandardScaler().fit_transform(X)

# 2. Search for K with multiple criteria
results = []
for k in range(2, 11):
    km = KMeans(
        n_clusters=k,
        init="k-means++",   # smart init (default)
        n_init=10,          # 10 random restarts, keep best
        max_iter=300,
        tol=1e-4,
        random_state=42,
    )
    labels = km.fit_predict(X)
    results.append({
        "k": k,
        "wcss": km.inertia_,
        "silhouette": silhouette_score(X, labels),
        "davies_bouldin": davies_bouldin_score(X, labels),  # lower is better
    })

# 3. Pick K maximizing silhouette
best_k = max(results, key=lambda r: r["silhouette"])["k"]
print(f"Best K = {best_k}")

# 4. Final fit
final = KMeans(n_clusters=best_k, init="k-means++", n_init=20, random_state=42).fit(X)

# 5. For huge data: MiniBatchKMeans is 10–100x faster
mbk = MiniBatchKMeans(n_clusters=best_k, batch_size=1024, n_init=10, random_state=42)
mbk.fit(X)

11. Decision Guide

1
Is your data roughly spherical & similar-sized?
Yes → K-Means++ is ideal. No → consider GMM, DBSCAN, or HDBSCAN.
2
Do you know K?
No → run K = 2…10, plot elbow + silhouette + Davies–Bouldin together; agreement across metrics is a strong signal.
3
How big is your data?
< 100k points → standard K-Means with n_init=10. > 1M → MiniBatchKMeans or K-Means|| on Spark.
4
Many outliers or non-Euclidean distances?
Switch to K-Medoids, or robustify with outlier removal (e.g., IsolationForest) before clustering.
The golden rule
K-Means is a tool, not a verdict. Always validate clusters with domain knowledge — cluster labels are meaningless until a human (or downstream task) gives them meaning.