Unsupervised Learning

10 / 13

K-Means & K-Means++ Clustering

Partition data into K groups by minimizing within-cluster variance — with smart initialization, validation, and the failure modes you must know.

1. The Objective: What K-Means Actually Optimizes

K-Means minimizes the Within-Cluster Sum of Squares (WCSS), also called inertia. Given data points x₁ … xₙ and K centroids μ₁ … μ_K, it solves:

J = Σₖ Σ_(x ∈ Cₖ) ‖x − μₖ‖²

This is NP-hard in general, so we use Lloyd's algorithm — a fast heuristic that converges to a local minimum. Different initializations give different local minima, which is exactly why K-Means++ matters.

2. Lloyd's Algorithm Step-by-Step

Initialize K centroids

Pick K starting centers — randomly, or smarter via K-Means++ (see §4).

Assignment step (E-step)

Assign every point to its nearest centroid by squared Euclidean distance:

Cₖ = { x : ‖x − μₖ‖² ≤ ‖x − μⱼ‖² ∀ j }

Update step (M-step)

Recompute each centroid as the mean of points assigned to it:

μₖ ← (1 / |Cₖ|) · Σ_(x ∈ Cₖ) x

Repeat until convergence

Stop when assignments don't change, centroids move less than a tolerance, or after a max number of iterations. WCSS is guaranteed to decrease monotonically.

Why squared Euclidean?

The mean is the point that minimizes squared Euclidean distance to a set. That's exactly why the M-step uses the arithmetic mean — and why K-Means is brittle on non-Euclidean data (use K-Medoids or spectral clustering instead).

3. Visualizing Three Well-Separated Clusters

Synthetic data with 3 natural blobs (true K = 3)

Class 1

✕Class 0

On well-separated, roughly spherical clusters K-Means works beautifully. The challenge is figuring out K, dealing with bad initial centroids, and recognizing when the assumptions break down.

4. How Clusters Are Categorized in K-Means

Each data point is placed into exactly one cluster (a "hard" assignment) based on which centroid is closest by squared Euclidean distance. The geometry that results — the set of regions that map to each centroid — is called a Voronoi tessellation. The whole feature space is sliced into K convex polygonal regions, one per centroid, with straight-line boundaries.

4.1 The Categorization Rule

label(x) = argminₖ ‖x − μₖ‖²

That single rule is the entire "category" decision: compute the distance from a point to every centroid, pick the smallest. Ties (which are rare in practice on continuous data) are broken arbitrarily.

4.2 Voronoi Regions — the Shape of a K-Means Cluster

Because the boundary between any two clusters is the perpendicular bisector of the line joining their centroids, K-Means clusters are always convexand have straight (linear) decision boundaries. This is why K-Means fails on crescent-shaped, ring-shaped, or otherwise non-convex clusters.

4.3 Categories of Clusters You Will Encounter

Real datasets produce clusters of very different quality. Knowing the category helps you decide whether K-Means is the right tool or whether you should switch algorithm.

Cluster category	What it looks like	K-Means behavior
Well-separated	Compact blobs, large gaps between groups	Excellent — converges to global optimum easily
Touching / overlapping	Blobs whose tails interpenetrate	Boundary points get mis-assigned; silhouette drops
Density-varying	One dense cluster, one sparse cluster of equal count	Centroid of sparse cluster drifts toward dense one
Unequal size (mass)	1000 vs 50 points	Large cluster 'steals' the boundary; small one shrinks
Elongated / anisotropic	Cigar-shaped along one axis	Splits the cigar into halves perpendicular to long axis
Non-convex (rings, moons)	Curved manifolds	Fails — use DBSCAN or spectral clustering instead
Nested / hierarchical	Cluster within a cluster	Cannot represent; needs hierarchical or GMM
Noise / outliers present	A few extreme points	Outliers pull centroids; consider K-Medoids

4.4 Hard vs Soft Categorization

K-Means uses hard assignment: every point belongs to one cluster with probability 1. Contrast this with soft / fuzzy methods (Fuzzy C-Means, Gaussian Mixtures) where each point has a membership weight for every cluster.

4.5 Anatomy of a Single Cluster Cₖ

Every cluster K-Means produces is fully described by four quantities:

Centroid μₖ — arithmetic mean of all member points; the cluster's "prototype".
Members Cₖ — the set of points whose nearest centroid is μₖ.
Within-cluster scatter — Σ ‖x − μₖ‖² over x ∈ Cₖ; how tight the cluster is.
Radius / spread — max or mean distance from μₖ to its members; useful for outlier flags.

4.6 End-to-End Categorization Pipeline

Key intuition

A K-Means "category" is just "which centroid am I closest to?". Everything else — Voronoi cells, convex boundaries, the limitations on ring or moon shapes — follows mechanically from that one rule.

5. K-Means++: Smarter Initialization

Random initialization can produce terrible local minima — two centroids landing in the same true cluster, leaving another cluster split in half. K-Means++ (Arthur & Vassilvitskii, 2007) fixes this by spreading initial centroids out probabilistically.

The K-Means++ Procedure

Pick first centroid uniformly at random

Choose μ₁ uniformly from the data points.

Compute D(x) for every point

For each remaining point, let D(x) be the distance to its nearest already-chosen centroid.

Sample next centroid weighted by D(x)²

Pick the next centroid with probability proportional to D(x)². Far-away points are far more likely to be selected — pushing centroids apart.

P(x) = D(x)² / Σⱼ D(xⱼ)²

Repeat until you have K centroids

Then run standard Lloyd's iterations from this seeded start.

Theoretical guarantee

K-Means++ is provably O(log K)-competitive in expectation with the optimal clustering. Pure random init has no such guarantee — it can be arbitrarily bad.

Random vs K-Means++ on the Same Data

Final WCSS across 50 runs (lower is better) on the 3-blob dataset above

Init strategy	Best WCSS	Mean WCSS	Worst WCSS	Runs converging to optimum
Random	62.1	118.4	284.7	31 / 50
K-Means++	62.1	64.2	71.3	48 / 50

K-Means++ doesn't just have a better mean — it dramatically reduces variance, so you don't need as many restarts. In scikit-learn it is the default (init="k-means++"), and n_init=10 runs Lloyd's 10 times and keeps the best.

6. Choosing K — Three Complementary Methods

6.1 The Elbow Method (WCSS vs K)

Plot WCSS for K = 1, 2, 3, …. WCSS always decreases as K grows (more centroids fit data better), but the marginal gain shrinks past the true number of clusters. The "elbow" is where the curve flattens.

Elbow plot — clear bend at K = 3

The elbow is sometimes ambiguous (smooth curves, no clear bend). When that happens, fall back to the silhouette score.

6.2 Silhouette Score

For each point i, define a(i) = mean distance to other points in its cluster, and b(i) = mean distance to points in the next-nearest cluster. The silhouette is:

s(i) = (b(i) − a(i)) / max(a(i), b(i)) ∈ [−1, 1]

s ≈ 1: point is well inside its cluster, far from others (good)
s ≈ 0: point sits on the boundary between two clusters
s < 0: point is probably mis-assigned

The mean silhouette across all points is a single quality score. Pick K that maximizes it.

Silhouette score peaks at K = 3

silhouette(K)

6.3 Gap Statistic & Davies–Bouldin

Gap statistic (Tibshirani et al.): compares your WCSS to that expected under a uniform null reference distribution. Pick the smallest K where the gap is within one standard error of the next K.
Davies–Bouldin index: average ratio of within-cluster scatter to between-cluster separation. Lower is better.
Calinski–Harabasz index (variance ratio criterion): higher is better. Fast and well-correlated with silhouette.

7. Worked Numerical Example (K = 2, 1D Data)

Points: {1, 2, 3, 10, 11, 12}. Initial centroids picked badly: μ₁ = 2, μ₂ = 3.

Lloyd's iterations — even from a poor init, K-Means converges quickly here

Iter	μ₁	μ₂	Cluster 1	Cluster 2	WCSS
0	2.0	3.0	{1, 2}	{3, 10, 11, 12}	60.5
1	1.5	9.0	{1, 2, 3}	{10, 11, 12}	4.0
2	2.0	11.0	{1, 2, 3}	{10, 11, 12}	4.0
3	2.0	11.0	{1, 2, 3}	{10, 11, 12}	4.0 ✓ converged

Notice WCSS dropped from 60.5 → 4.0 in two iterations. But with a worse init, say μ₁ = 1, μ₂ = 2, the algorithm could converge to a different local minimum where one cluster is {1} and the other {2, 3, 10, 11, 12} (WCSS ≈ 76). That's the local-minimum trap K-Means++ helps avoid.

8. Limitations & Failure Modes

K must be specified in advance. Use elbow / silhouette / gap, or switch to DBSCAN / HDBSCAN / Mean Shift which discover K automatically.
Assumes spherical, equal-variance clusters. Elongated, curved, or density-varying clusters break it. Try Gaussian Mixture Models for ellipsoidal clusters or spectral clustering for manifold structure.
Sensitive to feature scaling. A feature with range [0, 1000] will dominate one with range [0, 1]. Always StandardScaler first.
Sensitive to outliers. One extreme point pulls a centroid badly because of squared distance. Use K-Medoids (PAM) with L1 distance for robustness.
Curse of dimensionality. Euclidean distance becomes uninformative in very high dimensions — reduce with PCA or use cosine-based clustering on text/embeddings.
Empty clusters can occur if no points are assigned to a centroid. sklearn handles this by reseeding the empty centroid to the farthest point.

9. Variants Worth Knowing

Variant	What changes	When to use
MiniBatch K-Means	Updates centroids on small random batches	Massive datasets (millions of points), streaming
K-Medoids (PAM)	Centers must be actual data points; uses L1	Robust to outliers; works with arbitrary distances
Bisecting K-Means	Recursive 2-means splits	Hierarchical structure; often better local optima
Fuzzy C-Means	Soft assignments (membership ∈ [0,1])	Overlapping clusters, gene expression
Gaussian Mixture (EM)	Probabilistic, ellipsoidal clusters	Non-spherical, soft cluster boundaries
K-Means\|\| (parallel)	Scalable K-Means++ for distributed data	Spark / very large data

10. Production-Ready Python

import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, y_true = make_blobs(n_samples=1500, centers=4, cluster_std=0.8, random_state=42)

# 1. ALWAYS scale first
X = StandardScaler().fit_transform(X)

# 2. Search for K with multiple criteria
results = []
for k in range(2, 11):
    km = KMeans(
        n_clusters=k,
        init="k-means++",   # smart init (default)
        n_init=10,          # 10 random restarts, keep best
        max_iter=300,
        tol=1e-4,
        random_state=42,
    )
    labels = km.fit_predict(X)
    results.append({
        "k": k,
        "wcss": km.inertia_,
        "silhouette": silhouette_score(X, labels),
        "davies_bouldin": davies_bouldin_score(X, labels),  # lower is better
    })

# 3. Pick K maximizing silhouette
best_k = max(results, key=lambda r: r["silhouette"])["k"]
print(f"Best K = {best_k}")

# 4. Final fit
final = KMeans(n_clusters=best_k, init="k-means++", n_init=20, random_state=42).fit(X)

# 5. For huge data: MiniBatchKMeans is 10–100x faster
mbk = MiniBatchKMeans(n_clusters=best_k, batch_size=1024, n_init=10, random_state=42)
mbk.fit(X)

11. Decision Guide

Is your data roughly spherical & similar-sized?

Yes → K-Means++ is ideal. No → consider GMM, DBSCAN, or HDBSCAN.

Do you know K?

No → run K = 2…10, plot elbow + silhouette + Davies–Bouldin together; agreement across metrics is a strong signal.

How big is your data?

< 100k points → standard K-Means with n_init=10. > 1M → MiniBatchKMeans or K-Means|| on Spark.

Many outliers or non-Euclidean distances?

Switch to K-Medoids, or robustify with outlier removal (e.g., IsolationForest) before clustering.

The golden rule

K-Means is a tool, not a verdict. Always validate clusters with domain knowledge — cluster labels are meaningless until a human (or downstream task) gives them meaning.