K-Means & K-Means++ Clustering
Partition data into K groups by minimizing within-cluster variance — with smart initialization, validation, and the failure modes you must know.
1. The Objective: What K-Means Actually Optimizes
K-Means minimizes the Within-Cluster Sum of Squares (WCSS), also called inertia. Given data points x₁ … xₙ and K centroids μ₁ … μ_K, it solves:
This is NP-hard in general, so we use Lloyd's algorithm — a fast heuristic that converges to a local minimum. Different initializations give different local minima, which is exactly why K-Means++ matters.
2. Lloyd's Algorithm Step-by-Step
3. Visualizing Three Well-Separated Clusters
On well-separated, roughly spherical clusters K-Means works beautifully. The challenge is figuring out K, dealing with bad initial centroids, and recognizing when the assumptions break down.
4. How Clusters Are Categorized in K-Means
Each data point is placed into exactly one cluster (a "hard" assignment) based on which centroid is closest by squared Euclidean distance. The geometry that results — the set of regions that map to each centroid — is called a Voronoi tessellation. The whole feature space is sliced into K convex polygonal regions, one per centroid, with straight-line boundaries.
4.1 The Categorization Rule
That single rule is the entire "category" decision: compute the distance from a point to every centroid, pick the smallest. Ties (which are rare in practice on continuous data) are broken arbitrarily.
4.2 Voronoi Regions — the Shape of a K-Means Cluster
Because the boundary between any two clusters is the perpendicular bisector of the line joining their centroids, K-Means clusters are always convexand have straight (linear) decision boundaries. This is why K-Means fails on crescent-shaped, ring-shaped, or otherwise non-convex clusters.
4.3 Categories of Clusters You Will Encounter
Real datasets produce clusters of very different quality. Knowing the category helps you decide whether K-Means is the right tool or whether you should switch algorithm.
| Cluster category | What it looks like | K-Means behavior |
|---|---|---|
| Well-separated | Compact blobs, large gaps between groups | Excellent — converges to global optimum easily |
| Touching / overlapping | Blobs whose tails interpenetrate | Boundary points get mis-assigned; silhouette drops |
| Density-varying | One dense cluster, one sparse cluster of equal count | Centroid of sparse cluster drifts toward dense one |
| Unequal size (mass) | 1000 vs 50 points | Large cluster 'steals' the boundary; small one shrinks |
| Elongated / anisotropic | Cigar-shaped along one axis | Splits the cigar into halves perpendicular to long axis |
| Non-convex (rings, moons) | Curved manifolds | Fails — use DBSCAN or spectral clustering instead |
| Nested / hierarchical | Cluster within a cluster | Cannot represent; needs hierarchical or GMM |
| Noise / outliers present | A few extreme points | Outliers pull centroids; consider K-Medoids |
4.4 Hard vs Soft Categorization
K-Means uses hard assignment: every point belongs to one cluster with probability 1. Contrast this with soft / fuzzy methods (Fuzzy C-Means, Gaussian Mixtures) where each point has a membership weight for every cluster.
4.5 Anatomy of a Single Cluster Cₖ
Every cluster K-Means produces is fully described by four quantities:
- Centroid μₖ — arithmetic mean of all member points; the cluster's "prototype".
- Members Cₖ — the set of points whose nearest centroid is μₖ.
- Within-cluster scatter — Σ ‖x − μₖ‖² over x ∈ Cₖ; how tight the cluster is.
- Radius / spread — max or mean distance from μₖ to its members; useful for outlier flags.
4.6 End-to-End Categorization Pipeline
5. K-Means++: Smarter Initialization
Random initialization can produce terrible local minima — two centroids landing in the same true cluster, leaving another cluster split in half. K-Means++ (Arthur & Vassilvitskii, 2007) fixes this by spreading initial centroids out probabilistically.
The K-Means++ Procedure
μ₁ uniformly from the data points.D(x) be the distance to its nearest already-chosen centroid.D(x)². Far-away points are far more likely to be selected — pushing centroids apart.Random vs K-Means++ on the Same Data
| Init strategy | Best WCSS | Mean WCSS | Worst WCSS | Runs converging to optimum |
|---|---|---|---|---|
| Random | 62.1 | 118.4 | 284.7 | 31 / 50 |
| K-Means++ | 62.1 | 64.2 | 71.3 | 48 / 50 |
K-Means++ doesn't just have a better mean — it dramatically reduces variance, so you don't need as many restarts. In scikit-learn it is the default (init="k-means++"), and n_init=10 runs Lloyd's 10 times and keeps the best.
6. Choosing K — Three Complementary Methods
6.1 The Elbow Method (WCSS vs K)
Plot WCSS for K = 1, 2, 3, …. WCSS always decreases as K grows (more centroids fit data better), but the marginal gain shrinks past the true number of clusters. The "elbow" is where the curve flattens.
The elbow is sometimes ambiguous (smooth curves, no clear bend). When that happens, fall back to the silhouette score.
6.2 Silhouette Score
For each point i, define a(i) = mean distance to other points in its cluster, and b(i) = mean distance to points in the next-nearest cluster. The silhouette is:
- s ≈ 1: point is well inside its cluster, far from others (good)
- s ≈ 0: point sits on the boundary between two clusters
- s < 0: point is probably mis-assigned
The mean silhouette across all points is a single quality score. Pick K that maximizes it.
6.3 Gap Statistic & Davies–Bouldin
- Gap statistic (Tibshirani et al.): compares your WCSS to that expected under a uniform null reference distribution. Pick the smallest K where the gap is within one standard error of the next K.
- Davies–Bouldin index: average ratio of within-cluster scatter to between-cluster separation. Lower is better.
- Calinski–Harabasz index (variance ratio criterion): higher is better. Fast and well-correlated with silhouette.
7. Worked Numerical Example (K = 2, 1D Data)
Points: {1, 2, 3, 10, 11, 12}. Initial centroids picked badly: μ₁ = 2, μ₂ = 3.
| Iter | μ₁ | μ₂ | Cluster 1 | Cluster 2 | WCSS |
|---|---|---|---|---|---|
| 0 | 2.0 | 3.0 | {1, 2} | {3, 10, 11, 12} | 60.5 |
| 1 | 1.5 | 9.0 | {1, 2, 3} | {10, 11, 12} | 4.0 |
| 2 | 2.0 | 11.0 | {1, 2, 3} | {10, 11, 12} | 4.0 |
| 3 | 2.0 | 11.0 | {1, 2, 3} | {10, 11, 12} | 4.0 ✓ converged |
Notice WCSS dropped from 60.5 → 4.0 in two iterations. But with a worse init, say μ₁ = 1, μ₂ = 2, the algorithm could converge to a different local minimum where one cluster is {1} and the other {2, 3, 10, 11, 12} (WCSS ≈ 76). That's the local-minimum trap K-Means++ helps avoid.
8. Limitations & Failure Modes
- K must be specified in advance. Use elbow / silhouette / gap, or switch to DBSCAN / HDBSCAN / Mean Shift which discover K automatically.
- Assumes spherical, equal-variance clusters. Elongated, curved, or density-varying clusters break it. Try Gaussian Mixture Models for ellipsoidal clusters or spectral clustering for manifold structure.
- Sensitive to feature scaling. A feature with range [0, 1000] will dominate one with range [0, 1]. Always
StandardScalerfirst. - Sensitive to outliers. One extreme point pulls a centroid badly because of squared distance. Use K-Medoids (PAM) with L1 distance for robustness.
- Curse of dimensionality. Euclidean distance becomes uninformative in very high dimensions — reduce with PCA or use cosine-based clustering on text/embeddings.
- Empty clusters can occur if no points are assigned to a centroid. sklearn handles this by reseeding the empty centroid to the farthest point.
9. Variants Worth Knowing
| Variant | What changes | When to use |
|---|---|---|
| MiniBatch K-Means | Updates centroids on small random batches | Massive datasets (millions of points), streaming |
| K-Medoids (PAM) | Centers must be actual data points; uses L1 | Robust to outliers; works with arbitrary distances |
| Bisecting K-Means | Recursive 2-means splits | Hierarchical structure; often better local optima |
| Fuzzy C-Means | Soft assignments (membership ∈ [0,1]) | Overlapping clusters, gene expression |
| Gaussian Mixture (EM) | Probabilistic, ellipsoidal clusters | Non-spherical, soft cluster boundaries |
| K-Means|| (parallel) | Scalable K-Means++ for distributed data | Spark / very large data |
10. Production-Ready Python
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
X, y_true = make_blobs(n_samples=1500, centers=4, cluster_std=0.8, random_state=42)
# 1. ALWAYS scale first
X = StandardScaler().fit_transform(X)
# 2. Search for K with multiple criteria
results = []
for k in range(2, 11):
km = KMeans(
n_clusters=k,
init="k-means++", # smart init (default)
n_init=10, # 10 random restarts, keep best
max_iter=300,
tol=1e-4,
random_state=42,
)
labels = km.fit_predict(X)
results.append({
"k": k,
"wcss": km.inertia_,
"silhouette": silhouette_score(X, labels),
"davies_bouldin": davies_bouldin_score(X, labels), # lower is better
})
# 3. Pick K maximizing silhouette
best_k = max(results, key=lambda r: r["silhouette"])["k"]
print(f"Best K = {best_k}")
# 4. Final fit
final = KMeans(n_clusters=best_k, init="k-means++", n_init=20, random_state=42).fit(X)
# 5. For huge data: MiniBatchKMeans is 10–100x faster
mbk = MiniBatchKMeans(n_clusters=best_k, batch_size=1024, n_init=10, random_state=42)
mbk.fit(X)11. Decision Guide
n_init=10. > 1M → MiniBatchKMeans or K-Means|| on Spark.