Supervised Learning

08 / 13

Support Vector Machines & The Kernel Trick

Maximum-margin classifiers that scale into infinite-dimensional feature spaces — without ever computing them.

1. Intuition: Why Maximum Margin?

Many lines can separate two classes — but which is best? SVM picks the line (hyperplane) that sits as far as possible from the nearest points of each class. That gap is called the margin, and the points touching its edge are the support vectors.

Bigger margin → more "breathing room" → better generalization to unseen data. A boundary that barely squeezes between training points is fragile; a wide-margin boundary is robust.

Hard-margin SVM on linearly separable data

Class 1

✕Class 0

Decision boundary (p = 0.5)

Three lines you should picture

Decision boundary: w·x + b = 0
Positive margin: w·x + b = +1 (touches class +1 support vectors)
Negative margin: w·x + b = −1 (touches class −1 support vectors)

2. The Hard-Margin Optimization

Encode classes as yᵢ ∈ {−1, +1}. We want every point on the correct side of the margin:

yᵢ (w · xᵢ + b) ≥ 1, for all i

The geometric margin equals 2 / ‖w‖. Maximizing it is the same as minimizing ½‖w‖², giving the canonical primal problem:

minimize ½‖w‖² subject to yᵢ(w·xᵢ + b) ≥ 1

Worked example

Take two classes with support vectors at (2, 2) for class −1 and (4, 4) for class +1. By symmetry the optimal boundary is x₁ + x₂ = 6, i.e. w = (1, 1), b = −6. Scale so margins hit ±1:

Point	Class y	w·x + b	y(w·x+b)	Role
(2, 2)	−1	−2	+2	Support vector? scale to 1
(4, 4)	+1	+2	+2	Support vector? scale to 1
(1.5, 2.5)	−1	−2	+2	Inside class −1 cluster
(5, 4.8)	+1	+3.8	+3.8	Far from boundary

After dividing w and b by 2, the support vectors sit exactly on w·x + b = ±1 and the geometric margin is 2/‖w‖ = 2/√(0.5) ≈ 2.83.

3. Soft Margin: When Classes Overlap

Real data is messy. Hard-margin SVM has no solution if the classes aren't perfectly separable. Introduce slack variables ξᵢ ≥ 0 that allow some points to violate the margin, penalized by a constant C:

minimize ½‖w‖² + C · Σ ξᵢ subject to yᵢ(w·xᵢ + b) ≥ 1 − ξᵢ

Soft-margin SVM (classes overlap, two points violate the margin)

Class 1

✕Class 0

Decision boundary (p = 0.5)

The role of C

C value	Behavior	Margin width	Bias / Variance	When to use
C = 0.01	Almost ignores violations	Very wide	High bias	Very noisy data
C = 1 (default)	Balanced	Moderate	Balanced	Most problems
C = 100	Punishes every error	Narrow	High variance	Clean data, you trust labels
C → ∞	Becomes hard-margin	Smallest possible	Overfits noise	Only if perfectly separable

Mental model for C

C is the price you pay for each unit of margin violation. Low C = "I'd rather have a fat margin than fight every outlier." High C = "Every misclassification is unacceptable."

4. The Dual Form (Why It Matters for Kernels)

Using Lagrange multipliers αᵢ ≥ 0, the problem becomes:

maximize Σ αᵢ − ½ ΣΣ αᵢ αⱼ yᵢ yⱼ (xᵢ · xⱼ)

Two crucial facts fall out of the dual:

The data only appears as inner products xᵢ · xⱼ.
Most αᵢ = 0. Only support vectors have αᵢ > 0 — they alone determine the boundary.

That first fact is the key that unlocks the kernel trick.

5. The Kernel Trick

Suppose your data isn't linearly separable in 2D. You could manually engineer features — square terms, products, etc. — to lift it into a higher-dimensional space where a hyperplane does work. But computing those features explicitly is expensive (sometimes infinite!).

The trick: since the dual only needs inner products, replace xᵢ · xⱼ with a kernel function K(xᵢ, xⱼ) = φ(xᵢ) · φ(xⱼ) that equals the inner product in some higher-dimensional space φ — without ever building φ.

The whole idea in one line

Compute a similarity score between two points that secretly lives in a richer space.

Tiny derivation: polynomial kernel of degree 2

Let x = (x₁, x₂), z = (z₁, z₂). Then:

(x · z)² = (x₁z₁ + x₂z₂)² = x₁²z₁² + 2x₁x₂z₁z₂ + x₂²z₂²

That equals φ(x) · φ(z) where φ(x) = (x₁², √2·x₁x₂, x₂²). We get a 3D feature space for the cost of one multiplication and one square — never building φ.

6. The Common Kernels

Kernel	Formula K(x, z)	Implicit space	Use when
Linear	x · z	Original features	High-dim, sparse (text)
Polynomial	(γ x·z + r)ᵈ	All monomials up to degree d	Interactions matter
RBF / Gaussian	exp(−γ ‖x − z‖²)	Infinite-dimensional	Default for unknown structure
Sigmoid	tanh(γ x·z + r)	Neural-net like	Rare; legacy

RBF: a similarity that decays with distance

The RBF kernel returns 1 when two points coincide and falls off smoothly toward 0 as they move apart. The hyper-parameter γ controls how fast that falloff happens.

RBF kernel value vs. distance ‖x − z‖ for different γ

γ = 0.25 (broad)

γ = 1 (medium)

γ = 4 (narrow)

RBF SVM on XOR-style data — linear separator impossible, RBF easy

Class 1

✕Class 0

With γ tuned correctly, RBF wraps each cluster in its own influence zone and carves a curved boundary that no linear model could find.

How γ behaves

γ	Influence radius	Boundary shape	Risk
Small (0.01)	Wide	Smooth, almost linear	Underfit
Medium (1)	Local	Curvy, sensible	Usually best
Large (100)	Tiny	Memorizes each point	Severe overfit

7. Tuning C and γ Together

For the RBF kernel you must tune both. They interact:

High C + high γ → islands around every point. Memorization.
Low C + low γ → nearly a straight line. Underfit.
Sweet spot: grid search with cross-validation.

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("scaler", StandardScaler()),     # SVMs REQUIRE scaling
    ("svc",    SVC(kernel="rbf")),
])

grid = {
    "svc__C":     [0.1, 1, 10, 100],
    "svc__gamma": [0.001, 0.01, 0.1, 1],
}

search = GridSearchCV(pipe, grid, cv=5, n_jobs=-1)
search.fit(X_train, y_train)
print(search.best_params_, search.best_score_)

Always scale your features

SVMs measure distances. A feature in the range [0, 1,000,000] will dominate one in [0, 1] and the kernel becomes meaningless. Use StandardScaler or MinMaxScaler — every time.

8. Expert Problems and Fixes

Class imbalance

Default SVM treats every error equally — bad when the minority class is rare. Fix with class_weight="balanced" which scales C inversely to class frequency, effectively widening the margin around the rare class.

Huge datasets (n > 100k)

Kernel SVM training is O(n²–n³). Switch to LinearSVC (uses a different solver, scales linearly), or approximate the RBF kernel with Nystroem / RBFSampler + linear SVM.

Probability outputs

SVMs output signed distances, not probabilities. Wrap with probability=True(Platt scaling) or use CalibratedClassifierCV if you need calibrated probabilities for downstream decisions.

Multi-class

SVM is binary by nature. sklearn uses one-vs-one by default (n(n−1)/2 classifiers). For many classes, one-vs-rest with LinearSVC is faster.

Choosing a kernel

Start linear. If accuracy is poor and you have moderate data (< 50k rows), try RBF. Use polynomial when you have prior knowledge that interactions of a specific degree matter (e.g. physics features). Sigmoid is rarely competitive — skip it.

9. Strengths and Weaknesses

Strengths	Weaknesses
Excellent in high-dimensional spaces (text, genomics)	Slow to train on > 100k samples
Robust to overfitting via margin maximization	Sensitive to C, γ, and kernel choice
Kernel trick handles non-linearity elegantly	No native probability output
Only support vectors define the model — compact	Requires feature scaling

10. Quick Reference

from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Linear, fast, high-dim
clf = make_pipeline(StandardScaler(), LinearSVC(C=1.0, dual="auto"))

# Non-linear, default workhorse
clf = make_pipeline(StandardScaler(),
                    SVC(kernel="rbf", C=1.0, gamma="scale", class_weight="balanced"))

# Probabilistic predictions
clf = make_pipeline(StandardScaler(),
                    SVC(kernel="rbf", probability=True))

One-line summary

SVM = widest possible margin + kernel trick = a linear classifier that secretly works in a far richer space, defined entirely by a handful of support vectors.