ML Algorithms
Supervised Learning
08 / 13

Support Vector Machines & The Kernel Trick

Maximum-margin classifiers that scale into infinite-dimensional feature spaces — without ever computing them.

1. Intuition: Why Maximum Margin?

Many lines can separate two classes — but which is best? SVM picks the line (hyperplane) that sits as far as possible from the nearest points of each class. That gap is called the margin, and the points touching its edge are the support vectors.

Bigger margin → more "breathing room" → better generalization to unseen data. A boundary that barely squeezes between training points is fragile; a wide-margin boundary is robust.

Hard-margin SVM on linearly separable data
0.00.01.21.22.42.43.63.64.84.86.06.0x₁x₂
Class 1
Class 0
Decision boundary (p = 0.5)
Three lines you should picture
  • Decision boundary: w·x + b = 0
  • Positive margin: w·x + b = +1 (touches class +1 support vectors)
  • Negative margin: w·x + b = −1 (touches class −1 support vectors)

2. The Hard-Margin Optimization

Encode classes as yᵢ ∈ {−1, +1}. We want every point on the correct side of the margin:

yᵢ (w · xᵢ + b) ≥ 1, for all i

The geometric margin equals 2 / ‖w‖. Maximizing it is the same as minimizing ½‖w‖², giving the canonical primal problem:

minimize ½‖w‖² subject to yᵢ(w·xᵢ + b) ≥ 1

Worked example

Take two classes with support vectors at (2, 2) for class −1 and (4, 4) for class +1. By symmetry the optimal boundary is x₁ + x₂ = 6, i.e. w = (1, 1), b = −6. Scale so margins hit ±1:

PointClass yw·x + by(w·x+b)Role
(2, 2)−1−2+2Support vector? scale to 1
(4, 4)+1+2+2Support vector? scale to 1
(1.5, 2.5)−1−2+2Inside class −1 cluster
(5, 4.8)+1+3.8+3.8Far from boundary

After dividing w and b by 2, the support vectors sit exactly on w·x + b = ±1 and the geometric margin is 2/‖w‖ = 2/√(0.5) ≈ 2.83.

3. Soft Margin: When Classes Overlap

Real data is messy. Hard-margin SVM has no solution if the classes aren't perfectly separable. Introduce slack variables ξᵢ ≥ 0 that allow some points to violate the margin, penalized by a constant C:

minimize ½‖w‖² + C · Σ ξᵢ subject to yᵢ(w·xᵢ + b) ≥ 1 − ξᵢ
Soft-margin SVM (classes overlap, two points violate the margin)
0.00.01.21.22.42.43.63.64.84.86.06.0x₁x₂
Class 1
Class 0
Decision boundary (p = 0.5)

The role of C

C valueBehaviorMargin widthBias / VarianceWhen to use
C = 0.01Almost ignores violationsVery wideHigh biasVery noisy data
C = 1 (default)BalancedModerateBalancedMost problems
C = 100Punishes every errorNarrowHigh varianceClean data, you trust labels
C → ∞Becomes hard-marginSmallest possibleOverfits noiseOnly if perfectly separable
Mental model for C
C is the price you pay for each unit of margin violation. Low C = "I'd rather have a fat margin than fight every outlier." High C = "Every misclassification is unacceptable."

4. The Dual Form (Why It Matters for Kernels)

Using Lagrange multipliers αᵢ ≥ 0, the problem becomes:

maximize Σ αᵢ − ½ ΣΣ αᵢ αⱼ yᵢ yⱼ (xᵢ · xⱼ)

Two crucial facts fall out of the dual:

  • The data only appears as inner products xᵢ · xⱼ.
  • Most αᵢ = 0. Only support vectors have αᵢ > 0 — they alone determine the boundary.

That first fact is the key that unlocks the kernel trick.

5. The Kernel Trick

Suppose your data isn't linearly separable in 2D. You could manually engineer features — square terms, products, etc. — to lift it into a higher-dimensional space where a hyperplane does work. But computing those features explicitly is expensive (sometimes infinite!).

The trick: since the dual only needs inner products, replace xᵢ · xⱼ with a kernel function K(xᵢ, xⱼ) = φ(xᵢ) · φ(xⱼ) that equals the inner product in some higher-dimensional space φ — without ever building φ.

The whole idea in one line
Compute a similarity score between two points that secretly lives in a richer space.

Tiny derivation: polynomial kernel of degree 2

Let x = (x₁, x₂), z = (z₁, z₂). Then:

(x · z)² = (x₁z₁ + x₂z₂)² = x₁²z₁² + 2x₁x₂z₁z₂ + x₂²z₂²

That equals φ(x) · φ(z) where φ(x) = (x₁², √2·x₁x₂, x₂²). We get a 3D feature space for the cost of one multiplication and one square — never building φ.

6. The Common Kernels

KernelFormula K(x, z)Implicit spaceUse when
Linearx · zOriginal featuresHigh-dim, sparse (text)
Polynomial(γ x·z + r)ᵈAll monomials up to degree dInteractions matter
RBF / Gaussianexp(−γ ‖x − z‖²)Infinite-dimensionalDefault for unknown structure
Sigmoidtanh(γ x·z + r)Neural-net likeRare; legacy

RBF: a similarity that decays with distance

The RBF kernel returns 1 when two points coincide and falls off smoothly toward 0 as they move apart. The hyper-parameter γ controls how fast that falloff happens.

RBF kernel value vs. distance ‖x − z‖ for different γ
00.81.62.43.2400.210.420.630.841.05‖x − z‖K(x, z)
γ = 0.25 (broad)
γ = 1 (medium)
γ = 4 (narrow)
RBF SVM on XOR-style data — linear separator impossible, RBF easy
0.00.01.21.22.42.43.63.64.84.86.06.0x₁x₂
Class 1
Class 0

With γ tuned correctly, RBF wraps each cluster in its own influence zone and carves a curved boundary that no linear model could find.

How γ behaves

γInfluence radiusBoundary shapeRisk
Small (0.01)WideSmooth, almost linearUnderfit
Medium (1)LocalCurvy, sensibleUsually best
Large (100)TinyMemorizes each pointSevere overfit

7. Tuning C and γ Together

For the RBF kernel you must tune both. They interact:

  • High C + high γ → islands around every point. Memorization.
  • Low C + low γ → nearly a straight line. Underfit.
  • Sweet spot: grid search with cross-validation.
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("scaler", StandardScaler()),     # SVMs REQUIRE scaling
    ("svc",    SVC(kernel="rbf")),
])

grid = {
    "svc__C":     [0.1, 1, 10, 100],
    "svc__gamma": [0.001, 0.01, 0.1, 1],
}

search = GridSearchCV(pipe, grid, cv=5, n_jobs=-1)
search.fit(X_train, y_train)
print(search.best_params_, search.best_score_)
Always scale your features
SVMs measure distances. A feature in the range [0, 1,000,000] will dominate one in [0, 1] and the kernel becomes meaningless. Use StandardScaler or MinMaxScaler — every time.

8. Expert Problems and Fixes

1
Class imbalance
Default SVM treats every error equally — bad when the minority class is rare. Fix with class_weight="balanced" which scales C inversely to class frequency, effectively widening the margin around the rare class.
2
Huge datasets (n > 100k)
Kernel SVM training is O(n²–n³). Switch to LinearSVC (uses a different solver, scales linearly), or approximate the RBF kernel with Nystroem / RBFSampler + linear SVM.
3
Probability outputs
SVMs output signed distances, not probabilities. Wrap with probability=True(Platt scaling) or use CalibratedClassifierCV if you need calibrated probabilities for downstream decisions.
4
Multi-class
SVM is binary by nature. sklearn uses one-vs-one by default (n(n−1)/2 classifiers). For many classes, one-vs-rest with LinearSVC is faster.
5
Choosing a kernel
Start linear. If accuracy is poor and you have moderate data (< 50k rows), try RBF. Use polynomial when you have prior knowledge that interactions of a specific degree matter (e.g. physics features). Sigmoid is rarely competitive — skip it.

9. Strengths and Weaknesses

StrengthsWeaknesses
Excellent in high-dimensional spaces (text, genomics)Slow to train on > 100k samples
Robust to overfitting via margin maximizationSensitive to C, γ, and kernel choice
Kernel trick handles non-linearity elegantlyNo native probability output
Only support vectors define the model — compactRequires feature scaling

10. Quick Reference

from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Linear, fast, high-dim
clf = make_pipeline(StandardScaler(), LinearSVC(C=1.0, dual="auto"))

# Non-linear, default workhorse
clf = make_pipeline(StandardScaler(),
                    SVC(kernel="rbf", C=1.0, gamma="scale", class_weight="balanced"))

# Probabilistic predictions
clf = make_pipeline(StandardScaler(),
                    SVC(kernel="rbf", probability=True))
One-line summary
SVM = widest possible margin + kernel trick = a linear classifier that secretly works in a far richer space, defined entirely by a handful of support vectors.