Support Vector Machines & The Kernel Trick
Maximum-margin classifiers that scale into infinite-dimensional feature spaces — without ever computing them.
1. Intuition: Why Maximum Margin?
Many lines can separate two classes — but which is best? SVM picks the line (hyperplane) that sits as far as possible from the nearest points of each class. That gap is called the margin, and the points touching its edge are the support vectors.
Bigger margin → more "breathing room" → better generalization to unseen data. A boundary that barely squeezes between training points is fragile; a wide-margin boundary is robust.
- Decision boundary: w·x + b = 0
- Positive margin: w·x + b = +1 (touches class +1 support vectors)
- Negative margin: w·x + b = −1 (touches class −1 support vectors)
2. The Hard-Margin Optimization
Encode classes as yᵢ ∈ {−1, +1}. We want every point on the correct side of the margin:
The geometric margin equals 2 / ‖w‖. Maximizing it is the same as minimizing ½‖w‖², giving the canonical primal problem:
Worked example
Take two classes with support vectors at (2, 2) for class −1 and (4, 4) for class +1. By symmetry the optimal boundary is x₁ + x₂ = 6, i.e. w = (1, 1), b = −6. Scale so margins hit ±1:
| Point | Class y | w·x + b | y(w·x+b) | Role |
|---|---|---|---|---|
| (2, 2) | −1 | −2 | +2 | Support vector? scale to 1 |
| (4, 4) | +1 | +2 | +2 | Support vector? scale to 1 |
| (1.5, 2.5) | −1 | −2 | +2 | Inside class −1 cluster |
| (5, 4.8) | +1 | +3.8 | +3.8 | Far from boundary |
After dividing w and b by 2, the support vectors sit exactly on w·x + b = ±1 and the geometric margin is 2/‖w‖ = 2/√(0.5) ≈ 2.83.
3. Soft Margin: When Classes Overlap
Real data is messy. Hard-margin SVM has no solution if the classes aren't perfectly separable. Introduce slack variables ξᵢ ≥ 0 that allow some points to violate the margin, penalized by a constant C:
The role of C
| C value | Behavior | Margin width | Bias / Variance | When to use |
|---|---|---|---|---|
| C = 0.01 | Almost ignores violations | Very wide | High bias | Very noisy data |
| C = 1 (default) | Balanced | Moderate | Balanced | Most problems |
| C = 100 | Punishes every error | Narrow | High variance | Clean data, you trust labels |
| C → ∞ | Becomes hard-margin | Smallest possible | Overfits noise | Only if perfectly separable |
4. The Dual Form (Why It Matters for Kernels)
Using Lagrange multipliers αᵢ ≥ 0, the problem becomes:
Two crucial facts fall out of the dual:
- The data only appears as inner products
xᵢ · xⱼ. - Most
αᵢ = 0. Only support vectors haveαᵢ > 0— they alone determine the boundary.
That first fact is the key that unlocks the kernel trick.
5. The Kernel Trick
Suppose your data isn't linearly separable in 2D. You could manually engineer features — square terms, products, etc. — to lift it into a higher-dimensional space where a hyperplane does work. But computing those features explicitly is expensive (sometimes infinite!).
The trick: since the dual only needs inner products, replace xᵢ · xⱼ with a kernel function K(xᵢ, xⱼ) = φ(xᵢ) · φ(xⱼ) that equals the inner product in some higher-dimensional space φ — without ever building φ.
Tiny derivation: polynomial kernel of degree 2
Let x = (x₁, x₂), z = (z₁, z₂). Then:
That equals φ(x) · φ(z) where φ(x) = (x₁², √2·x₁x₂, x₂²). We get a 3D feature space for the cost of one multiplication and one square — never building φ.
6. The Common Kernels
| Kernel | Formula K(x, z) | Implicit space | Use when |
|---|---|---|---|
| Linear | x · z | Original features | High-dim, sparse (text) |
| Polynomial | (γ x·z + r)ᵈ | All monomials up to degree d | Interactions matter |
| RBF / Gaussian | exp(−γ ‖x − z‖²) | Infinite-dimensional | Default for unknown structure |
| Sigmoid | tanh(γ x·z + r) | Neural-net like | Rare; legacy |
RBF: a similarity that decays with distance
The RBF kernel returns 1 when two points coincide and falls off smoothly toward 0 as they move apart. The hyper-parameter γ controls how fast that falloff happens.
With γ tuned correctly, RBF wraps each cluster in its own influence zone and carves a curved boundary that no linear model could find.
How γ behaves
| γ | Influence radius | Boundary shape | Risk |
|---|---|---|---|
| Small (0.01) | Wide | Smooth, almost linear | Underfit |
| Medium (1) | Local | Curvy, sensible | Usually best |
| Large (100) | Tiny | Memorizes each point | Severe overfit |
7. Tuning C and γ Together
For the RBF kernel you must tune both. They interact:
- High C + high γ → islands around every point. Memorization.
- Low C + low γ → nearly a straight line. Underfit.
- Sweet spot: grid search with cross-validation.
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("scaler", StandardScaler()), # SVMs REQUIRE scaling
("svc", SVC(kernel="rbf")),
])
grid = {
"svc__C": [0.1, 1, 10, 100],
"svc__gamma": [0.001, 0.01, 0.1, 1],
}
search = GridSearchCV(pipe, grid, cv=5, n_jobs=-1)
search.fit(X_train, y_train)
print(search.best_params_, search.best_score_)StandardScaler or MinMaxScaler — every time.8. Expert Problems and Fixes
class_weight="balanced" which scales C inversely to class frequency, effectively widening the margin around the rare class.LinearSVC (uses a different solver, scales linearly), or approximate the RBF kernel with Nystroem / RBFSampler + linear SVM.probability=True(Platt scaling) or use CalibratedClassifierCV if you need calibrated probabilities for downstream decisions.sklearn uses one-vs-one by default (n(n−1)/2 classifiers). For many classes, one-vs-rest with LinearSVC is faster.9. Strengths and Weaknesses
| Strengths | Weaknesses |
|---|---|
| Excellent in high-dimensional spaces (text, genomics) | Slow to train on > 100k samples |
| Robust to overfitting via margin maximization | Sensitive to C, γ, and kernel choice |
| Kernel trick handles non-linearity elegantly | No native probability output |
| Only support vectors define the model — compact | Requires feature scaling |
10. Quick Reference
from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Linear, fast, high-dim
clf = make_pipeline(StandardScaler(), LinearSVC(C=1.0, dual="auto"))
# Non-linear, default workhorse
clf = make_pipeline(StandardScaler(),
SVC(kernel="rbf", C=1.0, gamma="scale", class_weight="balanced"))
# Probabilistic predictions
clf = make_pipeline(StandardScaler(),
SVC(kernel="rbf", probability=True))