Supervised Learning

02 / 13

Linear Regression

The simplest and most fundamental supervised learning algorithm for predicting continuous values.

What is Linear Regression?

Linear Regression models the relationship between input features and a continuous target by fitting a straight line (or hyperplane) through the data. It assumes a linear relationship between inputs X and output y.

The Hypothesis

For a single feature:

ŷ = w·x + b

For multiple features:

ŷ = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

Cost Function (MSE) — In Depth

The Mean Squared Error tells us how wrong our line is. For every training point we compute the residual ŷᵢ − yᵢ, square it (so positive and negative errors don't cancel and large errors are penalized more), and average across the dataset.

J(w, b) = (1 / 2m) · Σ (w·xᵢ + b − yᵢ)²

The 1/2 is a convenience: it cancels with the 2 from the derivative when we compute gradients later. It does not change where the minimum is.

Why squared, not absolute?

Differentiable everywhere — needed for gradient descent
Strongly convex — exactly one global minimum, no local traps
Punishes outliers — an error of 4 contributes 16, not 4
MAE (absolute error) is more robust to outliers but harder to optimize

Our running example

Suppose we measured 5 students — hours studied (x) vs exam score (y). The true relationship is y = 2x.

Training data: hours studied vs exam score

i	x (hours)	y (score)
1	1	2
2	2	4
3	3	6
4	4	8
5	5	10

Computing MSE for different values of w (with b = 0)

Let's evaluate three candidate lines and compute the cost step by step.

Case 1 — w = 1 (line too shallow)

Predictions: 1, 2, 3, 4, 5. Residuals: −1, −2, −3, −4, −5. Squared: 1, 4, 9, 16, 25.
Sum = 55. J = 55 / (2·5) = 5.5

Case 2 — w = 2 (perfect fit)

Predictions: 2, 4, 6, 8, 10. Residuals all zero. Squared sum = 0. J = 0.0

Case 3 — w = 3 (line too steep)

Predictions: 3, 6, 9, 12, 15. Residuals: 1, 2, 3, 4, 5. Squared: 1, 4, 9, 16, 25.
Sum = 55. J = 55 / (2·5) = 5.5

MSE across many values of w (b fixed at 0)

w	Predictions ŷ	Sum of squared errors	J(w, 0)
0	0, 0, 0, 0, 0	220	22
0.5	0.5, 1, 1.5, 2, 2.5	123.75	12.375
1	1, 2, 3, 4, 5	55	5.5
1.5	1.5, 3, 4.5, 6, 7.5	13.75	1.375
2	2, 4, 6, 8, 10	0	0
2.5	2.5, 5, 7.5, 10, 12.5	13.75	1.375
3	3, 6, 9, 12, 15	55	5.5
4	4, 8, 12, 16, 20	220	22

Notice the symmetry — equal errors above and below produce equal cost. The minimum sits exactly at w = 2, which is the true slope.

Visual: three candidate fits on the same data

Three lines compared. The blue (w=2) hits every point perfectly; the others miss.

w = 1, J = 5.5 (underfit)

w = 2, J = 0.0 (best)

w = 3, J = 5.5 (overshoot)

Visual: the residuals we are squaring

Below, dashed red lines show residuals for the bad fit w = 1. MSE is the average of the squared lengths of these dashed lines (times ½).

Residuals (vertical errors) for w = 1, b = 0

Underfit line w = 1

The cost as a function of w — the "bowl"

Plotting J(w) across many values of w(keeping b = 0) traces out a parabola. This convex bowlshape is what makes gradient descent guaranteed to find the global minimum.

J(w, 0) is a parabola with minimum at w = 2

The full surface: J(w, b) is a 3D bowl

When both w and b are free, the cost becomes a paraboloid surface in 3D. Contour lines (level sets) form concentric ellipses around the unique minimum:

        b
        ^
        |    .-""""-.        each ellipse = points with equal cost
        |   /  .---. \       inner ellipse = lower cost
        |  |  | * | |        * = global minimum (w*, b*)
        |   \  '---' /       gradient descent walks downhill
        |    '-....-'        toward the center
        +---------------> w

Expert Problems & Edge Cases

Problem 1 — The outlier disaster

Add a single mislabeled point (5, 100) to our dataset. Because MSE squares errors, that one point dominates the entire cost. The fitted line tilts dramatically toward the outlier even though four out of five points clearly follow y = 2x.

One outlier (red region top-right) drags the OLS fit far from the true line

True relationship y = 2x

OLS fit pulled by outlier

Remedies:

Huber loss — quadratic for small errors, linear beyond a threshold
RANSAC — repeatedly fit on random subsets, keep the consensus model
Robust regression (e.g. HuberRegressor in sklearn)
Outlier detection as a preprocessing step (IQR, isolation forest)

Problem 2 — Heteroscedasticity

OLS assumes residual variance is constant across x. If variance grows with x (e.g. predicting income from age — older people have wildly different incomes), the standard errors are wrong even if the line is right.

Diagnose: plot residuals vs predictions — if you see a "fan" or "cone" shape, you have heteroscedasticity. Fix: log-transform the target, weighted least squares, or use heteroscedasticity-robust standard errors.

Problem 3 — Multicollinearity

If two features are highly correlated (e.g. height_cm andheight_inches), the cost surface stops being a clean bowl and becomes a long narrow valley. Many different (w₁, w₂) combinations give nearly identical cost, so coefficient estimates become unstable and uninterpretable.

Diagnose: compute the Variance Inflation Factor (VIF > 10 is suspicious). Fix: drop one of the correlated features, combine them via PCA, or use Ridge regression which adds an L2 penalty + λΣwⱼ² that stabilizes the solution.

Problem 4 — Underfitting non-linear data

Linear regression on data shaped like y = x² will produce a flat useless fit and a high MSE that gradient descent can never reduce — because the model capacity is the bottleneck, not the optimizer.

Fix: add polynomial features (x, x², x³) and the problem becomes linear in the parameters again.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(PolynomialFeatures(degree=3), LinearRegression())
model.fit(X, y)

Problem 5 — When should we NOT minimize MSE?

Heavy-tailed targets → use MAE or Huber loss
Asymmetric costs (overestimating is worse than underestimating) → use quantile regression
Bounded targets (probabilities, counts) → use logistic / Poisson regression
Multiplicative errors → minimize MSE on log(y) instead

Gradient Descent: Reaching the Minimum

The partial derivatives of MSE are:

∂J/∂w = (1/m) · Σ (w·xᵢ + b − yᵢ) · xᵢ

∂J/∂b = (1/m) · Σ (w·xᵢ + b − yᵢ)

Initialize

Start with w = 0, b = 0 (or random small values).

Compute gradients

Evaluate ∂J/∂w and ∂J/∂b on the current parameters.

Update

w := w − α · ∂J/∂w, b := b − α · ∂J/∂b.

Repeat

Loop until the change in J is negligible or a max iteration count is reached.

Choosing the learning rate α

Too small → painfully slow convergence, may stop early
Just right → steady downhill descent
Too large → oscillates across the bowl, may diverge

Practical tip

Plot J after each iteration. A healthy run shows J decreasing smoothly. If it bounces or increases, cut α by 10×. If it barely moves, increase α by 3×.

Python: From Scratch and With sklearn

import numpy as np

# From scratch: gradient descent on MSE
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
w, b, lr = 0.0, 0.0, 0.01

for epoch in range(1000):
    yhat = w * X + b
    err = yhat - y
    dw = (err * X).mean()
    db = err.mean()
    w -= lr * dw
    b -= lr * db
    if epoch % 100 == 0:
        cost = (err ** 2).mean() / 2
        print(f"epoch {epoch:4d}  w={w:.3f}  b={b:.3f}  J={cost:.4f}")

print("Final:", w, b)  # ~ 2.0, ~ 0.0

# With sklearn — closed-form solution (Normal Equation)
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X.reshape(-1, 1), y)
print(model.coef_, model.intercept_)  # [2.] 0.0

Regularization: Ridge, Lasso, and Elastic Net

Plain (OLS) linear regression minimizes only the prediction error. With many features, noisy data, or multicollinearity, this leads to large, unstable coefficients that fit the training set but generalize poorly. Regularization adds a penalty on the size of the weights, shrinking them toward zero.

J(w) = MSE(w) + λ · Penalty(w)

The hyper-parameter λ (often called alpha in scikit-learn) controls how much we care about small weights vs. small error. Three penalties define three classic models:

Method	Penalty	Effect on weights	Sparsity?	Best when
OLS	0	Unconstrained	No	n ≫ p, low noise
Ridge (L2)	λ Σ wⱼ²	Shrinks all toward 0 (smoothly)	No	Many correlated features
Lasso (L1)	λ Σ \|wⱼ\|	Drives some weights exactly to 0	Yes	Feature selection needed
Elastic Net	λ(α‖w‖₁ + (1−α)‖w‖₂²)	Mixes shrinkage and sparsity	Partial	Many correlated, want selection

Ridge Regression (L2)

J(w) = (1/n) Σ (yᵢ − ŷᵢ)² + λ Σⱼ wⱼ²

Ridge has a beautiful closed-form solution:

w* = (XᵀX + λI)⁻¹ Xᵀy

Adding λI guarantees the matrix is invertible — even whenXᵀX is singular due to multicollinearity. This is the original motivation behind Ridge (Hoerl & Kennard, 1970).

Worked example: Ridge on our 5-point dataset

Using the same data (x = 1…5, y = 2x), the OLS slope is exactly w = 2. Watch what Ridge does as we increase λ (with bias = 0):

w_ridge = (Σxᵢyᵢ) / (Σxᵢ² + λ) = 110 / (55 + λ)

λ	w_ridge	Predictions at x = 1, 5	Training MSE	Comment
0	2.000	2.00, 10.00	0.000	OLS — perfect fit
1	1.964	1.96, 9.82	0.013	Tiny shrinkage
10	1.692	1.69, 8.46	0.853	Visible bias
55	1.000	1.00, 5.00	11.00	Halved the slope
1000	0.104	0.10, 0.52	33.86	Severe under-fit

Ridge regression: increasing λ pulls the line toward y = 0

OLS (λ=0)

Ridge (λ=10)

Ridge (λ=55)

Ridge (λ=1000)

The regularization path

Plotting weight magnitude vs. λ shows the classic shrinkage curve: Ridge weights decay smoothly and asymptotically toward zero — never exactly zero.

Ridge shrinkage path: w(λ) = 110 / (55 + λ)

Ridge weight w(λ)

Geometric intuition

Ridge constrains the weight vector to lie inside a circle of radius t. The optimal solution is where the elliptical MSE contours first touch this circle — typically a point with all coordinates non-zero but small.

Lasso Regression (L1)

J(w) = (1/n) Σ (yᵢ − ŷᵢ)² + λ Σⱼ |wⱼ|

The L1 penalty has corners at the axes. When MSE contours touch the diamond-shaped constraint region, they often hit a corner — meaning some coefficients become exactly zero. Lasso therefore performs automatic feature selection.

Soft-thresholding: the Lasso update rule

For a single feature with standardized data, the Lasso solution is:

w_lasso = sign(w_ols) · max(|w_ols| − λ, 0)

That "max(... − λ, 0)" is the soft-threshold operator. Any OLS weight smaller in magnitude than λ is killed outright.

Soft-thresholding: Lasso weight as a function of OLS weight (λ = 1)

Lasso (λ = 1)

Identity (OLS)

Ridge vs. Lasso in one sentence

Ridge shrinks every coefficient a little; Lasso shrinks small ones to zero and leaves the rest almost intact.

Property	Ridge (L2)	Lasso (L1)
Closed form?	Yes — (XᵀX + λI)⁻¹Xᵀy	No — coordinate descent or LARS
Produces zeros?	No	Yes (sparse model)
With correlated features	Splits weight across them	Picks one, drops the others
Stability under resampling	High	Lower (which feature 'wins' can flip)
Differentiable everywhere?	Yes	No — kink at 0

Elastic Net — Best of Both

J(w) = MSE + λ [ α Σ|wⱼ| + (1−α) Σwⱼ² ]

Elastic Net adds both penalties. It keeps Lasso's sparsity but borrows Ridge's stability when features are highly correlated (Lasso alone tends to pick one and discard the rest arbitrarily). The mixing parameter α ∈ [0, 1] controls the balance:

α = 1 → pure Lasso
α = 0 → pure Ridge
α = 0.5 → equal mix (a common default)

Choosing λ — the bias–variance trade-off

Increasing λ raises bias (the model gets simpler, fits training data worse) but lowers variance (it generalizes more reliably). The sweet spot minimizes the validation error, not the training error.

Conceptual bias–variance curve as λ grows

Bias²

Variance

Total error

In practice, use k-fold cross-validation:

from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import numpy as np

alphas = np.logspace(-3, 3, 50)

# Ridge — automatic CV over alphas
ridge = make_pipeline(StandardScaler(),
                      RidgeCV(alphas=alphas, cv=5))
ridge.fit(X_train, y_train)
print("Best alpha:", ridge[-1].alpha_)

# Lasso — coordinate-descent CV
lasso = make_pipeline(StandardScaler(),
                      LassoCV(alphas=alphas, cv=5, max_iter=10_000))
lasso.fit(X_train, y_train)
print("Non-zero coefs:", np.sum(lasso[-1].coef_ != 0))

# Elastic Net — also tunes the L1 ratio
enet = make_pipeline(StandardScaler(),
                     ElasticNetCV(l1_ratio=[0.1, 0.5, 0.9, 1.0],
                                  alphas=alphas, cv=5))
enet.fit(X_train, y_train)

Always standardize before regularizing

Penalties act on raw coefficient magnitudes. A feature in millimetres will be unfairly punished compared to one in kilometres. Use StandardScaler so every weight is on the same scale — non-negotiable for Ridge, Lasso, and Elastic Net.

Polynomial & Basis-Function Regression

Linear regression is "linear in the parameters", not in the features. By transforming x into [x, x², x³, …] we can fit curves while keeping the closed-form OLS machinery.

ŷ = w₀ + w₁ x + w₂ x² + … + w_d x^d

Degree d	Behavior	Risk
1	Straight line	Underfit if data curves
2–3	Smooth curve	Usually a sweet spot
10+	Wiggles through every point	Severe overfit, huge weights

The classic pairing

High-degree polynomial features plus Ridge regression is one of the most reliable non-linear baselines you can build with linear-regression machinery alone.

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    PolynomialFeatures(degree=5, include_bias=False),
    StandardScaler(),
    Ridge(alpha=1.0),       # tame the high-degree wiggles
)
model.fit(X_train, y_train)

Comparison: OLS vs Ridge vs Lasso — Residuals & Metrics

Theory is one thing — let's actually fit all three models on the same noisy dataset and look at how the residuals and error metrics differ. We use 12 training points with one deliberate outlier at x = 4.5 and 6 held-out test points generated from the same underlying line y = 1 + 1.5x + ε.

Fitted coefficients

Model	Intercept b	Slope w	Comment
OLS	0.661	1.689	Slope tilted up by the outlier
Ridge (λ = 5)	1.335	1.481	Shrinks slope, raises intercept
Lasso (λ = 0.8)	2.168	1.225	Shrinks slope hardest

Residual plot — training set

Each dashed segment shows how far each prediction missed its true target. A healthy model has residuals scattered randomly around the red zero line. The outlier at predicted ŷ ≈ 8 is the giant positive residual every model struggles with — but watch how its influence on the rest of the points changes:

Training residuals on identical data

OLS

Ridge (λ=5)

Lasso (λ=0.8)

What the residual plot tells us

OLS (green): residuals look balanced — but only because the line bent toward the outlier, masking the bias.
Ridge (purple): mostly mild negative residuals on the lower half — the line is slightly above the bulk of points because it refused to chase the outlier as hard.
Lasso (cyan): a clear negative bias for the first half — heavy shrinkage under-predicts when the slope is too small. Lasso gave up some fit to gain sparsity.

Residual plot — held-out test set

The training plot can mislead — the real question is how the residuals look on data the model has never seen. This is where regularization usually shines.

Test residuals — the generalization story

OLS

Ridge (λ=5)

Lasso (λ=0.8)

The metric scoreboard

MSE punishes large errors quadratically, so a single big miss dominates the score. MAE treats every dollar of error the same and is therefore far more outlier-robust. Looking at both together gives a much fuller picture than either alone:

Model	Train MSE	Train MAE	Test MSE	Test MAE	Verdict
OLS	1.103	0.693	0.445	0.511	Best train fit, OK on test
Ridge (λ=5)	1.231	0.676	0.303	0.465	🏆 Best generalization
Lasso (λ=0.8)	1.743	0.974	0.452	0.556	Over-shrunk for this dataset

Reading the scoreboard

OLS has the lowest training MSE — by design. It minimizes exactly that quantity.
Ridge wins on test MSE (0.303 vs 0.445) — about 32% lower error on unseen data. That gap is the bias–variance trade-off paying off.
OLS train MAE (0.693) > Ridge train MAE (0.676) even though OLS won on MSE. That happens because OLS spent its "budget" reducing the squared outlier residual; Ridge distributed its errors more evenly.
Lasso lost on every metric here — λ = 0.8 was too aggressive for one feature. With dozens of features (most of them noise) Lasso's story would flip.

Reproduce it yourself

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error

# same data used above
x_tr = np.array([0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6]).reshape(-1, 1)
y_tr = np.array([1.751, 2.679, 3.086, 3.466, 4.477, 4.905,
                 6.286, 7.804, 11.455, 8.128, 9.544, 10.214])  # outlier at index 8

x_te = np.array([0.8, 1.7, 2.6, 3.4, 4.7, 5.6]).reshape(-1, 1)
y_te = np.array([2.253, 3.085, 4.885, 6.448, 7.378, 9.171])

models = {
    "OLS":   LinearRegression(),
    "Ridge": Ridge(alpha=5.0),
    "Lasso": Lasso(alpha=0.8),
}

for name, m in models.items():
    m.fit(x_tr, y_tr)
    yhat_tr, yhat_te = m.predict(x_tr), m.predict(x_te)
    print(f"{name:6s} | "
          f"train MSE={mean_squared_error(y_tr, yhat_tr):.3f} "
          f"MAE={mean_absolute_error(y_tr, yhat_tr):.3f} | "
          f"test MSE={mean_squared_error(y_te, yhat_te):.3f} "
          f"MAE={mean_absolute_error(y_te, yhat_te):.3f}")

The big lesson

Always report both MSE and MAE on a held-out set, not just training MSE. Regularization can look worse on training error and dramatically better in production — that's exactly when you should ship it.

Quick Decision Guide

Start with OLS

It's your interpretable baseline. Inspect residual plots and coefficient magnitudes.

Multicollinearity? Reach for Ridge

Stable coefficients, closed-form, no feature selection.

Hundreds of features, suspect many are noise? Use Lasso

Get a sparse, interpretable model "for free".

Both correlated AND noisy? Elastic Net

Tune α ∈ {0.1, 0.5, 0.9} via cross-validation.

Curved relationship? Polynomial features + Ridge

Lift x into [x, x², x³, …] and let regularization handle the variance.

Assumptions Recap

Linear relationship between features and target
Independent observations
Homoscedastic residuals (constant variance)
Approximately normal residuals (for valid inference)
No severe multicollinearity

When to use Linear Regression

Use it as your interpretable baseline for any continuous prediction task — house prices, sales forecasting, dose-response curves. If the residual plots look pathological, escalate to Ridge, Lasso, polynomial features, or a non-linear model.