ML Algorithms
Supervised Learning
02 / 13

Linear Regression

The simplest and most fundamental supervised learning algorithm for predicting continuous values.

What is Linear Regression?

Linear Regression models the relationship between input features and a continuous target by fitting a straight line (or hyperplane) through the data. It assumes a linear relationship between inputs X and output y.

The Hypothesis

For a single feature:

ŷ = w·x + b

For multiple features:

ŷ = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

Cost Function (MSE) — In Depth

The Mean Squared Error tells us how wrong our line is. For every training point we compute the residual ŷᵢ − yᵢ, square it (so positive and negative errors don't cancel and large errors are penalized more), and average across the dataset.

J(w, b) = (1 / 2m) · Σ (w·xᵢ + b − yᵢ)²

The 1/2 is a convenience: it cancels with the 2 from the derivative when we compute gradients later. It does not change where the minimum is.

Why squared, not absolute?

  • Differentiable everywhere — needed for gradient descent
  • Strongly convex — exactly one global minimum, no local traps
  • Punishes outliers — an error of 4 contributes 16, not 4
  • MAE (absolute error) is more robust to outliers but harder to optimize

Our running example

Suppose we measured 5 students — hours studied (x) vs exam score (y). The true relationship is y = 2x.

Training data: hours studied vs exam score
ix (hours)y (score)
112
224
336
448
5510

Computing MSE for different values of w (with b = 0)

Let's evaluate three candidate lines and compute the cost step by step.

Case 1 — w = 1 (line too shallow)
Predictions: 1, 2, 3, 4, 5. Residuals: −1, −2, −3, −4, −5. Squared: 1, 4, 9, 16, 25.
Sum = 55. J = 55 / (2·5) = 5.5
Case 2 — w = 2 (perfect fit)
Predictions: 2, 4, 6, 8, 10. Residuals all zero. Squared sum = 0. J = 0.0
Case 3 — w = 3 (line too steep)
Predictions: 3, 6, 9, 12, 15. Residuals: 1, 2, 3, 4, 5. Squared: 1, 4, 9, 16, 25.
Sum = 55. J = 55 / (2·5) = 5.5
MSE across many values of w (b fixed at 0)
wPredictions ŷSum of squared errorsJ(w, 0)
00, 0, 0, 0, 022022
0.50.5, 1, 1.5, 2, 2.5123.7512.375
11, 2, 3, 4, 5555.5
1.51.5, 3, 4.5, 6, 7.513.751.375
22, 4, 6, 8, 1000
2.52.5, 5, 7.5, 10, 12.513.751.375
33, 6, 9, 12, 15555.5
44, 8, 12, 16, 2022022

Notice the symmetry — equal errors above and below produce equal cost. The minimum sits exactly at w = 2, which is the true slope.

Visual: three candidate fits on the same data

Three lines compared. The blue (w=2) hits every point perfectly; the others miss.
012456-226101418Hours studied (x)Exam score (y)
w = 1, J = 5.5 (underfit)
w = 2, J = 0.0 (best)
w = 3, J = 5.5 (overshoot)

Visual: the residuals we are squaring

Below, dashed red lines show residuals for the bad fit w = 1. MSE is the average of the squared lengths of these dashed lines (times ½).

Residuals (vertical errors) for w = 1, b = 0
012456-2146912Hours studied (x)Exam score (y)
Underfit line w = 1

The cost as a function of w — the "bowl"

Plotting J(w) across many values of w(keeping b = 0) traces out a parabola. This convex bowlshape is what makes gradient descent guaranteed to find the global minimum.

J(w, 0) is a parabola with minimum at w = 2
0.04.89.714.519.424.20.00.81.62.43.24.0minw (slope)Cost J(w, 0)

The full surface: J(w, b) is a 3D bowl

When both w and b are free, the cost becomes a paraboloid surface in 3D. Contour lines (level sets) form concentric ellipses around the unique minimum:

        b
        ^
        |    .-""""-.        each ellipse = points with equal cost
        |   /  .---. \       inner ellipse = lower cost
        |  |  | * | |        * = global minimum (w*, b*)
        |   \  '---' /       gradient descent walks downhill
        |    '-....-'        toward the center
        +---------------> w

Expert Problems & Edge Cases

Problem 1 — The outlier disaster

Add a single mislabeled point (5, 100) to our dataset. Because MSE squares errors, that one point dominates the entire cost. The fitted line tilts dramatically toward the outlier even though four out of five points clearly follow y = 2x.

One outlier (red region top-right) drags the OLS fit far from the true line
012456-2512182532xy
True relationship y = 2x
OLS fit pulled by outlier

Remedies:

  • Huber loss — quadratic for small errors, linear beyond a threshold
  • RANSAC — repeatedly fit on random subsets, keep the consensus model
  • Robust regression (e.g. HuberRegressor in sklearn)
  • Outlier detection as a preprocessing step (IQR, isolation forest)

Problem 2 — Heteroscedasticity

OLS assumes residual variance is constant across x. If variance grows with x (e.g. predicting income from age — older people have wildly different incomes), the standard errors are wrong even if the line is right.

Diagnose: plot residuals vs predictions — if you see a "fan" or "cone" shape, you have heteroscedasticity. Fix: log-transform the target, weighted least squares, or use heteroscedasticity-robust standard errors.

Problem 3 — Multicollinearity

If two features are highly correlated (e.g. height_cm andheight_inches), the cost surface stops being a clean bowl and becomes a long narrow valley. Many different (w₁, w₂) combinations give nearly identical cost, so coefficient estimates become unstable and uninterpretable.

Diagnose: compute the Variance Inflation Factor (VIF > 10 is suspicious). Fix: drop one of the correlated features, combine them via PCA, or use Ridge regression which adds an L2 penalty + λΣwⱼ² that stabilizes the solution.

Problem 4 — Underfitting non-linear data

Linear regression on data shaped like y = x² will produce a flat useless fit and a high MSE that gradient descent can never reduce — because the model capacity is the bottleneck, not the optimizer.

Fix: add polynomial features (x, x², x³) and the problem becomes linear in the parameters again.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(PolynomialFeatures(degree=3), LinearRegression())
model.fit(X, y)

Problem 5 — When should we NOT minimize MSE?

  • Heavy-tailed targets → use MAE or Huber loss
  • Asymmetric costs (overestimating is worse than underestimating) → use quantile regression
  • Bounded targets (probabilities, counts) → use logistic / Poisson regression
  • Multiplicative errors → minimize MSE on log(y) instead

Gradient Descent: Reaching the Minimum

The partial derivatives of MSE are:

∂J/∂w = (1/m) · Σ (w·xᵢ + b − yᵢ) · xᵢ
∂J/∂b = (1/m) · Σ (w·xᵢ + b − yᵢ)
1
Initialize
Start with w = 0, b = 0 (or random small values).
2
Compute gradients
Evaluate ∂J/∂w and ∂J/∂b on the current parameters.
3
Update
w := w − α · ∂J/∂w, b := b − α · ∂J/∂b.
4
Repeat
Loop until the change in J is negligible or a max iteration count is reached.

Choosing the learning rate α

  • Too small → painfully slow convergence, may stop early
  • Just right → steady downhill descent
  • Too large → oscillates across the bowl, may diverge
Practical tip
Plot J after each iteration. A healthy run shows J decreasing smoothly. If it bounces or increases, cut α by 10×. If it barely moves, increase α by 3×.

Python: From Scratch and With sklearn

import numpy as np

# From scratch: gradient descent on MSE
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
w, b, lr = 0.0, 0.0, 0.01

for epoch in range(1000):
    yhat = w * X + b
    err = yhat - y
    dw = (err * X).mean()
    db = err.mean()
    w -= lr * dw
    b -= lr * db
    if epoch % 100 == 0:
        cost = (err ** 2).mean() / 2
        print(f"epoch {epoch:4d}  w={w:.3f}  b={b:.3f}  J={cost:.4f}")

print("Final:", w, b)  # ~ 2.0, ~ 0.0
# With sklearn — closed-form solution (Normal Equation)
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X.reshape(-1, 1), y)
print(model.coef_, model.intercept_)  # [2.] 0.0

Regularization: Ridge, Lasso, and Elastic Net

Plain (OLS) linear regression minimizes only the prediction error. With many features, noisy data, or multicollinearity, this leads to large, unstable coefficients that fit the training set but generalize poorly. Regularization adds a penalty on the size of the weights, shrinking them toward zero.

J(w) = MSE(w)  +  λ · Penalty(w)

The hyper-parameter λ (often called alpha in scikit-learn) controls how much we care about small weights vs. small error. Three penalties define three classic models:

MethodPenaltyEffect on weightsSparsity?Best when
OLS0UnconstrainedNon ≫ p, low noise
Ridge (L2)λ Σ wⱼ²Shrinks all toward 0 (smoothly)NoMany correlated features
Lasso (L1)λ Σ |wⱼ|Drives some weights exactly to 0YesFeature selection needed
Elastic Netλ(α‖w‖₁ + (1−α)‖w‖₂²)Mixes shrinkage and sparsityPartialMany correlated, want selection

Ridge Regression (L2)

J(w) = (1/n) Σ (yᵢ − ŷᵢ)² + λ Σⱼ wⱼ²

Ridge has a beautiful closed-form solution:

w* = (XᵀX + λI)⁻¹ Xᵀy

Adding λI guarantees the matrix is invertible — even whenXᵀX is singular due to multicollinearity. This is the original motivation behind Ridge (Hoerl & Kennard, 1970).

Worked example: Ridge on our 5-point dataset

Using the same data (x = 1…5, y = 2x), the OLS slope is exactly w = 2. Watch what Ridge does as we increase λ (with bias = 0):

w_ridge = (Σxᵢyᵢ) / (Σxᵢ² + λ) = 110 / (55 + λ)
λw_ridgePredictions at x = 1, 5Training MSEComment
02.0002.00, 10.000.000OLS — perfect fit
11.9641.96, 9.820.013Tiny shrinkage
101.6921.69, 8.460.853Visible bias
551.0001.00, 5.0011.00Halved the slope
10000.1040.10, 0.5233.86Severe under-fit
Ridge regression: increasing λ pulls the line toward y = 0
01245602571012xy
OLS (λ=0)
Ridge (λ=10)
Ridge (λ=55)
Ridge (λ=1000)

The regularization path

Plotting weight magnitude vs. λ shows the classic shrinkage curve: Ridge weights decay smoothly and asymptotically toward zero — never exactly zero.

Ridge shrinkage path: w(λ) = 110 / (55 + λ)
0408012016020000.440.881.321.762.20λ (regularization strength)weight value
Ridge weight w(λ)
Geometric intuition
Ridge constrains the weight vector to lie inside a circle of radius t. The optimal solution is where the elliptical MSE contours first touch this circle — typically a point with all coordinates non-zero but small.

Lasso Regression (L1)

J(w) = (1/n) Σ (yᵢ − ŷᵢ)² + λ Σⱼ |wⱼ|

The L1 penalty has corners at the axes. When MSE contours touch the diamond-shaped constraint region, they often hit a corner — meaning some coefficients become exactly zero. Lasso therefore performs automatic feature selection.

Soft-thresholding: the Lasso update rule

For a single feature with standardized data, the Lasso solution is:

w_lasso = sign(w_ols) · max(|w_ols| − λ, 0)

That "max(... − λ, 0)" is the soft-threshold operator. Any OLS weight smaller in magnitude than λ is killed outright.

Soft-thresholding: Lasso weight as a function of OLS weight (λ = 1)
-3-1.8-0.60.61.83-2.20-1.32-0.440.441.322.20OLS weightLasso weight
Lasso (λ = 1)
Identity (OLS)
Ridge vs. Lasso in one sentence
Ridge shrinks every coefficient a little; Lasso shrinks small ones to zero and leaves the rest almost intact.
PropertyRidge (L2)Lasso (L1)
Closed form?Yes — (XᵀX + λI)⁻¹XᵀyNo — coordinate descent or LARS
Produces zeros?NoYes (sparse model)
With correlated featuresSplits weight across themPicks one, drops the others
Stability under resamplingHighLower (which feature 'wins' can flip)
Differentiable everywhere?YesNo — kink at 0

Elastic Net — Best of Both

J(w) = MSE + λ [ α Σ|wⱼ| + (1−α) Σwⱼ² ]

Elastic Net adds both penalties. It keeps Lasso's sparsity but borrows Ridge's stability when features are highly correlated (Lasso alone tends to pick one and discard the rest arbitrarily). The mixing parameter α ∈ [0, 1] controls the balance:

  • α = 1 → pure Lasso
  • α = 0 → pure Ridge
  • α = 0.5 → equal mix (a common default)

Choosing λ — the bias–variance trade-off

Increasing λ raises bias (the model gets simpler, fits training data worse) but lowers variance (it generalizes more reliably). The sweet spot minimizes the validation error, not the training error.

Conceptual bias–variance curve as λ grows
02468100246810λerror
Bias²
Variance
Total error

In practice, use k-fold cross-validation:

from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import numpy as np

alphas = np.logspace(-3, 3, 50)

# Ridge — automatic CV over alphas
ridge = make_pipeline(StandardScaler(),
                      RidgeCV(alphas=alphas, cv=5))
ridge.fit(X_train, y_train)
print("Best alpha:", ridge[-1].alpha_)

# Lasso — coordinate-descent CV
lasso = make_pipeline(StandardScaler(),
                      LassoCV(alphas=alphas, cv=5, max_iter=10_000))
lasso.fit(X_train, y_train)
print("Non-zero coefs:", np.sum(lasso[-1].coef_ != 0))

# Elastic Net — also tunes the L1 ratio
enet = make_pipeline(StandardScaler(),
                     ElasticNetCV(l1_ratio=[0.1, 0.5, 0.9, 1.0],
                                  alphas=alphas, cv=5))
enet.fit(X_train, y_train)
Always standardize before regularizing
Penalties act on raw coefficient magnitudes. A feature in millimetres will be unfairly punished compared to one in kilometres. Use StandardScaler so every weight is on the same scale — non-negotiable for Ridge, Lasso, and Elastic Net.

Polynomial & Basis-Function Regression

Linear regression is "linear in the parameters", not in the features. By transforming x into [x, x², x³, …] we can fit curves while keeping the closed-form OLS machinery.

ŷ = w₀ + w₁ x + w₂ x² + … + w_d x^d
Degree dBehaviorRisk
1Straight lineUnderfit if data curves
2–3Smooth curveUsually a sweet spot
10+Wiggles through every pointSevere overfit, huge weights
The classic pairing
High-degree polynomial features plus Ridge regression is one of the most reliable non-linear baselines you can build with linear-regression machinery alone.
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    PolynomialFeatures(degree=5, include_bias=False),
    StandardScaler(),
    Ridge(alpha=1.0),       # tame the high-degree wiggles
)
model.fit(X_train, y_train)

Comparison: OLS vs Ridge vs Lasso — Residuals & Metrics

Theory is one thing — let's actually fit all three models on the same noisy dataset and look at how the residuals and error metrics differ. We use 12 training points with one deliberate outlier at x = 4.5 and 6 held-out test points generated from the same underlying line y = 1 + 1.5x + ε.

Fitted coefficients

ModelIntercept bSlope wComment
OLS0.6611.689Slope tilted up by the outlier
Ridge (λ = 5)1.3351.481Shrinks slope, raises intercept
Lasso (λ = 0.8)2.1681.225Shrinks slope hardest

Residual plot — training set

Each dashed segment shows how far each prediction missed its true target. A healthy model has residuals scattered randomly around the red zero line. The outlier at predicted ŷ ≈ 8 is the giant positive residual every model struggles with — but watch how its influence on the rest of the points changes:

Training residuals on identical data
1.03.05.07.09.011.0-2.00-0.501.002.504.00Predicted ŷResidual (y − ŷ)
OLS
Ridge (λ=5)
Lasso (λ=0.8)
What the residual plot tells us
  • OLS (green): residuals look balanced — but only because the line bent toward the outlier, masking the bias.
  • Ridge (purple): mostly mild negative residuals on the lower half — the line is slightly above the bulk of points because it refused to chase the outlier as hard.
  • Lasso (cyan): a clear negative bias for the first half — heavy shrinkage under-predicts when the slope is too small. Lasso gave up some fit to gain sparsity.

Residual plot — held-out test set

The training plot can mislead — the real question is how the residuals look on data the model has never seen. This is where regularization usually shines.

Test residuals — the generalization story
1.03.05.07.09.011.0-1.50-0.750.000.751.50Predicted ŷResidual (y − ŷ)
OLS
Ridge (λ=5)
Lasso (λ=0.8)

The metric scoreboard

MSE punishes large errors quadratically, so a single big miss dominates the score. MAE treats every dollar of error the same and is therefore far more outlier-robust. Looking at both together gives a much fuller picture than either alone:

ModelTrain MSETrain MAETest MSETest MAEVerdict
OLS1.1030.6930.4450.511Best train fit, OK on test
Ridge (λ=5)1.2310.6760.3030.465🏆 Best generalization
Lasso (λ=0.8)1.7430.9740.4520.556Over-shrunk for this dataset
Reading the scoreboard
  • OLS has the lowest training MSE — by design. It minimizes exactly that quantity.
  • Ridge wins on test MSE (0.303 vs 0.445) — about 32% lower error on unseen data. That gap is the bias–variance trade-off paying off.
  • OLS train MAE (0.693) > Ridge train MAE (0.676) even though OLS won on MSE. That happens because OLS spent its "budget" reducing the squared outlier residual; Ridge distributed its errors more evenly.
  • Lasso lost on every metric here — λ = 0.8 was too aggressive for one feature. With dozens of features (most of them noise) Lasso's story would flip.

Reproduce it yourself

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error

# same data used above
x_tr = np.array([0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6]).reshape(-1, 1)
y_tr = np.array([1.751, 2.679, 3.086, 3.466, 4.477, 4.905,
                 6.286, 7.804, 11.455, 8.128, 9.544, 10.214])  # outlier at index 8

x_te = np.array([0.8, 1.7, 2.6, 3.4, 4.7, 5.6]).reshape(-1, 1)
y_te = np.array([2.253, 3.085, 4.885, 6.448, 7.378, 9.171])

models = {
    "OLS":   LinearRegression(),
    "Ridge": Ridge(alpha=5.0),
    "Lasso": Lasso(alpha=0.8),
}

for name, m in models.items():
    m.fit(x_tr, y_tr)
    yhat_tr, yhat_te = m.predict(x_tr), m.predict(x_te)
    print(f"{name:6s} | "
          f"train MSE={mean_squared_error(y_tr, yhat_tr):.3f} "
          f"MAE={mean_absolute_error(y_tr, yhat_tr):.3f} | "
          f"test MSE={mean_squared_error(y_te, yhat_te):.3f} "
          f"MAE={mean_absolute_error(y_te, yhat_te):.3f}")
The big lesson
Always report both MSE and MAE on a held-out set, not just training MSE. Regularization can look worse on training error and dramatically better in production — that's exactly when you should ship it.

Quick Decision Guide

1
Start with OLS
It's your interpretable baseline. Inspect residual plots and coefficient magnitudes.
2
Multicollinearity? Reach for Ridge
Stable coefficients, closed-form, no feature selection.
3
Hundreds of features, suspect many are noise? Use Lasso
Get a sparse, interpretable model "for free".
4
Both correlated AND noisy? Elastic Net
Tune α ∈ {0.1, 0.5, 0.9} via cross-validation.
5
Curved relationship? Polynomial features + Ridge
Lift x into [x, x², x³, …] and let regularization handle the variance.

Assumptions Recap

  • Linear relationship between features and target
  • Independent observations
  • Homoscedastic residuals (constant variance)
  • Approximately normal residuals (for valid inference)
  • No severe multicollinearity
When to use Linear Regression
Use it as your interpretable baseline for any continuous prediction task — house prices, sales forecasting, dose-response curves. If the residual plots look pathological, escalate to Ridge, Lasso, polynomial features, or a non-linear model.