Linear Regression
The simplest and most fundamental supervised learning algorithm for predicting continuous values.
What is Linear Regression?
Linear Regression models the relationship between input features and a continuous target by fitting a straight line (or hyperplane) through the data. It assumes a linear relationship between inputs X and output y.
The Hypothesis
For a single feature:
For multiple features:
Cost Function (MSE) — In Depth
The Mean Squared Error tells us how wrong our line is. For every training point we compute the residual ŷᵢ − yᵢ, square it (so positive and negative errors don't cancel and large errors are penalized more), and average across the dataset.
The 1/2 is a convenience: it cancels with the 2 from the derivative when we compute gradients later. It does not change where the minimum is.
Why squared, not absolute?
- Differentiable everywhere — needed for gradient descent
- Strongly convex — exactly one global minimum, no local traps
- Punishes outliers — an error of 4 contributes 16, not 4
- MAE (absolute error) is more robust to outliers but harder to optimize
Our running example
Suppose we measured 5 students — hours studied (x) vs exam score (y). The true relationship is y = 2x.
| i | x (hours) | y (score) |
|---|---|---|
| 1 | 1 | 2 |
| 2 | 2 | 4 |
| 3 | 3 | 6 |
| 4 | 4 | 8 |
| 5 | 5 | 10 |
Computing MSE for different values of w (with b = 0)
Let's evaluate three candidate lines and compute the cost step by step.
Sum = 55. J = 55 / (2·5) = 5.5
Sum = 55. J = 55 / (2·5) = 5.5
| w | Predictions ŷ | Sum of squared errors | J(w, 0) |
|---|---|---|---|
| 0 | 0, 0, 0, 0, 0 | 220 | 22 |
| 0.5 | 0.5, 1, 1.5, 2, 2.5 | 123.75 | 12.375 |
| 1 | 1, 2, 3, 4, 5 | 55 | 5.5 |
| 1.5 | 1.5, 3, 4.5, 6, 7.5 | 13.75 | 1.375 |
| 2 | 2, 4, 6, 8, 10 | 0 | 0 |
| 2.5 | 2.5, 5, 7.5, 10, 12.5 | 13.75 | 1.375 |
| 3 | 3, 6, 9, 12, 15 | 55 | 5.5 |
| 4 | 4, 8, 12, 16, 20 | 220 | 22 |
Notice the symmetry — equal errors above and below produce equal cost. The minimum sits exactly at w = 2, which is the true slope.
Visual: three candidate fits on the same data
Visual: the residuals we are squaring
Below, dashed red lines show residuals for the bad fit w = 1. MSE is the average of the squared lengths of these dashed lines (times ½).
The cost as a function of w — the "bowl"
Plotting J(w) across many values of w(keeping b = 0) traces out a parabola. This convex bowlshape is what makes gradient descent guaranteed to find the global minimum.
The full surface: J(w, b) is a 3D bowl
When both w and b are free, the cost becomes a paraboloid surface in 3D. Contour lines (level sets) form concentric ellipses around the unique minimum:
b
^
| .-""""-. each ellipse = points with equal cost
| / .---. \ inner ellipse = lower cost
| | | * | | * = global minimum (w*, b*)
| \ '---' / gradient descent walks downhill
| '-....-' toward the center
+---------------> wExpert Problems & Edge Cases
Problem 1 — The outlier disaster
Add a single mislabeled point (5, 100) to our dataset. Because MSE squares errors, that one point dominates the entire cost. The fitted line tilts dramatically toward the outlier even though four out of five points clearly follow y = 2x.
Remedies:
- Huber loss — quadratic for small errors, linear beyond a threshold
- RANSAC — repeatedly fit on random subsets, keep the consensus model
- Robust regression (e.g.
HuberRegressorin sklearn) - Outlier detection as a preprocessing step (IQR, isolation forest)
Problem 2 — Heteroscedasticity
OLS assumes residual variance is constant across x. If variance grows with x (e.g. predicting income from age — older people have wildly different incomes), the standard errors are wrong even if the line is right.
Diagnose: plot residuals vs predictions — if you see a "fan" or "cone" shape, you have heteroscedasticity. Fix: log-transform the target, weighted least squares, or use heteroscedasticity-robust standard errors.
Problem 3 — Multicollinearity
If two features are highly correlated (e.g. height_cm andheight_inches), the cost surface stops being a clean bowl and becomes a long narrow valley. Many different (w₁, w₂) combinations give nearly identical cost, so coefficient estimates become unstable and uninterpretable.
Diagnose: compute the Variance Inflation Factor (VIF > 10 is suspicious). Fix: drop one of the correlated features, combine them via PCA, or use Ridge regression which adds an L2 penalty + λΣwⱼ² that stabilizes the solution.
Problem 4 — Underfitting non-linear data
Linear regression on data shaped like y = x² will produce a flat useless fit and a high MSE that gradient descent can never reduce — because the model capacity is the bottleneck, not the optimizer.
Fix: add polynomial features (x, x², x³) and the problem becomes linear in the parameters again.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
model = make_pipeline(PolynomialFeatures(degree=3), LinearRegression())
model.fit(X, y)Problem 5 — When should we NOT minimize MSE?
- Heavy-tailed targets → use MAE or Huber loss
- Asymmetric costs (overestimating is worse than underestimating) → use quantile regression
- Bounded targets (probabilities, counts) → use logistic / Poisson regression
- Multiplicative errors → minimize MSE on
log(y)instead
Gradient Descent: Reaching the Minimum
The partial derivatives of MSE are:
w = 0, b = 0 (or random small values).∂J/∂w and ∂J/∂b on the current parameters.w := w − α · ∂J/∂w, b := b − α · ∂J/∂b.Choosing the learning rate α
- Too small → painfully slow convergence, may stop early
- Just right → steady downhill descent
- Too large → oscillates across the bowl, may diverge
Python: From Scratch and With sklearn
import numpy as np
# From scratch: gradient descent on MSE
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
w, b, lr = 0.0, 0.0, 0.01
for epoch in range(1000):
yhat = w * X + b
err = yhat - y
dw = (err * X).mean()
db = err.mean()
w -= lr * dw
b -= lr * db
if epoch % 100 == 0:
cost = (err ** 2).mean() / 2
print(f"epoch {epoch:4d} w={w:.3f} b={b:.3f} J={cost:.4f}")
print("Final:", w, b) # ~ 2.0, ~ 0.0# With sklearn — closed-form solution (Normal Equation)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X.reshape(-1, 1), y)
print(model.coef_, model.intercept_) # [2.] 0.0Regularization: Ridge, Lasso, and Elastic Net
Plain (OLS) linear regression minimizes only the prediction error. With many features, noisy data, or multicollinearity, this leads to large, unstable coefficients that fit the training set but generalize poorly. Regularization adds a penalty on the size of the weights, shrinking them toward zero.
The hyper-parameter λ (often called alpha in scikit-learn) controls how much we care about small weights vs. small error. Three penalties define three classic models:
| Method | Penalty | Effect on weights | Sparsity? | Best when |
|---|---|---|---|---|
| OLS | 0 | Unconstrained | No | n ≫ p, low noise |
| Ridge (L2) | λ Σ wⱼ² | Shrinks all toward 0 (smoothly) | No | Many correlated features |
| Lasso (L1) | λ Σ |wⱼ| | Drives some weights exactly to 0 | Yes | Feature selection needed |
| Elastic Net | λ(α‖w‖₁ + (1−α)‖w‖₂²) | Mixes shrinkage and sparsity | Partial | Many correlated, want selection |
Ridge Regression (L2)
Ridge has a beautiful closed-form solution:
Adding λI guarantees the matrix is invertible — even whenXᵀX is singular due to multicollinearity. This is the original motivation behind Ridge (Hoerl & Kennard, 1970).
Worked example: Ridge on our 5-point dataset
Using the same data (x = 1…5, y = 2x), the OLS slope is exactly w = 2. Watch what Ridge does as we increase λ (with bias = 0):
| λ | w_ridge | Predictions at x = 1, 5 | Training MSE | Comment |
|---|---|---|---|---|
| 0 | 2.000 | 2.00, 10.00 | 0.000 | OLS — perfect fit |
| 1 | 1.964 | 1.96, 9.82 | 0.013 | Tiny shrinkage |
| 10 | 1.692 | 1.69, 8.46 | 0.853 | Visible bias |
| 55 | 1.000 | 1.00, 5.00 | 11.00 | Halved the slope |
| 1000 | 0.104 | 0.10, 0.52 | 33.86 | Severe under-fit |
The regularization path
Plotting weight magnitude vs. λ shows the classic shrinkage curve: Ridge weights decay smoothly and asymptotically toward zero — never exactly zero.
Lasso Regression (L1)
The L1 penalty has corners at the axes. When MSE contours touch the diamond-shaped constraint region, they often hit a corner — meaning some coefficients become exactly zero. Lasso therefore performs automatic feature selection.
Soft-thresholding: the Lasso update rule
For a single feature with standardized data, the Lasso solution is:
That "max(... − λ, 0)" is the soft-threshold operator. Any OLS weight smaller in magnitude than λ is killed outright.
| Property | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Closed form? | Yes — (XᵀX + λI)⁻¹Xᵀy | No — coordinate descent or LARS |
| Produces zeros? | No | Yes (sparse model) |
| With correlated features | Splits weight across them | Picks one, drops the others |
| Stability under resampling | High | Lower (which feature 'wins' can flip) |
| Differentiable everywhere? | Yes | No — kink at 0 |
Elastic Net — Best of Both
Elastic Net adds both penalties. It keeps Lasso's sparsity but borrows Ridge's stability when features are highly correlated (Lasso alone tends to pick one and discard the rest arbitrarily). The mixing parameter α ∈ [0, 1] controls the balance:
α = 1→ pure Lassoα = 0→ pure Ridgeα = 0.5→ equal mix (a common default)
Choosing λ — the bias–variance trade-off
Increasing λ raises bias (the model gets simpler, fits training data worse) but lowers variance (it generalizes more reliably). The sweet spot minimizes the validation error, not the training error.
In practice, use k-fold cross-validation:
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import numpy as np
alphas = np.logspace(-3, 3, 50)
# Ridge — automatic CV over alphas
ridge = make_pipeline(StandardScaler(),
RidgeCV(alphas=alphas, cv=5))
ridge.fit(X_train, y_train)
print("Best alpha:", ridge[-1].alpha_)
# Lasso — coordinate-descent CV
lasso = make_pipeline(StandardScaler(),
LassoCV(alphas=alphas, cv=5, max_iter=10_000))
lasso.fit(X_train, y_train)
print("Non-zero coefs:", np.sum(lasso[-1].coef_ != 0))
# Elastic Net — also tunes the L1 ratio
enet = make_pipeline(StandardScaler(),
ElasticNetCV(l1_ratio=[0.1, 0.5, 0.9, 1.0],
alphas=alphas, cv=5))
enet.fit(X_train, y_train)StandardScaler so every weight is on the same scale — non-negotiable for Ridge, Lasso, and Elastic Net.Polynomial & Basis-Function Regression
Linear regression is "linear in the parameters", not in the features. By transforming x into [x, x², x³, …] we can fit curves while keeping the closed-form OLS machinery.
| Degree d | Behavior | Risk |
|---|---|---|
| 1 | Straight line | Underfit if data curves |
| 2–3 | Smooth curve | Usually a sweet spot |
| 10+ | Wiggles through every point | Severe overfit, huge weights |
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
model = make_pipeline(
PolynomialFeatures(degree=5, include_bias=False),
StandardScaler(),
Ridge(alpha=1.0), # tame the high-degree wiggles
)
model.fit(X_train, y_train)Comparison: OLS vs Ridge vs Lasso — Residuals & Metrics
Theory is one thing — let's actually fit all three models on the same noisy dataset and look at how the residuals and error metrics differ. We use 12 training points with one deliberate outlier at x = 4.5 and 6 held-out test points generated from the same underlying line y = 1 + 1.5x + ε.
Fitted coefficients
| Model | Intercept b | Slope w | Comment |
|---|---|---|---|
| OLS | 0.661 | 1.689 | Slope tilted up by the outlier |
| Ridge (λ = 5) | 1.335 | 1.481 | Shrinks slope, raises intercept |
| Lasso (λ = 0.8) | 2.168 | 1.225 | Shrinks slope hardest |
Residual plot — training set
Each dashed segment shows how far each prediction missed its true target. A healthy model has residuals scattered randomly around the red zero line. The outlier at predicted ŷ ≈ 8 is the giant positive residual every model struggles with — but watch how its influence on the rest of the points changes:
- OLS (green): residuals look balanced — but only because the line bent toward the outlier, masking the bias.
- Ridge (purple): mostly mild negative residuals on the lower half — the line is slightly above the bulk of points because it refused to chase the outlier as hard.
- Lasso (cyan): a clear negative bias for the first half — heavy shrinkage under-predicts when the slope is too small. Lasso gave up some fit to gain sparsity.
Residual plot — held-out test set
The training plot can mislead — the real question is how the residuals look on data the model has never seen. This is where regularization usually shines.
The metric scoreboard
MSE punishes large errors quadratically, so a single big miss dominates the score. MAE treats every dollar of error the same and is therefore far more outlier-robust. Looking at both together gives a much fuller picture than either alone:
| Model | Train MSE | Train MAE | Test MSE | Test MAE | Verdict |
|---|---|---|---|---|---|
| OLS | 1.103 | 0.693 | 0.445 | 0.511 | Best train fit, OK on test |
| Ridge (λ=5) | 1.231 | 0.676 | 0.303 | 0.465 | 🏆 Best generalization |
| Lasso (λ=0.8) | 1.743 | 0.974 | 0.452 | 0.556 | Over-shrunk for this dataset |
- OLS has the lowest training MSE — by design. It minimizes exactly that quantity.
- Ridge wins on test MSE (0.303 vs 0.445) — about 32% lower error on unseen data. That gap is the bias–variance trade-off paying off.
- OLS train MAE (0.693) > Ridge train MAE (0.676) even though OLS won on MSE. That happens because OLS spent its "budget" reducing the squared outlier residual; Ridge distributed its errors more evenly.
- Lasso lost on every metric here — λ = 0.8 was too aggressive for one feature. With dozens of features (most of them noise) Lasso's story would flip.
Reproduce it yourself
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error
# same data used above
x_tr = np.array([0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6]).reshape(-1, 1)
y_tr = np.array([1.751, 2.679, 3.086, 3.466, 4.477, 4.905,
6.286, 7.804, 11.455, 8.128, 9.544, 10.214]) # outlier at index 8
x_te = np.array([0.8, 1.7, 2.6, 3.4, 4.7, 5.6]).reshape(-1, 1)
y_te = np.array([2.253, 3.085, 4.885, 6.448, 7.378, 9.171])
models = {
"OLS": LinearRegression(),
"Ridge": Ridge(alpha=5.0),
"Lasso": Lasso(alpha=0.8),
}
for name, m in models.items():
m.fit(x_tr, y_tr)
yhat_tr, yhat_te = m.predict(x_tr), m.predict(x_te)
print(f"{name:6s} | "
f"train MSE={mean_squared_error(y_tr, yhat_tr):.3f} "
f"MAE={mean_absolute_error(y_tr, yhat_tr):.3f} | "
f"test MSE={mean_squared_error(y_te, yhat_te):.3f} "
f"MAE={mean_absolute_error(y_te, yhat_te):.3f}")Quick Decision Guide
Assumptions Recap
- Linear relationship between features and target
- Independent observations
- Homoscedastic residuals (constant variance)
- Approximately normal residuals (for valid inference)
- No severe multicollinearity