Supervised Learning

04 / 13

Logistic Regression

Despite the name, this is a classification algorithm. It models the probability that a sample belongs to a class — and gives you a clean, interpretable linear decision boundary.

From Linear Regression to Probabilities

A linear model z = w·x + b outputs any real number from −∞ to +∞. For classification we need a probability between 0 and 1. We achieve this by squashing z through the sigmoid (logistic) function.

σ(z) = 1 / (1 + e⁻ᶻ)

P(y = 1 | x) = σ(w·x + b)

The shape of sigmoid

The sigmoid function: monotonic, smooth, bounded between 0 and 1

σ(z) = 1 / (1 + e⁻ᶻ)

z → +∞ ⇒ σ(z) → 1 (very confident class 1)
z = 0 ⇒ σ(z) = 0.5 (decision boundary)
z → −∞ ⇒ σ(z) → 0 (very confident class 0)

Sigmoid output for sample inputs

A few values to build intuition

z	e⁻ᶻ	σ(z)	Predicted class (threshold 0.5)
-4	54.60	0.018	0
-2	7.39	0.119	0
-1	2.72	0.269	0
0	1.00	0.500	boundary
1	0.37	0.731	1
2	0.14	0.881	1
4	0.018	0.982	1

Why Not MSE? The Need for Log Loss

Plugging sigmoid into the MSE formula (σ(z) − y)² creates a non-convex cost surface with many local minima — gradient descent gets stuck. We need a cost that:

Is convex in the parameters (one global minimum)
Heavily penalizes confident wrong predictions
Has clean derivatives that pair with sigmoid

Binary Cross-Entropy (Log Loss)

For a single example with prediction p = σ(z) and true label y ∈ {0, 1}:

L(p, y) = −[ y · log(p) + (1 − y) · log(1 − p) ]

Read it as two cases stitched together:

If y = 1: loss = −log(p) — penalizes small p
If y = 0: loss = −log(1 − p) — penalizes large p

Log loss for both true labels — confident wrong predictions explode toward infinity

−log(p) when y = 1

−log(1−p) when y = 0

Worked examples — the same prediction, different labels

Why log loss is brutal on confident wrong predictions

True y	Predicted p	Loss	Interpretation
1	0.99	0.010	great — high confidence, correct
1	0.7	0.357	good
1	0.5	0.693	uncertain
1	0.2	1.609	wrong, fairly confident — painful
1	0.01	4.605	very confident, totally wrong — disaster
0	0.01	0.010	great — high confidence, correct
0	0.5	0.693	uncertain
0	0.99	4.605	very confident, totally wrong — disaster

Total cost across the dataset

J(w, b) = −(1/m) · Σᵢ [ yᵢ log(pᵢ) + (1 − yᵢ) log(1 − pᵢ) ]

Why this works

Combined with sigmoid, the gradient simplifies to the same elegant form as linear regression: ∂J/∂w = (1/m) Σ (pᵢ − yᵢ) xᵢ. The error (pᵢ − yᵢ)drives the update — exactly like MSE on linear regression.

The Decision Boundary

We classify as 1 when p ≥ 0.5, which happens exactly when z ≥ 0. So the boundary is the line:

w₁x₁ + w₂x₂ + b = 0

It's linear — a straight line in 2D, a plane in 3D, a hyperplane in higher dimensions. The model can only carve space into two half-spaces with a flat cut.

Case A — Well-separated classes

When classes are far apart, the boundary fits cleanly between them. The probability ramps sharply from blue (class 0) to purple (class 1) across a thin transition zone.

Well-separated data — confident, clean boundary

Class 1

✕Class 0

Decision boundary (p = 0.5)

Case B — Overlapping classes

When the classes mix in the middle, the same algorithm still finds the best linear cut, but accepts that some points will be misclassified. The probability transition is gentler — many points sit near p ≈ 0.5.

Overlapping data — boundary is a compromise; some errors are unavoidable

Class 1

✕Class 0

Decision boundary (p = 0.5)

Case C — Non-linearly separable data

Logistic regression cannot draw a curved boundary. For data shaped like concentric rings or an XOR pattern, no straight line works. The fix is to engineer non-linear features:

# Add polynomial features so the linear model can carve curves
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),
    LogisticRegression(C=1.0)
)
model.fit(X, y)

With degree-2 features (x₁², x₂², x₁x₂), the boundary becomes an ellipse, parabola, or hyperbola in the original space — yet the model is still linear in its parameters.

The 0.5 Threshold Is Not Sacred

The default threshold 0.5 assumes equal cost for false positives and false negatives. In real applications it almost never does.

How the threshold reshapes the confusion matrix on the same model

Threshold	Precision	Recall	Use case
0.1	low	very high	Cancer screening — never miss a positive
0.3	moderate	high	Fraud detection — review more cases
0.5	balanced	balanced	General-purpose default
0.7	high	moderate	Spam filter — avoid blocking real mail
0.9	very high	low	High-stakes account suspension

How to choose

Plot the precision-recall curve (or ROC curve) on a held-out validation set. Pick the threshold that maximizes your domain-specific metric — F-beta, expected utility, or a fixed recall target.

Expert Problems & Edge Cases

Problem 1 — Class imbalance

With 99% negatives and 1% positives, the model can hit 99% accuracy by predicting "negative" for everything. Log loss happily allows this — it minimizes total loss, and most loss comes from negatives.

Fixes:

class_weight="balanced" in sklearn — weights inversely proportional to class frequencies
Resampling — undersample majority or oversample minority (SMOTE)
Lower threshold — predict positive at p > 0.05 instead of 0.5
Use better metrics — F1, AUPRC, balanced accuracy, not raw accuracy

Problem 2 — Perfect separation

If a single feature perfectly separates the classes (e.g. age < 18 → class 0, age ≥ 18 → class 1), unregularized logistic regression has no finite optimum: making the weight bigger always reduces log loss, so the optimizer pushes w → ∞.

Diagnose: coefficients keep growing; convergence warnings; standard errors explode. Fix: always include L2 regularization (sklearn's defaultC = 1.0 already does this).

Problem 3 — Multicollinearity & unstable coefficients

Just like linear regression, correlated predictors give unstable, hard-to-interpret weights. Even worse here: a positive coefficient may flip negative when you add or remove a feature.

Fixes: drop redundant features, use Ridge (L2) penalty for stability, or Lasso (L1) for automatic feature selection.

Problem 4 — Probability calibration

Logistic regression is usually well-calibrated out of the box: when it says 70%, roughly 70% of those cases really are positive. But after heavy regularization, class weighting, or resampling, calibration breaks. Predicted "0.9" might really mean 60%.

Diagnose: reliability diagram (predicted vs actual frequency in bins). Fix: Platt scaling or isotonic regression on a held-out set (CalibratedClassifierCV).

Problem 5 — Multi-class extension

For K > 2 classes, replace sigmoid with softmax:

P(y = k | x) = e^(zₖ) / Σⱼ e^(zⱼ)

Each class gets its own weight vector. The cost generalizes to categorical cross-entropy. Equivalently, sklearn offers one-vs-rest — train K binary classifiers and pick the one with the highest score.

Training: Gradient Descent

Initialize

Start with weights w = 0, bias b = 0.

Forward pass

Compute z = Xw + b then p = σ(z) for every example.

Compute gradient

∂J/∂w = (1/m) Xᵀ(p − y), ∂J/∂b = mean(p − y).

Update

w := w − α · ∂J/∂w, repeat until log loss stops decreasing.

Python Implementation

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y, random_state=0)

# ALWAYS scale features for logistic regression
scaler = StandardScaler().fit(X_tr)

clf = LogisticRegression(C=1.0, class_weight="balanced", max_iter=1000)
clf.fit(scaler.transform(X_tr), y_tr)

print("Accuracy:", clf.score(scaler.transform(X_te), y_te))
print("Probabilities:", clf.predict_proba(scaler.transform(X_te[:3])))
print("Coefficients:", clf.coef_[0][:5])

Use Cases

Spam vs not-spam email classification
Disease diagnosis — and you need calibrated probabilities, not just labels
Customer churn prediction with interpretable feature effects
Credit risk scoring (where regulators demand interpretable models)
Click-through rate prediction at massive scale (works on billions of rows)

Why it remains a top choice

Logistic Regression is fast, interpretable (each weight = log-odds change per unit of x), produces calibrated probabilities, scales to huge datasets, and serves as the strong baseline that more complex models must beat.