ML Algorithms
Supervised Learning
04 / 13

Logistic Regression

Despite the name, this is a classification algorithm. It models the probability that a sample belongs to a class — and gives you a clean, interpretable linear decision boundary.

From Linear Regression to Probabilities

A linear model z = w·x + b outputs any real number from −∞ to +∞. For classification we need a probability between 0 and 1. We achieve this by squashing z through the sigmoid (logistic) function.

σ(z) = 1 / (1 + e⁻ᶻ)
P(y = 1 | x) = σ(w·x + b)

The shape of sigmoid

The sigmoid function: monotonic, smooth, bounded between 0 and 1
-8-4.8-1.61.64.8800.200.400.600.801z=0 → p=0.5z = w·x + bσ(z) = P(y = 1)
σ(z) = 1 / (1 + e⁻ᶻ)
  • z → +∞σ(z) → 1 (very confident class 1)
  • z = 0σ(z) = 0.5 (decision boundary)
  • z → −∞σ(z) → 0 (very confident class 0)

Sigmoid output for sample inputs

A few values to build intuition
ze⁻ᶻσ(z)Predicted class (threshold 0.5)
-454.600.0180
-27.390.1190
-12.720.2690
01.000.500boundary
10.370.7311
20.140.8811
40.0180.9821

Why Not MSE? The Need for Log Loss

Plugging sigmoid into the MSE formula (σ(z) − y)² creates a non-convex cost surface with many local minima — gradient descent gets stuck. We need a cost that:

  • Is convex in the parameters (one global minimum)
  • Heavily penalizes confident wrong predictions
  • Has clean derivatives that pair with sigmoid

Binary Cross-Entropy (Log Loss)

For a single example with prediction p = σ(z) and true label y ∈ {0, 1}:

L(p, y) = −[ y · log(p) + (1 − y) · log(1 − p) ]

Read it as two cases stitched together:

  • If y = 1: loss = −log(p) — penalizes small p
  • If y = 0: loss = −log(1 − p) — penalizes large p
Log loss for both true labels — confident wrong predictions explode toward infinity
00.30.50.8101.202.403.604.806Predicted probability pLoss
−log(p) when y = 1
−log(1−p) when y = 0

Worked examples — the same prediction, different labels

Why log loss is brutal on confident wrong predictions
True yPredicted pLossInterpretation
10.990.010great — high confidence, correct
10.70.357good
10.50.693uncertain
10.21.609wrong, fairly confident — painful
10.014.605very confident, totally wrong — disaster
00.010.010great — high confidence, correct
00.50.693uncertain
00.994.605very confident, totally wrong — disaster

Total cost across the dataset

J(w, b) = −(1/m) · Σᵢ [ yᵢ log(pᵢ) + (1 − yᵢ) log(1 − pᵢ) ]
Why this works
Combined with sigmoid, the gradient simplifies to the same elegant form as linear regression: ∂J/∂w = (1/m) Σ (pᵢ − yᵢ) xᵢ. The error (pᵢ − yᵢ)drives the update — exactly like MSE on linear regression.

The Decision Boundary

We classify as 1 when p ≥ 0.5, which happens exactly when z ≥ 0. So the boundary is the line:

w₁x₁ + w₂x₂ + b = 0

It's linear — a straight line in 2D, a plane in 3D, a hyperplane in higher dimensions. The model can only carve space into two half-spaces with a flat cut.

Case A — Well-separated classes

When classes are far apart, the boundary fits cleanly between them. The probability ramps sharply from blue (class 0) to purple (class 1) across a thin transition zone.

Well-separated data — confident, clean boundary
0.00.01.31.32.62.63.93.95.25.26.56.5x₁x₂
Class 1
Class 0
Decision boundary (p = 0.5)

Case B — Overlapping classes

When the classes mix in the middle, the same algorithm still finds the best linear cut, but accepts that some points will be misclassified. The probability transition is gentler — many points sit near p ≈ 0.5.

Overlapping data — boundary is a compromise; some errors are unavoidable
0.50.51.51.52.52.53.53.54.54.55.55.5x₁x₂
Class 1
Class 0
Decision boundary (p = 0.5)

Case C — Non-linearly separable data

Logistic regression cannot draw a curved boundary. For data shaped like concentric rings or an XOR pattern, no straight line works. The fix is to engineer non-linear features:

# Add polynomial features so the linear model can carve curves
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),
    LogisticRegression(C=1.0)
)
model.fit(X, y)

With degree-2 features (x₁², x₂², x₁x₂), the boundary becomes an ellipse, parabola, or hyperbola in the original space — yet the model is still linear in its parameters.

The 0.5 Threshold Is Not Sacred

The default threshold 0.5 assumes equal cost for false positives and false negatives. In real applications it almost never does.

How the threshold reshapes the confusion matrix on the same model
ThresholdPrecisionRecallUse case
0.1lowvery highCancer screening — never miss a positive
0.3moderatehighFraud detection — review more cases
0.5balancedbalancedGeneral-purpose default
0.7highmoderateSpam filter — avoid blocking real mail
0.9very highlowHigh-stakes account suspension
How to choose
Plot the precision-recall curve (or ROC curve) on a held-out validation set. Pick the threshold that maximizes your domain-specific metric — F-beta, expected utility, or a fixed recall target.

Expert Problems & Edge Cases

Problem 1 — Class imbalance

With 99% negatives and 1% positives, the model can hit 99% accuracy by predicting "negative" for everything. Log loss happily allows this — it minimizes total loss, and most loss comes from negatives.

Fixes:

  • class_weight="balanced" in sklearn — weights inversely proportional to class frequencies
  • Resampling — undersample majority or oversample minority (SMOTE)
  • Lower threshold — predict positive at p > 0.05 instead of 0.5
  • Use better metrics — F1, AUPRC, balanced accuracy, not raw accuracy

Problem 2 — Perfect separation

If a single feature perfectly separates the classes (e.g. age < 18 → class 0, age ≥ 18 → class 1), unregularized logistic regression has no finite optimum: making the weight bigger always reduces log loss, so the optimizer pushes w → ∞.

Diagnose: coefficients keep growing; convergence warnings; standard errors explode. Fix: always include L2 regularization (sklearn's defaultC = 1.0 already does this).

Problem 3 — Multicollinearity & unstable coefficients

Just like linear regression, correlated predictors give unstable, hard-to-interpret weights. Even worse here: a positive coefficient may flip negative when you add or remove a feature.

Fixes: drop redundant features, use Ridge (L2) penalty for stability, or Lasso (L1) for automatic feature selection.

Problem 4 — Probability calibration

Logistic regression is usually well-calibrated out of the box: when it says 70%, roughly 70% of those cases really are positive. But after heavy regularization, class weighting, or resampling, calibration breaks. Predicted "0.9" might really mean 60%.

Diagnose: reliability diagram (predicted vs actual frequency in bins). Fix: Platt scaling or isotonic regression on a held-out set (CalibratedClassifierCV).

Problem 5 — Multi-class extension

For K > 2 classes, replace sigmoid with softmax:

P(y = k | x) = e^(zₖ) / Σⱼ e^(zⱼ)

Each class gets its own weight vector. The cost generalizes to categorical cross-entropy. Equivalently, sklearn offers one-vs-rest — train K binary classifiers and pick the one with the highest score.

Training: Gradient Descent

1
Initialize
Start with weights w = 0, bias b = 0.
2
Forward pass
Compute z = Xw + b then p = σ(z) for every example.
3
Compute gradient
∂J/∂w = (1/m) Xᵀ(p − y), ∂J/∂b = mean(p − y).
4
Update
w := w − α · ∂J/∂w, repeat until log loss stops decreasing.

Python Implementation

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y, random_state=0)

# ALWAYS scale features for logistic regression
scaler = StandardScaler().fit(X_tr)

clf = LogisticRegression(C=1.0, class_weight="balanced", max_iter=1000)
clf.fit(scaler.transform(X_tr), y_tr)

print("Accuracy:", clf.score(scaler.transform(X_te), y_te))
print("Probabilities:", clf.predict_proba(scaler.transform(X_te[:3])))
print("Coefficients:", clf.coef_[0][:5])

Use Cases

  • Spam vs not-spam email classification
  • Disease diagnosis — and you need calibrated probabilities, not just labels
  • Customer churn prediction with interpretable feature effects
  • Credit risk scoring (where regulators demand interpretable models)
  • Click-through rate prediction at massive scale (works on billions of rows)
Why it remains a top choice
Logistic Regression is fast, interpretable (each weight = log-odds change per unit of x), produces calibrated probabilities, scales to huge datasets, and serves as the strong baseline that more complex models must beat.