Logistic Regression
Despite the name, this is a classification algorithm. It models the probability that a sample belongs to a class — and gives you a clean, interpretable linear decision boundary.
From Linear Regression to Probabilities
A linear model z = w·x + b outputs any real number from −∞ to +∞. For classification we need a probability between 0 and 1. We achieve this by squashing z through the sigmoid (logistic) function.
The shape of sigmoid
z → +∞⇒σ(z) → 1(very confident class 1)z = 0⇒σ(z) = 0.5(decision boundary)z → −∞⇒σ(z) → 0(very confident class 0)
Sigmoid output for sample inputs
| z | e⁻ᶻ | σ(z) | Predicted class (threshold 0.5) |
|---|---|---|---|
| -4 | 54.60 | 0.018 | 0 |
| -2 | 7.39 | 0.119 | 0 |
| -1 | 2.72 | 0.269 | 0 |
| 0 | 1.00 | 0.500 | boundary |
| 1 | 0.37 | 0.731 | 1 |
| 2 | 0.14 | 0.881 | 1 |
| 4 | 0.018 | 0.982 | 1 |
Why Not MSE? The Need for Log Loss
Plugging sigmoid into the MSE formula (σ(z) − y)² creates a non-convex cost surface with many local minima — gradient descent gets stuck. We need a cost that:
- Is convex in the parameters (one global minimum)
- Heavily penalizes confident wrong predictions
- Has clean derivatives that pair with sigmoid
Binary Cross-Entropy (Log Loss)
For a single example with prediction p = σ(z) and true label y ∈ {0, 1}:
Read it as two cases stitched together:
- If
y = 1: loss =−log(p)— penalizes small p - If
y = 0: loss =−log(1 − p)— penalizes large p
Worked examples — the same prediction, different labels
| True y | Predicted p | Loss | Interpretation |
|---|---|---|---|
| 1 | 0.99 | 0.010 | great — high confidence, correct |
| 1 | 0.7 | 0.357 | good |
| 1 | 0.5 | 0.693 | uncertain |
| 1 | 0.2 | 1.609 | wrong, fairly confident — painful |
| 1 | 0.01 | 4.605 | very confident, totally wrong — disaster |
| 0 | 0.01 | 0.010 | great — high confidence, correct |
| 0 | 0.5 | 0.693 | uncertain |
| 0 | 0.99 | 4.605 | very confident, totally wrong — disaster |
Total cost across the dataset
∂J/∂w = (1/m) Σ (pᵢ − yᵢ) xᵢ. The error (pᵢ − yᵢ)drives the update — exactly like MSE on linear regression.The Decision Boundary
We classify as 1 when p ≥ 0.5, which happens exactly when z ≥ 0. So the boundary is the line:
It's linear — a straight line in 2D, a plane in 3D, a hyperplane in higher dimensions. The model can only carve space into two half-spaces with a flat cut.
Case A — Well-separated classes
When classes are far apart, the boundary fits cleanly between them. The probability ramps sharply from blue (class 0) to purple (class 1) across a thin transition zone.
Case B — Overlapping classes
When the classes mix in the middle, the same algorithm still finds the best linear cut, but accepts that some points will be misclassified. The probability transition is gentler — many points sit near p ≈ 0.5.
Case C — Non-linearly separable data
Logistic regression cannot draw a curved boundary. For data shaped like concentric rings or an XOR pattern, no straight line works. The fix is to engineer non-linear features:
# Add polynomial features so the linear model can carve curves
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
model = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False),
LogisticRegression(C=1.0)
)
model.fit(X, y)With degree-2 features (x₁², x₂², x₁x₂), the boundary becomes an ellipse, parabola, or hyperbola in the original space — yet the model is still linear in its parameters.
The 0.5 Threshold Is Not Sacred
The default threshold 0.5 assumes equal cost for false positives and false negatives. In real applications it almost never does.
| Threshold | Precision | Recall | Use case |
|---|---|---|---|
| 0.1 | low | very high | Cancer screening — never miss a positive |
| 0.3 | moderate | high | Fraud detection — review more cases |
| 0.5 | balanced | balanced | General-purpose default |
| 0.7 | high | moderate | Spam filter — avoid blocking real mail |
| 0.9 | very high | low | High-stakes account suspension |
Expert Problems & Edge Cases
Problem 1 — Class imbalance
With 99% negatives and 1% positives, the model can hit 99% accuracy by predicting "negative" for everything. Log loss happily allows this — it minimizes total loss, and most loss comes from negatives.
Fixes:
- class_weight="balanced" in sklearn — weights inversely proportional to class frequencies
- Resampling — undersample majority or oversample minority (SMOTE)
- Lower threshold — predict positive at
p > 0.05instead of 0.5 - Use better metrics — F1, AUPRC, balanced accuracy, not raw accuracy
Problem 2 — Perfect separation
If a single feature perfectly separates the classes (e.g. age < 18 → class 0, age ≥ 18 → class 1), unregularized logistic regression has no finite optimum: making the weight bigger always reduces log loss, so the optimizer pushes w → ∞.
Diagnose: coefficients keep growing; convergence warnings; standard errors explode. Fix: always include L2 regularization (sklearn's defaultC = 1.0 already does this).
Problem 3 — Multicollinearity & unstable coefficients
Just like linear regression, correlated predictors give unstable, hard-to-interpret weights. Even worse here: a positive coefficient may flip negative when you add or remove a feature.
Fixes: drop redundant features, use Ridge (L2) penalty for stability, or Lasso (L1) for automatic feature selection.
Problem 4 — Probability calibration
Logistic regression is usually well-calibrated out of the box: when it says 70%, roughly 70% of those cases really are positive. But after heavy regularization, class weighting, or resampling, calibration breaks. Predicted "0.9" might really mean 60%.
Diagnose: reliability diagram (predicted vs actual frequency in bins). Fix: Platt scaling or isotonic regression on a held-out set (CalibratedClassifierCV).
Problem 5 — Multi-class extension
For K > 2 classes, replace sigmoid with softmax:
Each class gets its own weight vector. The cost generalizes to categorical cross-entropy. Equivalently, sklearn offers one-vs-rest — train K binary classifiers and pick the one with the highest score.
Training: Gradient Descent
w = 0, bias b = 0.z = Xw + b then p = σ(z) for every example.∂J/∂w = (1/m) Xᵀ(p − y), ∂J/∂b = mean(p − y).w := w − α · ∂J/∂w, repeat until log loss stops decreasing.Python Implementation
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y, random_state=0)
# ALWAYS scale features for logistic regression
scaler = StandardScaler().fit(X_tr)
clf = LogisticRegression(C=1.0, class_weight="balanced", max_iter=1000)
clf.fit(scaler.transform(X_tr), y_tr)
print("Accuracy:", clf.score(scaler.transform(X_te), y_te))
print("Probabilities:", clf.predict_proba(scaler.transform(X_te[:3])))
print("Coefficients:", clf.coef_[0][:5])Use Cases
- Spam vs not-spam email classification
- Disease diagnosis — and you need calibrated probabilities, not just labels
- Customer churn prediction with interpretable feature effects
- Credit risk scoring (where regulators demand interpretable models)
- Click-through rate prediction at massive scale (works on billions of rows)