Supervised Learning
07 / 13
Random Forest
An ensemble of decision trees that vote together — more accurate and robust than a single tree.
The Idea: Bagging
Random Forest builds many decision trees on bootstrapped samples of the training data (sampling with replacement) and combines their predictions:
- Classification → majority vote
- Regression → average
Each tree also considers only a random subset of features at each split, which decorrelates the trees and reduces variance.
Why It Works
Individual trees overfit and have high variance. Averaging many decorrelated, high-variance models produces a stable, low-variance ensemble.
Python Implementation
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X, y)
print("Accuracy:", rf.score(X, y))
print("Top features:", rf.feature_importances_[:5])Key Hyperparameters
- n_estimators — number of trees (more is better, with diminishing returns)
- max_depth — depth of each tree
- max_features — features considered per split (often √n)
- min_samples_leaf — controls leaf size
A reliable workhorse
Random Forest is often a great first model: low tuning effort, handles missing values and categorical data well, and provides feature importance scores out of the box.