Random Forest

An ensemble of decision trees that vote together — more accurate and robust than a single tree.

The Idea: Bagging

Random Forest builds many decision trees on bootstrapped samples of the training data (sampling with replacement) and combines their predictions:

Classification → majority vote
Regression → average

Each tree also considers only a random subset of features at each split, which decorrelates the trees and reduces variance.

Why It Works

Individual trees overfit and have high variance. Averaging many decorrelated, high-variance models produces a stable, low-variance ensemble.

Python Implementation

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X, y)
print("Accuracy:", rf.score(X, y))
print("Top features:", rf.feature_importances_[:5])

Key Hyperparameters

n_estimators — number of trees (more is better, with diminishing returns)
max_depth — depth of each tree
max_features — features considered per split (often √n)
min_samples_leaf — controls leaf size

A reliable workhorse

Random Forest is often a great first model: low tuning effort, handles missing values and categorical data well, and provides feature importance scores out of the box.