Principal Component Analysis (PCA)

A dimensionality reduction technique that finds the directions of maximum variance in your data.

The Goal

PCA transforms correlated features into a smaller set of uncorrelated components that capture as much of the original variance as possible. Useful for visualization, noise reduction, and speeding up downstream models.

How It Works

Standardize the data (zero mean, unit variance)
Compute the covariance matrix
Find eigenvectors and eigenvalues — the principal components
Project data onto the top K components

Python Implementation

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

X, _ = load_iris(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

print("Explained variance:", pca.explained_variance_ratio_)
print("Total:", sum(pca.explained_variance_ratio_))

Choosing the Number of Components

Plot cumulative explained variance and pick K such that you preserve, say, 95% of variance.

Important caveat

PCA components are linear combinations of original features and lose direct interpretability. For non-linear structure, consider t-SNE or UMAP.