Unsupervised Learning
11 / 13
Principal Component Analysis (PCA)
A dimensionality reduction technique that finds the directions of maximum variance in your data.
The Goal
PCA transforms correlated features into a smaller set of uncorrelated components that capture as much of the original variance as possible. Useful for visualization, noise reduction, and speeding up downstream models.
How It Works
- Standardize the data (zero mean, unit variance)
- Compute the covariance matrix
- Find eigenvectors and eigenvalues — the principal components
- Project data onto the top K components
Python Implementation
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
X, _ = load_iris(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)
print("Explained variance:", pca.explained_variance_ratio_)
print("Total:", sum(pca.explained_variance_ratio_))Choosing the Number of Components
Plot cumulative explained variance and pick K such that you preserve, say, 95% of variance.
Important caveat
PCA components are linear combinations of original features and lose direct interpretability. For non-linear structure, consider t-SNE or UMAP.