Demo runtime · idle

PCA, visualized

Reduction

Principal Component Analysis finds the directions along which your data varies the most — then lets you project onto a smaller subset of them, keeping signal and discarding redundancy. Watch the PC arrows appear on a 3D cloud on the left, and the projection (with the discarded dimensions collapsed to zero) on the right. Compare across six datasets to see where PCA shines and where it fails.

Original data · PCs in red/green/blue
Projected data · in PC coordinates
Explained variance per principal component · dashed = cumulative
Dataset
Points along a diagonal + noise. PC1 captures most variance — the canonical PCA case.
Target dimensions 2
Standardize
Tip: drag to rotate either 3D panel — compare the variance directions to the data shape.

The algorithm edit and re-run

pca.py

The math, derived

1. The goal — max-variance directions.

Given centered data $X \in \mathbb{R}^{N \times d}$ (each column zero-mean — this is why preprocessing always subtracts the mean), find a unit vector $u \in \mathbb{R}^d$ that maximizes the variance of the projections $X u$:

$$ \mathrm{Var}(Xu) \;=\; \frac{1}{N - 1}\, u^{\top} X^{\top} X\, u \;=\; u^{\top} S\, u $$

where $S = \frac{1}{N-1} X^{\top} X$ is the covariance matrix. (If you also divide by std before this step, $S$ becomes the correlation matrix — what we compute when Standardize is on.)

2. The constraint — unit length.

Without a length constraint, you can make $u^{\top} S u$ arbitrarily large by scaling. We only care about direction, so we constrain $u^{\top} u = 1$:

$$ \max_{u} \; u^{\top} S\, u \quad \text{subject to} \quad u^{\top} u = 1 $$

3. Lagrangian.

Use a Lagrange multiplier $\lambda$ to fold the constraint into the objective:

$$ \mathcal{L}(u, \lambda) \;=\; u^{\top} S\, u \,-\, \lambda \,(\, u^{\top} u - 1 \,) $$

4. Take the gradient, set to zero.

Differentiate w.r.t. $u$ and equate to zero:

$$ \nabla_{u} \mathcal{L} \;=\; 2 S u \,-\, 2 \lambda u \;=\; 0 $$ $$ S\, u \;=\; \lambda\, u $$

That is exactly the eigenvalue equation for $S$. Every critical point of the constrained problem is an eigenvector of the covariance matrix.

5. Which eigenvector?

Substitute $Su = \lambda u$ back into the objective: $u^{\top} S u = \lambda \, u^{\top} u = \lambda$. So the variance along eigenvector $u$ equals its eigenvalue $\lambda$. The principal components are the eigenvectors of $S$ sorted by descending eigenvalue — the eigenvalue tells you how much variance that direction carries.

Try this

The 90% threshold

On the plane dataset, look at the scree plot. PC1 + PC2 should hit ~95% explained variance — the third dimension is mostly noise. This is the visual intuition behind n_components=0.95.

When PCA gives up

Switch to the sphere dataset. All three eigenvalues should be roughly equal — there's no preferred direction. PCA produces something, but it's meaningless. This is what isotropic variance looks like.

The nonlinearity wall

The swiss roll is a 2D manifold embedded in 3D. Linear PCA flattens it — the projection loses the curved structure. This motivates kernel PCA, t-SNE, or UMAP for nonlinear data.

SVD instead of eig

Replace np.linalg.eig(C) with np.linalg.svd(X_std, full_matrices=False). SVD is more numerically stable for nearly-singular covariance matrices — it’s what sklearn.decomposition.PCA uses under the hood.

Scale matters

In standardize(), replace / X.std(axis=0) with just centering (subtract mean only). On the anisotropic dataset, watch how the PCs change — PCA without scaling is at the mercy of feature units.

Reconstruction error

Add reconstructed = projected @ eig_vecs[:, :dims].T in your code and print np.linalg.norm(X_std - reconstructed). That’s the information you discarded. Compare at dims = 1, 2, 3 across datasets.

In one glance

⚠️ Watch out Skip standardization Isotropic variance Nonlinear structure Use eigvecs for non-PCA model No scree plot → arbitrary k
🔧 In practice sklearn.decomposition.PCA StandardScaler np.linalg.svd explained_variance_ratio_ n_components=0.95

Frequently asked

Three things: (1) dimensionality reduction — compress N-feature data to $k$ features that retain most variance; (2) visualization — projecting high-dimensional data to 2D or 3D for plotting; (3) decorrelation — the principal components are orthogonal, useful as inputs to downstream models that assume independent features.
When variance is isotropic (a sphere has no preferred direction) or when the structure is nonlinear (a swiss roll’s underlying 2D manifold gets flattened by linear projection). Try the sphere and swissRoll datasets — PCA flatlines on the first and loses the structure on the second.
Almost always yes. PCA is sensitive to feature scale — a column ranging 0–10000 will dominate PC1 over a column ranging 0–1 even if both carry similar information. Subtract the mean and divide by the std (z-score) before fitting.
PCA is linear, fast, interpretable, and good for global structure + decorrelation. t-SNE and UMAP are nonlinear, slower, and better at preserving local neighborhoods — but they distort global distances and shouldn’t be used for downstream modeling. Use PCA when you want a meaningful feature space; use t-SNE/UMAP only for visualization.
Two heuristics: (1) plot the cumulative explained variance and pick the smallest $k$ that gets you above your threshold (e.g., 95%); (2) look for an elbow in the scree plot — a sharp drop in eigenvalues followed by a plateau. In sklearn, PCA(n_components=0.95) picks the smallest $k$ retaining 95% variance for you.