PCA, visualized
Principal Component Analysis finds the directions along which your data varies the most — then lets you project onto a smaller subset of them, keeping signal and discarding redundancy. Watch the PC arrows appear on a 3D cloud on the left, and the projection (with the discarded dimensions collapsed to zero) on the right. Compare across six datasets to see where PCA shines and where it fails.
The algorithm edit and re-run
The math, derived
1. The goal — max-variance directions.
Given centered data $X \in \mathbb{R}^{N \times d}$ (each column zero-mean — this is why preprocessing always subtracts the mean), find a unit vector $u \in \mathbb{R}^d$ that maximizes the variance of the projections $X u$:
$$ \mathrm{Var}(Xu) \;=\; \frac{1}{N - 1}\, u^{\top} X^{\top} X\, u \;=\; u^{\top} S\, u $$where $S = \frac{1}{N-1} X^{\top} X$ is the covariance matrix. (If you also divide by std before this step, $S$ becomes the correlation matrix — what we compute when Standardize is on.)
2. The constraint — unit length.
Without a length constraint, you can make $u^{\top} S u$ arbitrarily large by scaling. We only care about direction, so we constrain $u^{\top} u = 1$:
$$ \max_{u} \; u^{\top} S\, u \quad \text{subject to} \quad u^{\top} u = 1 $$3. Lagrangian.
Use a Lagrange multiplier $\lambda$ to fold the constraint into the objective:
$$ \mathcal{L}(u, \lambda) \;=\; u^{\top} S\, u \,-\, \lambda \,(\, u^{\top} u - 1 \,) $$4. Take the gradient, set to zero.
Differentiate w.r.t. $u$ and equate to zero:
$$ \nabla_{u} \mathcal{L} \;=\; 2 S u \,-\, 2 \lambda u \;=\; 0 $$ $$ S\, u \;=\; \lambda\, u $$That is exactly the eigenvalue equation for $S$. Every critical point of the constrained problem is an eigenvector of the covariance matrix.
5. Which eigenvector?
Substitute $Su = \lambda u$ back into the objective: $u^{\top} S u = \lambda \, u^{\top} u = \lambda$. So the variance along eigenvector $u$ equals its eigenvalue $\lambda$. The principal components are the eigenvectors of $S$ sorted by descending eigenvalue — the eigenvalue tells you how much variance that direction carries.
Try this
The 90% threshold
On the plane dataset, look at the scree plot. PC1 + PC2 should hit ~95% explained variance — the third dimension is mostly noise. This is the visual intuition behind n_components=0.95.
When PCA gives up
Switch to the sphere dataset. All three eigenvalues should be roughly equal — there's no preferred direction. PCA produces something, but it's meaningless. This is what isotropic variance looks like.
The nonlinearity wall
The swiss roll is a 2D manifold embedded in 3D. Linear PCA flattens it — the projection loses the curved structure. This motivates kernel PCA, t-SNE, or UMAP for nonlinear data.
SVD instead of eig
Replace np.linalg.eig(C) with np.linalg.svd(X_std, full_matrices=False). SVD is more numerically stable for nearly-singular covariance matrices — it’s what sklearn.decomposition.PCA uses under the hood.
Scale matters
In standardize(), replace / X.std(axis=0) with just centering (subtract mean only). On the anisotropic dataset, watch how the PCs change — PCA without scaling is at the mercy of feature units.
Reconstruction error
Add reconstructed = projected @ eig_vecs[:, :dims].T in your code and print np.linalg.norm(X_std - reconstructed). That’s the information you discarded. Compare at dims = 1, 2, 3 across datasets.
In one glance
Frequently asked
sphere and swissRoll datasets — PCA flatlines on the first and loses the structure on the second.PCA(n_components=0.95) picks the smallest $k$ retaining 95% variance for you.