K-Means, visualized
K-Means partitions data into k clusters by alternating two steps: assign each point to its nearest centroid, then move each centroid to the mean of its members. Watch the points snap to clusters and the centroids drift on the left, while the inertia (sum of squared distances) ticks down on the right — one iteration at a time. Swap datasets to see where K-means shines and where it visibly fails.
The algorithm edit and re-run
The math, derived
1. The objective.
Pick $k$ centroids $\mu_1, \ldots, \mu_k$ that minimize the total within-cluster squared distance — the inertia:
$$ J(\mu_1, \ldots, \mu_k) \;=\; \sum_{c=1}^{k} \;\sum_{x \in C_c} \|x - \mu_c\|^2 $$$C_c$ is the set of points assigned to cluster $c$. Lower $J$ ⇒ tighter clusters.
2. The combinatorial trap.
Optimizing $J$ exactly is NP-hard — there are $\binom{N}{k}$ ways to assign $N$ points to $k$ groups. We need a heuristic, and the natural one is coordinate descent: alternate between the two unknowns (the assignments and the centroids), holding one fixed at a time.
3. E-step — assign points to nearest centroid.
With $\mu_c$ fixed, the assignment that minimizes $J$ for each point is the obvious one: nearest centroid by squared distance.
$$ C_c \;=\; \{\, x_i \;:\; c \,=\, \arg\min_j \|x_i - \mu_j\|^2 \,\} $$4. M-step — centroid = mean of cluster.
With $C_c$ fixed, the centroid that minimizes the within-cluster term is the mean — from calculus, $\nabla_{\mu_c} J = -2 \sum_{x \in C_c} (x - \mu_c) = 0$.
$$ \mu_c \;=\; \frac{1}{|C_c|} \sum_{x \in C_c} x $$5. Why it converges (and doesn’t reach the optimum).
Both steps strictly decrease $J$ unless $J$ is already at a local minimum — that’s why the inertia curve on the right is monotonically falling. But local, not global: a bad initial centroid placement can trap K-means in a worse partition than the true optimum. That’s what k-means++ fixes — it spreads the initial centroids far apart so we usually start in the right basin.
Try this
The over-segmentation trap
On the three-blobs dataset, crank k from 3 to 6. The algorithm has to split the existing blobs — the inertia keeps falling but the result is meaningless. This is why elbow plots matter.
Initialization matters
Switch to moons and toggle init between random and kmeans++, varying the seed. Count iterations to convergence and compare final inertia. k-means++ usually wins by 30–60%.
The non-convex wall
On rings with k=2, K-means can’t separate them — both rings have the same mean. The result is a bad horizontal split. This is the canonical motivation for spectral clustering or DBSCAN.
The small-cluster problem
On unequalSize, K-means absorbs the 15-point cluster into a neighbor (mean pull is too weak). Add min_size logic in the code: split the biggest cluster when a smaller one drops below threshold.
Manhattan instead of Euclidean
Replace ((X-c)**2).sum(-1) with np.abs(X-c).sum(-1) in assign(). That’s now K-medians (sort of). How does the partition change on anisotropic blobs?
Multiple restarts
Wrap run() in a loop that tries 10 random seeds and keeps the lowest-inertia result. This is what sklearn.KMeans does by default (n_init=10) and is one of the few times "just do it 10 times" is the right answer.