Demo runtime · idle

K-Means, visualized

Clustering

K-Means partitions data into k clusters by alternating two steps: assign each point to its nearest centroid, then move each centroid to the mean of its members. Watch the points snap to clusters and the centroids drift on the left, while the inertia (sum of squared distances) ticks down on the right — one iteration at a time. Swap datasets to see where K-means shines and where it visibly fails.

Clusters & centroids

Inertia (WCSS) per iteration

Dataset

The baseline: K-means converges in 2–3 iterations and finds the truth.

k (clusters) 3

Initialization

Max iterations

Random seed

Anim speed 2/s

Tip: change the seed to see how much the final clusters depend on initialization.

The algorithm edit and re-run

k_means.py

import numpy as np

# Cluster a 2D dataset DATA_X (shape: N×2) into k groups by Lloyd's
# algorithm.  Each iteration: (1) assign each point to its nearest
# centroid (E-step), (2) move each centroid to the mean of its
# assigned points (M-step).  Inertia = sum of squared distances of
# each point to its assigned centroid.

# ── E-step: nearest-centroid assignment ─────────────────
def assign(X, centroids):
    # Pairwise squared distance via broadcasting.
    dists = ((X[:, None, :] - centroids[None, :, :]) ** 2).sum(-1)
    return np.argmin(dists, axis=1)

# ── M-step: centroid update ─────────────────────────────
def update_centroids(X, labels, k):
    new_c = np.zeros((k, X.shape[1]))
    for c in range(k):
        mask = (labels == c)
        if mask.any():
            new_c[c] = X[mask].mean(axis=0)
        else:
            # Empty cluster: reseed to a random data point so k stays fixed.
            new_c[c] = X[np.random.randint(len(X))]
    return new_c

def inertia(X, labels, centroids):
    return float(sum(((X[labels == c] - centroids[c]) ** 2).sum()
                     for c in range(len(centroids))))

# ── k-means++ initialization ────────────────────────────
def init_plus_plus(X, k, rng):
    n = len(X)
    centroids = [X[rng.integers(n)]]
    for _ in range(1, k):
        d2 = np.min(((X[:, None, :] -
                      np.array(centroids)[None, :, :]) ** 2).sum(-1), axis=1)
        probs = d2 / d2.sum()
        centroids.append(X[rng.choice(n, p=probs)])
    return np.array(centroids)

# ── Lloyd's training loop ───────────────────────────────
def run(k=3, max_iter=20, init="kmeans++", seed=0):
    """Cluster DATA_X into k groups.  Returns iteration history."""
    rng = np.random.default_rng(seed)
    if init == "kmeans++":
        centroids = init_plus_plus(DATA_X, k, rng)
    else:
        idx = rng.choice(len(DATA_X), size=k, replace=False)
        centroids = DATA_X[idx].copy()

labels = assign(DATA_X, centroids)
    history = {
        "centroids": [centroids.tolist()],
        "labels":    [labels.tolist()],
        "inertia":   [inertia(DATA_X, labels, centroids)],
    }

for _ in range(int(max_iter)):
        new_c = update_centroids(DATA_X, labels, k)
        if np.allclose(new_c, centroids, atol=1e-6):
            break
        centroids = new_c
        labels = assign(DATA_X, centroids)
        history["centroids"].append(centroids.tolist())
        history["labels"].append(labels.tolist())
        history["inertia"].append(inertia(DATA_X, labels, centroids))

return history

# Try this:
#   · Switch init="random" on the moons dataset — see how sensitive convergence is to init
#   · Replace Euclidean distance with Manhattan (np.abs(...).sum(-1)) — different splits
#   · Add an early-stop tolerance: break when |inertia_new - inertia_old| / inertia_old < 1e-4
#   · Add a "miniBatch" mode that samples 64 points per iteration (sklearn.MiniBatchKMeans)

The math, derived

1. The objective.

Pick $k$ centroids $\mu_1, \ldots, \mu_k$ that minimize the total within-cluster squared distance — the inertia:

$$ J(\mu_1, \ldots, \mu_k) \;=\; \sum_{c=1}^{k} \;\sum_{x \in C_c} \|x - \mu_c\|^2 $$

$C_c$ is the set of points assigned to cluster $c$. Lower $J$ ⇒ tighter clusters.

2. The combinatorial trap.

Optimizing $J$ exactly is NP-hard — there are $\binom{N}{k}$ ways to assign $N$ points to $k$ groups. We need a heuristic, and the natural one is coordinate descent: alternate between the two unknowns (the assignments and the centroids), holding one fixed at a time.

3. E-step — assign points to nearest centroid.

With $\mu_c$ fixed, the assignment that minimizes $J$ for each point is the obvious one: nearest centroid by squared distance.

$$ C_c \;=\; \{\, x_i \;:\; c \,=\, \arg\min_j \|x_i - \mu_j\|^2 \,\} $$

4. M-step — centroid = mean of cluster.

With $C_c$ fixed, the centroid that minimizes the within-cluster term is the mean — from calculus, $\nabla_{\mu_c} J = -2 \sum_{x \in C_c} (x - \mu_c) = 0$.

$$ \mu_c \;=\; \frac{1}{|C_c|} \sum_{x \in C_c} x $$

5. Why it converges (and doesn’t reach the optimum).

Both steps strictly decrease $J$ unless $J$ is already at a local minimum — that’s why the inertia curve on the right is monotonically falling. But local, not global: a bad initial centroid placement can trap K-means in a worse partition than the true optimum. That’s what k-means++ fixes — it spreads the initial centroids far apart so we usually start in the right basin.

Try this

The over-segmentation trap

On the three-blobs dataset, crank k from 3 to 6. The algorithm has to split the existing blobs — the inertia keeps falling but the result is meaningless. This is why elbow plots matter.

Initialization matters

Switch to moons and toggle init between random and kmeans++, varying the seed. Count iterations to convergence and compare final inertia. k-means++ usually wins by 30–60%.

The non-convex wall

On rings with k=2, K-means can’t separate them — both rings have the same mean. The result is a bad horizontal split. This is the canonical motivation for spectral clustering or DBSCAN.

The small-cluster problem

On unequalSize, K-means absorbs the 15-point cluster into a neighbor (mean pull is too weak). Add min_size logic in the code: split the biggest cluster when a smaller one drops below threshold.

Manhattan instead of Euclidean

Replace ((X-c)**2).sum(-1) with np.abs(X-c).sum(-1) in assign(). That’s now K-medians (sort of). How does the partition change on anisotropic blobs?

Multiple restarts

Wrap run() in a loop that tries 10 random seeds and keeps the lowest-inertia result. This is what sklearn.KMeans does by default (n_init=10) and is one of the few times "just do it 10 times" is the right answer.

Frequently asked

An unsupervised algorithm that partitions data into $k$ groups. It alternates two steps: assign each point to the nearest centroid, then move each centroid to the mean of its assigned points. Repeats until centroids stop moving. The total within-cluster squared distance (inertia) decreases monotonically — that’s Lloyd’s convergence guarantee.

K-means assumes clusters are convex blobs around a center. Half-moons and concentric rings are non-convex — the "center" of a ring is the same point as the center of the inner ring, so K-means can’t see them as different groups. Density-based methods (DBSCAN) or spectral clustering can.

A smarter way to pick initial centroids. Instead of choosing them uniformly at random, k-means++ picks the first one randomly, then weights subsequent picks by squared distance to existing centroids — pushing them apart. Result: fewer iterations to converge and better final inertia, especially on hard datasets.

There’s no perfect rule, but two heuristics help: the elbow method (plot inertia vs $k$ — pick the value where the curve bends) and the silhouette score (measures how well each point fits its cluster vs neighbors). For most real datasets, several values of $k$ are defensible — the choice depends on what you’re optimizing.

Menu

⭐ Popular Tools

🕒 Recently Used

📁 All Categories

Quick Links

Support