Demo runtime · idle

Gradient Descent, visualized

Optimization

Linear regression learns its slope and intercept by gradient descent. Watch the regression line tilt and shift to fit the data on the left, while the (w, b) marker descends the MSE surface on the right — one epoch at a time. Pick a different dataset to see what gradient descent learns (and where it struggles): clean linear, heavy noise, outliers, a saturating curve, growing variance, or two mixed populations.

Regression line · spending → sales
MSE surface over (w, b) · drag to rotate
Dataset
Clean linear: GD converges nicely (the baseline).
Learning rate η 5e-5
Epochs 200
Initial w (slope)
Initial b (intercept)
Anim speed 30/s
Tip: drag to rotate the surface, scroll to zoom, click any point to seed (w, b) and re-run.

The algorithm edit and re-run

gradient_descent.py

The math, derived

1. The model.

A linear regression predicts $\hat{y}$ from $x$ using two parameters — a slope $w$ and an intercept $b$:

$$ \hat{y} \;=\; w \, x \,+\, b $$

Our job is to find the $w$ and $b$ that make $\hat{y}$ closest to the true $y$ across all $N$ data points.

2. The loss: mean squared error.

For each example $i$, the residual is $y_i - \hat{y}_i$. We square it (so positive and negative errors don’t cancel) and average:

$$ L(w, b) \;=\; \frac{1}{N} \sum_{i=1}^{N} \bigl(\,y_i - (w\,x_i + b)\,\bigr)^2 $$

3. Differentiate w.r.t. w.

Treat $b$ as a constant. The inner residual depends on $w$ only through the $-w\,x_i$ term, so by the chain rule:

$$ \frac{\partial L}{\partial w} \;=\; \frac{1}{N} \sum_{i=1}^{N} 2\bigl(y_i - (w\,x_i + b)\bigr) \cdot \bigl(-x_i\bigr) $$ $$ \;=\; -\frac{2}{N} \sum_{i=1}^{N} x_i\,\bigl(y_i - (w\,x_i + b)\bigr) $$

That’s exactly mse_loss_dw in the code above — np.mean handles the $\frac{1}{N}\sum$.

4. Differentiate w.r.t. b.

Same idea, but the inner term contributes $-1$ for $b$ instead of $-x_i$:

$$ \frac{\partial L}{\partial b} \;=\; -\frac{2}{N} \sum_{i=1}^{N} \bigl(y_i - (w\,x_i + b)\bigr) $$

5. The update rule.

Take a small step opposite each partial derivative, scaled by the learning rate $\eta$:

$$ w \;\leftarrow\; w \,-\, \eta\,\frac{\partial L}{\partial w}, \qquad b \;\leftarrow\; b \,-\, \eta\,\frac{\partial L}{\partial b} $$

Repeat for many epochs. The MSE bowl is convex, so the parameters always converge to the unique optimum — if $\eta$ is small enough. Push $\eta$ past the divergence threshold and watch the marker fly off the canvas.

Try this

Find the divergence threshold

The default lr = 5e-5 is conservative. Slowly increase it. Around 2e-4 the regression line starts swinging; past 3e-4 it diverges entirely. Why is the threshold so low? Because x reaches 50 — the gradient gets multiplied by big numbers.

Normalize, then go fast

Add DATA_X = (DATA_X - DATA_X.mean()) / DATA_X.std() at the top of the algorithm. Now the gradient stays bounded — crank lr to 0.01 and convergence drops to ~50 epochs.

Add momentum

Replace step() with momentum GD: v_w = 0.9*v_w − lr*dw; w = w + v_w (same for b). The marker takes a more direct path down the MSE bowl.

Switch MSE to MAE

Mean absolute error is more robust to outliers. Replace squared error with np.mean(np.abs(...)); the gradient becomes -np.mean(x * np.sign(y - (w*x+b))). How does the fit change? (Hint: it pulls the line toward the median residual instead of the mean.)

Add L2 regularization

Append + lam*w to mse_loss_dw with lam = 0.001. This is ridge regression. It pulls $w$ toward zero. Find a value of lam large enough to visibly shrink the slope.

Bad initialization

Click the top-right corner of the MSE surface to set w ≈ 0.65, b ≈ 14. The line starts wildly steep. Does default lr still converge? How many epochs?

Frequently asked

It’s how the regression line learns its slope $w$ and intercept $b$. We define an error metric (mean squared error) and step both parameters in the direction that reduces it: $w \leftarrow w - \eta \cdot \partial L / \partial w$ and $b \leftarrow b - \eta \cdot \partial L / \partial b$. After enough epochs, $(w, b)$ settles at the values that fit the data best.
Almost always: learning rate too high. The step overshoots the bottom of the MSE bowl and lands further uphill than where it started — $w$ and $b$ oscillate then fly off. The dataset’s $x$ values reach 50, so even a modest learning rate gets multiplied by big gradients. Halve lr and try again.
Because the input feature is unnormalized — $x \in [0, 50]$ means the gradient w.r.t. $w$ can be hundreds. A larger lr would diverge. In real-world pipelines you’d normalize $x$ (subtract mean, divide by std) first, then a much larger lr like 0.01 becomes safe. Try it: prepend DATA_X = (DATA_X - DATA_X.mean()) / DATA_X.std() to the code and crank lr up.
Yes — the algorithm source on this page is editable. Try adding momentum (keep a running average of past gradients), switching MSE to MAE (absolute error), or adding an L2 regularizer. Click Run to see your version fit the data.