Gradient Descent, visualized
Linear regression learns its slope and intercept by gradient descent. Watch the regression line tilt and shift to fit the data on the left, while the (w, b) marker descends the MSE surface on the right — one epoch at a time. Pick a different dataset to see what gradient descent learns (and where it struggles): clean linear, heavy noise, outliers, a saturating curve, growing variance, or two mixed populations.
The algorithm edit and re-run
The math, derived
1. The model.
A linear regression predicts $\hat{y}$ from $x$ using two parameters — a slope $w$ and an intercept $b$:
$$ \hat{y} \;=\; w \, x \,+\, b $$Our job is to find the $w$ and $b$ that make $\hat{y}$ closest to the true $y$ across all $N$ data points.
2. The loss: mean squared error.
For each example $i$, the residual is $y_i - \hat{y}_i$. We square it (so positive and negative errors don’t cancel) and average:
$$ L(w, b) \;=\; \frac{1}{N} \sum_{i=1}^{N} \bigl(\,y_i - (w\,x_i + b)\,\bigr)^2 $$3. Differentiate w.r.t. w.
Treat $b$ as a constant. The inner residual depends on $w$ only through the $-w\,x_i$ term, so by the chain rule:
$$ \frac{\partial L}{\partial w} \;=\; \frac{1}{N} \sum_{i=1}^{N} 2\bigl(y_i - (w\,x_i + b)\bigr) \cdot \bigl(-x_i\bigr) $$ $$ \;=\; -\frac{2}{N} \sum_{i=1}^{N} x_i\,\bigl(y_i - (w\,x_i + b)\bigr) $$That’s exactly mse_loss_dw in the code above — np.mean handles the $\frac{1}{N}\sum$.
4. Differentiate w.r.t. b.
Same idea, but the inner term contributes $-1$ for $b$ instead of $-x_i$:
$$ \frac{\partial L}{\partial b} \;=\; -\frac{2}{N} \sum_{i=1}^{N} \bigl(y_i - (w\,x_i + b)\bigr) $$5. The update rule.
Take a small step opposite each partial derivative, scaled by the learning rate $\eta$:
$$ w \;\leftarrow\; w \,-\, \eta\,\frac{\partial L}{\partial w}, \qquad b \;\leftarrow\; b \,-\, \eta\,\frac{\partial L}{\partial b} $$Repeat for many epochs. The MSE bowl is convex, so the parameters always converge to the unique optimum — if $\eta$ is small enough. Push $\eta$ past the divergence threshold and watch the marker fly off the canvas.
Try this
Find the divergence threshold
The default lr = 5e-5 is conservative. Slowly increase it. Around 2e-4 the regression line starts swinging; past 3e-4 it diverges entirely. Why is the threshold so low? Because x reaches 50 — the gradient gets multiplied by big numbers.
Normalize, then go fast
Add DATA_X = (DATA_X - DATA_X.mean()) / DATA_X.std() at the top of the algorithm. Now the gradient stays bounded — crank lr to 0.01 and convergence drops to ~50 epochs.
Add momentum
Replace step() with momentum GD: v_w = 0.9*v_w − lr*dw; w = w + v_w (same for b). The marker takes a more direct path down the MSE bowl.
Switch MSE to MAE
Mean absolute error is more robust to outliers. Replace squared error with np.mean(np.abs(...)); the gradient becomes -np.mean(x * np.sign(y - (w*x+b))). How does the fit change? (Hint: it pulls the line toward the median residual instead of the mean.)
Add L2 regularization
Append + lam*w to mse_loss_dw with lam = 0.001. This is ridge regression. It pulls $w$ toward zero. Find a value of lam large enough to visibly shrink the slope.
Bad initialization
Click the top-right corner of the MSE surface to set w ≈ 0.65, b ≈ 14. The line starts wildly steep. Does default lr still converge? How many epochs?
Frequently asked
lr and try again.lr would diverge. In real-world pipelines you’d normalize $x$ (subtract mean, divide by std) first, then a much larger lr like 0.01 becomes safe. Try it: prepend DATA_X = (DATA_X - DATA_X.mean()) / DATA_X.std() to the code and crank lr up.