Perceptron, visualized
The perceptron (Rosenblatt, 1958) is the original neural network — one neuron that learns to separate two classes with a hyperplane. The plane below has normal vector w; train it and watch w rotate as each misclassified point pushes the boundary back into place. For probabilities and gradient descent, see Logistic Regression →.
Tune the model
Linearly separable — perceptron converges in tens of epochs.
The math, derived
1. The model.
The perceptron checks which side of a hyperplane the input lands on:
$$ \hat{y} \;=\; \mathrm{sign}(w \cdot x + b) $$$w \in \mathbb{R}^d$ is the hyperplane’s normal vector, $b$ is the offset from origin, $\hat{y} \in \{-1, +1\}$ is the predicted class.
2. The loss — hinge.
Zero on correctly-classified points; positive on misclassified ones, scaled by how far inside the wrong region the point sits:
$$ L_i \;=\; \max\big(0,\, -y_i\,(w \cdot x_i + b)\big) $$A point is correctly classified when $y_i (w \cdot x_i + b) > 0$. Hinge loss only "cares" about points the current hyperplane gets wrong.
3. The gradient.
Subgradient is piecewise — non-zero only on the misclassified points:
$$ \frac{\partial L_i}{\partial w} \;=\; \begin{cases} -\,y_i\,x_i & \text{if } y_i\,(w \cdot x_i + b) \le 0 \\ 0 & \text{otherwise} \end{cases} $$Correct points contribute nothing — the perceptron has no error signal on them. That’s the entire algorithm in one line.
4. The update — the perceptron rule.
Step opposite the gradient. After substitution:
$$ w \;\leftarrow\; w \,+\, \eta\, y_i\, x_i \qquad b \;\leftarrow\; b \,+\, \eta\, y_i $$(only when $x_i$ is misclassified.) "Nudge $w$ in the direction of misclassified positives and away from misclassified negatives." That’s it.
5. Novikoff’s convergence theorem.
If the data is linearly separable with margin $\gamma > 0$ and $\|x_i\| \le R$ for all $i$, then perceptron converges in at most
$$ \left(\frac{R}{\gamma}\right)^2 \text{ updates} $$— regardless of dimension. This guarantee made the perceptron historically important: it was the first ML algorithm with a proof that it works (Rosenblatt 1958; Novikoff 1962). On the XOR and shells presets, no such $\gamma$ exists — so the algorithm runs forever without converging.
Try this
Watch Novikoff at work
On two blobs click Train 100 epochs. Convergence usually hits in 10–40 epochs. The green banner’s epoch count is bounded by $(R/\gamma)^2$ — smaller margin or larger data, more updates needed.
The XOR wall
Switch to XOR and Auto-train. Loss never reaches zero; the plane keeps rotating forever. This is what Minsky & Papert (1969) used to kill perceptron research for a decade.
Overlap = endless oscillation
On overlap, perceptron oscillates forever — even though the data is almost separable, the few boundary points keep pushing the plane back and forth.
Hand-set the plane
Without training, slide w₀, w₁, w₂, b until the plane visually separates the two blobs. Compare your manual fit to what training finds.
Big η → divergent jumps
Crank η to 0.3 and train. The plane flips wildly each update; loss spikes. The perceptron does still converge (Novikoff doesn’t depend on η), but the trajectory is ugly.
What about LogReg?
Logistic regression on the same data gives smooth, differentiable updates from probabilities. Same boundary class, gentler optimization. Both fail XOR.