Perceptron, visualized

Linear Models

The perceptron (Rosenblatt, 1958) is the original neural network — one neuron that learns to separate two classes with a hyperplane. The plane below has normal vector w; train it and watch w rotate as each misclassified point pushes the boundary back into place. For probabilities and gradient descent, see Logistic Regression →.

Decision hyperplane · drag to rotate, scroll to zoom
class +1 class −1 misclassified (currently wrong) decision plane: w·x + b = 0 normal vector w
Hinge loss per epoch
Epoch
0
Hinge loss
Misclassified
N points
0

Tune the model

Parameters — the hyperplane is defined by these. Training learns them; you can also drag by hand.
w₀
x-component of normal
w₁
y-component of normal
w₂
z-component of normal
b
bias / offset from origin
Hyperparameter — you set this. Controls how big each update step is.
η
learning rate — size of each weight update
Preset

Linearly separable — perceptron converges in tens of epochs.

The math, derived

1. The model.

The perceptron checks which side of a hyperplane the input lands on:

$$ \hat{y} \;=\; \mathrm{sign}(w \cdot x + b) $$

$w \in \mathbb{R}^d$ is the hyperplane’s normal vector, $b$ is the offset from origin, $\hat{y} \in \{-1, +1\}$ is the predicted class.

2. The loss — hinge.

Zero on correctly-classified points; positive on misclassified ones, scaled by how far inside the wrong region the point sits:

$$ L_i \;=\; \max\big(0,\, -y_i\,(w \cdot x_i + b)\big) $$

A point is correctly classified when $y_i (w \cdot x_i + b) > 0$. Hinge loss only "cares" about points the current hyperplane gets wrong.

3. The gradient.

Subgradient is piecewise — non-zero only on the misclassified points:

$$ \frac{\partial L_i}{\partial w} \;=\; \begin{cases} -\,y_i\,x_i & \text{if } y_i\,(w \cdot x_i + b) \le 0 \\ 0 & \text{otherwise} \end{cases} $$

Correct points contribute nothing — the perceptron has no error signal on them. That’s the entire algorithm in one line.

4. The update — the perceptron rule.

Step opposite the gradient. After substitution:

$$ w \;\leftarrow\; w \,+\, \eta\, y_i\, x_i \qquad b \;\leftarrow\; b \,+\, \eta\, y_i $$

(only when $x_i$ is misclassified.) "Nudge $w$ in the direction of misclassified positives and away from misclassified negatives." That’s it.

5. Novikoff’s convergence theorem.

If the data is linearly separable with margin $\gamma > 0$ and $\|x_i\| \le R$ for all $i$, then perceptron converges in at most

$$ \left(\frac{R}{\gamma}\right)^2 \text{ updates} $$

regardless of dimension. This guarantee made the perceptron historically important: it was the first ML algorithm with a proof that it works (Rosenblatt 1958; Novikoff 1962). On the XOR and shells presets, no such $\gamma$ exists — so the algorithm runs forever without converging.

Try this

Watch Novikoff at work

On two blobs click Train 100 epochs. Convergence usually hits in 10–40 epochs. The green banner’s epoch count is bounded by $(R/\gamma)^2$ — smaller margin or larger data, more updates needed.

The XOR wall

Switch to XOR and Auto-train. Loss never reaches zero; the plane keeps rotating forever. This is what Minsky & Papert (1969) used to kill perceptron research for a decade.

Overlap = endless oscillation

On overlap, perceptron oscillates forever — even though the data is almost separable, the few boundary points keep pushing the plane back and forth.

Hand-set the plane

Without training, slide w₀, w₁, w₂, b until the plane visually separates the two blobs. Compare your manual fit to what training finds.

Big η → divergent jumps

Crank η to 0.3 and train. The plane flips wildly each update; loss spikes. The perceptron does still converge (Novikoff doesn’t depend on η), but the trajectory is ugly.

What about LogReg?

Logistic regression on the same data gives smooth, differentiable updates from probabilities. Same boundary class, gentler optimization. Both fail XOR.

In one glance

⚠️ Watch out Hard predictions only — no probabilities Linear-only — XOR fails No calibration Oscillates on noise No regularization
🔧 In practice sklearn.linear_model.Perceptron averaged Perceptron kernel Perceptron passive-aggressive variants precursor to SVM

Frequently asked

The original neural network (Rosenblatt, 1958). A single neuron that takes a weighted sum of inputs plus a bias, and outputs +1 if the result is positive, −1 otherwise. Training adjusts the weights using the perceptron rule: $w \leftarrow w + \eta \cdot y \cdot x$ for each misclassified point. It’s the simplest learnable model — and the ancestor of every modern neural network.
Same model class (a linear separator), but different output and loss. Perceptron predicts a hard $\pm 1$ class via $\mathrm{sign}(w \cdot x + b)$; logistic regression predicts a probability via $\sigma(w \cdot x + b)$. Perceptron uses hinge loss (zero on correct, positive on wrong); LogReg uses cross-entropy (smooth, differentiable everywhere). LogReg therefore gives calibrated probabilities and works under gradient methods; perceptron is faster but binary-only.
If the training data is linearly separable with margin $\gamma$ (the distance from the closest point to the optimal hyperplane), and all points satisfy $\|x\| \le R$, then the perceptron algorithm converges in at most $(R/\gamma)^2$ updates — regardless of dimension. This guarantee made the perceptron historically important: it was the first machine-learning algorithm with a proof that it works.
Whenever the data is not linearly separable. Classic examples: XOR (no plane separates the four / eight corners), concentric shells (both classes share the origin), and any noisy real-world classification problem. Minsky and Papert proved this in 1969, and the limitation killed perceptron research for a decade until multi-layer networks were figured out — exactly the failure mode the XOR preset demonstrates here.