Activation Functions, visualized
An activation function is the nonlinearity that turns stacked linear layers into a real neural network. Stack any number of linear layers without one and you still get a single linear layer. The activation is also what backprop multiplies through — so the shape of its derivative is the gradient that learning depends on. Toggle activations below to compare shapes and slopes side by side.
Formula reference function, derivative, range, when to use
Each card below is one activation. The function is what the forward pass computes; the derivative is what gets multiplied through during backpropagation. Where derivatives are zero or saturated, gradients vanish.
Try this
See the vanishing gradient
Turn on Sigmoid only. Look at the derivative panel: the peak is 0.25, and it’s near zero everywhere else. Stack 5 sigmoids and the gradient gets multiplied by 0.255 ≈ 0.001 even in the best case — gone before the first layer sees it.
Why Tanh beats Sigmoid in hidden layers
Now switch to Tanh. Peak derivative is 1.0, four times sigmoid’s. And outputs are symmetric around zero, so gradients don’t all share the same sign. This is why Tanh outlived Sigmoid for a decade in the hidden layers of MLPs.
The dying ReLU corridor
ReLU on, set x min = -10. Look at the derivative for x < 0: it’s flat zero. If a neuron lands here and its input distribution stays negative, every gradient is zero — the weight never updates again. Now turn on Leaky ReLU — the derivative is α, not 0. The neuron can come back.
GELU exact vs tanh approximation
Enable GELU, then flip the dropdown between Exact (erf) and Tanh approx. Watch the curves: they overlap almost everywhere. The approximation is ~30% faster on GPUs without hardware erf, with quality loss too small to measure in training.
Self-gating: Swish vs ReLU
Side by side, enable ReLU and Swish. Near x = 0 Swish is smoother and slightly negative, which Google found made EfficientNet train faster — the smoothness alone gets you better optimization without changing anything else.
Why output range matters
Look at Sigmoid vs Identity. Sigmoid clamps to (0, 1) — perfect for “is this a cat? probability”. Identity is unbounded — perfect for “how much will this house sell for?”. Same activation idea, opposite use cases. Output layer = pick the range that matches your target.
Frequently asked
max(0, x); (2) doesn’t saturate on the positive side, so gradients flow freely; (3) produces sparse activations (about half the neurons output 0 at any time), which acts as implicit regularization. Sigmoid and Tanh saturate at both ends, killing gradients in deep networks — that’s the vanishing gradient problem.erf function). The tanh form $\tfrac{1}{2} x \bigl(1 + \tanh[\sqrt{2/\pi}\,(x + 0.044715\,x^3)]\bigr)$ approximates it ~30% faster on GPUs that don’t have hardware erf. The shapes differ by less than $10^{-4}$ across the typical input range — both are interchangeable for training.