Activation Functions, visualized

Optimization

An activation function is the nonlinearity that turns stacked linear layers into a real neural network. Stack any number of linear layers without one and you still get a single linear layer. The activation is also what backprop multiplies through — so the shape of its derivative is the gradient that learning depends on. Toggle activations below to compare shapes and slopes side by side.

f(x)
f′(x) · the gradient backprop multiplies
x min
x max
y min
y max
Tip: derivative panel uses dashed lines — same color as the activation in the left panel.

Formula reference function, derivative, range, when to use

Each card below is one activation. The function is what the forward pass computes; the derivative is what gets multiplied through during backpropagation. Where derivatives are zero or saturated, gradients vanish.

Identity range: (−∞, ∞)
$$ f(x) = x $$
$$ f'(x) = 1 $$
Output layer for regression. Never as a hidden activation — defeats the purpose.
Sigmoid range: (0, 1)
$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
$$ \sigma'(x) = \sigma(x)\,\bigl(1 - \sigma(x)\bigr) $$
Output for binary classification (probability). Avoid as hidden activation — saturates at ±∞ and kills gradients.
Tanh range: (−1, 1)
$$ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$
$$ \tanh'(x) = 1 - \tanh^2(x) $$
Zero-centered alternative to sigmoid. Still saturates — mostly historical now; LSTM gates being the holdout.
ReLU range: [0, ∞)
$$ f(x) = \max(0,\,x) $$
$$ f'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \le 0 \end{cases} $$
The default hidden activation since AlexNet (2012). Cheap, sparse, gradient-friendly for positives. Negative inputs go dark forever — the "dying ReLU" problem.
Leaky ReLU range: (−∞, ∞)
$$ f(x) = \begin{cases} x & x > 0 \\ \alpha x & x \le 0 \end{cases} $$
$$ f'(x) = \begin{cases} 1 & x > 0 \\ \alpha & x \le 0 \end{cases} $$
Fixes dying ReLU by leaking a small negative slope (typically $\alpha = 0.01$). Cheap and usually a safe ReLU upgrade.
ELU range: (−α, ∞)
$$ f(x) = \begin{cases} x & x \ge 0 \\ \alpha(e^x - 1) & x < 0 \end{cases} $$
$$ f'(x) = \begin{cases} 1 & x \ge 0 \\ \alpha e^x & x < 0 \end{cases} $$
Smooth negative tail; mean activations sit closer to zero, which helps deep nets train more stably. Slightly more expensive than Leaky ReLU.
GELU range: (−0.17, ∞)
$$ \mathrm{GELU}(x) = x \cdot \Phi(x) = \tfrac{1}{2} x \,\bigl(1 + \mathrm{erf}(x/\sqrt{2})\bigr) $$
$$ \text{tanh approx: } \tfrac{1}{2} x \bigl(1 + \tanh[\sqrt{2/\pi}\,(x + 0.044715\,x^3)]\bigr) $$
Smooth, probabilistic gating. BERT, GPT, ViT — modern transformers default to GELU. The tanh approximation is ~30% faster on hardware without efficient erf.
Swish / SiLU range: (−0.28, ∞)
$$ f(x) = x \cdot \sigma(x) $$
$$ f'(x) = \sigma(x) + x\,\sigma(x)\,\bigl(1 - \sigma(x)\bigr) $$
Google's EfficientNet uses Swish (same family as GELU). Cheaper than GELU at near-identical quality on vision tasks.
Softplus range: (0, ∞)
$$ f(x) = \log(1 + e^x) $$
$$ f'(x) = \sigma(x) $$
Smooth ReLU substitute. Cute fact: its derivative is the sigmoid. Mostly historical — the smoothness rarely pays for the extra compute.
Mish range: (−0.31, ∞)
$$ f(x) = x \cdot \tanh\bigl(\log(1 + e^x)\bigr) $$
$$ f'(x) \approx \tanh(\zeta) + x\,\sigma(x)\,\bigl(1 - \tanh^2(\zeta)\bigr),\;\; \zeta = \log(1+e^x) $$
YOLOv4 and successors. Self-regularizing — small gains over Swish on some vision benchmarks, slightly more compute.

Try this

See the vanishing gradient

Turn on Sigmoid only. Look at the derivative panel: the peak is 0.25, and it’s near zero everywhere else. Stack 5 sigmoids and the gradient gets multiplied by 0.255 ≈ 0.001 even in the best case — gone before the first layer sees it.

Why Tanh beats Sigmoid in hidden layers

Now switch to Tanh. Peak derivative is 1.0, four times sigmoid’s. And outputs are symmetric around zero, so gradients don’t all share the same sign. This is why Tanh outlived Sigmoid for a decade in the hidden layers of MLPs.

The dying ReLU corridor

ReLU on, set x min = -10. Look at the derivative for x < 0: it’s flat zero. If a neuron lands here and its input distribution stays negative, every gradient is zero — the weight never updates again. Now turn on Leaky ReLU — the derivative is α, not 0. The neuron can come back.

GELU exact vs tanh approximation

Enable GELU, then flip the dropdown between Exact (erf) and Tanh approx. Watch the curves: they overlap almost everywhere. The approximation is ~30% faster on GPUs without hardware erf, with quality loss too small to measure in training.

Self-gating: Swish vs ReLU

Side by side, enable ReLU and Swish. Near x = 0 Swish is smoother and slightly negative, which Google found made EfficientNet train faster — the smoothness alone gets you better optimization without changing anything else.

Why output range matters

Look at Sigmoid vs Identity. Sigmoid clamps to (0, 1) — perfect for “is this a cat? probability”. Identity is unbounded — perfect for “how much will this house sell for?”. Same activation idea, opposite use cases. Output layer = pick the range that matches your target.

Frequently asked

A nonlinear transformation applied to each neuron’s output. Without it, stacking layers is mathematically equivalent to a single linear layer — no matter how deep your network is, it could only learn linear relationships. The activation is what lets neural networks model curves, decision boundaries, and everything else nonlinear.
Three reasons: (1) cheap — just max(0, x); (2) doesn’t saturate on the positive side, so gradients flow freely; (3) produces sparse activations (about half the neurons output 0 at any time), which acts as implicit regularization. Sigmoid and Tanh saturate at both ends, killing gradients in deep networks — that’s the vanishing gradient problem.
If a neuron’s input distribution shifts firmly negative (e.g., from a bad initialization or a large gradient step), ReLU outputs 0 for all inputs. Its derivative is also 0, so the neuron’s weights never update — it’s effectively dead forever. Leaky ReLU, ELU, GELU, and Swish all fix this by allowing some signal through on the negative side.
GELU is smooth (so optimization is gentler), self-gating (the magnitude of $x$ decides how much of $x$ to pass through, weighted by a probability), and slightly better-behaved empirically on language tasks. The original Vaswani et al. transformer used ReLU, but BERT switched to GELU and every major LLM since has followed.
As an output layer for binary classification — its (0, 1) range maps cleanly to probability. For hidden layers in modern networks, almost never. (LSTM gates are an exception, and even those are getting replaced.)
The exact GELU uses the Gaussian CDF (erf function). The tanh form $\tfrac{1}{2} x \bigl(1 + \tanh[\sqrt{2/\pi}\,(x + 0.044715\,x^3)]\bigr)$ approximates it ~30% faster on GPUs that don’t have hardware erf. The shapes differ by less than $10^{-4}$ across the typical input range — both are interchangeable for training.