Transformers & Attention Visualizer

Type a sentence and explore how attention focuses on context. Toggle heads/layers, hover tokens to see attention weights, and switch between self- and cross-attention.

Visualization Controls Guide

Attention Heatmap

Query tokens Key tokens

Input (self-attention)

Hover tokens to highlight rows/columns. Each cell shows attention from a query token (row) to a key token (column).

Mode

Heads4

Layers2

Temperature1.0

Head

Layer

Auto-play heads/layers

Tokenization

Whitespace-based tokens (punctuation split). Limit ~30 tokens per sequence for clarity.

How to read attention

Self vs Cross attention

Self‑attention: Each token attends to all tokens in the same sequence.
Cross‑attention: Decoder tokens attend to encoder tokens (target queries source).

Heads & Layers

Heads: Multiple views of context; different heads can focus on different relations (syntax, position).
Layers: Deeper layers refine context; earlier layers capture local patterns, deeper layers capture long‑range relations.

Pro tips

Hover a token to highlight where it sends attention (rows) or receives attention (columns).
Use Temperature: higher → sharpened attention, lower → flatter weights.
Auto‑play to cycle through heads/layers and notice complementary patterns.

SEO: transformer visualization • attention mechanism • how BERT works • attention heads explained

What’s happening under the hood — and why it matters

What’s happening

Query/Key similarity: Each head projects tokens into query and key vectors and measures similarity. The heatmap cell is the softmax‑normalized similarity.
Softmax weighting: Rows sum to 1. Temperature sharpens (high) or smooths (low) these weights.
Heads focus differently: One head may track word identity (“the→the”), another syntax (“cat→sat”), another positions (“on→the”).
Layers compose: Earlier layers learn local links; later layers aggregate longer‑range dependencies.

Interacting with the view

Switch Self vs Cross to see intra‑sentence vs encoder→decoder patterns.
Move across heads/layers to observe complementary attention behaviors.
Hover tokens to highlight the row/column and read off weight distribution.

Why this is important

Interpretability: Attention offers an intuitive window into what a model “looks at” when forming context.
Debugging: Mismatches (e.g., attention stuck on punctuation) can reveal tokenization or context issues.
Quality signals: Healthy heads often show consistent, meaningful patterns (e.g., determiners attending nouns, verbs attending subjects/objects).
Education: Seeing heads and layers evolve turns the math of attention into actionable intuition.

Where you’ll see this

Language models (GPT/BERT family) attending to entities, coreferences, and syntax.
Translation: decoder cross‑attention aligning target words with source phrases.
Vision and audio transformers: attention over patches or time frames.

Support This Free Tool

Every coffee helps keep the servers running. Every book sale funds the next tool I'm dreaming up. You're not just supporting a site — you're helping me build what developers actually need.

500K+ users

200+ tools

100% private

☕ One-time support Buy me a coffee 📚 Learn & support 9-Book Bundle - $9 Stay updated Follow @anish2good

Privacy Guarantee: Private keys you enter or generate are never stored on our servers. All tools are served over HTTPS.

Transformers & Attention Visualizer

Attention Heatmap

Mode

Tokenization

How to read attention

Self vs Cross attention

Heads & Layers

Pro tips

What’s happening under the hood — and why it matters

What’s happening

Interacting with the view

Why this is important

Where you’ll see this

Support This Free Tool

Quick Access

PGP Tools

Sharing Services

Security Tools

Cryptography

Network Tools

Legal & Compliance

DevOps/Container

Blockchain

Encoders/Converters

Developer Tools

Machine Learning Visualizers

Media Tools

Documents & PDF

Finance

Health

Lifestyle & Productivity

Chemistry

Math & Education

Physics Tools

Internationalization