Transformers & Attention Visualizer

Type a sentence and explore how attention focuses on context. Toggle heads/layers, hover tokens to see attention weights, and switch between self- and cross-attention.


Attention Heatmap
Query tokens Key tokens
Hover tokens to highlight rows/columns. Each cell shows attention from a query token (row) to a key token (column).
Mode
Heads4
Layers2
Temperature1.0
Tokenization
Whitespace-based tokens (punctuation split). Limit ~30 tokens per sequence for clarity.

How to read attention
Self vs Cross attention
  • Self‑attention: Each token attends to all tokens in the same sequence.
  • Cross‑attention: Decoder tokens attend to encoder tokens (target queries source).
Heads & Layers
  • Heads: Multiple views of context; different heads can focus on different relations (syntax, position).
  • Layers: Deeper layers refine context; earlier layers capture local patterns, deeper layers capture long‑range relations.
Pro tips
  • Hover a token to highlight where it sends attention (rows) or receives attention (columns).
  • Use Temperature: higher → sharpened attention, lower → flatter weights.
  • Auto‑play to cycle through heads/layers and notice complementary patterns.
SEO: transformer visualization • attention mechanism • how BERT works • attention heads explained
What’s happening under the hood — and why it matters
What’s happening
  • Query/Key similarity: Each head projects tokens into query and key vectors and measures similarity. The heatmap cell is the softmax‑normalized similarity.
  • Softmax weighting: Rows sum to 1. Temperature sharpens (high) or smooths (low) these weights.
  • Heads focus differently: One head may track word identity (“the→the”), another syntax (“cat→sat”), another positions (“on→the”).
  • Layers compose: Earlier layers learn local links; later layers aggregate longer‑range dependencies.
Interacting with the view
  • Switch Self vs Cross to see intra‑sentence vs encoder→decoder patterns.
  • Move across heads/layers to observe complementary attention behaviors.
  • Hover tokens to highlight the row/column and read off weight distribution.
Why this is important
  • Interpretability: Attention offers an intuitive window into what a model “looks at” when forming context.
  • Debugging: Mismatches (e.g., attention stuck on punctuation) can reveal tokenization or context issues.
  • Quality signals: Healthy heads often show consistent, meaningful patterns (e.g., determiners attending nouns, verbs attending subjects/objects).
  • Education: Seeing heads and layers evolve turns the math of attention into actionable intuition.
Where you’ll see this
  • Language models (GPT/BERT family) attending to entities, coreferences, and syntax.
  • Translation: decoder cross‑attention aligning target words with source phrases.
  • Vision and audio transformers: attention over patches or time frames.

Support This Free Tool

Every coffee helps keep the servers running. Every book sale funds the next tool I'm dreaming up. You're not just supporting a site — you're helping me build what developers actually need.

500K+ users
200+ tools
100% private
Privacy Guarantee: Private keys you enter or generate are never stored on our servers. All tools are served over HTTPS.