Feature Hashing Collision Explorer

Explore the “hashing trick” for large, sparse categorical spaces (e.g., text tokens or ads logs). Adjust bucket size and distribution to see how collisions change and how signed hashing stabilizes downstream models.

Visualization Controls Guide

Bucket Distribution & Collisions

Count per bucket Collision highlight

Collision rate: — Avg load/bucket: — Memory: — Accuracy: —

Hover a bucket to list colliding categories.

Downstream Model Impact

Synthetic linear classifier; signed hashing should center features (mean ≈ 0)

Drag “buckets” down to see collisions explode and accuracy drop; toggle “signed hashing” to reduce bias from collisions. Memory bars compare one-hot vs hashed vectors.

Data scale

# categories: 10,000 Exponent 10^x (100 → 1,000,000). Uses sampling to keep it fast.

Samples drawn: 20000

Hashing

Hash buckets (vector size): 1024 Use powers of two for faster modulo; lower buckets → more collisions.

Signed hashing (±1)

Distribution

Collision highlighting

How to interpret

What: Feature hashing maps a potentially huge set of categories (tokens, ids) into a fixed-size vector by hashing each category to a bucket. Optional signed hashing assigns ±1 so the expected contribution per feature is near zero.

Why: It provides a memory- and speed-efficient alternative to one-hot encoding when the vocabulary is large or evolving (e.g., ad logs, search queries, URLs).

How to read the plots: The bucket chart shows counts per bucket; red marks indicate buckets receiving multiple categories (collisions). The accuracy panel trains a tiny synthetic linear model; as buckets shrink, interference from collisions grows and accuracy degrades. Signed hashing reduces systematic bias by centering collisions.

Collision rate rises roughly with (#categories / #buckets) and skewed (Zipf) traffic can overload a few buckets.
Average load per bucket ≈ samples / buckets; memory footprint scales with vector size, not with vocabulary.
Compare one-hot vs hashing memory in the lower chart; hashing keeps constant memory even as categories grow.

Support This Free Tool

Every coffee helps keep the servers running. Every book sale funds the next tool I'm dreaming up. You're not just supporting a site — you're helping me build what developers actually need.

500K+ users

200+ tools

100% private

☕ One-time support Buy me a coffee 📚 Learn & support 9-Book Bundle - $9 Stay updated Follow @anish2good

Privacy Guarantee: Private keys you enter or generate are never stored on our servers. All tools are served over HTTPS.

Feature Hashing Collision Explorer

Bucket Distribution & Collisions

Downstream Model Impact

Data scale

Hashing

How to interpret

Support This Free Tool

Quick Access

PGP Tools

Sharing Services

Security Tools

Cryptography

Network Tools

Legal & Compliance

DevOps/Container

Blockchain

Encoders/Converters

Developer Tools

Machine Learning Visualizers

Media Tools

Documents & PDF

Finance

Health

Lifestyle & Productivity

Chemistry

Math & Education

Physics Tools

Internationalization