Feature Hashing Collision Explorer

Explore the “hashing trick” for large, sparse categorical spaces (e.g., text tokens or ads logs). Adjust bucket size and distribution to see how collisions change and how signed hashing stabilizes downstream models.


Bucket Distribution & Collisions
Count per bucket Collision highlight
Collision rate: — Avg load/bucket: — Memory: — Accuracy: —
Hover a bucket to list colliding categories.
Downstream Model Impact
Synthetic linear classifier; signed hashing should center features (mean ≈ 0)
Drag “buckets” down to see collisions explode and accuracy drop; toggle “signed hashing” to reduce bias from collisions. Memory bars compare one-hot vs hashed vectors.
Data scale
Exponent 10^x (100 → 1,000,000). Uses sampling to keep it fast.
Hashing
Use powers of two for faster modulo; lower buckets → more collisions.

How to interpret

What: Feature hashing maps a potentially huge set of categories (tokens, ids) into a fixed-size vector by hashing each category to a bucket. Optional signed hashing assigns ±1 so the expected contribution per feature is near zero.

Why: It provides a memory- and speed-efficient alternative to one-hot encoding when the vocabulary is large or evolving (e.g., ad logs, search queries, URLs).

How to read the plots: The bucket chart shows counts per bucket; red marks indicate buckets receiving multiple categories (collisions). The accuracy panel trains a tiny synthetic linear model; as buckets shrink, interference from collisions grows and accuracy degrades. Signed hashing reduces systematic bias by centering collisions.

  • Collision rate rises roughly with (#categories / #buckets) and skewed (Zipf) traffic can overload a few buckets.
  • Average load per bucket ≈ samples / buckets; memory footprint scales with vector size, not with vocabulary.
  • Compare one-hot vs hashing memory in the lower chart; hashing keeps constant memory even as categories grow.

Support This Free Tool

Every coffee helps keep the servers running. Every book sale funds the next tool I'm dreaming up. You're not just supporting a site — you're helping me build what developers actually need.

500K+ users
200+ tools
100% private
Privacy Guarantee: Private keys you enter or generate are never stored on our servers. All tools are served over HTTPS.