Explore the “hashing trick” for large, sparse categorical spaces (e.g., text tokens or ads logs). Adjust bucket size and distribution to see how collisions change and how signed hashing stabilizes downstream models.
What: Feature hashing maps a potentially huge set of categories (tokens, ids) into a fixed-size vector by hashing each category to a bucket. Optional signed hashing assigns ±1 so the expected contribution per feature is near zero.
Why: It provides a memory- and speed-efficient alternative to one-hot encoding when the vocabulary is large or evolving (e.g., ad logs, search queries, URLs).
How to read the plots: The bucket chart shows counts per bucket; red marks indicate buckets receiving multiple categories (collisions). The accuracy panel trains a tiny synthetic linear model; as buckets shrink, interference from collisions grows and accuracy degrades. Signed hashing reduces systematic bias by centering collisions.
Every coffee helps keep the servers running. Every book sale funds the next tool I'm dreaming up. You're not just supporting a site — you're helping me build what developers actually need.