Categorical Encoding Lab

Compare classic encoders for categorical features. Adjust cardinality, class imbalance, and cross-validation to see how encoders affect memory, leakage, and model performance.

Visualization Controls Guide

Encoders

Schema: Category id → sparse vector of length K with a single 1 at index id.
x_onehot[i] = 1[i = cat_id]

Explodes feature space with high cardinality; linear models like it, trees may not need it.

Schema: Map category to integer index.
x_ordinal = rank(cat)

Imposes artificial order and distance; can mislead linear models.

Schema: Replace category with mean target computed out-of-fold.
x_target = E[y | cat] (K-fold, OOF)

Powerful but must avoid leakage with K-fold CV or time-based splits; add smoothing for rare categories.

Schema: Replace category with global mean target per category (no CV).
x_mean = mean(y | cat)

Convenient but leaky; use only for illustration or with strong regularization and careful validation.

Schema: Weight of Evidence (binary): log odds by category with smoothing.
woe(cat) = ln((P(y=1|cat)+ε)/(P(y=0|cat)+ε))

Often yields well-calibrated scores for logistic models; common in risk scoring.

Schema: Encode category index in base-2 using ⌈log2 K⌉ bits.
x_bits = bin(cat_id)

Very compact; may collide semantics but keeps dimensionality small.

Bar shows feature count per encoder for current cardinality.

Model Performance

Logistic model on synthetic labels (train/test split)

Accuracy: — ROC-AUC: — PR-AUC: — Time (sim): — Memory: —

Toggle calibration to see predicted probability vs observed rate. Target/WOE often calibrate better than ordinal.

Dataset generator

Cardinality (K): 100

Class imbalance (P[y=1]): 0.50

Validation & leakage

Use K-fold CV for Target encoding

K-folds: 5

Show calibration curve

Training

Train size: 70%

Encoder selection

Use the tabs or this dropdown to switch encoder for performance plots.

How to interpret

What: Categorical encoders transform labels into numeric features. Some expand to many columns (one-hot), others compress to 1–few numbers (target, WOE, binary).

Why: Different models react differently to encodings; the right encoder can improve accuracy, calibration, and speed while controlling memory.

How to use: Increase cardinality to see the one-hot blow-up; disable CV to observe optimistic metrics from leakage with target encoding; enable calibration to compare probabilistic quality.

Target encoding should be out-of-fold (K-fold CV) to avoid leakage.
WOE uses smoothed log-odds; often well-behaved for logistic models.
Binary encoding uses ⌈log2 K⌉ bits; very compact but mixes categories.

Support This Free Tool

Every coffee helps keep the servers running. Every book sale funds the next tool I'm dreaming up. You're not just supporting a site — you're helping me build what developers actually need.

500K+ users

200+ tools

100% private

☕ One-time support Buy me a coffee 📚 Learn & support 9-Book Bundle - $9 Stay updated Follow @anish2good

Privacy Guarantee: Private keys you enter or generate are never stored on our servers. All tools are served over HTTPS.

Categorical Encoding Lab

Encoders

Model Performance

Dataset generator

Validation & leakage

Training

Encoder selection

How to interpret

Support This Free Tool

Quick Access

PGP Tools

Sharing Services

Security Tools

Cryptography

Network Tools

Legal & Compliance

DevOps/Container

Blockchain

Encoders/Converters

Developer Tools

Machine Learning Visualizers

Media Tools

Documents & PDF

Finance

Health

Lifestyle & Productivity

Chemistry

Math & Education

Physics Tools

Internationalization