Categorical Encoding Lab

Compare classic encoders for categorical features. Adjust cardinality, class imbalance, and cross-validation to see how encoders affect memory, leakage, and model performance.


Encoders
One-Hot | Ordinal | Target (CV) | Mean | WOE | Binary
Schema: Category id → sparse vector of length K with a single 1 at index id.
x_onehot[i] = 1[i = cat_id]
Explodes feature space with high cardinality; linear models like it, trees may not need it.
Schema: Map category to integer index.
x_ordinal = rank(cat)
Imposes artificial order and distance; can mislead linear models.
Schema: Replace category with mean target computed out-of-fold.
x_target = E[y | cat] (K-fold, OOF)
Powerful but must avoid leakage with K-fold CV or time-based splits; add smoothing for rare categories.
Schema: Replace category with global mean target per category (no CV).
x_mean = mean(y | cat)
Convenient but leaky; use only for illustration or with strong regularization and careful validation.
Schema: Weight of Evidence (binary): log odds by category with smoothing.
woe(cat) = ln((P(y=1|cat)+ε)/(P(y=0|cat)+ε))
Often yields well-calibrated scores for logistic models; common in risk scoring.
Schema: Encode category index in base-2 using ⌈log2 K⌉ bits.
x_bits = bin(cat_id)
Very compact; may collide semantics but keeps dimensionality small.
Bar shows feature count per encoder for current cardinality.
Model Performance
Logistic model on synthetic labels (train/test split)
Accuracy: — ROC-AUC: — PR-AUC: — Time (sim): — Memory: —
Toggle calibration to see predicted probability vs observed rate. Target/WOE often calibrate better than ordinal.
Dataset generator
Validation & leakage
Training
Encoder selection
Use the tabs or this dropdown to switch encoder for performance plots.

How to interpret

What: Categorical encoders transform labels into numeric features. Some expand to many columns (one-hot), others compress to 1–few numbers (target, WOE, binary).

Why: Different models react differently to encodings; the right encoder can improve accuracy, calibration, and speed while controlling memory.

How to use: Increase cardinality to see the one-hot blow-up; disable CV to observe optimistic metrics from leakage with target encoding; enable calibration to compare probabilistic quality.

  • Target encoding should be out-of-fold (K-fold CV) to avoid leakage.
  • WOE uses smoothed log-odds; often well-behaved for logistic models.
  • Binary encoding uses ⌈log2 K⌉ bits; very compact but mixes categories.

Support This Free Tool

Every coffee helps keep the servers running. Every book sale funds the next tool I'm dreaming up. You're not just supporting a site — you're helping me build what developers actually need.

500K+ users
200+ tools
100% private
Privacy Guarantee: Private keys you enter or generate are never stored on our servers. All tools are served over HTTPS.