Decision Trees, picked properly

Model Selection

A decision tree can fit anything — including the noise. This lab teaches you to choose one. Run k-fold cross-validation, sweep validation curves to spot under/overfitting, prune with ccp_alpha, scan a grid-search heatmap, and finally evaluate once on a hold-out test. The decision regions update live so you can see what your hyperparameter choice does.

Decision regions · click to add a point
Class 1 Class 0
Validation curve
Grid search · max_depth × min_samples_leaf
Dataset pick a preset, or click the canvas to add points
Tree hyperparameters refits live
criterion
max_depth5
min_samples_split4
min_samples_leaf2
max_features2
ccp_alpha (pruning)0.000
Validation how CV splits the data
K-folds5
strategy
scoring metric
cost FP1.0
cost FN5.0
CV Mean
CV Std
Recommended
Final test hold-out, never tune on it
Final Test Score
Theory & exercises · CV math, bias-variance, ideas to try

The math, in four moves

1. Impurity at a node.

Gini measures the chance of misclassifying a random sample at a node:

$$ \mathrm{Gini}(t) = 1 - \sum_{k} p_k^2 $$

Entropy uses information instead: $H(t) = -\sum_k p_k \log_2 p_k$. Both penalize mixed nodes; pure leaves get score 0.

2. Best split = max impurity reduction.

$$ \Delta i = i(t) - \tfrac{n_L}{n}\,i(t_L) - \tfrac{n_R}{n}\,i(t_R) $$

Try every candidate threshold on every feature; pick the one with the largest $\Delta i$. Recurse on the children.

3. K-fold cross-validation.

Split the data into K folds. For each fold $f$, train on the other K-1 and score on $f$:

$$ \mathrm{CV} = \frac{1}{K}\sum_{f=1}^{K} \mathrm{score}\!\left(\hat{m}_{-f},\; D_f\right) $$

CV Mean = honest generalization estimate; CV Std = how sensitive that estimate is to the random split.

4. Cost-complexity pruning.

$$ R_\alpha(T) = R(T) + \alpha\,|\tilde{T}| $$

$R(T)$ is training error, $|\tilde{T}|$ is leaf count. Each $\alpha$ picks the smallest sub-tree with minimum penalized error. Larger $\alpha$ → smaller tree → less variance, more bias.

Try this

Watch overfitting on XOR

Pick XOR. Push max_depth to 12 and min_samples_leaf to 1. Hit Compute Curve with sweep=max_depth. Train climbs to 1.0; CV stops improving around depth 4. That gap is overfit.

Prune your way out

Set sweep=ccp_alpha. Watch CV rise, plateau, then fall as $\alpha$ over-prunes. The elbow is your model.

Stable plateaus beat spiky maxima

Run the grid heatmap. A single bright cell next to dark neighbours is suspicious — probably CV noise. Pick a depth/leaf in a wide green plateau instead.

Cost-sensitive screening

Set metric=Cost-based, push cost FN to 10. The recommended tree shifts to a higher-recall configuration. This is how you tune for screening problems (missed-fraud cost ≫ false alarm cost).

Forward-chaining for time series

Switch strategy to Forward-Chaining. If your data drifts, random K-fold gives optimistically inflated scores. Forward splits respect time order — closer to production.

The final-test rule

After picking hyperparameters, make a 20% hold-out and evaluate once. If the score drops a lot vs CV mean, you tuned too aggressively. Keep that test untouched while iterating.

In one glance

⚠️ Watch out Tuning on the test set High train, low CV Single-cell heatmap peak Random K-fold on temporal data Tiny leaves on noisy data
🔧 In practice DecisionTreeClassifier GridSearchCV cross_val_score cost_complexity_pruning_path TimeSeriesSplit make_scorer

Frequently asked

The dataset is split into K equally-sized folds. The model trains on K−1 of them and validates on the held-out fold; this rotates so every fold serves as the validation set once. CV Mean and CV Std summarize how well your hyperparameter choice generalizes — high mean with low std is the sweet spot. Random K-fold assumes i.i.d. data; forward-chaining respects temporal order.
It penalizes tree complexity. After fitting, any split whose impurity reduction is below ccp_alpha gets pruned. Higher alpha → smaller tree → less variance, more bias. Sweep it on the validation curve to find the elbow where CV stops improving.
In practice they almost always agree on the chosen splits. Gini is slightly cheaper to compute (no log) and is the sklearn default. Entropy can prefer slightly more balanced splits. Pick one and don’t worry — depth and leaf settings matter far more.
Classic overfitting. The tree memorized the training data — every leaf is pure — but the regions don’t generalize. Reduce max_depth, raise min_samples_leaf, or raise ccp_alpha. The validation curve shows the gap: when train keeps climbing but CV plateaus or drops, you’ve crossed into overfit territory.