Decision Trees, picked properly
A decision tree can fit anything — including the noise. This lab teaches you to choose one. Run k-fold cross-validation, sweep validation curves to spot under/overfitting, prune with ccp_alpha, scan a grid-search heatmap, and finally evaluate once on a hold-out test. The decision regions update live so you can see what your hyperparameter choice does.
Theory & exercises · CV math, bias-variance, ideas to try
The math, in four moves
1. Impurity at a node.
Gini measures the chance of misclassifying a random sample at a node:
$$ \mathrm{Gini}(t) = 1 - \sum_{k} p_k^2 $$Entropy uses information instead: $H(t) = -\sum_k p_k \log_2 p_k$. Both penalize mixed nodes; pure leaves get score 0.
2. Best split = max impurity reduction.
$$ \Delta i = i(t) - \tfrac{n_L}{n}\,i(t_L) - \tfrac{n_R}{n}\,i(t_R) $$Try every candidate threshold on every feature; pick the one with the largest $\Delta i$. Recurse on the children.
3. K-fold cross-validation.
Split the data into K folds. For each fold $f$, train on the other K-1 and score on $f$:
$$ \mathrm{CV} = \frac{1}{K}\sum_{f=1}^{K} \mathrm{score}\!\left(\hat{m}_{-f},\; D_f\right) $$CV Mean = honest generalization estimate; CV Std = how sensitive that estimate is to the random split.
4. Cost-complexity pruning.
$$ R_\alpha(T) = R(T) + \alpha\,|\tilde{T}| $$$R(T)$ is training error, $|\tilde{T}|$ is leaf count. Each $\alpha$ picks the smallest sub-tree with minimum penalized error. Larger $\alpha$ → smaller tree → less variance, more bias.
Try this
Watch overfitting on XOR
Pick XOR. Push max_depth to 12 and min_samples_leaf to 1. Hit Compute Curve with sweep=max_depth. Train climbs to 1.0; CV stops improving around depth 4. That gap is overfit.
Prune your way out
Set sweep=ccp_alpha. Watch CV rise, plateau, then fall as $\alpha$ over-prunes. The elbow is your model.
Stable plateaus beat spiky maxima
Run the grid heatmap. A single bright cell next to dark neighbours is suspicious — probably CV noise. Pick a depth/leaf in a wide green plateau instead.
Cost-sensitive screening
Set metric=Cost-based, push cost FN to 10. The recommended tree shifts to a higher-recall configuration. This is how you tune for screening problems (missed-fraud cost ≫ false alarm cost).
Forward-chaining for time series
Switch strategy to Forward-Chaining. If your data drifts, random K-fold gives optimistically inflated scores. Forward splits respect time order — closer to production.
The final-test rule
After picking hyperparameters, make a 20% hold-out and evaluate once. If the score drops a lot vs CV mean, you tuned too aggressively. Keep that test untouched while iterating.
In one glance
Frequently asked
ccp_alpha gets pruned. Higher alpha → smaller tree → less variance, more bias. Sweep it on the validation curve to find the elbow where CV stops improving.max_depth, raise min_samples_leaf, or raise ccp_alpha. The validation curve shows the gap: when train keeps climbing but CV plateaus or drops, you’ve crossed into overfit territory.