Classification Trees
- Gini impurity of leaf node = $1-(prob \space of \space 1)^2-(prob \space of \space 0)^2$
- Gini = $w_LG_L+w_RG_R$
- Split the feature which has lowest gini
Decision trees can overfit
How to split nodes?
- Yes/No: Simply split at yes/no
- Numeric data: Try all avg. values between all numbers OR different percentiles
- Ranked data: ≤1, ≤2, ≤3
- Multiple choices: All combinations of choices
Feature selection
- Impurity reduction should be above a threshold
Missing values in a feature
- Put the most common value for that class → Mean/Median
- Find another column with high correlation, use regression with that column to find value