Classification Trees
How to split nodes?
Two approaches: Gini or Entropy reduction (information gain)
-
Gini
- Gini impurity of leaf node = $1-\sum p_i^2 = 1-(prob \space of \space 1)^2-(prob \space of \space 0)^2$
- Gini = $w_LG_L+w_RG_R$
- Split the feature which has lowest gini
-
Entropy reduction/Information Gain
$$
Entropy = -\sum_{i=1}^{n} p_i \log_2 p_i
$$
-
Entropy ranges from 0 to 1, where:
- If all instances belong to the same class (perfectly pure), the entropy is 0
- If the instances are evenly distributed across all classes (completely impure), the entropy is 1
-
Information Gain: We find the entropy before splitting and then weighted entropy after splitting
$$
Information \space Gain (X) = Entropy(Y) - \sum_{i=1}^k \frac{N_i}{N} \times Entropy(X_i)
$$
- Entropy(Y) is the entropy of the target variable before splitting
- k is the number of unique values or categories in feature X
- N is the total number of instances in the dataset
- $N_i$ is the number of instances in the dataset with value i for feature X
- Entropy($X_i$) is the entropy of the dataset after splitting based on value i of feature X
<aside>
⚡
For a classification decision, the leaf node can be assigned to the majority class i.e look at the majority class among all observations in training data coming to the leaf node
</aside>
How to split features?
- Yes/No: Simply split at yes/no
- Numeric data: Try all avg. values between all numbers OR different percentiles
- Ranked data: ≤1, ≤2, ≤3
- Multiple choices: All combinations of choices
Feature selection
Impurity reduction or information gain should be above a threshold
Missing values in a feature
- Put the most common value for that class → Mean/Median
- Find another column with high correlation, use regression with that column to find value