Classification Trees

How to split nodes?

Two approaches: Gini or Entropy reduction (information gain)

  1. Gini

  2. Entropy reduction/Information Gain

    $$ Entropy = -\sum_{i=1}^{n} p_i \log_2 p_i $$

<aside> ⚡

For a classification decision, the leaf node can be assigned to the majority class i.e look at the majority class among all observations in training data coming to the leaf node

</aside>

How to split features?

Feature selection

Impurity reduction or information gain should be above a threshold

Missing values in a feature

  1. Put the most common value for that class → Mean/Median
  2. Find another column with high correlation, use regression with that column to find value