GBDT (Boosting)

General Concepts

Initial prediction: Avg. value
Build trees iteratively to predict the residuals (actual-predicted) from the previous tree
1. Residual is the gradient of loss function w.r.t. predicted. This give the name gradient descent to this algo
2. Remember the gradient descent formula for regression, we are doing kind of the similar thing:
  
  New Prediction = Old Prediction - Alpha x Gradient w.r.t prediction
Combine the output from trees using a learning rate

$$ Loss Function = \sum \frac{1}{2}(Actual - Predicted)^2 $$

Loss function is minimized when initial value is set as avg. value of all observations

because if we solve $\frac{\delta Loss}{\delta Predicted} = 0$, we get Predicted = Actual

This is the same reason for value in the leaf node being average of all values

$$ Odds = \frac{Count \ of \ Ones}{ Count \ of \ Zeros} \\ $$

$$ Prob = \frac{e^{log(odds)}}{1+e^{log(odds)}} $$

Initial prediction: log(odds)
Build trees to predict residuals. We fit trees to the gradient of the loss function