Steps before starting any model training

  1. Understand data
    1. Look for missing values
    2. Look for outliers, range of values in each column
  2. Prepare features:
    1. Feature scaling
    2. One-hot encoding
    3. Feature transformations including multiplication of features
  3. Check for feature correlation
  4. Fix imbalanced data

Q: Do class imbalance impacts decision trees and other tree based models?

Yes, class imbalance can cause bias toward the majority class, leading to poor classification of the minority class.

Q: What are common ways to address class imbalance?

  1. Adjust Class Weights: Many ML algos support class weighting to balance the impact of classes
  2. Resampling:
    1. Oversampling Minority Class (e.g., SMOTE): Generates synthetic minority samples
    2. Undersampling Majority Class: Reduces the size of the majority class (randomly or stratified)
  3. Change Evaluation Metrics: Instead of accuracy, use precision, recall, f1, ROC-AUC

Q: How does SMOTE works?

💡 The new synthetic point is not a duplicate but a linear interpolation between two real minority samples

🚫 Avoid SMOTE with k-NN and clustering algorithms (since it creates synthetic neighbors that can affect distance-based calculations).

Q: How does perform missing value treatment?