Steps before starting any model training
- Understand data
- Look for missing values
- Look for outliers, range of values in each column
- Prepare features:
- Feature scaling
- One-hot encoding
- Feature transformations including multiplication of features
- Check for feature correlation
- Fix imbalanced data
Q: Do class imbalance impacts decision trees and other tree based models?
Yes, class imbalance can cause bias toward the majority class, leading to poor classification of the minority class.
Q: What are common ways to address class imbalance?
- Adjust Class Weights: Many ML algos support class weighting to balance the impact of classes
- Resampling:
- Oversampling Minority Class (e.g., SMOTE): Generates synthetic minority samples
- Undersampling Majority Class: Reduces the size of the majority class (randomly or stratified)
- Change Evaluation Metrics: Instead of accuracy, use precision, recall, f1, ROC-AUC
Q: How does SMOTE works?
-
SMOTE (Synthetic Minority Over-sampling Technique)
-
Traditional oversampling methods duplicate minority class samples, which can cause overfitting
-
SMOTE works by generating synthetic points using K-nearest neighbors (KNN).
🔹 Step 1: Select a random minority class sample
🔹 Step 2: Find its K-nearest neighbors (other minority samples)
🔹 Step 3: Randomly pick one of these neighbors
🔹 Step 4: Create a new synthetic sample along the line connecting them
💡 The new synthetic point is not a duplicate but a linear interpolation between two real minority samples
🚫 Avoid SMOTE with k-NN and clustering algorithms (since it creates synthetic neighbors that can affect distance-based calculations).
Q: How does perform missing value treatment?