-
Back-propagation and Chain Rule (Video, Video)
- Partial computations of the gradient from one layer are reused in the computation of the gradient for the previous layer
-
Activation Functions
- Sigmoid: $\frac{1}{1+e^{-x}}$
- tanh: No need to remember exact formula
- Relu: $max(x,0)$
- Leaky Relu: $max(x, 0.01x)$
- ELU (Exponential Linear unit):
- PRelu (Parametric Relu): $max(x, \alpha x)$
- Swish (Self-Gated): $x \times sigmoid(x)$
- Softplus: $ln(1+e^{x})$
- Softmax: $\frac{e^{x_i}}{\sum e^{x_i}}$
-
Optimisers
- Gradient Descent
- Stochastic Gradient Descent
- Mini-batch SGD
- SGD with momentum
- Adagrad (Adaptive Gradient Descent)
- Adjusting learning rate based on past gradients
- Adadelta and RMSProp
- Adam
- Smoothening with momentum
- Adaptive gradient with RMSProp
-
Loss Functions
- Regression
- Mean Squared Error
- Mean Absolute Error
- Huber loss
- Classification
- Binary Cross Entropy
- Categorical (multi-class) Cross Entropy
-
Weight Initialization Techniques
- Random Initialization
- He Initialization (works well with ReLU)
- Xavier Initialization (works well with sigmoid and tanh activations)
-
Regularization Techniques
- L1 Regularization (Lasso)
- L2 Regularization (Ridge)
- Dropout
- Early Stopping
- Batch Normalization
- Data Augmentation