- Bootstrapped dataset: Randomly select rows with replacement
- Create tree with randomly selected columns at each node
- Number of variables to select: Try square root of # of variables
- Track out of bag error
Proximity matrix
- To compute the proximity matrix, each sample in the dataset is exposed to all decision trees in the forest.
- For every pair of samples, the proximity between them is calculated based on their relative positions in the decision trees.
- If two samples end up in the same leaf node of a decision tree, their proximity is increased by a certain amount (e.g., 1).
- The process is repeated for all decision trees in the forest, accumulating proximity values for each pair of samples.
In the proximity matrix, higher values indicate greater similarity or closeness between pairs of samples. Samples that are frequently grouped together in the same leaf nodes across different decision trees tend to have higher proximity values