Random Forests Explained in detail
June 27, 2020
In my previous post, I have talked about Decision Tree and explained in detail how it works. In this post we will be talking about Random forest which is a collection of Decision Tree.
A random forest is almost always better than a single decision tree. This is the reason why it is one of the most popular machine learning algorithms. Random forests use a technique known as bagging, which is an ensemble method.
Before diving into the understanding of the random forest let's spend some time understanding Ensembles.
Ensembles
An ensemble means a group of things viewed as a whole rather than individually. In ensembles, a collection of models is used to make predictions, rather than individual models.
One of the popular algorithms in the family of ensemble models is the random forest which is an ensemble made by the combination of a large number of decision trees.
In principle, ensembles can be made by combining all types of models. An ensemble can have a logistic regression, a neural network, and few decision trees.
There are few points that one should keep in mind before creating an ensemble. These are:
- Choose the collection of models that works better than the individual.
- Choose an individual model to form the ensembles better than the individual model.
And to satisfy the above two points our ensemble should be diverse and acceptable. Let's understand them in detail.
Diversity
Diversity ensures that the models serve complementary purposes, which means that the individual models make predictions independent of each other.
Diversity ensures that even if some trees overfit, the other trees in the ensemble will neutralize the effect. The independence among the trees results in a lower variance of the ensemble compared to a single tree.
Diversity represents independence, i.e. models are not correlated (and do not get influenced by) other models. This means that the answers (predictions) given by the two models are independent of each other.
Consider an example of a cricket team where we have bowler, batsman, wicketkeeper etc all having a different skill set to maintain diversity.
Acceptability
Acceptability implies that each model is at least better than a random model. This is a pretty lenient criterion for each model to be accepted into the ensemble, i.e. it has to be at least better than a random guesser.
Acceptability means that a model is at least not making random guesses, whose P(success) is 0.50 in case of binary prediction. Thus, we want models whose probability of success is > 50%.
One more time, Consider an example of a cricket team where every player is a good player and is acceptable from a skill point of view.
We know how to create an ensemble, but how can we guarantee that the ensemble works better than the individual model?
- The ensemble makes decisions by taking the majority vote.
- Since our individual models in an ensemble are acceptable (i.e each model is having a probability of success greater than 50 percent in case of binary prediction), probability of the ensemble being wrong (i.e. the majority vote going wrong) will be far lesser than that of any individual model.
- The ensemble will overcome the assumptions of the individual model. Consider an example of a random forest where decision trees overfit. Let it be, as chances are very less that more than 50% of models in the ensemble will overfit.
Bagging
Random Forest uses bagging which is an ensemble method to create an ensemble of the decision tree. Bagging stands for Bootstrapped Aggregation.
Bagging is a technique for choosing random samples of observations from a dataset. Each of these samples is then used to train each tree in the forest. Bagging is a sampling technique and it is not specific to Random Forest.
And to choose the random sample, technique of bootstrapping is used with replacement. A bootstrap sample typically contains about 30-70% data from the data set.
- Random choice of bootstrapped observations.
- Random choice of attributes at each split of a tree ensuring diversity in a random forest
NOTE: After the bootstrap sample is selected, tree-building starts, and a random subset of features is selected at each node in order to split it. This is what makes random forests even better than a simple bagging algorithm.
Out of Bag Error
In the Random forest, there is no need to separate training and test data.
Consider a random forest of three decision trees where every tree is made upon the bootstrap sample. Since each tree is built on a bootstrap sample of (40 to 70% of data), therefore 30% is unused. Similarly, there will be other trees as well where data points from 30% from the first models are unused. Therefore this will be test data to the trees in the random forest and we calculated accuracy.
The above steps are done for each observation in the training set. Once all the predictions for each observation are calculated, the OOB error is calculated as the number of observations predicted wrongly as a proportion of the total number of observations.
It has been proven that using an OOB estimate is as accurate as using a test data set of a size equal to the training set. Thus, the OOB error completely omits the need for set-aside test data. Therefore, OOB error is almost as good as the cross-validation error.
Time Taken to Build an Ensemble
Building an ensemble requires the following steps:
- Take a random sample of observations, say j = 40% of the total.
- Build S trees by finding all the splits within each tree from a subsetted feature space. The time required is thus proportional to the number of trees S and the time required to build each tree.
The time required to build each tree is proportional to the average number of levels of a tree and the average time required to find the split at each level. To find the right split, only a subset of features f (√M) is considered.
Hyperparameter
- max_features: You saw that there is an optimal value of max_features - at very low values, the component trees are too simple to learn anything useful, while at extremely high values, the component trees become similar to each other (and violate the 'diversity' criterion).
- n_estimators: When you observe the plot of n_estimators and training and test accuracies, you will see that as you increase the value of n_estimators, both the training test accuracies gradually increase. More importantly, the model does not overfit even when its complexity is increasing. This is an important benefit of random forests - you can increase the number of trees as much you like without worrying about overfitting (if your computational resources allow).
Advantages of Random Forest
- Diversity: It arises because you create each tree with a subset of the attributes/features/variables, i.e. you don’t consider all the attributes while making each tree. The choice of the attributes considered for each tree is random. This ensures that the trees are independent of each other.
- Stability: It arises because the answers given by a large number of trees average out. The random forest has a lower model variance than an ordinary individual tree.
- Immunity to the curse of Dimensionality: Since each tree does not consider all the features, the feature space (the number of features a model has to consider) reduces. This makes the algorithm immune to the curse of dimensionality. A large feature space causes computational and complexity issues.
- Parallelizability: You need a number of trees to make a forest. Since two trees are independently built on different data and attributes, they can be built separately. This implies that you can make full use of your multi-core CPU to build random forests. Suppose there are 4 cores and 100 trees to be built; each core can build 25 trees to make a forest.
- OOB or out-of-bag error: No distinction between training and testing data.
- Interpretability: Random forest models are easy to interpret.
Summary
To generate a random forest of n decision trees, the below steps are followed:
- Create n bootstrap samples from the data.
- Train each tree on a different bootstrap sample.
- Each tree makes a prediction, and the final prediction will be the majority score of all these predictions.