This Month
Published: 29 September 2017

Points of Significance

Ensemble methods: bagging and random forests

Naomi Altman¹ &
Martin Krzywinski²

Nature Methods volume 14, pages 933–934 (2017)Cite this article

18k Accesses
133 Citations
9 Altmetric
Metrics details

Subjects

Many heads are better than one.

You have full access to this article via your institution.

Download PDF

Just as we might consult multiple experts about a problem and then combine their advice to come to a consensus decision, repeated statistical analyses on the same data can be combined to form a single result called an ensemble or consensus estimator. This is particularly useful when the outcome of the original analysis is sensitive to small changes in the sample. This month, we will discuss how bootstrap samples¹ can be used to create an ensemble to improve regression (prediction of a quantitative outcome) and classification (prediction of a categorical outcome)². This differs from our previous use of the bootstrap to assess the variability of an analysis.

Bagging is a common ensemble method that uses bootstrap sampling³. Random forest is an enhancement of bagging that can improve variable selection. We will start by explaining bagging and then discuss the enhancement leading to random forest.

We'll illustrate bagging by improving a regression tree fit⁴ of noisy data sampled from a parabola (Fig. 1). Because our sample is relatively small (n = 30), our regression tree prediction based on the entire sample is coarse (Fig. 1a). We begin bagging by generating bootstrap samples of size n by sampling n observations with replacement from our sample and then calculating a regression tree prediction for each bootstrap sample (Fig. 1b). Finally, we combine the individual bootstrap predictions into a consensus estimate, which can be done for regression by averaging the fitted values (Fig. 1c).

**Figure 1: Bagging applied to regression using a regression tree.**

Our consensus regression fit in Figure 1c is smoother than the single fit based on the entire sample and reflects the shape of the parabola more closely. This suggests that if we increase the number of bootstraps we could obtain an even better fit—but how many should we use? Using more samples reduces the variance of the fit, but because many bootstrap samples are similar, at some point more bootstraps will merely increase computation time without improving the estimates.

In general, the optimal number of bootstrap samples depends on the problem. Let's look at how we can monitor the quality of our fit to choose the number of bootstraps. Let ŷ be our original predictor based on the entire sample and ŷ_B be the consensus bagged predictor. We can use these to calculate the mean square prediction error, MSE = Σ_i(y_i − ŷ_i)²/n, for both fits, which we'll call ε and ε_B.

It turns out that there is another useful error that we can calculate, the 'out-of-bag' (OOB) error. Because we sample with replacement, in any given bootstrap sample some observations are not selected (hollow points, Fig. 1b) while others are represented more than once. The points that are not selected form the OOB sample, which can be used as the validation sample for the fit⁵ to assess the regression accuracy for new observations not included in the training data. The OOB error, ε_OOB, is calculated analogously to ε_B, except that instead of ŷ_i we use ŷ_OOB,i, which is the fit for each y_i averaged from samples in which it is OOB.

To assess the bagging process, we periodically compute ε_OOB and continue to create new bootstrap samples until the error stabilizes. Let's perform more bootstraps to the sample in Figure 1 and see how the errors decrease. The regression tree fit based on the full sample gives an MSE of ε = 0.067. If we run ten bootstraps (Fig. 2a), the error drops to ε_B = 0.048 with an OOB error of ε_OOB = 0.077. In Figure 2b we compare the ensemble regressions for single runs of 10, 25 and 50 bootstraps. There does not appear to be much difference between the fits that use 25 and 50 bootstraps, which we can verify by looking at the profile of ε_B and ε_OOB as a function of the number of bootstraps (Fig. 2c). We can see that after about 25 bootstraps, both errors remain relatively constant. Using ε_OOB gives us a better indication of when to stop, since ε_B appears to stabilize too early (at about 15 bootstraps).

**Figure 2: Ensemble regressions improve in quality, up to a point, as the number of bootstraps is increased.**

Because bagged regressions are averages, they usually have smaller variance than ŷ. But because ŷ_OOB is based on only about 37% of the bootstrap samples (the expected fraction of the sample that is OOB), it is more variable than ŷ_B. This is reflected in the smaller value of ε_B compared with ε_OOB (Fig. 2c). However, as an estimator of the true prediction error for new samples, ε_B is too small because it is based on the overfitted training sample. While ε_OOB tends to be a bit larger than the true prediction error based on new samples, this conservative bias is usually small. Using ε_OOB to assess the fit allows us to use all the data to develop our regression, rather than requiring a hold-out test sample, and hence provides a better fit in general.

Simulations have shown that bagging performs best for algorithms that are highly sensitive to small changes in the data⁶. This sensitivity means that the fitted values ŷ will be highly variable from sample to sample without aggregation. When the algorithm is very stable—for example, in linear regression with no influential points—the ŷ_B may actually be more variable than ŷ.

Bagging can easily be applied to classification problems. Instead of using the average regression as a consensus, now a consensus classification is formed by 'voting', where the observation is classified into the class most frequently chosen.

In Figure 3 we show bagging applied to the two-dimensional classification example we discussed in our previous column⁴. This example uses two predictors (x and y position) and a categorical outcome with four levels. The classification outcome is sensitive to outliers—green outliers in the top left quadrant cause the green class boundary to extend across the full width of the square (Fig. 3a). This issue is mitigated when creating bootstrap samples, since the outliers may be left out, which causes the green class boundaries to be more confined to the upper right (Fig. 3b). As we increase the number of bootstrap iterations, the boundaries become smoother and less likely to overfit (Fig. 3c). As before, we can monitor the bagging and OOB error (Fig. 3d) to guide us about the number of bootstrap iterations to perform. The original predictor misclassification rate was 29%, which dropped to 26% at 50 bootstrap iterations with an OOB error of about 40%.

**Figure 3: Application of bagging to classification using a decision tree applied to n = 100 two-dimensional data points assigned to one of four color categories.**

Regression or classification fits generated from different bootstrap samples are correlated because of the observations that have been selected in both samples. The higher the correlation, the more similar the fit from each bootstrap and the smaller the mitigating effect of the consensus in reducing variance. For variable selection problems, strongly predictive variables that are selected in most bootstrap samples induce a strong correlation among the fits, reducing the utility of bagging.

To limit the impact of such variables, a simple but clever modification of CART bagging is used: a random forest⁷. In this approach, at each node of the tree, a subset m of the p variables in the data is selected at random, and only these m variables are considered for the partition at the node. This random selection of variables reduces the similarity of trees grown from different bootstrap samples—even two trees grown from the same bootstrap sample will likely differ. Once a sufficiently large forest of trees has been grown, the results are bagged in the usual way.

There will be a value of m that optimizes the variance reduction relative to the computational cost. This can be estimated using the OOB error as a function of m. Random forests are quite robust with respect to m, and rules of thumb such as using m = p/3 for regression and m = √p for classification are sometimes used⁷.

Ensemble methods like bagging and random forest are practical for mitigating both underfitting and overfitting, as we've seen with our regression and classification examples. The use of the OOB sample with each bootstrap is conceptually equivalent to using a test set for out-of-sample assessment but provides a means to use the entire sample to both estimate and assess the fit.

References

Kulesa, A., Krzywinski, M., Blainey, P. & Altman, N. Nat. Methods 12, 477–478 (2015).
Article CAS Google Scholar
Lever, J., Krzywinski, M. & Altman, N. Nat. Methods 13, 541–542 (2016).
Article CAS Google Scholar
Breiman, L. Mach. Learn. 24, 123–140 (1996).
Google Scholar
Krzywinski, M. & Altman, N. Nat. Methods 14, 757–758 (2017).
Article CAS Google Scholar
Lever, J., Krzywinski, M. & Altman, N. Nat. Methods 13, 703–704 (2016).
Article CAS Google Scholar
Liang, G., Zhu, X. & Zhang, C. in Proc. 25th AAAI Conference on Artificial Intelligence (eds. Wang, D. & Reynolds, M.) 1802–1803 (Springer, 2011).
Google Scholar
Breiman, L. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Professor of Statistics at The Pennsylvania State University,
Naomi Altman
staff scientist at Canada's Michael Smith Genome Sciences Centre,
Martin Krzywinski

Authors

Naomi Altman
View author publications
You can also search for this author in PubMed Google Scholar
Martin Krzywinski
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Altman, N., Krzywinski, M. Ensemble methods: bagging and random forests. Nat Methods 14, 933–934 (2017). https://doi.org/10.1038/nmeth.4438

Download citation

Published: 29 September 2017
Issue Date: 01 October 2017
DOI: https://doi.org/10.1038/nmeth.4438

This article is cited by

Enhancing multistep-ahead bike-sharing demand prediction with a two-stage online learning-based time-series model: insight from Seoul
- Subeen Leem
- Jisong Oh
- Seungmin Rho
The Journal of Supercomputing (2024)
Machine learning predictive model for aspiration screening in hospitalized patients with acute stroke
- Dougho Park
- Seok Il Son
- Mun-Chul Kim
Scientific Reports (2023)
Global long term daily 1 km surface soil moisture dataset with physics informed machine learning
- Qianqian Han
- Yijian Zeng
- Bob Su
Scientific Data (2023)
The 10-m cotton maps in Xinjiang, China during 2018–2021
- Xiaoyan Kang
- Changping Huang
- Qingxi Tong
Scientific Data (2023)
Multidimensional variability in ecological assessments predicts two clusters of suicidal patients
- Pablo Bonilla-Escribano
- David Ramírez
- Jorge López-Castromán
Scientific Reports (2023)

Ensemble methods: bagging and random forests

Subjects

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Enhancing multistep-ahead bike-sharing demand prediction with a two-stage online learning-based time-series model: insight from Seoul

Machine learning predictive model for aspiration screening in hospitalized patients with acute stroke

Global long term daily 1 km surface soil moisture dataset with physics informed machine learning

The 10-m cotton maps in Xinjiang, China during 2018–2021

Multidimensional variability in ecological assessments predicts two clusters of suicidal patients

Search

Quick links

Subjects

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Enhancing multistep-ahead bike-sharing demand prediction with a two-stage online learning-based time-series model: insight from Seoul

Machine learning predictive model for aspiration screening in hospitalized patients with acute stroke

Global long term daily 1 km surface soil moisture dataset with physics informed machine learning

The 10-m cotton maps in Xinjiang, China during 2018–2021

Multidimensional variability in ecological assessments predicts two clusters of suicidal patients

Search

Quick links