Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Conservation machine learning: a case study of random forests


Conservation machine learning conserves models across runs, users, and experiments—and puts them to good use. We have previously shown the merit of this idea through a small-scale preliminary experiment, involving a single dataset source, 10 datasets, and a single so-called cultivation method—used to produce the final ensemble. In this paper, focusing on classification tasks, we perform extensive experimentation with conservation random forests, involving 5 cultivation methods (including a novel one introduced herein—lexigarden), 6 dataset sources, and 31 datasets. We show that significant improvement can be attained by making use of models we are already in possession of anyway, and envisage the possibility of repositories of models (not merely datasets, solutions, or code), which could be made available to everyone, thus having conservation live up to its name, furthering the cause of data and computational science.


We recently presented the idea of conservation machine learning, wherein machine learning (ML) models are saved across multiple runs, users, and experiments1. Conservation ML is essentially an “add-on” meta-algorithm, which can be applied to any collection of models (or even sub-models), however they were obtained: via ensemble or non-ensemble methods, collected over multiple runs, gathered from different modelers, a priori intended to be used in conjunction with others—or simply plucked a posteriori, and so forth.

A random forest (RF) is an oft-used ensemble technique that employs a forest of decision-tree classifiers on various sub-samples of the dataset, with random subsets of the features for node splits. It uses majority voting (for classification problems) or averaging (for regression problems) to improve predictive accuracy and control over-fitting2.

Reference3 presented a method for constructing ensembles from libraries of thousands of models. They used a simple hill-climbing procedure to build the final ensemble, and successfully tested their method on 7 problems. They further examined three alternatives to their selection procedure to reduce overfitting. Pooling algorithms, such as stacked generalization4, and super learner5, have also proven successful. There also exists a body of knowledge regarding ensemble pruning6.

We believe the novelty of conservation machine learning, herein applied to random forests, is two-fold. First and foremost, we envisage the possibility of vast repositories of models (not merely datasets, solutions, or code). Consider the common case wherein several research groups have been tackling an extremely hard problem (e.g.,7), each group running variegated ML algorithms over several months (maybe years). It would not be hard to imagine that the number of models produced over time would run into the millions (quite easily more). Most of these models would be discarded unflinchingly, with only a minute handful retained, and possibly reported upon in the literature. We advocate making all models available to everyone, thus having conservation live up to its name, furthering the cause of data and computational science. Our second contribution in this paper is the introduction of a new ensemble cultivation method—lexigarden.

In1 we offered a discussion and a preliminary proof-of-concept of conservation ML, involving a single dataset source, 10 datasets, and a single so-called cultivation method. Herein, focusing on classification tasks, we perform extensive experimentation involving 5 cultivation methods, including the newly introduced lexigarden (“Ensemble cultivation”), 6 dataset sources, and 31 datasets (“Datasets”). Upon describing the setup (“Experimental setup”), we show promising results (“Results”), followed by a discussion (“Discussion”) and concluding remarks (“Concluding remarks”).

Ensemble cultivation

Conservation ML begins with amassing a collection of models—through whatever means. Herein, we will collect models saved over multiple runs of RF training.

Once in possession of a collection of fitted models it is time to produce a final ensemble. We examine five ways of doing so:

  1. 1.

    Jungle. Use all collected fitted models to form class predictions through majority voting, where each model votes for a single class.

  2. 2.

    Super-ensemble. Use an ensemble of ensembles, namely an ensemble of RFs, with prediction done through majority voting, where each RF votes for a single class.

    To clarify, assume we perform 100 runs of RFs of size 100. We are then in possession of a jungle of size 10,000 decision trees, and a super-ensemble of size 100 RFs. Both perform predictions through majority voting.

  3. 3.

    Order-based (or ranking-based) pruning. We implemented two methods of ensemble pruning6. The first, ranking-based, sorts the jungle from best model (over training set) to worst, and then selects the top n models for the ensemble, with n being a specified parameter

  4. 4.

    Clustering-based pruning. The second ensemble-pruning method performs k-means clustering over all model output vectors, with a given number of clusters, k, and then produces an ensemble by collecting the top-scoring (over training set) model of each cluster.

  5. 5.

    Lexigarden. A new method, introduce herein and described below, based on lexicase selection, a performant selection technique for evolutionary algorithms, with selection being one of the primary steps in such algorithms8,9. This step mimics the role of natural selection in nature, by probabilistically selecting fitter individuals from the evolving population, which will then undergo pseudo-genetic modification operators.

Lexicase selection selects individuals by filtering a pool of individuals which, before filtering, typically contains the entire population. The filtering is accomplished in steps, each of which filters according to performance on a single test case. Lexicase selection has been used productively within the field of evolutionary algorithms10,11. Herein, we co-opt it to cultivate a “garden” of select trees from the jungle, introducing the lexigarden algorithm. Lexigarden generates a garden of a specified size, whose models were selected through lexicase selection.

Algorithm 1 provides the pseudocode. Lexigarden receives a jungle of models, a dataset along with target values, and the number of models the generated garden is to contain. The lexicase function begins by randomly shuffling the dataset, after which it successively iterates through it, retaining only the models that provide a correct answer for each instance. In the end we are left with either a single model or a small number of models, which are precisely those that have correctly classified the subset of the dataset engendered through the looping process. The lexicase function is called as many times as needed to fill the garden with models. Lexigarden ends up generating a highly diverse subset of all the models by picking ones that each excels on a particular random subset of instances.



To compose a variegated collection of classification datasets for the experiments described in the next section we turned to 6 different sources:

  1. 1.

    Easy. Scikit-learn’s12 “easy” classification datasets, where near-perfect performance is expected as par for the course: iris, wine, cancer, digits.

  2. 2.

    Clf. Datasets produced through make_classification, a Scikit-learn function that, “initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2*class_sep and assigns an equal number of clusters to each class.”13

  3. 3.

    HIBACHI. A method and software for simulating complex biological and biomedical data14.

  4. 4.

    GAMETES. Software that generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. We generated a 2-way and a 3-way biallelic pure, strict, epistatic model with a heritability of 1, referred to in related work as the Xor model15.

  5. 5.

    OpenML. A repository of over 21,000 datasets, of which we selected problems designated as “Highest Impact”, with a mix of number of samples, features, and classes16.

  6. 6.

    PMLB. The Penn Machine Learning Benchmark repository is an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies17.

Note that in addition to there being 6 dataset sources, there is also a mix of dataset repositories (Easy, OpenML, PMLB) and dataset generators (Clf, HIBACHI, GAMETES). Figure 1 shows a “bird’s-eye view” of the total of 31 datasets.

Figure 1

A “bird’s-eye view” of the 31 datasets used in this study: number of instances (left), number of features (center), and number of classes (right).

Experimental setup

The experiments were based on Scikit-learn’s RandomForestClassifier function, which we used with its default values (our aim here was not to improve RFs per se but to show that conservation can significantly improve a base ML algorithm)12,13.

The experimental setup is shown in Algorithm 2. For each replicate experiment we created 5 folds, for 5-fold cross validation. For each fold the dataset was split into a training set of 4 folds and the left-out test fold. 100 runs were conducted per fold. Each run consisted of fitting a 100-tree RF to the traning set and testing the fitted RF on the test set. In addition, all trees were saved into a jungle and all RFs were saved into a super-ensemble; these could then be tested on the test set.


The jungle also served as fodder for the cultivation methods of “Ensemble cultivation”. For every training epoch, we generated gardens of given sizes (see Table 1) using the three cultivation methods: order-based pruning, clustering-based pruning, and lexigarden. These gardens could then also be tested on the test set.


Table 1 shows the results of our experiments. Each line in the table presents the results for a single dataset, involving 30 replicate experiments, as delineated in “Experimental setup”. We show the mean performance of RFs alongside the improvement of the 5 conservation methods discussed in “Ensemble cultivation”.

Table 1 Experimental results.

In addition to reporting improvement values we also report their statistical significance, as assessed by 10,000-round permutation tests. The permutation tests focused on the mean test scores across all replicates and folds, comparing each ensemble (mean) to the original RFs. We report on two p-value significance levels: \(<0.001\) and \(<0.05\).


Focusing on random forests for classification we performed a study of the newly introduced idea of conservation machine learning. It is interesting to note that—case in point—our experiments herein alone produced almost fifty million models (31 datasets \(\times \) 30 replicates \(\times \) 5 folds \(\times \) 100 runs \(\times \) 100 decision trees = 46,500,000).

As can be seen in Table 1, we attained statistically significant improvement for all datasets, except for the four “easy” problems, where a standard RF’s accuracy was close to 1 to begin with, and little improvment could be eked out. All but the super-ensemble method ranked highest on several datasets: the jungle method attained the best improvement for 8 datasets, order-based pruning attained the best improvement for 10 datasets, clustering-based pruning attained the best improvement for 2 datasets, and lexigarden attained the best improvement for 8 datasets (we excluded “easy” datasets from this count).

In summary, our results show that conservation random forests are able to improve performance through ensemble cultivation by making use of models we are already in possession of anyway.

There is a cost attached to conserving models, involving memory and computation time. The former is probably less significant, since saving millions and even billions of models requires storage space well within our reach. Computation time of a jungle or super-ensemble is obviously more costly than an ensemble of a lesser size (or a single model), but if the performance benefits are deemed worthwhile then time should not pose an issue. Computing a garden’s output involves only a minor increase in computation time, and a one-time computation of the garden’s members. As pointed out recently by18, improvements in software, algorithms, and hardware architecture can bring a much-needed boost. For example, they showed that a sample Python program, when coded in Java, produced a 10.8\(\times \) speedup, and coding it in C produced a 47\(\times \) speedup; moreover, tailoring the code to exploit specific features of the hardware gained an additional 1300\(\times \) speedup.

Concluding remarks

There are many possible avenues for future exploration:

  • We focused on classification tasks, which leaves open the exploration of other types of tasks, such as regression and clustering.

  • We examined random forests, with other forms of ensemble techniques—such as sequential, boosting algorithms—yet to be explored.

  • Non-ensemble techniques deserve study as well.

  • It is also possible to amass models that were obtained through different ML methods (or even non-ML methods).

  • We used simple majority voting to compute the output of jungles, super-ensembles, and gardens. More sophisticated methods could be explored.

  • We offered lexigarden as a novel supplement to the cultivation-method toolkit. Other cultivation techniques could be devised.

  • As noted in1, current cloud repositories usually store code, datasets, and leaderboards. One might consider a new kind of model archive, storing a plethora of models, which could provide bountiful grist for the ML mill.

While we focused on random forests herein, we note again that conservation ML is essentially an “add-on” meta-algorithm that can be applied to any collection of models, however they were obtained.

We hope to see this idea receiving attention in the future.

Code availability

The code is available at


  1. 1.

    Sipper, M., & Moore, J.H. Conservation machine learning. BioData Min.13(9) (2020).

  2. 2.

    Ho, T.K. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995).

  3. 3.

    Caruana, R., Niculescu-Mizil, A., Crew, G., & Ksikes, A. Ensemble selection from libraries of models. In In Proceedings of the 21st International Conference on Machine Learning, pp. 137–144. ACM Press (2004).

  4. 4.

    David, H. Stacked generalization. Neural Netw. 5(2), 241–259 (1992).

    Article  Google Scholar 

  5. 5.

    Van der Laan, M. J., Polley, E. C. & Hubbard, A. E. Super learner. Stat. Appl. Genet. Mol. Biol.6(1) (2007).

  6. 6.

    Tsoumakas, G., Partalas, I., & Vlahavas, I. An ensemble pruning primer. In Applications of Supervised and Unsupervised Ensemble Methods (eds Okun, O. & Valentini, G.) 1–13 (Springer, Berlin, Heidelberg, 2009).

  7. 7.

    Bycroft, C. et al. The UK biobank resource with deep phenotyping and genomic data. Nature 562(7726), 203–209 (2018).

    ADS  CAS  Article  Google Scholar 

  8. 8.

    Metevier, B., Saini, A. K., & Spector, L. Lexicase selection beyond genetic programming. In Banzhaf, W., Spector, L., & Sheneman, L., editors, Genetic Programming Theory and Practice XVI, pp. 123–136. Springer (2019).

  9. 9.

    Spector L. Assessment of problem modality by differential performance of lexicase selection in genetic programming: A preliminary report. In Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 401–408. ACM (2012).

  10. 10.

    Helmuth, T., Spector, L. & Matheson, J. Solving uncompromising problems with lexicase selection. IEEE Trans. Evol. Comput. 19(5), 630–643 (2014).

    Article  Google Scholar 

  11. 11.

    Helmuth, T., McPhee, N. F., & Spector, L. Lexicase selection for program synthesis: A diversity analysis. In Riolo, R., Worzel, W.P., Kotanchek, M., & Kordon, A., editors, Genetic Programming Theory and Practice XIII, pp. 151–167, Cham. Springer International Publishing (2016).

  12. 12.

    Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  MATH  Google Scholar 

  13. 13.

    Scikit-learn: Machine learning in Python. Accessed: 2020-06-09 (2020).

  14. 14.

    Moore, J. H., Shestov, M., Schmitt, P., & Olson, R. S. A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods. In Pacific Symposium on Biocomputing, volume 23, pp. 259–267. World Scientific (2018).

  15. 15.

    Urbanowicz, R. J. et al. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Min. 5(1), 16 (2012).

    Article  Google Scholar 

  16. 16.

    Vanschoren, J., van Rijn, J. N., Bischl, B. & Torgo, L. OpenML: Networked science in machine learning. SIGKDD Explor. 15(2), 49–60 (2013).

    Article  Google Scholar 

  17. 17.

    Olson, R. S., La Cava, W., Orzechowski, P., Urbanowicz, R. J. & Moore, J. H. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Min. 10(1), 36 (2017).

    Article  Google Scholar 

  18. 18.

    Leiserson, C.E. et al. There’s plenty of room at the top: What will drive computer performance after moore’s law?. Science 368(6495) (2020).

Download references


We thank Patryk Orzechowski for his help with the HIBACHI datasets, and Ryan Urbanowicz for helping out with the GAMETES datasets. This work was supported by National Institutes of Health (USA) grants LM010098, LM012601, AI116794.

Author information




M.S. wrote conceived the basic idea, ran the experiments, and wrote up the paper draft J.H.M. provided statistical analysis and paper improvements All authors reviewed the manuscript.

Corresponding author

Correspondence to Moshe Sipper.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sipper, M., Moore, J.H. Conservation machine learning: a case study of random forests. Sci Rep 11, 3629 (2021).

Download citation


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing