Conservation machine learning: a case study of random forests

Sipper, Moshe; Moore, Jason H.

doi:10.1038/s41598-021-83247-4

Download PDF

Article
Open access
Published: 11 February 2021

Conservation machine learning: a case study of random forests

Moshe Sipper^1,2 &
Jason H. Moore¹

Scientific Reports volume 11, Article number: 3629 (2021) Cite this article

5591 Accesses
25 Citations
17 Altmetric
Metrics details

Subjects

Abstract

Conservation machine learning conserves models across runs, users, and experiments—and puts them to good use. We have previously shown the merit of this idea through a small-scale preliminary experiment, involving a single dataset source, 10 datasets, and a single so-called cultivation method—used to produce the final ensemble. In this paper, focusing on classification tasks, we perform extensive experimentation with conservation random forests, involving 5 cultivation methods (including a novel one introduced herein—lexigarden), 6 dataset sources, and 31 datasets. We show that significant improvement can be attained by making use of models we are already in possession of anyway, and envisage the possibility of repositories of models (not merely datasets, solutions, or code), which could be made available to everyone, thus having conservation live up to its name, furthering the cause of data and computational science.

Mushroom data creation, curation, and simulation to support classification tasks

Article Open access 14 April 2021

Alternative stopping rules to limit tree expansion for random forest models

Article Open access 06 September 2022

Dataset meta-level and statistical features affect machine learning performance

Article Open access 19 January 2024

Introduction

We recently presented the idea of conservation machine learning, wherein machine learning (ML) models are saved across multiple runs, users, and experiments¹. Conservation ML is essentially an “add-on” meta-algorithm, which can be applied to any collection of models (or even sub-models), however they were obtained: via ensemble or non-ensemble methods, collected over multiple runs, gathered from different modelers, a priori intended to be used in conjunction with others—or simply plucked a posteriori, and so forth.

A random forest (RF) is an oft-used ensemble technique that employs a forest of decision-tree classifiers on various sub-samples of the dataset, with random subsets of the features for node splits. It uses majority voting (for classification problems) or averaging (for regression problems) to improve predictive accuracy and control over-fitting².

Reference³ presented a method for constructing ensembles from libraries of thousands of models. They used a simple hill-climbing procedure to build the final ensemble, and successfully tested their method on 7 problems. They further examined three alternatives to their selection procedure to reduce overfitting. Pooling algorithms, such as stacked generalization⁴, and super learner⁵, have also proven successful. There also exists a body of knowledge regarding ensemble pruning⁶.

We believe the novelty of conservation machine learning, herein applied to random forests, is two-fold. First and foremost, we envisage the possibility of vast repositories of models (not merely datasets, solutions, or code). Consider the common case wherein several research groups have been tackling an extremely hard problem (e.g.,⁷), each group running variegated ML algorithms over several months (maybe years). It would not be hard to imagine that the number of models produced over time would run into the millions (quite easily more). Most of these models would be discarded unflinchingly, with only a minute handful retained, and possibly reported upon in the literature. We advocate making all models available to everyone, thus having conservation live up to its name, furthering the cause of data and computational science. Our second contribution in this paper is the introduction of a new ensemble cultivation method—lexigarden.

In¹ we offered a discussion and a preliminary proof-of-concept of conservation ML, involving a single dataset source, 10 datasets, and a single so-called cultivation method. Herein, focusing on classification tasks, we perform extensive experimentation involving 5 cultivation methods, including the newly introduced lexigarden (“Ensemble cultivation”), 6 dataset sources, and 31 datasets (“Datasets”). Upon describing the setup (“Experimental setup”), we show promising results (“Results”), followed by a discussion (“Discussion”) and concluding remarks (“Concluding remarks”).

Ensemble cultivation

Conservation ML begins with amassing a collection of models—through whatever means. Herein, we will collect models saved over multiple runs of RF training.

Once in possession of a collection of fitted models it is time to produce a final ensemble. We examine five ways of doing so:

1.
Jungle. Use all collected fitted models to form class predictions through majority voting, where each model votes for a single class.
2.
Super-ensemble. Use an ensemble of ensembles, namely an ensemble of RFs, with prediction done through majority voting, where each RF votes for a single class.

To clarify, assume we perform 100 runs of RFs of size 100. We are then in possession of a jungle of size 10,000 decision trees, and a super-ensemble of size 100 RFs. Both perform predictions through majority voting.
3.
Order-based (or ranking-based) pruning. We implemented two methods of ensemble pruning⁶. The first, ranking-based, sorts the jungle from best model (over training set) to worst, and then selects the top n models for the ensemble, with n being a specified parameter
4.
Clustering-based pruning. The second ensemble-pruning method performs k-means clustering over all model output vectors, with a given number of clusters, k, and then produces an ensemble by collecting the top-scoring (over training set) model of each cluster.
5.
Lexigarden. A new method, introduce herein and described below, based on lexicase selection, a performant selection technique for evolutionary algorithms, with selection being one of the primary steps in such algorithms^8,9. This step mimics the role of natural selection in nature, by probabilistically selecting fitter individuals from the evolving population, which will then undergo pseudo-genetic modification operators.

Lexicase selection selects individuals by filtering a pool of individuals which, before filtering, typically contains the entire population. The filtering is accomplished in steps, each of which filters according to performance on a single test case. Lexicase selection has been used productively within the field of evolutionary algorithms^10,11. Herein, we co-opt it to cultivate a “garden” of select trees from the jungle, introducing the lexigarden algorithm. Lexigarden generates a garden of a specified size, whose models were selected through lexicase selection.

Algorithm 1 provides the pseudocode. Lexigarden receives a jungle of models, a dataset along with target values, and the number of models the generated garden is to contain. The lexicase function begins by randomly shuffling the dataset, after which it successively iterates through it, retaining only the models that provide a correct answer for each instance. In the end we are left with either a single model or a small number of models, which are precisely those that have correctly classified the subset of the dataset engendered through the looping process. The lexicase function is called as many times as needed to fill the garden with models. Lexigarden ends up generating a highly diverse subset of all the models by picking ones that each excels on a particular random subset of instances.

Datasets

To compose a variegated collection of classification datasets for the experiments described in the next section we turned to 6 different sources:

1.
Easy. Scikit-learn’s¹² “easy” classification datasets, where near-perfect performance is expected as par for the course: iris, wine, cancer, digits.
2.
Clf. Datasets produced through make_classification, a Scikit-learn function that, “initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2*class_sep and assigns an equal number of clusters to each class.”¹³
3.
HIBACHI. A method and software for simulating complex biological and biomedical data¹⁴.
4.
GAMETES. Software that generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. We generated a 2-way and a 3-way biallelic pure, strict, epistatic model with a heritability of 1, referred to in related work as the Xor model¹⁵.
5.
OpenML. A repository of over 21,000 datasets, of which we selected problems designated as “Highest Impact”, with a mix of number of samples, features, and classes¹⁶.
6.
PMLB. The Penn Machine Learning Benchmark repository is an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies¹⁷.

Note that in addition to there being 6 dataset sources, there is also a mix of dataset repositories (Easy, OpenML, PMLB) and dataset generators (Clf, HIBACHI, GAMETES). Figure 1 shows a “bird’s-eye view” of the total of 31 datasets.

Experimental setup

The experiments were based on Scikit-learn’s RandomForestClassifier function, which we used with its default values (our aim here was not to improve RFs per se but to show that conservation can significantly improve a base ML algorithm)^12,13.

The experimental setup is shown in Algorithm 2. For each replicate experiment we created 5 folds, for 5-fold cross validation. For each fold the dataset was split into a training set of 4 folds and the left-out test fold. 100 runs were conducted per fold. Each run consisted of fitting a 100-tree RF to the traning set and testing the fitted RF on the test set. In addition, all trees were saved into a jungle and all RFs were saved into a super-ensemble; these could then be tested on the test set.

The jungle also served as fodder for the cultivation methods of “Ensemble cultivation”. For every training epoch, we generated gardens of given sizes (see Table 1) using the three cultivation methods: order-based pruning, clustering-based pruning, and lexigarden. These gardens could then also be tested on the test set.

Results

Table 1 shows the results of our experiments. Each line in the table presents the results for a single dataset, involving 30 replicate experiments, as delineated in “Experimental setup”. We show the mean performance of RFs alongside the improvement of the 5 conservation methods discussed in “Ensemble cultivation”.

Table 1 Experimental results.

Full size table

In addition to reporting improvement values we also report their statistical significance, as assessed by 10,000-round permutation tests. The permutation tests focused on the mean test scores across all replicates and folds, comparing each ensemble (mean) to the original RFs. We report on two p-value significance levels: \(<0.001\) and \(<0.05\).

Discussion

Focusing on random forests for classification we performed a study of the newly introduced idea of conservation machine learning. It is interesting to note that—case in point—our experiments herein alone produced almost fifty million models (31 datasets \(\times \) 30 replicates \(\times \) 5 folds \(\times \) 100 runs \(\times \) 100 decision trees = 46,500,000).

As can be seen in Table 1, we attained statistically significant improvement for all datasets, except for the four “easy” problems, where a standard RF’s accuracy was close to 1 to begin with, and little improvment could be eked out. All but the super-ensemble method ranked highest on several datasets: the jungle method attained the best improvement for 8 datasets, order-based pruning attained the best improvement for 10 datasets, clustering-based pruning attained the best improvement for 2 datasets, and lexigarden attained the best improvement for 8 datasets (we excluded “easy” datasets from this count).

In summary, our results show that conservation random forests are able to improve performance through ensemble cultivation by making use of models we are already in possession of anyway.

There is a cost attached to conserving models, involving memory and computation time. The former is probably less significant, since saving millions and even billions of models requires storage space well within our reach. Computation time of a jungle or super-ensemble is obviously more costly than an ensemble of a lesser size (or a single model), but if the performance benefits are deemed worthwhile then time should not pose an issue. Computing a garden’s output involves only a minor increase in computation time, and a one-time computation of the garden’s members. As pointed out recently by¹⁸, improvements in software, algorithms, and hardware architecture can bring a much-needed boost. For example, they showed that a sample Python program, when coded in Java, produced a 10.8\(\times \) speedup, and coding it in C produced a 47\(\times \) speedup; moreover, tailoring the code to exploit specific features of the hardware gained an additional 1300\(\times \) speedup.

Concluding remarks

There are many possible avenues for future exploration:

We focused on classification tasks, which leaves open the exploration of other types of tasks, such as regression and clustering.
We examined random forests, with other forms of ensemble techniques—such as sequential, boosting algorithms—yet to be explored.
Non-ensemble techniques deserve study as well.
It is also possible to amass models that were obtained through different ML methods (or even non-ML methods).
We used simple majority voting to compute the output of jungles, super-ensembles, and gardens. More sophisticated methods could be explored.
We offered lexigarden as a novel supplement to the cultivation-method toolkit. Other cultivation techniques could be devised.
As noted in¹, current cloud repositories usually store code, datasets, and leaderboards. One might consider a new kind of model archive, storing a plethora of models, which could provide bountiful grist for the ML mill.

While we focused on random forests herein, we note again that conservation ML is essentially an “add-on” meta-algorithm that can be applied to any collection of models, however they were obtained.

We hope to see this idea receiving attention in the future.

Code availability

The code is available at https://github.com/EpistasisLab.

References

Sipper, M., & Moore, J.H. Conservation machine learning. BioData Min.13(9) (2020).
Ho, T.K. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995).
Caruana, R., Niculescu-Mizil, A., Crew, G., & Ksikes, A. Ensemble selection from libraries of models. In In Proceedings of the 21st International Conference on Machine Learning, pp. 137–144. ACM Press (2004).
David, H. Stacked generalization. Neural Netw. 5(2), 241–259 (1992).
Article Google Scholar
Van der Laan, M. J., Polley, E. C. & Hubbard, A. E. Super learner. Stat. Appl. Genet. Mol. Biol.6(1) (2007).
Tsoumakas, G., Partalas, I., & Vlahavas, I. An ensemble pruning primer. In Applications of Supervised and Unsupervised Ensemble Methods (eds Okun, O. & Valentini, G.) 1–13 (Springer, Berlin, Heidelberg, 2009).
Bycroft, C. et al. The UK biobank resource with deep phenotyping and genomic data. Nature 562(7726), 203–209 (2018).
Article ADS CAS Google Scholar
Metevier, B., Saini, A. K., & Spector, L. Lexicase selection beyond genetic programming. In Banzhaf, W., Spector, L., & Sheneman, L., editors, Genetic Programming Theory and Practice XVI, pp. 123–136. Springer (2019).
Spector L. Assessment of problem modality by differential performance of lexicase selection in genetic programming: A preliminary report. In Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 401–408. ACM (2012).
Helmuth, T., Spector, L. & Matheson, J. Solving uncompromising problems with lexicase selection. IEEE Trans. Evol. Comput. 19(5), 630–643 (2014).
Article Google Scholar
Helmuth, T., McPhee, N. F., & Spector, L. Lexicase selection for program synthesis: A diversity analysis. In Riolo, R., Worzel, W.P., Kotanchek, M., & Kordon, A., editors, Genetic Programming Theory and Practice XIII, pp. 151–167, Cham. Springer International Publishing (2016).
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Scikit-learn: Machine learning in Python. https://scikit-learn.org/. Accessed: 2020-06-09 (2020).
Moore, J. H., Shestov, M., Schmitt, P., & Olson, R. S. A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods. In Pacific Symposium on Biocomputing, volume 23, pp. 259–267. World Scientific (2018).
Urbanowicz, R. J. et al. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Min. 5(1), 16 (2012).
Article Google Scholar
Vanschoren, J., van Rijn, J. N., Bischl, B. & Torgo, L. OpenML: Networked science in machine learning. SIGKDD Explor. 15(2), 49–60 (2013).
Article Google Scholar
Olson, R. S., La Cava, W., Orzechowski, P., Urbanowicz, R. J. & Moore, J. H. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Min. 10(1), 36 (2017).
Article Google Scholar
Leiserson, C.E. et al. There’s plenty of room at the top: What will drive computer performance after moore’s law?. Science 368(6495) (2020).

Download references

Acknowledgements

We thank Patryk Orzechowski for his help with the HIBACHI datasets, and Ryan Urbanowicz for helping out with the GAMETES datasets. This work was supported by National Institutes of Health (USA) grants LM010098, LM012601, AI116794.

Author information

Authors and Affiliations

Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, 19104-6021, USA
Moshe Sipper & Jason H. Moore
Department of Computer Science, Ben-Gurion University, Beer Sheva, 84105, Israel
Moshe Sipper

Authors

Moshe Sipper
View author publications
You can also search for this author in PubMed Google Scholar
Jason H. Moore
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.S. wrote conceived the basic idea, ran the experiments, and wrote up the paper draft J.H.M. provided statistical analysis and paper improvements All authors reviewed the manuscript.

Corresponding author

Correspondence to Moshe Sipper.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sipper, M., Moore, J.H. Conservation machine learning: a case study of random forests. Sci Rep 11, 3629 (2021). https://doi.org/10.1038/s41598-021-83247-4

Download citation

Received: 23 August 2020
Accepted: 01 February 2021
Published: 11 February 2021
DOI: https://doi.org/10.1038/s41598-021-83247-4

This article is cited by

Predicting the risk category of thymoma with machine learning-based computed tomography radiomics signatures and their between-imaging phase differences
- Zhu Liang
- Jiamin Li
- Biao Deng
Scientific Reports (2024)
Identification of colorectal cancer progression-associated intestinal microbiome and predictive signature construction
- Jungang Liu
- Xiaoliang Huang
- Weizhong Tang
Journal of Translational Medicine (2023)
Enhancing manufacturing process by predicting component failures using machine learning
- Raihanus Saadat
- Sharifah Mashita Syed-Mohamad
- Pantea Keikhosrokiani
Neural Computing and Applications (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.