Introduction

We recently presented the idea of conservation machine learning, wherein machine learning (ML) models are saved across multiple runs, users, and experiments1. Conservation ML is essentially an “add-on” meta-algorithm, which can be applied to any collection of models (or even sub-models), however they were obtained: via ensemble or non-ensemble methods, collected over multiple runs, gathered from different modelers, a priori intended to be used in conjunction with others—or simply plucked a posteriori, and so forth.

A random forest (RF) is an oft-used ensemble technique that employs a forest of decision-tree classifiers on various sub-samples of the dataset, with random subsets of the features for node splits. It uses majority voting (for classification problems) or averaging (for regression problems) to improve predictive accuracy and control over-fitting2.

Reference3 presented a method for constructing ensembles from libraries of thousands of models. They used a simple hill-climbing procedure to build the final ensemble, and successfully tested their method on 7 problems. They further examined three alternatives to their selection procedure to reduce overfitting. Pooling algorithms, such as stacked generalization4, and super learner5, have also proven successful. There also exists a body of knowledge regarding ensemble pruning6.

We believe the novelty of conservation machine learning, herein applied to random forests, is two-fold. First and foremost, we envisage the possibility of vast repositories of models (not merely datasets, solutions, or code). Consider the common case wherein several research groups have been tackling an extremely hard problem (e.g.,7), each group running variegated ML algorithms over several months (maybe years). It would not be hard to imagine that the number of models produced over time would run into the millions (quite easily more). Most of these models would be discarded unflinchingly, with only a minute handful retained, and possibly reported upon in the literature. We advocate making all models available to everyone, thus having conservation live up to its name, furthering the cause of data and computational science. Our second contribution in this paper is the introduction of a new ensemble cultivation method—lexigarden.

In1 we offered a discussion and a preliminary proof-of-concept of conservation ML, involving a single dataset source, 10 datasets, and a single so-called cultivation method. Herein, focusing on classification tasks, we perform extensive experimentation involving 5 cultivation methods, including the newly introduced lexigarden (“Ensemble cultivation”), 6 dataset sources, and 31 datasets (“Datasets”). Upon describing the setup (“Experimental setup”), we show promising results (“Results”), followed by a discussion (“Discussion”) and concluding remarks (“Concluding remarks”).

Ensemble cultivation

Conservation ML begins with amassing a collection of models—through whatever means. Herein, we will collect models saved over multiple runs of RF training.

Once in possession of a collection of fitted models it is time to produce a final ensemble. We examine five ways of doing so:

  1. 1.

    Jungle. Use all collected fitted models to form class predictions through majority voting, where each model votes for a single class.

  2. 2.

    Super-ensemble. Use an ensemble of ensembles, namely an ensemble of RFs, with prediction done through majority voting, where each RF votes for a single class.

    To clarify, assume we perform 100 runs of RFs of size 100. We are then in possession of a jungle of size 10,000 decision trees, and a super-ensemble of size 100 RFs. Both perform predictions through majority voting.

  3. 3.

    Order-based (or ranking-based) pruning. We implemented two methods of ensemble pruning6. The first, ranking-based, sorts the jungle from best model (over training set) to worst, and then selects the top n models for the ensemble, with n being a specified parameter

  4. 4.

    Clustering-based pruning. The second ensemble-pruning method performs k-means clustering over all model output vectors, with a given number of clusters, k, and then produces an ensemble by collecting the top-scoring (over training set) model of each cluster.

  5. 5.

    Lexigarden. A new method, introduce herein and described below, based on lexicase selection, a performant selection technique for evolutionary algorithms, with selection being one of the primary steps in such algorithms8,9. This step mimics the role of natural selection in nature, by probabilistically selecting fitter individuals from the evolving population, which will then undergo pseudo-genetic modification operators.

Lexicase selection selects individuals by filtering a pool of individuals which, before filtering, typically contains the entire population. The filtering is accomplished in steps, each of which filters according to performance on a single test case. Lexicase selection has been used productively within the field of evolutionary algorithms10,11. Herein, we co-opt it to cultivate a “garden” of select trees from the jungle, introducing the lexigarden algorithm. Lexigarden generates a garden of a specified size, whose models were selected through lexicase selection.

Algorithm 1 provides the pseudocode. Lexigarden receives a jungle of models, a dataset along with target values, and the number of models the generated garden is to contain. The lexicase function begins by randomly shuffling the dataset, after which it successively iterates through it, retaining only the models that provide a correct answer for each instance. In the end we are left with either a single model or a small number of models, which are precisely those that have correctly classified the subset of the dataset engendered through the looping process. The lexicase function is called as many times as needed to fill the garden with models. Lexigarden ends up generating a highly diverse subset of all the models by picking ones that each excels on a particular random subset of instances.

figure a

Datasets

To compose a variegated collection of classification datasets for the experiments described in the next section we turned to 6 different sources:

  1. 1.

    Easy. Scikit-learn’s12 “easy” classification datasets, where near-perfect performance is expected as par for the course: iris, wine, cancer, digits.

  2. 2.

    Clf. Datasets produced through make_classification, a Scikit-learn function that, “initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2*class_sep and assigns an equal number of clusters to each class.”13

  3. 3.

    HIBACHI. A method and software for simulating complex biological and biomedical data14.

  4. 4.

    GAMETES. Software that generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. We generated a 2-way and a 3-way biallelic pure, strict, epistatic model with a heritability of 1, referred to in related work as the Xor model15.

  5. 5.

    OpenML. A repository of over 21,000 datasets, of which we selected problems designated as “Highest Impact”, with a mix of number of samples, features, and classes16.

  6. 6.

    PMLB. The Penn Machine Learning Benchmark repository is an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies17.

Note that in addition to there being 6 dataset sources, there is also a mix of dataset repositories (Easy, OpenML, PMLB) and dataset generators (Clf, HIBACHI, GAMETES). Figure 1 shows a “bird’s-eye view” of the total of 31 datasets.

Figure 1
figure 1

A “bird’s-eye view” of the 31 datasets used in this study: number of instances (left), number of features (center), and number of classes (right).

Experimental setup

The experiments were based on Scikit-learn’s RandomForestClassifier function, which we used with its default values (our aim here was not to improve RFs per se but to show that conservation can significantly improve a base ML algorithm)12,13.

The experimental setup is shown in Algorithm 2. For each replicate experiment we created 5 folds, for 5-fold cross validation. For each fold the dataset was split into a training set of 4 folds and the left-out test fold. 100 runs were conducted per fold. Each run consisted of fitting a 100-tree RF to the traning set and testing the fitted RF on the test set. In addition, all trees were saved into a jungle and all RFs were saved into a super-ensemble; these could then be tested on the test set.

figure b

The jungle also served as fodder for the cultivation methods of “Ensemble cultivation”. For every training epoch, we generated gardens of given sizes (see Table 1) using the three cultivation methods: order-based pruning, clustering-based pruning, and lexigarden. These gardens could then also be tested on the test set.

Results

Table 1 shows the results of our experiments. Each line in the table presents the results for a single dataset, involving 30 replicate experiments, as delineated in “Experimental setup”. We show the mean performance of RFs alongside the improvement of the 5 conservation methods discussed in “Ensemble cultivation”.

Table 1 Experimental results.

In addition to reporting improvement values we also report their statistical significance, as assessed by 10,000-round permutation tests. The permutation tests focused on the mean test scores across all replicates and folds, comparing each ensemble (mean) to the original RFs. We report on two p-value significance levels: \(<0.001\) and \(<0.05\).

Discussion

Focusing on random forests for classification we performed a study of the newly introduced idea of conservation machine learning. It is interesting to note that—case in point—our experiments herein alone produced almost fifty million models (31 datasets \(\times \) 30 replicates \(\times \) 5 folds \(\times \) 100 runs \(\times \) 100 decision trees = 46,500,000).

As can be seen in Table 1, we attained statistically significant improvement for all datasets, except for the four “easy” problems, where a standard RF’s accuracy was close to 1 to begin with, and little improvment could be eked out. All but the super-ensemble method ranked highest on several datasets: the jungle method attained the best improvement for 8 datasets, order-based pruning attained the best improvement for 10 datasets, clustering-based pruning attained the best improvement for 2 datasets, and lexigarden attained the best improvement for 8 datasets (we excluded “easy” datasets from this count).

In summary, our results show that conservation random forests are able to improve performance through ensemble cultivation by making use of models we are already in possession of anyway.

There is a cost attached to conserving models, involving memory and computation time. The former is probably less significant, since saving millions and even billions of models requires storage space well within our reach. Computation time of a jungle or super-ensemble is obviously more costly than an ensemble of a lesser size (or a single model), but if the performance benefits are deemed worthwhile then time should not pose an issue. Computing a garden’s output involves only a minor increase in computation time, and a one-time computation of the garden’s members. As pointed out recently by18, improvements in software, algorithms, and hardware architecture can bring a much-needed boost. For example, they showed that a sample Python program, when coded in Java, produced a 10.8\(\times \) speedup, and coding it in C produced a 47\(\times \) speedup; moreover, tailoring the code to exploit specific features of the hardware gained an additional 1300\(\times \) speedup.

Concluding remarks

There are many possible avenues for future exploration:

  • We focused on classification tasks, which leaves open the exploration of other types of tasks, such as regression and clustering.

  • We examined random forests, with other forms of ensemble techniques—such as sequential, boosting algorithms—yet to be explored.

  • Non-ensemble techniques deserve study as well.

  • It is also possible to amass models that were obtained through different ML methods (or even non-ML methods).

  • We used simple majority voting to compute the output of jungles, super-ensembles, and gardens. More sophisticated methods could be explored.

  • We offered lexigarden as a novel supplement to the cultivation-method toolkit. Other cultivation techniques could be devised.

  • As noted in1, current cloud repositories usually store code, datasets, and leaderboards. One might consider a new kind of model archive, storing a plethora of models, which could provide bountiful grist for the ML mill.

While we focused on random forests herein, we note again that conservation ML is essentially an “add-on” meta-algorithm that can be applied to any collection of models, however they were obtained.

We hope to see this idea receiving attention in the future.