Conservation machine learning: a case study of random forests

Conservation machine learning conserves models across runs, users, and experiments—and puts them to good use. We have previously shown the merit of this idea through a small-scale preliminary experiment, involving a single dataset source, 10 datasets, and a single so-called cultivation method—used to produce the final ensemble. In this paper, focusing on classification tasks, we perform extensive experimentation with conservation random forests, involving 5 cultivation methods (including a novel one introduced herein—lexigarden), 6 dataset sources, and 31 datasets. We show that significant improvement can be attained by making use of models we are already in possession of anyway, and envisage the possibility of repositories of models (not merely datasets, solutions, or code), which could be made available to everyone, thus having conservation live up to its name, furthering the cause of data and computational science.


1.
Jungle. Use all collected fitted models to form class predictions through majority voting, where each model votes for a single class. 2. Super-ensemble. Use an ensemble of ensembles, namely an ensemble of RFs, with prediction done through majority voting, where each RF votes for a single class.
To clarify, assume we perform 100 runs of RFs of size 100. We are then in possession of a jungle of size 10,000 decision trees, and a super-ensemble of size 100 RFs. Both perform predictions through majority voting. 3. Order-based (or ranking-based) pruning. We implemented two methods of ensemble pruning 6 . The first, ranking-based, sorts the jungle from best model (over training set) to worst, and then selects the top n models for the ensemble, with n being a specified parameter 4. Clustering-based pruning. The second ensemble-pruning method performs k-means clustering over all model output vectors, with a given number of clusters, k, and then produces an ensemble by collecting the topscoring (over training set) model of each cluster. 5. Lexigarden. A new method, introduce herein and described below, based on lexicase selection, a performant selection technique for evolutionary algorithms, with selection being one of the primary steps in such algorithms 8,9 . This step mimics the role of natural selection in nature, by probabilistically selecting fitter individuals from the evolving population, which will then undergo pseudo-genetic modification operators.
Lexicase selection selects individuals by filtering a pool of individuals which, before filtering, typically contains the entire population. The filtering is accomplished in steps, each of which filters according to performance on a single test case. Lexicase selection has been used productively within the field of evolutionary algorithms 10,11 . Herein, we co-opt it to cultivate a "garden" of select trees from the jungle, introducing the lexigarden algorithm. Lexigarden generates a garden of a specified size, whose models were selected through lexicase selection. Algorithm 1 provides the pseudocode. Lexigarden receives a jungle of models, a dataset along with target values, and the number of models the generated garden is to contain. The lexicase function begins by randomly shuffling the dataset, after which it successively iterates through it, retaining only the models that provide a correct answer for each instance. In the end we are left with either a single model or a small number of models, which are precisely those that have correctly classified the subset of the dataset engendered through the looping process. The lexicase function is called as many times as needed to fill the garden with models. Lexigarden ends up generating a highly diverse subset of all the models by picking ones that each excels on a particular random subset of instances.

Input:
models ← collection (jungle) of models cases ← dataset instances targets ← target values n models ← number of models in output garden Output: Garden of n models models, selected from the input collection while True do 10: candidates ← all models in candidates with correct prediction on first case in cases 11: if only one candidate remains in candidates then 12: Return candidate 13: Delete first case from cases 14: if cases is empty then 15

Datasets
To compose a variegated collection of classification datasets for the experiments described in the next section we turned to 6 different sources: 1. Easy. Scikit-learn's 12 "easy" classification datasets, where near-perfect performance is expected as par for the course: iris, wine, cancer, digits. 2. Clf. Datasets produced through make_classification, a Scikit-learn function that, "initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2*class_sep and assigns an equal number of clusters to each class. " 13 3. HIBACHI. A method and software for simulating complex biological and biomedical data 14 . 4. GAMETES. Software that generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. We generated a 2-way and a 3-way biallelic pure, strict, epistatic model with a heritability of 1, referred to in related work as the Xor model 15 . 5. OpenML. A repository of over 21,000 datasets, of which we selected problems designated as "Highest Impact", with a mix of number of samples, features, and classes 16 . 6. PMLB. The Penn Machine Learning Benchmark repository is an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies 17 .
Note that in addition to there being 6 dataset sources, there is also a mix of dataset repositories (Easy, OpenML, PMLB) and dataset generators (Clf, HIBACHI, GAMETES). Figure 1 shows a "bird's-eye view" of the total of 31 datasets.

Experimental setup
The experiments were based on Scikit-learn's RandomForestClassifier function, which we used with its default values (our aim here was not to improve RFs per se but to show that conservation can significantly improve a base ML algorithm) 12,13 .
The experimental setup is shown in Algorithm 2. For each replicate experiment we created 5 folds, for 5-fold cross validation. For each fold the dataset was split into a training set of 4 folds and the left-out test fold. 100 runs were conducted per fold. Each run consisted of fitting a 100-tree RF to the traning set and testing the fitted RF on the test set. In addition, all trees were saved into a jungle and all RFs were saved into a super-ensemble; these could then be tested on the test set.

Input:
dataset ← dataset to be used Output: Performance measures (over test sets) for run ← 1 to n runs do 12: Train RF with n trees on training set 13: Test resultant RF on test set 14: Add all models (decision trees) to jungle 15: Add RF to super-ensemble 16: Test jungle on test set

17:
Test super-ensemble on test set 18: Generate gardens of given sizes using order-based pruning, clustering-based pruning, and lexigarden www.nature.com/scientificreports/ The jungle also served as fodder for the cultivation methods of "Ensemble cultivation". For every training epoch, we generated gardens of given sizes (see Table 1) using the three cultivation methods: order-based pruning, clustering-based pruning, and lexigarden. These gardens could then also be tested on the test set. Table 1 shows the results of our experiments. Each line in the table presents the results for a single dataset, involving 30 replicate experiments, as delineated in "Experimental setup". We show the mean performance of RFs alongside the improvement of the 5 conservation methods discussed in "Ensemble cultivation".

Results
In addition to reporting improvement values we also report their statistical significance, as assessed by 10,000round permutation tests. The permutation tests focused on the mean test scores across all replicates and folds, comparing each ensemble (mean) to the original RFs. We report on two p-value significance levels: < 0.001 and < 0.05.

Discussion
Focusing on random forests for classification we performed a study of the newly introduced idea of conservation machine learning. It is interesting to note that-case in point-our experiments herein alone produced almost fifty million models (31 datasets × 30 replicates × 5 folds × 100 runs × 100 decision trees = 46,500,000).
As can be seen in Table 1, we attained statistically significant improvement for all datasets, except for the four "easy" problems, where a standard RF's accuracy was close to 1 to begin with, and little improvment could be eked out. All but the super-ensemble method ranked highest on several datasets: the jungle method attained the best improvement for 8 datasets, order-based pruning attained the best improvement for 10 datasets, clusteringbased pruning attained the best improvement for 2 datasets, and lexigarden attained the best improvement for 8 datasets (we excluded "easy" datasets from this count).
In summary, our results show that conservation random forests are able to improve performance through ensemble cultivation by making use of models we are already in possession of anyway.
There is a cost attached to conserving models, involving memory and computation time. The former is probably less significant, since saving millions and even billions of models requires storage space well within our reach. Computation time of a jungle or super-ensemble is obviously more costly than an ensemble of a lesser size (or a single model), but if the performance benefits are deemed worthwhile then time should not pose an issue. Computing a garden's output involves only a minor increase in computation time, and a one-time computation of the garden's members. As pointed out recently by 18 , improvements in software, algorithms, and hardware architecture can bring a much-needed boost. For example, they showed that a sample Python program, when coded in Java, produced a 10.8× speedup, and coding it in C produced a 47× speedup; moreover, tailoring the code to exploit specific features of the hardware gained an additional 1300× speedup.

Concluding remarks
There are many possible avenues for future exploration: • We focused on classification tasks, which leaves open the exploration of other types of tasks, such as regression and clustering. • We examined random forests, with other forms of ensemble techniques-such as sequential, boosting algorithms-yet to be explored. • Non-ensemble techniques deserve study as well.
• It is also possible to amass models that were obtained through different ML methods (or even non-ML methods). • We used simple majority voting to compute the output of jungles, super-ensembles, and gardens. More sophisticated methods could be explored. • We offered lexigarden as a novel supplement to the cultivation-method toolkit. Other cultivation techniques could be devised. www.nature.com/scientificreports/ • As noted in 1 , current cloud repositories usually store code, datasets, and leaderboards. One might consider a new kind of model archive, storing a plethora of models, which could provide bountiful grist for the ML mill.
While we focused on random forests herein, we note again that conservation ML is essentially an "add-on" metaalgorithm that can be applied to any collection of models, however they were obtained. . Cls: number of target classes. RF: mean performance of random forests on test set across all replicates (with standard deviation in parentheses). Performance is measured as the balanced accuracy score between target and predicted values. Jung: results for jungle of size 10,000 decision trees, comprising percent improvement over RFs. For this and the subsequent improvement results we provide an indication of the p-value of a 10,000-round permutation test of the improvement: a '!!' in parentheses indicates a p-value < 0.001 , a '!' indicates a p-value < 0.05 , and no parenthetic value indicates a p-value >= 0.05 . Sup: results for super-ensemble of size 100 RFs. Ord300: results for order-based pruning, producing an ensemble of top 300 decision trees. Ord1000: results for order-based pruning, producing an ensemble of top 1000 decision trees. Clus20: results for clustering-based pruning, with 20 clusters. Clus50: results for clustering-based pruning, with 50 clusters. Lex300: results for garden of size 300 decision trees, generated by lexigarden. Lex1000: results for garden of size 1000 decision trees, generated by lexigarden. www.nature.com/scientificreports/ We hope to see this idea receiving attention in the future.