Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm

We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13 ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a materials composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully-automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm - namely, that crystal graph methods appear to outperform traditional machine learning methods given ~10^4 or greater data points. The pre-processed, ready-to-use Matbench tasks and the Automatminer source code are open source and available online (http://hackingmaterials.lbl.gov/automatminer/). We encourage evaluating new materials ML algorithms on the MatBench benchmark and comparing them against the latest version of Automatminer.


Introduction
New functional materials are vital for making fundamental advances across scientific domains, including computing and energy conversion. However, most materials are brought to commercialization primarily by direct experimental investigation, an approach typically limited by 20+ year design processes, constraints in the number of chemical systems that can be investigated, and the limits of a particular researcher's intuition. By utilizing materials "big data" and leveraging advances in machine learning (ML), the emerging field of materials informatics has demonstrated massive potential as a catalyst for materials development, alongside ab initio techniques such as high-throughput density functional theory 1,2 (DFT). For example, by using support vector machines to search a space of more than 118k candidate crystal structures, Tehrani et al. 3 identified, synthesized, and experimentally validated two novel superhard carbides. In another study, Cooper et al. 4 applied natural language processing (NLP) techniques to assemble 9k photovoltaic candidates from scientific literature; equipped with algorithmic structure-property encodings and a design-to-device data mining workflow, they identified and experimentally realized a new high-performing panchromatic absorption dye. These examples are but two of many. The sheer investigative volume and potential research impact of materials data mining has helped brand it as "materials 4.0" 5 or "the 4th paradigm" 6 of materials research. However, the growing role of ML in materials design exposes weaknesses in the materials data mining pipeline: first, there is no systematic method for comparing and selecting materials ML models. Comparing newly published models to existing techniques is crucial for rational ML model design and advancement of the field. Other fields of applied ML have seen rapid advancement in recent years in large part due to the creation and use of standardized community benchmarks such  9 , it is uncommon for two algorithms to be tested against the same dataset and with the same data cleaning procedures. Methods for estimating generalization error (e.g., the train/test split) also vary significantly. Typically, either the predictive error is averaged over a set of cross-validation folds (CV score) 10 or a hold-out test set is used, with the specifics of the split procedure varying between studies. Furthermore, if a model's hyperparameters are tuned to directly optimize one of these metrics, equivalent to trying many models and only reporting the best one, they may significantly misrepresent the true generalization error 10,11 (model selection bias). Arbitrary choice of hold-out set can also bias a comparison in favor of one model over another (sample selection bias) [12][13][14] . Thus, the materials informatics community lacks a standard benchmarking method for critically evaluating new models. If models cannot be accurately compared, ML studies are difficult to reproduce and innovation suffers.
Moreover, the breadth of materials ML tasks is so large that many models must still be designed and tuned by hand. While encouraging for the field, the recent explosion 15 of novel descriptors and models has given practitioners a paradox-of-choice, as selecting the optimal descriptors and model for a given task is nontrivial. The consequences of this paradox-of-choice can be that researchers select suboptimal models or spend researcher time towards re-tuning models for new applications. Thus, an automatic algorithm -which requires no expert domain knowledge to operate yet utilizes knowledge from published literature -could be of great use in prototyping, validating, and analyzing novel high-fidelity models.
Given the above considerations, a benchmark consisting of the following two parts is needed: (1) a robust test suite of materials ML tasks and (2) an automatic "reference" model. The test suite must mitigate arbitrarily favoring one model over another. Furthermore, it should contain a variety of datasets such that domainspecific algorithms can compare on specific datasets and general-purpose algorithms can compare across multiple relevant tasks. The second part, the reference algorithm, may serve multiple purposes. First, it might provide a community standard -or "lower bar" -which future innovation in materials ML should aim to surpass. Second, it can act as an entry point into materials informatics for non-domain specialists since it only requires a dataset as input.
Finally, it can help determine which descriptors in the literature are most applicable to a given task or set of tasks.
In this paper, we introduce both these developments -a benchmark test set and a reference algorithm -for application to inorganic, solid state materials property prediction tasks. Matbench, the test suite, is a collection of 13 materials science-specific data mining tasks curated to reflect the diversity of modern materials data. Containing both traditional "small" materials datasets of only a few hundred samples and large datasets of >10 5 samples from simulation-derived databases, Matbench provides a consistent nested cross validation 16 (NCV) method for estimating regression and classification errors on a range of mechanical, electronic, and thermodynamic material properties. Automatminer, the reference algorithm, is a general-purpose and fully-automated machine learning pipeline. In contrast to other published models that are trained to predict a specific property, Automatminer is capable of predicting any materials property given materials primitives (e.g., chemical composition) as input when provided with a suitable training dataset. It does this by performing a procedure similar to a human researcher: by generating descriptors using Matminer's library 17 of published materials-specific featurizations, performing feature reduction and data preprocessing, and determining the best machine learning model by internally testing various possibilities on validation data. We test Automatminer on the test suite in order to establish baseline performance, and we present a comparison of Automatminer with published ML methods. Finally, we demonstrate our benchmark capable of distinguishing predictive strengths and weaknesses among ML techniques. We expect both Matbench and Automatminer to evolve over time, although the current versions of these tools are ready for immediate use. As evidence of its usefulness, Kabiraj et al. 18 have recently used Automatminer in their research on 2D ferromagnets.

Matbench test suite v0.1
The Matbench test suite v0.1 contains 13 supervised ML tasks from 10 datasets. Matbench's data is sourced from various sub-disciplines of materials science, such as experimental mechanical properties (alloy strength), computed elastic properties, computed and experimental electronic properties, optical and phonon properties, and thermodynamic stabilities for crystals, 2D materials, and disordered metals. The number of samples in each task ranges from 312 to 132,752, representing both relatively scarce experimental materials properties and comparatively abundant properties such as DFT-GGA 19 formation energies. Each task is a self-contained dataset containing a single material primitive as input (either composition or composition plus crystal structure) and target property as output for each sample. To help enforce homogeneity, datasets are precleaned to remove unphysical computed data and task-irrelevant experimental data (see Methods for more details); thus, as opposed to many raw datasets or structured online repositories, Matbench's tasks have already had their data cleaned for input into ML pipelines. We recommend the datasets be used as-is for consistent comparisons between models. To mitigate model and sample selection biases, each task uses a consistent nested cross-validation 16 procedure for error estimation (see Methods). The distribution of datasets with respect to application type, sample count, type of input data, and type of output data is illustrated in Figure 1; detailed notes on each task can be found in Table 1.  "Application" describes the ML target property of the task as it relates to materials, "Num. samples" describes the number of samples in each task, "Input Type" describes the materials primitives that serve as input for each task, and "Task Type" designates the supervised ML task type. Numbers in the bars represent the number of tasks fitting the descriptor above it (e.g., there are 10 regression tasks).

Automatminer Reference Algorithm
At a high level, an Automatminer pipeline can be considered a black box that performs many of the steps typically performed by trained researchers (feature extraction, feature reduction, model selection, hyperparameter tuning). Given only a training dataset, and without further researcher intervention or hyperparameter tuning, Automatminer produces a machine learning model that accepts materials compositions and/or crystal structures and returns predictions. Automatminer can create persistent end-to-end pipelines containing all internal training data, configuration, and the best-found model -allowing the final models to be further inspected, shared, and reproduced.
As shown in Figure 2, the Automatminer pipeline is composed of four stages.
In the first stage, autofeaturization, Automatminer generates potentially relevant features using Matminer's featurizer library 17  are available based on memory, CPU, and time constraints, and no user customization is required to train or predict using materials data when using these presets. In this work, we report results generated using the "Express" preset, which is designed to run with a maximum AutoML training time of 24 hours.

Figure 2:
The AutoML + Matminer (Automatminer) pipeline, which can be applied to composition-only datasets, structure datasets, and datasets containing electronic bandstructure information. Once fit, the pipeline accepts one or more materials primitives and returns a prediction of a materials property. During autofeaturization, the input dataset is populated with potentially relevant features using the Matminer library. Next, data cleaning and feature reduction stages prepare the feature matrices for input to an AutoML search algorithm. During training, the final stage searches ML pipelines for optimal configurations; during prediction, the best ML pipeline (according to validation score) is used to make predictions.
We evaluate Automatminer on the Matbench test suite and provide comparisons with alternative algorithms in Figure 3. The evaluation is performed using a five-fold Nested Cross Validation (NCV) procedure. In contrast to relying on a single train-test split, in the five-fold NCV procedure, five different train-test sets are created. For each of the five train-test sets, a machine learning model is fit using only the training data and evaluated on the test data. Note that this implies that even for a single type of model (e.g., Automatminer or CGCNN 33 ), a slightly different model will be trained for each of the five splits since the training data differs between splits. The errors from the five different overall runs are averaged to give the overall score. Note that within each of the five runs of this outer loop, the training data portion is generally split using an inner cross-validation that is used for model selection within the training data, hence the name "Nested Cross Validation" (in our procedure, an algorithm can make use of the training data however it chooses). One advantage of 5-fold nested CV over a traditional train-test split is that each sample in the overall dataset is present as training in four of the splits and as test in one of the splits.
For all tasks, the Automatminer "Express" preset configuration is used in this work. For some Matbench tasks, we were able to find published scores of researcher-optimized machine learning models, which we label as the "Best Literature" score. However, it should be noted that although these studies report the same error metric (MAE) using similar datasets, the scores do not use identical datasets (e.g., using different data filtering algorithms to remove erroneous or unreliable data points) or the same error estimation procedure (e.g., they do not use nested cross validation and may use different proportions of train and test).
Therefore, these scores cannot be directly compared to the algorithms listed above.  Table 1  All models outperform Dummy on all tasks: the Dummy comparison exhibits errors between 68% and 299% higher than the best model for any task. We next examine which algorithms perform best, with "best" taken to include scores within 1% of the best NCV score (we find the standard deviation between folds for the same model is typically between 0.5 -5%   Automatminer outperforms the graph networks on most small datasets, MEGNet decisively outperforms Automatminer for the PhDOS task. The predictive advantage may lie in MEGNet's specific architecture and implementation rather than an inherent advantage of crystal graph neural networks, given CGCNN has higher error than both Automatminer and MEGNet for the PhDOS task.

Discussion
The reference algorithm and test suite presented above encompass a benchmark that can be used to accelerate development of supervised learning tasks in materials science. Automatminer provides an extensible and universal platform for automated model selection, while Matbench defines a consistent test procedure for unbiased model comparison. Together, Automatminer + Matbench define a performance baseline for machine learning models aiming to predict materials properties from composition or crystal structure. In this section, we address limitations and extensions of both the reference algorithm and the test suite.

Reference algorithm analysis
Although the "Express" preset was used to demonstrate Automatminer's performance, the Automatminer pipeline is fully configurable at each stage. the pre-defined model space or feature set construction, thoughtfully-engineered models such as graph networks or other concepts will likely be able to exceed the baseline AutoML model's performance. An AutoML algorithm is best suited for the rapid prototyping of more complex human-tuned models rather than the replacement of architectures designed with human expertise.

Test suite limitations and extensions
In the Matbench benchmark, we use NCV as a one-size-fits-all tool for evaluation, but it is also conceivable that domain-specific methods better estimate the generalization error than NCV. Ren et al. 38 use "grouped" CV to estimate the error of their models for classifying bulk metallic glasses outside of the chemical systems contained in the training set. The rationale behind grouped CV is that the testing procedure should mimic the real-world application. In the case of bulk metallic glass study, the intended goal of the algorithm was to make predictions in chemical systems where no data points were yet present. However, a randomized train/test split would likely result in selecting some data points from all chemical systems for the training and testing data. Instead, grouped CV will first separate data points by chemical space, and then select an entire chemical space to fall into either the test or training set. This ensures that testing is conducted on new chemical spaces for which there is no training data within that chemical space.
Yet, using grouped CV requires a well-defined manner for grouping the data.
In the case of bulk metallic glasses, chemical systems are easily identified as natural groups since the goal is to predict data for entirely unexplored chemical systems. For other materials ML tasks, features for grouping may be hidden in subtle structural motifs or nuances of electronic configuration. Leave-one-clusterout CV (LOCO-CV) 41  features are based on composition, but the most natural grouping is by a structural feature such as crystal type, then the resulting groups will have less value. Thus, for now it is largely up to researchers to determine the need for using grouped CV and to determine the best grouping strategy. Other strategies 41,42 to predict outlier data in the test set may also prove useful.
An improved benchmark could use a specific, distinct error estimation procedure for every task; such a procedure can be determined by domain experts to most accurately represent the real-world use of the algorithm. The ideal benchmark would therefore be a consensus of community tasks, each with an error estimation procedure customized to most accurately reflect the algorithm's true error rate in that particular subfield. We chose NCV as a standard error estimator because there are few such well-agreed-upon procedures for existing materials datasets. Future versions of the benchmark may include error estimation procedures other than NCV.
Matbench is not intended to be a final benchmark but a versioned resource that will grow with the field. The ever-increasing volume of data generated from advances in high-throughput experimentation and computation may enable future ML algorithms to predict classes of materials properties that are presently sparse.

Conclusion
We presented Matbench v0.1, a set of ML tasks aimed at standardizing comparisons of materials property prediction algorithms. We also introduced Automatminer, a fully-automated pipeline for predicting materials properties, which we used to set a baseline across the task set. Using Matbench, we compare Automatminer with crystal graph neural network models, a traditional Random Forest model, and a Dummy control model. We find Automatminer's auto-generated models outperform or equal the RF model across all but one task and are more accurate than crystal graph networks on most tasks with ~10 4 points or fewer.
However, crystal graph networks appear to learn better on tasks with larger datasets. Automatminer can be used outside of benchmarking to make predictions automatically and seed research for more specialized, hand-tuned models. We encourage evaluating new ML algorithms on the Matbench benchmark and comparing with the latest version of Automatminer.

Methods
Raw data for Matbench v0.1 were obtained by downloading from the original sources. Tabular versions of some datasets are available online through Matminer's dataset retrieval tools. These datasets contain metadata and auxiliary data. In contrast, the final Matbench datasets are curated tasks containing only the materials input objects and target variables, with all extraneous data removed.
Unphysical (e.g., negative DFT elastic moduli), highly uncommon or unrepresentative samples (e.g., solid state noble gases) were removed according to a specific per-task procedure. Table 2 describes the resources and steps needed to recreate each dataset from the original source or Matminer version.    46 . Several alternative schemes have been proposed which preserve NCV's advantages while attempting to mitigate issues from increased variability and computational cost. One potential improvement is repeated NCV; but even this approach demonstrates large variation of loss estimates across nested CV runs and is even more computationally expensive than NCV 47 . A promising alternative proposes a smooth analytical alternative to NCV which would reduce the NCV's computational intensity 46 . This analytical alternative also reduces the variability introduced by learning set choice using weights determined after the outer CV loop has been fixed. Yet, the analytical alternative relies on critical assumptions which do not hold for particular models such as support vector machines with noisy observations. Therefore, at this time, NCV is an adequate method for evaluating and comparing models using the Matbench benchmark. We note that all the code for running the specific tests in this paper is also present in a subpackage of this repository: (https://github.com/hackingmaterials/automatminer/tree/master/automatminer_dev ).

Automatminer Configuration
Beyond what is listed in Methods, the Automatminer configuration is determined by the specifics of two primary operations -specifying a set of Matminer 4 featurizers (featurization) and specifying a predetermined model space. During fitting and prediction, Automatminer robustly applies this set of featurizers and TPOT 5 searches the model space for the optimal model. The Automatminer "Express" preset used to generate the results in the main text includes preset settings for a set of Automatminer features (Table S2.1), a data cleaning and feature reduction procedure (described in Methods), and settings for TPOT, including the model space (Tables S2.2-3). Alternate presets available in Automatminer can be explored in the source code.

Metallicity Classification
For tasks where the AutoML algorithm can fit and iterate models rapidly (i.e., the dataset is small, <10 4 samples), Automatminer can require increasingly large computational effort to marginally improve predictive performance. Therefore, for small datasets, the bulk of Automatminer's performance can be retained using inexpensive presets (~1-5 minute training) versus more expensive presets (24h+ training).
Here we show the performance of three presets on the Matbench  Debug has a much higher number of features because its preset undergoes minimal correlation-based feature reduction only.