Introduction

The use of machine learning (ML) has been increasingly popular in the materials science community1,2,3,4,5,6,7,8,9,10,11. Central to the training of machine learning models is the need for findable, accessible, interoperable, and reusable (F.A.I.R.)12 materials science datasets. High-throughput density functional theory (DFT) calculations have proven to be an efficient and reliable way to generate materials property data, screen the target materials space and accelerate materials discovery13,14,15,16,17. Concentrated community efforts have led to the curation of large DFT databases for various materials properties, e.g., Materials Project18, Automatic FLOW for Materials Discovery19, Open Quantum Materials Database14, and JARVIS-DFT20. The availability of large materials databases has fueled the development and application of machine learning methods based on a chemical formula or atomic structures, including traditional ML models with preselected feature sets21,22,23,24,25,26,27,28,29 and neural networks with automatic feature extraction30,31,32,33,34,35,36,37,38,39,40,41.

The continuously improved performance of ML models in the DFT database benchmarks shows the great potential of using these models as the surrogate of computationally expensive DFT calculations to explore unknown materials37. However, there are reasons to remain cautious, particularly for the generalization performance of the trained ML models42. First, the current DFT databases still cover only a very limited region of the potential materials space43,44. Some databases may be the results of mission-driven calculations and therefore be more focused on certain types of materials or structural archetypes, leading to biased distributions45,46,47,48. In addition, data distributions may shift even between different versions of an actively expanding database, due to a change in their focus with time. While the common practice is to train and validate ML models on the latest databases, we are unaware of any systematic study examining whether these models can predict reasonably (or at least qualitatively) well the properties of new materials added in the future database versions. Such an examination is critical for assessing the maturity of a database (namely, whether it is sufficiently representative of the materials space) and the robustness of the resulting ML prediction, both of which are essential for building trust in the use of these ML models.

Since the current databases may not yet offer an unbiased and sufficiently rich representation of the potential materials space, the performance scores of an ML model evaluated from a random train-validation-test split may be an optimistic estimate of the true generalization performance49,50,51. While the latter may be estimated more properly from grouped cross-validations (CV)52,53,54, finding a well-defined method for grouping data is not always trivial and depends on the preselected input features, which may not be the optimal way to find the most physically relevant grouping37. On the other hand, one may consider it safer to limit the use of an ML model to its applicability domain, or the interpolation region48. However, in high-dimensional compositional-structural feature space, as is encountered in materials science, it is challenging to properly define an interpolation region and to determine when the model is extrapolating.

In this work, we highlight the limitations of the current ML methods in materials science for predicting out-of-distribution samples, by showing that ML models pretrained on the Materials Project18 2018 database have unexpectedly acute performance degradation on the latest database. Such performance degradation can occur in the deployment stage of any ML model and degrades community trust in their validity. Therefore, we also provide solutions for diagnosing, foreseeing, and addressing the issue, and discuss ways to improve prediction robustness and generalizability.

The paper is organized as follows. First, we examine the performance of a state-of-the-art neural network, with a comparison to traditional ML models. Next, we analyze the observed performance degradation in terms of the dataset’s feature space. We then discuss different methods based on the dataset’s representation and model predictions to foresee the generalization issue. Finally, we propose ways to improve prediction robustness for materials exploration.

Results

Failure to generalize in new regions of materials space

Formation energy (Ef) is a fundamental property that dictates the phase stability of a material. Formation energy prediction is a basic task for ML models used in materials science, including traditional descriptor-based models26,27,28,55 and neural networks31,32,33,34,36. Among them, graph neural network (GNN) models with atomistic structures as inputs are currently considered to have state-of-the-art performance7. Here we consider the Atomistic LIne Graph Neural Network (ALIGNN), an architecture that performs message passing on both the interatomic bond graph and its line graph corresponding to bond angles34. The ALIGNN model shows the best performance in predicting the Materials Project18 formation energy according to the Matbench37 leader-board; we, therefore, choose it as the representative GNN model for the subsequent performance evaluation.

We use the ALIGNN model pretrained on the Materials Project 2018.06.01 version (denoted as MP18), which contains 69239 materials and has been used for benchmarking GNN models in the recent papers32,33,34. In the original ALIGNN paper, a 60000-5000-4239 train-validation-test split of the MP18 dataset was used, achieving a mean absolute error (MAE) of 0.022 eV/atom for the test set34.

We use the MP18-pretrained ALIGNN model (ALIGNN-MP18) to predict the formation energies of the new structures in the latest (2021.11.10 version) Materials Project database (denoted as MP21). Instead of testing on the whole MP21 dataset, we consider the scenario where we want to apply ML models to explore a particular material subspace of interest. In this work, we define the alloys of interest (AoI) as the space formed by the first 34 metallic elements (from Li to Ba) and the alloys formed exclusively by these elements. This AoI materials space is defined to include the most common components for high-entropy alloys, a class of alloys that has recently drawn much attention thanks to its superior performance compared to traditional alloys56. In the MP21 dataset, there are 7800 AoI, 2261 (or 29%) of which already appear in the MP18 dataset, while the rest are not contained within MP18. Therefore, we consider those 2261 alloys as the AoI in the training set, and the rest that appear only in the MP21 as the AoI in the test set.

A list of important acronyms used in this work is given in Table 1. A description of the MP18 dataset, and the AoI data is given in Table 2. We note that the mean absolute deviation (MAD) and the standard deviation (STD) of the data correspond to the mean absolute error (MAE) and the root mean square error (RMSE) of a baseline model whose prediction for every structure is equal to the mean of the training data.

Table 1 List of important acronyms used in this work.
Table 2 Description of the MP18 data and the AoI data in the MP18 and MP21 datasets.

Figure 1 shows the ALIGNN-MP18 performance on the formation energy predictions of the AoI. For the AoI in the training set, the ALIGNN-MP18 predictions agree well with the DFT values, with an MAE of 0.013 eV/atom. For the AoI test data, while there is still a reasonable agreement for the structures with \({E}_{f}^{\,{{\mbox{DFT}}}\,}\) below 0.5 eV/atom, the ALIGNN-MP18 model strongly underestimates the formation energies for a significant portion of the structures whose \({E}_{f}^{\,{{\mbox{DFT}}}\,}\) are above 0.5 eV/atom. In the latter case, the prediction errors range from 0.5 eV/atom to up to 3.5 eV/atom, which is 23 to 160 times larger than the MP18 test MAE of 0.022 eV/atom. Indeed, the prediction errors are nearly as large as \({E}_{f}^{\,{{\mbox{DFT}}}\,}\) for those alloys, indicating that the ALIGNN-MP18 predictions fail to even qualitatively match the DFT formation energies. For reference, the MAE and the coefficient of determination (R2 score) for the AoI test set are 0.297 eV/atom and 0.194, respectively (Table 3).

Fig. 1: Performance of the ALIGNN-MP18 model.
figure 1

a Parity plot and b prediction errors of the ALIGNN-MP18 model.

Table 3 Comparison of MAE (in eV/atom), RMSE (in eV/atom), and coefficient of determination (R2) between different ML models.

It can be seen from Fig. 1 that the ALIGNN-MP18 predictions are largely restricted to the value range below 1 eV/atom. Indeed, despite a large formation energy range (from –4.3 to 4.4 eV/atom) of the whole MP18 dataset, most of the formation energies of the AoI in the MP18 lie between −1 to 1 eV/atom. Therefore, it is not surprising that the ALIGNN-MP18 predictions are limited by the range of the formation energies of the AoI training set. However, it is unexpected to observe that the strong underestimation by the ALIGNN-MP18 model already occurs in the formation energy range of 0.5 to 1 eV/atom. For alloys with formation energies above 1 eV/atom, the ALIGNN-MP18 model predicts values that are well below the upper bound of formation energies in the training set, some of which are even negative. Consequently, the test set performance issue of the ALIGNN-MP18 model cannot be explained by the bounded energy range of the AoI in the training set. The origin of the issue will be discussed in the next section.

To verify whether the performance issue is common to other ML models, we perform the same training and test procedures with traditional descriptor-based ML models. To do so, we first use Matminer26,27,28 to extract 273 features based on compositions and structures for the whole MP18 dataset and the alloys in MP21. Then, we down select features by sequentially dropping highly correlated features using a Pearson’s R of 0.7 as the threshold, reducing the final number of features to 90. These 90 features are used for subsequent traditional ML model training and other analysis throughout this work.

Here we consider three traditional regression models: the gradient-boosted trees as implemented in XGBoost (XGB)57, random forests (RF) as implemented in scikit-learn58, and linear forests (LF) as implemented in linear-forest (https://github.com/cerlymarco/linear-tree)59. XGB builds sequentially a number of decision trees in a way such that each subsequent tree tries to reduce the residuals of the previous one. RF is an ensemble learning technique that combines multiple independently built decision trees to improve accuracy and minimize variance. LF combines the strengths of the linear and RF models, by first fitting a linear model (in this work, a Ridge model) and then building an RF on the residuals of the linear model.

The motivation for using the traditional models for understanding the ALIGNN-MP18 performance issue is three-fold. First, do traditional ML models fail to generalize as well, or is this failure unique to neural networks? Second, traditional models can provide more interpretability than neural networks and can be used as surrogate models of the ALIGNN in the subsequent analysis. Finally, traditional models are computationally much easier to train than large neural networks, allowing us to perform more detailed statistical examinations. In fact, the reference implementation of ALIGNN-MP1834 required a total compute cost of 28 GPU hours plus 224 CPU hours for training on the MP18 dataset. For comparison, the same training with traditional ML models takes 0.02 CPU hours (four orders of magnitude less compute than the ALIGNN). First, the XGB, RF, and LF models are hypertuned, trained, and tested with the same train-validation-test split of the MP18 as for the ALIGNN model. Then, the models are trained on the MP18 and tested on the new AoI in the MP21. Comparisons of their performance metrics and predictions are shown in Table 3 and Fig. 2, respectively.

Fig. 2: Comparison of MP18-pretrained model performance on the AoI test set.
figure 2

The ALIGNN model performance is compared to that of the (a) XGB, b RF, and c LF models.

Table 3 shows that the MP18 test set MAE of the traditional models are three to five times larger than that of the ALIGNN-MP18. This is consistent with literature findings that neural networks usually outperform traditional models in various benchmarks of large materials databases7,37. However, evaluating model performance from the random train-validation-test split is based on the assumption that data distributions are identical for the training and test sets, which may not hold when exploring new materials. Therefore, such performance scores are not good estimates of the model’s true generalizability37,53. Indeed, when the models are applied to the new AoI in MP21, the large performance difference between traditional and ALIGNN models disappears. More strikingly, XGB outperforms ALIGNN in terms of the MAE, RMSE, and R2 scores, whereas LF outperforms ALIGNN in terms of the RMSE and R2 scores. An equal footing comparison of the extrapolation performance should also take into account the complexity and capacity of the models. A more consistent comparison may be to compute the ratio of the performance metrics obtained with the training set and the test set, which are shown as the last two columns in Table 3. The performance degradation of the traditional models is less severe than that of the ALIGNN model.

Figure 2 gives a more detailed comparison of the prediction performance of the MP21 new alloys. Compared to the ALIGNN model, the XGB model leads to larger errors in the \({E}_{f}^{\,{{\mbox{DFT}}}\,}\) range below 0.5 eV/atom, but performs considerably better for predicting high-energy alloys, of which there are fewer structures that are misclassified as having negative formation energies. On the other hand, the RF model performs similarly to the ALIGNN model in the \({E}_{f}^{\,{{\mbox{DFT}}}\,}\) range below 0.5 eV/atom but worse than the latter for high-energy structures. Interestingly, the LF model, in which the linear model is first fitted before training the RF model, improves the predictions for high-energy structures to an extent similar to the XGB model. The better RMSE scores for the XGB and LF models are attributed to the less degraded predictions for those high-energy structures.

The above discussion of Table 3 and Fig. 2 shows that the performance degradation issue observed in the ALIGNN model also occurs in other traditional descriptor-based models, but the performance degradation can be quite different, with the XGB and LF models demonstrating less performance degradation. In the following sections, we will reveal the origin of the performance degradation and the reasons behind the better generalizability of the XGB and LF models.

Diagnosing generalization performance degradation

In the previous section, we have shown that the performance issue on the AoI test set is common to different ML models, indicating that it is likely related to the distribution shift between the training and the test sets. For instance, the test set may cover compositions or structures that lie far away from the training set. Here we show how to diagnose this issue in a holistic and detailed manner, and discuss some important insights resulting from this analysis.

We start by comparing the distributions of some basic compositional and structural features between the MP18 and MP21 datasets. In Fig. 3, we count for each element X of 34 metallic elements the number of X-containing AoI in the training and the test set (see Supplementary Fig. 1 for the distribution of all materials). We also plot the MAE of ALIGNN-MP18 for the corresponding X-containing AoI in the test set to investigate potential correlations between large MAE and elements that are underrepresented in the AoI training set. We find that although there are few AoI that contain elements such as K, Rb, and Cs in the training set, the corresponding test MAE are actually rather small. Indeed, we find a Spearman’s rank correlation coefficient (rS) of 0.06, i.e., negligible correlation, between the test MAE of X-containing AoI and the number of X-containing AoI in the training set. Meanwhile, we find a weak anti-correlation (rS equal to −0.42) between the test MAE of X-containing AoI and the number of allX-containing structures (i.e., AoI and Non-AoI) in the training set, although such a correlation vanishes above a threshold of 1000 X-containing structures (Supplementary Fig. 2). This suggests that chemically less relevant data can still inform ML models and may reduce generalization errors in a target subspace, though to a limited extent.

Fig. 3: Number of AoI containing a given element.
figure 3

The line plot (with respect to the right Y axis) indicates the MAE by the ALIGNN-MP18 on the AoI test set.

Another basic composition-related feature is the number of elements contained in a structure. We find that the majority of the AoI in the training and the test sets are binary and ternary systems. The poorly predicted structures are the ternary alloys and some binary ones that have a \({E}_{f}^{\,{{\mbox{DFT}}}\,}\) larger than 0.5 eV/atom (Supplementary Fig. 2).

To study the data distribution in the structural space, we consider the crystallographic space group (SG), which describes the symmetry of a crystal. There are, in total, 230 SG for three-dimensional crystals, and the numbers of AoI belonging to these SG are shown in Fig. 4 (see also Supplementary Fig. 3 for the error distribution). It can be seen that there are few training data but much more test data for the SG-38, SG-71, and SG-187 structures. The parity plots for these structures are shown in Fig. 5. The formation energies for the 538 SG-187 structures in the test set are well predicted, although there are only 15 training AoI with this SG. For the SG-38 AoI, the 1045 test samples that lie well beyond the small formation energy range of the 4 training data are also reasonably well predicted. By contrast, while the formation energies of the test SG-71 AoI in the \({E}_{f}^{\,{{\mbox{DFT}}}\,}\) range covered by the training data are well predicted, those with \({E}_{f}^{\,{{\mbox{DFT}}}\,}\) higher than 0.5 eV/atom are considerably underestimated by the ALIGNN-MP18 model. The different generalization behavior among these three SG suggests that failure to generalize is not strictly explained by the underrepresentation of a given SG in the training data, nor by the range of target values.

Fig. 4: Number of structures as a function of space group number.
figure 4

For reference, the lattice type for a given interval of SG numbers is as follows: [1,2] triclinic, [3,15] monoclinic, [16,74] orthorhombic, [75,142] tetragonal, [143,167] trigonal, [168,194] hexagonal, and [195,230] cubic.

Fig. 5: Parity plot for the AoI data in different space groups (SG).
figure 5

a SG-187, b SG-38, c SG-71.

While it is found that the poorly predicted data are primarily associated with ternary SG-71 structures, it is unclear why it is these structures that are particularly hard to predict for the ALIGNN model. It would be difficult to interrogate the ALIGNN model for a physical understanding of the problem. On the other hand, we find that there is a relatively strong correlation in the test set predictions between the ALIGNN and traditional ML models (Pearson’s r for ALIGNN versus RF: 0.83, ALIGNN versus XGB: 0.77, ALIGNN versus LF: 0.68) and we can therefore use these models as surrogates for the ALIGNN to study the feature space in place of the neural network’s representation.

As mentioned in the previous section, there are 90 features after dropping the highly correlated ones from the initial set of 273 Matminer-extracted features. A typical way to understand high-dimensional data is to project them on a two-dimensional plane by applying dimension reduction. Here we use Uniform Manifold Approximation and Projection (UMAP), a stochastic and non-linear dimensionality reduction algorithm that preserves the data’s local and global structure60. One of the key hyperparameters in UMAP is n_neighbor which constrains the size of the local neighborhood for learning the data’s manifold structure. Lower values of n_neighbor force UMAP to concentrate on the local structure of the data, whereas higher values push UMAP to provide a broader picture by neglecting finer details60. By varying this hyperparameter, one can therefore obtain an idea of the data’s structure at different scales. In Fig. 6, we show a UMAP visualization of the feature space of the AoI training and test data. The test samples with low prediction errors are those clusters covered by the training data, whereas the majority of the poorly predicted alloys (which are largely SG-71 structures) form an isolated cluster away from the rest of the data. Supplementary Figure 4 provides additional UMAP visualizations with smaller n_neighbor, where there are smaller and more dispersed clusters.

Fig. 6: UMAP projection of the 90-dimensional feature space for the AoI training and test.
figure 6

The X and Y axis are not shown because the two dimensions in UMAP have no particular meanings. For the UMAP projection with only the test data, the reader is referred to Supplementary Fig. 4.

It is worth noting that we have also attempted the commonly used principal component analysis (PCA) but found no clear clustering trend (Supplementary Fig. 5). This can be related to the fact that PCA is a linear algorithm and is not good at decoding the potentially non-linear relationships between features. Another reason may be that PCA looks for new dimensions that maximize the data’s variance but does not preserve the local topology of the data as UMAP does in Fig. 6.

Figure 6 is a clear demonstration in the feature space that the poorly predicted test samples lie in an area well beyond that of the training AoI data. A complementary and more detailed understanding can be obtained by comparing the feature value ranges between the AoI training and test data. Figure 7 shows the features whose ranges in all AoI data are larger by more than 5% than the ranges in training AoI data. Substantial changes in the value ranges can be noted for some features. In particular, only the lower 1/3 portion of mean neighbor distance variation and the higher 4/5 of mean CN_VoronoiNN feature values are covered in the AoI training data. The feature mean CN_VoronoiNN corresponds to the average number of nearest neighbors, while the feature mean neighbor distance variation is the mean of the nearest neighbor distance variation, which measures the extent of atom displacement from high-symmetry sites and the extent of the lattice distortion against high-symmetry structures27,61. The right panel in Fig. 7 clearly reveals that the test data with large prediction errors have high mean neighbor distance variation and low mean CN_VoronoiNN values, namely the poorly predicted structures are the ones with strong lattice distortion and a small number of nearest neighbors.

Fig. 7: Distribution of the AoI training and test data.
figure 7

a Feature value range of the training AoI rescaled with respect to that of all the AoI data. b Scatter plot of the AoI training and test data.

It should be noted that the generalization performance degradation discussed in this work is likely to be a widespread issue across materials datasets. Indeed, human bias is known to present in materials data and have adverse effects on ML performance45. Its presence has recently been documented for other computational databases such as OQMD and JARVIS-DFT62. This could be related to the fact that these databases are the results of mission-driven calculations which are focused on specific materials domains and applications instead of a general, diversified, and unbiased representation of materials. Therefore, as the funding and mission of the database builders change with time, so do the materials representation and the distribution bias in the datasets, leading to the degraded performance on out-of-distribution data.

Foreseeing performance issue

Our analysis in the previous section shows that ML models fail to generalize for compounds with large DFT formation energies relative to the range of formation energies in the training data. However, in a materials discovery setting, we must foresee this generalization risk without prior knowledge. In other words, it is important to identify the applicability domain and know whether ML models may be extrapolating and unreliable when used to explore unknown materials.

The natural idea is to define an applicability domain based on the training data density and coverage in the feature space, or equivalently estimate the similarity and distance between the training and test sets. However, this is not trivial in practice. While estimating data density based on basic compositional and structural features, as shown in Figs. 3,4 could provide some indications of a potential distribution shift; our discussion, such as the one for Fig. 5, also shows that fewer data for some SGs do not necessarily lead to poor predictions. Perhaps a more robust and comprehensive picture of the data can be obtained by extracting meaningful and predictive features and visualizing them with the aid of dimension-reduction techniques such as UMAP. The distribution and clustering of the training and test data, as shown in Fig. 6, can clearly help identify the test samples for which the ML predictions would be problematic. In addition, comparing the range of feature values in the training and test data (Fig. 7) is a simple yet effective way to find out whether ML models are extrapolating when used to explore new regions of materials space. Various techniques, including the above-mentioned ones, should be used to inspect the training and the target space during the deployment of ML models, in order to reduce the risk of extrapolation in materials exploration.

Apart from carefully examining the feature space of datasets, one can also train multiple ML models and be more skeptical of the predictions of the test data with significant disagreement. For instance, our results in Fig. 2 show that different ML models show considerable disagreement for those out-of-distribution samples. Therefore, the degree of disagreement between the ML models can also be used to identify out-of-distribution samples. To better illustrate this point, we compute the prediction difference between the ALIGNN and other models, namely \(| {E}_{f}^{\,{{\mbox{ALIGNN}}}\,}\)-\({E}_{f}^{\,{{\mbox{XGB}}}\,}|\), \(| {E}_{f}^{\,{{\mbox{ALIGNN}}}\,}\)-\({E}_{f}^{\,{{\mbox{RF}}}\,}|\), and \(| {E}_{f}^{\,{{\mbox{ALIGNN}}}\,}\)-\({E}_{f}^{\,{{\mbox{LF}}}\,}|\) for each of the test data. We then use UMAP to project the test data represented by the model disagreement in Fig. 8, where the data are separated into two clusters. The cluster located on the left is associated with test data having, on average, a much larger disagreement compared to the cluster on the right. Specifically, the mean value of \(| {E}_{f}^{\,{{\mbox{ALIGNN}}}\,}\)-\({E}_{f}^{\,{{\mbox{XGB}}}\,}|\) is 0.69 eV/atom (0.07 eV/atom) for the cluster located on the left (right).

Fig. 8: UMAP projection of the AoI test data represented by the ML model disagreement.
figure 8

The disagreements between the ALIGNN and other models are used to represent the data points.

Another commonly employed method to identify out-of-distribution test samples is to use of uncertainty quantification. However, quantifying the uncertainty associated with the neural network predictions is challenging7,63 and is beyond the scope of this work. Instead, we consider the uncertainty associated with the RF model, based on the quantile regression forests64. The prediction uncertainty of the RF model is computed as the width of the 95 % confidence interval, namely the difference between the 2.5 and 97.5 percentiles of the trees’ predictions. As shown in Fig. 9, the RF uncertainty is only moderately correlated with the true prediction error for the test data. Based on the uncertainty distribution of the AoI in the training set, one may consider an uncertainty threshold between 1.5 and 2.0 eV/atom for identifying samples that cannot be reliably predicted. However, using these thresholds not only includes many structures that actually have low prediction errors, but also excludes the poorly predicted structures whose prediction uncertainties are between 1.0 to 1.5 eV/atom. Therefore, the RF uncertainty quantification does not allow for effectively discerning the out-of-distribution from the in-distribution samples.

Fig. 9: Prediction uncertainty versus prediction error on the training and test AoI data for the MP18-pretrained RF model.
figure 9

The Pearson (rp), Spearman (rs), and Kendall (rk) correlation coefficients for the prediction uncertainties and errors of the test AoI data are shown for reference.

Improving prediction robustness for materials exploration

Once we spot the gap between the training and test data, the next step is to improve prediction robustness by acquiring new data, ideally with a minimum additional cost. In the following discussion of different acquisition policies, we consider the RF model as a proxy for the ALIGNN model, because it is much faster to update than the ALIGNN model, and its predictions have the best correlation with the ALIGNN predictions compared to the LF and XGB models. Active learning with the ALIGNN model is beyond the scope of this work.

Our discussion in the previous sections can provide insights for establishing the acquisition policy. For instance, one can prioritize the UMAP space poorly covered by the training data. In Fig. 10, we demonstrate the effectiveness of this simple idea. We add a given number of samples randomly taken from the isolated cluster in the UMAP plot (Fig. 6) to the original MP18 training set to train the RF model. We find a significant decrease in the test MAE, compared with the baseline acquisition policy of randomly taking data from the whole test set. With only 50 data (out of 5539 test data) added, the UMAP-guided random sampling leads to a test MAE of 0.13 eV/atom, which is only half of the test MAE of 0.27 eV/atom resulting from the baseline policy (random sampling) with the same number of added samples. The latter needs five times the number of samples to arrive at the same MAE.

Fig. 10: Test MAE as a function of a number of selected test data added to the training set.
figure 10

The inset shows the enlarged region at the early stage of the active learning process. All the results are reported for the RF model. The random sampling and the UMAP-plus-random results are the averages of 10 runs with different random seeds, with the error bars indicating the standard deviation.

As discussed in Fig. 8, the level of disagreement between the ML models is also useful in finding the poorly predicted samples. We, therefore, consider the query by committee (QBC) acquisition, where we select the test data that have the strongest disagreement among the three committee members (RF, LF, and XGB). As shown in Fig. 10, the QBC strategy shows a slightly better performance than the UMAP-guided random sampling. Hoping to find an even better performance in the early acquisition stage, we further consider combining the QBC with the UMAP-guided sampling, but find the resulting performance is similar to using only the QBC strategy. To estimate whether this is because we are reaching the optimal strategy, we compute another acquisition curve, where we select samples that have the largest RF-DFT disagreement. As the DFT labels are assumed known, this curve is not regarded as a true active learning acquisition, but is only used to estimate the optimal performance that active learning can reach. It is clear from Fig. 10 that the QBC curve is quite close to the estimated optimal curve, so it is not surprising that combining it with UMAP does not bring further improvement.

It is worth noting that with the UMAP-guided sampling or the QBC policy, adding only 1% of the data already results in a reasonable test MAE. Therefore, these two strategies are very effective in identifying the most diversified and informative samples. Though Fig. 10 shows that adding even more samples can further improve the model performance in the AoI subspace, such an improvement is rather incremental. The compute should be saved to explore the regions of materials space that could bring potentially drastic gain in the prediction robustness and accuracy.

We note that the active learning strategies proposed here focus on finding out-of-distribution data points rather than eliminating dataset bias, which is a plausible source of the observed generalization performance degradation. Indeed, human bias in datasets is known to have negative impacts on ML models. While simple random sampling might mitigate these impacts and result in better models than human selection when building a database from scratch45, it is not necessarily the optimal strategy for expanding the existing database (see Fig. 10). In practice, mission-driven databases may already have biases and continue to expand with biases into the new material space defined by new funding projects. Therefore, focused on the scenario where an existing database and a pool of candidate materials to explore are proposed, our active learning strategies enable the identification of the gap between the existing and proposed datasets and the acquisition of only the data points that can best improve the model performance.

In case acquiring new data is not possible, prediction robustness for those out-of-distribution samples can still be improved by using more extrapolative models. For instance, tree-based models are usually considered to be interpolative, as is also found here for the RF model (Fig. 2). By simply adding a linear component to the RF model; however, our LF model gives a more robust estimation of the stability for those out-of-distribution samples. The better extrapolation performance is enabled by the features whose ranges in the test set far exceed those in the training set, as removing these features from the LF model reduces the extrapolation performance to the same level as the RF model. On the other hand, the better extrapolation performance of the XGB and LF models also comes from the training data outside the space of interest (namely non-AoI), since training only on the MP18 AoI data leads to performance similar to that of the RF model. This indicates that ML models can learn from less relevant data outsides the target space for better generalization performance.

Discussion

This work is focused on the prediction robustness of ML models, by examining the formation energy predictions of the MP18-pretrained models for the new alloys in the latest MP21 dataset. We considered the ALIGNN model, a graph neural network with state-of-the-art performance in the Matbench formation energy prediction task, as well as three traditional descriptor-based ML models (XGB, RF, and LF). Despite the excellent test performance in the MP18, the MP18-pretrained ALIGNN model strongly underestimated the DFT formation energies of some test data in the MP21. While this performance issue was also found in the traditional ML models, the XGB and LF models provided more robust phase stability estimation for the test data. We analyzed and discussed the origins of performance degradation from multiple perspectives. In particular, we used UMAP to perform dimension reduction on the high-dimensional Matminer-extracted features, revealing that the poorly predicted data lie far beyond the feature space occupied by the training set. With these insights, we then discussed possible methods, including the UMAP-aided clustering and the query of multiple ML models, to identify out-of-distribution data and foresee performance degradation. Finally, we provided suggestions to improve prediction robustness for materials exploration. We showed that the accuracy can be greatly improved by just adding a very small amount of new data as identified by UMAP clustering and querying different ML models. We believe that UMAP-guided active learning shows promising potential for future dataset expansion. In cases where data acquisition is not possible, we also propose to include extrapolative components such as linear models for a more robust prediction for out-of-distribution samples. We hope this work can raise awareness of the limitations of the current ML approaches in the materials science community and provide insights for building databases and ML models with better prediction robustness and generalizability. As a perspective, a similar but more extended and systematic analysis of ML generalization performance on other materials properties, including spectral ones such as density of states, and across multiple databases, will be an interesting and important future work.

Methods

The 2018.06.01 snapshot of Materials Project is retrieved by using JARVIS-tools20, while the latest 2021.11.10 version is retrieved by using the Materials Project API18. For each material, Materials Project uses the material_id field as its identifier and the task_ids field to store its past and current identifiers. The structures in the MP21 task_id field contains an MP18 identifier and are considered as the materials existing in the MP18, whereas the rest in the MP21 are considered to be the new materials unseen in the MP18.

We use the ALIGNN-MP18 model that was published with the original paper34. We use Matminer28 to extract 273 compositional and structural features27, and obtain 90 features after sequentially dropping highly correlated features (with a Pearson’s r of 0.7 as the threshold). We use three traditional ML models: the gradient-boosted trees as implemented in XGBoost (XGB)57, random forests (RF) as implemented in scikit-learn58, and linear forests (LF) as implemented in linear-forest (https://github.com/cerlymarco/linear-tree)59. For the XGB model, we use 2000 estimators, a learning rate of 0.1, an L1 (L2) regularization strength of 0.1 (0.0), and the histogram tree grow method. For the RF model, we use 100 estimators, 30% of the features for the best splitting. We combine the same RF model with a Ridge model with a regularization strength of 1 for the LF model. We use the packages’ default settings for other hyperparameters not mentioned here.