Introduction

Machine learning (ML) is emerging as a novel tool for rapid prediction of material properties1,2,3,4,5,6. In general, these predictions are made by fitting statistical models on a large number of data points. Because of the scarcity of well-curated experimental data in materials science, these input data are often obtained from Density Functional Theory (DFT) calculations housed in one of the many open materials databases7,8,9,10,11,12. In principle, once these models are trained on this immense set of quantum chemical data, the determination of properties for new materials can be made in orders of magnitude less time using the trained models compared with computationally expensive DFT calculations.

Of particular interest is the use of ML to discover new materials. The combinatorics of materials discovery make for an immensely challenging problem—if we consider the possible combinations of just four elements (A, B, C, and D), from any of the ~80 elements that are technologically relevant, there are already ~1.6 million quaternary chemical spaces to consider. This is before we consider such factors as stoichiometry (ABCD2, AB2C3D4, etc.) or crystal structure, each of which add substantially to the combinatorial complexity. The Inorganic Crystal Structure Database (ICSD) of known solid-state materials contains ~105 entries13, several orders of magnitude less than the 1010 quaternary compositions identified as plausible using electronegativity- and charge-based rules14. This suggests that (1) there is ample opportunity for new materials discovery and (2) the problem of finding stable materials may resemble the needle-in-a-haystack problem, with many unstable compositions for each stable one. The immensity of this problem is a natural fit for high-throughput ML techniques.

In this work, we closely examine whether recently published ML models for formation energy are capable of distinguishing the relative stability of chemically similar materials and provide a roadmap for doing the same for future models. We show that although the formation energy of compounds from elements can be learned with high accuracy using a variety of ML approaches, these learned formation energies do not reproduce DFT-calculated relative stabilities. While the accuracy of these models for formation energy approaches the DFT error (relative to experiment), DFT predictions benefit from a systematic cancellation of error15,16 when making stability predictions, while ML models do not. Of particular concern for most ML models is the high rate of materials predicted to be stable that are not stable by DFT, impeding the use of these models to efficiently discover new materials. As a result, we propose more critical evaluation methods for ML of thermodynamic quantities.

Results

The relationship between formation energy and stability

A necessary condition for a material to be used for any application is stability (under some conditions). The thermodynamic stability of a material is defined by its Gibbs energy of decomposition, ΔGd, which is the Gibbs formation energy, ΔGf, of the specified material relative to all other compounds in the relevant chemical space. Temperature-dependent thermodynamics are not yet tractable with high-throughput DFT and have only sparsely been addressed with ML17, so material stability is primarily assessed using the decomposition enthalpy, ΔHd, which is approximated as the total energy difference between a given compound and competing compounds in a given chemical space15,16,18,19. For the purpose of this study, we will directly compare ML predictions and DFT calculations of ΔHd, hence the lack of entropy contributions is not an issue.

The quantity ΔHd is obtained by a convex hull construction in formation enthalpy (ΔHf)-composition space. Figure 1a shows an example for a binary AB space, having three known compounds, A4B, A2B, and AB3. The convex hull is the lower convex enthalpy envelope which lies below all points in the composition space (blue line). Stable compositions lie on the convex hull, and unstable compositions lie above the hull. A4B is unstable (above the hull), so ΔHd > 0 and is calculated as the distance in ΔHf between A4B and the convex hull of stable points. AB3 is stable (on the hull), so ΔHd < 0 and is calculated as the distance in ΔHf between AB3 and a hypothetical convex hull constructed without AB3 (dashed line). |ΔHd| is therefore the minimum amount that ΔHf must decrease for an unstable compound to become thermodynamically stable or the maximum amount that ΔHf can increase for a stable compound to remain stable. We used ΔHd in this work instead of the more common, “energy above the hull,” because the former provides a distribution of values for stable compounds, whereas the latter is equal to 0 for all stable compounds. The convex hull construction, described here for a binary system, generalizes for chemical spaces comprised of any number of elements.

Fig. 1: Distinction between formation and decomposition energies.
figure 1

a Illustration of the convex hull construction to obtain the decomposition enthalpy, ΔHd, from the formation enthalpy, ΔHf. b The decomposition enthalpy, ΔHd, shown against the formation enthalpy, ΔHf, for 85,014 ground-state entries in Materials Project, indicating effectively no correlation between the two quantities. The strong linear correlation that is present at ΔHd = ΔHf arises for chemical spaces that contain only one compound (~3% of compounds). These compounds were excluded from the correlation coefficient, R2, determination. A normalized histogram of ΔHfHd) is shown above (along the right side of) b. Both histograms are binned every 10 meV/atom.

Hence, while ΔHf quantifies to what extent a compound may form from its elements, the thermodynamic quantity that controls phase stability is ΔHd and arises from the competition between ΔHf for all compounds within a chemical space. As we show later, while formation enthalpies can be of the order of several eV the value of ΔHd is typically 1–2 orders of magnitude smaller. In addition, thermodynamic stability is highly nonlinear in ΔHd around zero, as negative values indicate stable compounds, whereas positive values are unstable or metastable compounds.

Although ΔHd determines stability, the standard thermodynamic property that is predicted by ML models is the absolute ΔHf20,21,22,23,24,25,26,27,28,29. This is in large part because ΔHf is intrinsic to a given compound, whereas ΔHd inherently depends upon a compound’s stability relative to neighboring compositions, making ΔHd less robust to learn directly.

Using data available in the Materials Project (MP)16, we applied the convex hull construction to obtain ΔHd for 85,014 inorganic crystalline solids (the majority of which are in the ICSD) and compare ΔHd with ΔHf in Fig. 1b. It is clear that effectively no linear correlation exists between ΔHd and ΔHf, except for the trivial case where only a single compound exists in a chemical space (ΔHd = ΔHf), which is true for only ~3% of materials in MP. While ΔHf somewhat uniformly spans a wide range of energies (mean ± average absolute deviation = −1.42 ± 0.95 eV/atom), ΔHd spans much smaller energies (0.06 ± 0.12 eV/atom), suggesting ΔHd is the more sensitive or subtle quantity to predict (histograms of ΔHf and ΔHd are provided in Fig. 1b). Still, while no linear correlation exists between ΔHd and ΔHf, and ΔHd occurs over a much smaller energy range, it would be possible for ΔHf models to predict ΔHd as long as the relative differences in ΔHf within a given chemical space are predicted with accuracy comparable with the range of variation in ΔHd or if they would benefit from substantial error cancellation when comparing the energies of compounds with similar chemistry.

Learning formation energy from chemical composition

ML material properties requires that an arbitrary material is “represented” by a set of attributes (features). This representation can be as simple as a vector corresponding to the fractional amount of each element in the compound (e.g., Li2O = [0, 0, 2/3, 0, 0, 0, 0, 1/3, 0, 0, …], where the length of the vector is the number of elements in the periodic table), or a vector that includes substantial physical or chemical information about the material. In the search for new materials, the structure is rarely known a priori, and instead a list of compositions with unknown structure is screened for stability, i.e., the possibility that a thermodynamically stable structure exists at that composition. In this case, the material representation is constructed only from the chemical formula without including properties such as the geometric (i.e., lattice and basis) or electronic structure. These models, which take as input the chemical formula and output thermodynamic predictions, are henceforth referred to as compositional models here.

In this work, we assessed the potential for five recently introduced compositional representations—Meredig20, Magpie21, AutoMat22, ElemNet23, and Roost24—to predict the stability of compounds in MP. Meredig, Magpie, and AutoMat include chemical information for each element in their material representations from quantities such as atomic electronegativities, radii, and elemental group. Each of these compositional representations were trained using gradient-boosted regression trees (XGBoost30). ElemNet and Roost differ in that no a priori information other than the stoichiometry is used as input. For ElemNet, a deep learning architecture maps the stoichiometry input into formation energy predictions. For Roost, the stoichiometric representation and fit are simultaneously learned using a graph neural network. In addition, we included a baseline representation for comparison, ElFrac, where the representation is simply the stoichiometric fraction of each element in the formula, trained using XGBoost30. Because compositional models necessarily make the same prediction for all structures having the same formula, all analysis in this work was performed using the lowest energy (ground-state) structure in MP for each available composition. Additional details on the training of each model and the MP dataset is available in the “Methods” section.

Parity plots comparing ΔHf in MP (ΔHf,MP) to machine-learned ΔHfHf,pred) for each model are shown in Fig. 2. It is clear that each published representation substantially improves upon the baseline ElFrac model, decreasing the mean absolute error (MAE) by 27–74%. This increased accuracy is attributed to the increased complexity of the representation. For most models, the MAE between MP and these ML models is comparable with the expected numerical disagreement between MP and experimentally obtained ΔHf8,16,31,32,33, implying a substantial amount of the information required to determine ΔHf is contained in the composition (and not the structure). The success of ML models for predicting ΔHf is not surprising considering the historical context of simple heuristics that perform relatively well at predicting the driving force for the formation of compounds from elements—e.g., the Miedema model34.

Fig. 2: Performance of machine-learned models for formation energy.
figure 2

Parity plot for formation enthalpy predictions using six different machine learning models that take as input the chemical formula and output ΔHf. ElFrac refers to a baseline representation that parametrizes each formula only by the stoichiometric coefficient of each element. Meredig, Magpie, AutoMat, ElemNet, and Roost refer to the representations published in refs. 20,21,22,23,24, respectively. ΔHf,pred corresponds with ML predictions for aggregated hold-out sets during five-fold cross-validation of the Materials Project dataset (see “Methods” for details). ΔHf,MP refers to the formation energy per atom in the MP database. The absolute error on ΔHf is shown as the colorbar and the mean absolute error (MAE) is shown within each panel.

Implicit stability predictions from learned formation enthalpies

While the MAE of the ML-predicted ΔHf approaches the MAE between DFT and experiment for this quantity (~0.1–0.2 eV/atom)8,16,31,32,35, the use of ΔHf for stability predictions requires that the relative ΔHf between chemically similar compounds is predicted more accurately. To assess the accuracy of the relative ΔHf, we reconstructed, for each ML model, the convex hulls for all chemical spaces using ΔHf,pred. Parity plots for ΔHd are shown in Fig. 3. Even though the quantity ΔHd is on average much smaller than ΔHf, the MAE in predicting it is almost identical to the error in predicting ΔHf (Fig. 2), indicating very little error cancellation for the ML models when energy differences are taken in a chemical space, which is in contrast to the beneficial error cancellation for stability predictions with DFT15,16. In contrast to ΔHf, where all representations substantially improve the predictive accuracy from the baseline ElFrac model, for ΔHd, four of the five models (all except Roost) only negligibly improve upon the baseline model with MAE of ~0.10–0.14 eV/atom. Importantly, for the purposes of predicting stability, a difference of ~0.1 eV/atom can be the difference between a compound that is readily synthesizable and one that is unlikely to ever be realized36,37.

Fig. 3: Application of predicted formation energies to decomposition energies.
figure 3

Parity plot for decomposition enthalpy predictions. ΔHd,MP results from convex hulls constructed with ΔHf,MP (Fig. 2). ΔHd,pred is obtained from convex hulls constructed with ΔHf,pred (Fig. 2). The annotations are the same as in Fig. 2.

DFT calculations benefit from a systematic cancellation of errors that leads to much smaller errors for ΔHd than for ΔHf, with MAE for ΔHd as low as ~0.04 eV/atom for a substantial fraction of decomposition reactions16. Unfortunately, ML models do not similarly benefit from this cancellation of errors and instead appear to learn clusters in material space that have similar ΔHf, but they are generally unable to distinguish between stable and unstable compounds within a chemical space. It is notable that Roost substantially improves upon the other models. However, there are still strong signatures of inaccurate stability predictions in its parity plot (Fig. 3), most notably in the ~vertical line at ΔHd,pred = 0 and ~horizontal line at ΔHd,MP = 0. These two lines indicate substantial disagreement between the actual and predicted stabilities for many compounds, despite the relatively low MAE.

The inability for compositional models to properly distinguish relative stability is further demonstrated by assessing how well the models classify compounds as stable (on the convex hull) or unstable (above the hull), as shown in Fig. 4. Overall, 60% of the compounds in the MP dataset are not on the hull, so the classification accuracy of a naive model that states that everything is unstable would be 60%. Five of the six models (all except Roost) only marginally improve upon this extremely naive model accuracy (58–65%). Strikingly, Roost considerably outperforms the other compositional models (76% accuracy), despite using stoichiometry alone as input. Plausibly, this superior performance is due to the use of weighted soft attention mechanisms during training of the representation38. Although only the nominal chemical composition (element fraction) is used as input, the model learns a more meaningful representation of this input composition on a case-by-case basis during training. This is in contrast to the other compositional models, which have fixed stoichiometric representations and either include hand-picked elemental attributes such as electronegativity (Meredig and Magpie) or use deep learning (ElemNet). Notably, AutoMat uses a two-step process: first it rationally selects the most relevant elemental attributes from a large list using a decision tree model and then fits a regression model with the reduced feature space. Considering the modest classification accuracy by AutoMat (65%), despite the wide range of elemental attributes considered in its optimization, we speculate that further improvements in the clever selection of these attributes is unlikely to lead to transformative improvements in predicted stabilities. Instead, major improvements to compositional formation energy models will likely result from qualitative changes in model architecture, as in Roost, and not from optimizing the selection of elemental attributes.

Fig. 4: Performance of predicted formation energies on stability classification.
figure 4

Classification of materials as stable (ΔHd ≤ 0) or unstable (ΔHd > 0) using each of the six compositional models. “Correct” predictions are those for which the ML models and MP both predict a given material to be either stable or unstable. The histograms are binned every 5 meV/atom with respect to ΔHd,MP to indicate how the correct and incorrect predictions and the number of compounds in our dataset vary as a function of the magnitude above or below the convex hull. Acc is the classification accuracy. F1 is the harmonic mean of precision and recall. FPR is the false positive rate. The moving average of the accuracy (computed within 20 meV/atom intervals) as a function of ΔHd,MP is shown as a blue line (right axis). As expected, the accuracy is lowest near the chosen stability threshold of ΔHd,MP = 0.

While Roost improves considerably upon other compositional models, the accuracy, F1 score, and false positive rate taken together do not inspire much confidence that any of these models can accurately predict the stability of solid-state materials (Fig. 4). Of particular concern is the high false positive rates of 25–38%. This metric provides the likelihood that a compound predicted to be stable will not actually be. Further aggravating this situation is that the false positive rate reported here for the models is greatly underestimated compared with the false positive rate that is expected for new materials discovery. The MP database is largely populated with known materials extracted from the ICSD, and this results in ~40% of the entries in MP being on the hull. The fraction of all conceivable hypothetical materials (from which new materials will be discovered) that are stable is likely several orders of magnitude lower than 40%. This necessitates that searches for new materials cover a huge number of possible compounds, and false positive rates in excess of 25% would lead to an enormous amount of predicted materials which are not stable, limiting the ability for these ML models to efficiently accelerate new materials discovery.

A key consideration when discussing the accuracy in classifying compounds as stable or unstable is the choice of threshold for stability, which we have chosen to be ΔHd = 0. In materials discovery or screening applications, compounds are often considered potentially synthesizable even for ΔHd > 0 to consider potential inaccuracies in the predicted stabilities and account for a range of accessible metastability36,37. To probe the effects of moving this threshold for stability to higher or lower values of ΔHd, we show the receiver operating characteristic curves for each model in Supplementary Fig. 1. As the threshold for stability moves to larger positive values, increases in model accuracy are concomitant with an increase in the rate of false positives and a decrease in the confidence that the compounds predicted to be stable are actually accessible. Conversely, as the threshold decreases below zero, the accuracy and false positive rate decrease together as less and less compounds meet this stricter threshold for stability. Ultimately, the conclusions we draw from setting a stability threshold of ΔHd = 0 are not affected by alternative stability thresholds.

Predicting stability in sparse chemical spaces

While quantifying the accuracy of ML approaches on the entire MP dataset is instructive, it does not resemble the materials discovery problem because it assesses only the limited space of compositions that have been previously explored and therefore have many stable compounds. In order to simulate a realistic materials discovery problem, we identified a set of chemical spaces within the MP dataset that are sparse in terms of stable compounds. Lithium transition metal (TM) oxides are used as the cathode material for rechargeable Li-ion batteries and have attracted substantial attention for materials discovery in recent years. In particular, Li–Mn oxides have been considered as an alternative to LiCoO2 utilizing less or no cobalt: e.g., spinel LiMn2O439, layered LiMnO240, nickel–manganese–cobalt cathodes41, and disordered rock salt cathodes42. For this work, the quaternary space, Li–Mn–TM–O with TM {Ti, V, Cr, Fe, Co, Ni, Cu}, is an attractive space to test the efficacy of these models, as it contains only 9 stable compounds and 258 more that are unstable in MP. We tested the potential for ML models to discover these stable compounds by excluding all 267 quaternary Li–Mn–TM–O compounds from the MP dataset and repeating the training of each model on ΔHf with the remaining 84,747 compounds. We then applied each trained model to predict ΔHf for the excluded Li–Mn–TM–O compounds and assessed their stability. Importantly, we are again concerned with DFT-calculated stability at 0 K, so we are not considering the potential for compounds in this quaternary space to be stabilized due to entropic effects (e.g., configurational disorder).

The ΔHf parity plot for these 267 Li–Mn–TM–O compounds is shown in Supplementary Fig. 2 and reveals that all models have a higher accuracy predicting ΔHf for this subset of materials than for the entire dataset (Fig. 3). The improved prediction of ΔHf is likely because the compounds in this subset have strongly negative ΔHf and are well represented by the thousands of TM and lithium-containing oxides that comprise the MP dataset. Despite this improved accuracy on ΔHf, the models all have alarmingly poor performance in predicting ΔHd. In Fig. 5, we show that none of the models are able to correctly detect more than three of the nine stable compounds, and even for the most successful model by this metric (AutoMat), the 3 true positives come with 24 false positives. It is noteworthy that in this experiment, the models are given a large head-start towards making these predictions because the composition space under investigation is restricted to those compounds that have DFT calculations tabulated in MP, which is biased towards stability compared with the space of all possible hypothetical compounds.

Fig. 5: Applying predicted formation energies to stability predictions in a sparse chemical space.
figure 5

Re-training each model on all of MP minus 267 quaternary compounds in the Li–Mn–TM–O chemical space (TM {Ti, V, Cr, Fe, Co, Ni, Cu}) and obtaining ΔHd using the predicted ΔHf for each of the excluded compounds (ΔHd,pred) and comparing with stabilities available in MP, ΔHd,MP. FP = false positive, TP = true positive, TN = true negative, FN = false negative.

To account for the MP stability bias and more closely simulate a realistic materials discovery problem, we assessed the potential for these models to identify the nine stable MP compounds when considering a much larger composition space. Using the approach defined in ref. 14 we produced 13,392 additional quaternary compounds in these 7 Li–Mn–TM–O chemical spaces that obey simple electronegativity- and charge-based rules. For this expanded space of quaternary compounds, we used each compositional model (trained on all of MP minus the 267 Li–Mn–TM–O compounds) to predict ΔHf and assessed their stability (Table 1). The compositional models each predict ~4–5% of these compounds to be stable, and all of the models fail to accurately predict the stability of more than two of the nine compounds that are actually stable in MP. A remarkable 139 compounds are predicted to be stable by all six models and 1372 unique compounds are predicted to be stable by at least one model. While it is likely that the space of stable quaternary compounds in the Li–Mn–TM–O space has not yet been fully explored in MP (or by extension, the ICSD), our intuition suggests it is highly unlikely that the number of new stable materials in this well-studied space is orders of magnitude larger than the number of known stable materials. The false positive rates obtained on the entire MP dataset shown in Fig. 4 suggest ~25–38% of these predicted stable Li–Mn–TM–O compounds are not actually stable, and these rates are likely underestimated, as discussed previously. The magnitude of compounds predicted to be stable by the ML models, and their false positive rates, imply that these models will inevitably identify a large number of unstable materials as candidates for further analysis (either with DFT calculations or experimental synthesis). This substantially impedes the capability of these formation energy models to accelerate the discovery of novel compounds that can be synthesized.

Table 1 Stability predictions in the expanded Li–Mn–TM–O (TM {Ti, V, Cr, Fe, Co, Ni, Cu}) composition space.

Direct training on decomposition energy

An alternative approach to consider is to train directly on ΔHd instead of using ML-predicted ΔHf to obtain ΔHd through the convex hull construction. Note that direct training on ΔHd is complicated by the fact that ΔHd for a given compound is dependent upon ΔHf for other compounds within a given chemical space. This is unlike ΔHf, which is intrinsic to a single compound. To assess the capability of each representation to directly predict stability, we repeated the analysis shown in Figs. 35 and Table 1 but training on ΔHd. The performance of each model on the MP Li–Mn–TM–O dataset is shown in Fig. 6, the performance on the expanded Li–Mn–TM–O space in Supplementary Table 1, and results for ΔHd on the entire MP dataset are shown in Supplementary Figs. 3 and 4. While the prediction accuracy (MAE and stability classification) on the entire MP dataset is typically comparable with or slightly better when training on ΔHd (Supplementary Figs. 3 and 4) instead of ΔHf (Figs. 3 and 4), the capability of the trained model to predict stability in sparse chemical spaces is even worse than when training on ΔHf (Fig. 6 and Supplementary Table 1).

Fig. 6: Applying predicted decomposition energies to stability predictions in a sparse chemical space.
figure 6

Re-training each model directly on ΔHd on all of MP minus 267 quaternary compounds in the Li–Mn–TM–O chemical space (TM {Ti, V, Cr, Fe, Co, Ni, Cu}). Stability is determined by these direct predictions of ΔHd,pred and compared with stabilities available in MP, ΔHd,MP. FP = false positive, TP = true positive, TN = true negative, FN = false negative.

None of the models are able to identify even one of the nine MP-stable quaternary compounds from the set of 267 Li–Mn–TM–O compounds in MP, and every model predicts all 267 Li–Mn–TM–O compounds to be unstable (Fig. 6). It is especially notable that for all models except Roost and ElemNet, the predictions for all 267 quaternary compounds fall in a very small window (0.040 eV/atom < ΔHd,pred < 0.082 eV/atom), suggesting the models only learn that all compounds in this space should be within the vicinity of the convex hull and do nothing to distinguish between chemically similar compounds. When the space of potential compounds is expanded to 13,659 compounds, only Roost and ElemNet predict any compound to be stable, but again, none of the nine MP-stable compounds are predicted to be stable by any model (Supplementary Table 1).

As an additional demonstration, all representations (except Roost—see “Methods” for details) were also trained as classifiers (instead of regressors), tasked with predicting whether a given compound is stable (ΔHd ≤ 0) or unstable (ΔHd > 0). The accuracies, F1 scores, and false positive rates are tabulated in Supplementary Table 2 and found to be only slightly better (accuracies < 80%, F1 scores < 0.75, false positive rates > 0.15) than those obtained by training on ΔHf (Fig. 4) or ΔHd (Supplementary Fig. 4).

Beyond the poor performance associated with these models, the direct prediction of ΔHd (or classification of stable/unstable) is difficult to physically motivate because unlike ΔHf, ΔHd is not an intrinsic property of a material but depends on the energy at other compositions with which it may be in competition. This nonlocality of ΔHd also depends on the completeness of a given phase diagram: as new materials are discovered in a chemical space, ΔHd is subject to change for any compound in that space, even if that compound’s energy itself does not change, complicating the application of ML models trained on ΔHd.

Revisiting stability predictions with a structural representation

In addition to compositional models, representations that rely on the crystal structure for predicting formation energy have also received substantial attention in recent years25,27,28,29,43,44,45. These models perform a different task than compositional models because they evaluate the property of a material given both the composition and the structure. Nevertheless, it is interesting to assess whether these structural models can predict stability with improved accuracy relative to models that are given only composition.

Here we take the crystal graph convolutional neural network (CGCNN)25 as a representative example of existing structural models. CGCNN is a flexible framework that uses message passing over the atoms and bonds of a crystal (see “Methods” for training details). In Fig. 7, we show the performance of CGCNN on the same set of analyses as were shown for the compositional models in Figs. 25: learning ΔHf (Fig. 7a), constructing convex hulls with those predicted ΔHf to generate ΔHd (Fig. 7b), assessing the capability of these ΔHd,pred values to classify materials as stable or unstable (Fig. 7c), and probing the ability for this model to predict stability in the sparse Li–Mn–TM–O space (Fig. 7d). It is clear that CGCNN improves substantially upon the direct prediction of ΔHf (Fig. 7a) and the implicit prediction of ΔHd (Fig. 7b), reducing the MAE by ~50% compared with the best performing compositional model (Roost). The extremely inaccurate predictions of ΔHd near ΔHd,pred = 0 or ΔHd,MP = 0 that are observed in Fig. 3 for most compositional models are also no longer present with CGCNN (Fig. 7b). CGCNN displays an improved classification accuracy (80%) and a narrow distribution of incorrect stability predictions, only disagreeing with MP regarding the stability of compounds within the vicinity of ΔHd,MP = 0 (Fig. 7c). Most impressively, CGCNN is relatively successful at finding the needles in the excluded Li–Mn–TM–O haystack, recovering five of the nine stable compounds with only six false positives (Fig. 7d). In addition to the improved predictive accuracy, the parity plot for this excluded set looks fundamentally different than for the compositional models. In the compositional models (Fig. 5), the parity plot is scattered, and there is effectively no linear correlation between the actual and predicted ΔHd, whereas for CGCNN, there is a strong linear correlation (Fig. 7d).

Fig. 7: Performance of formation energies predicted using a structural representation.
figure 7

a Repeating the analysis shown in Fig. 2 using CGCNN. Annotations are as in Fig. 2. b Repeating the analysis shown in Fig. 3 using CGCNN. Annotations are as in Fig. 3. c Repeating the analysis shown in Fig. 4 using CGCNN. Annotations are as in Fig. 4. d Repeating the analysis shown in Fig. 5 using CGCNN. Annotations are as in Fig. 5.

The nonincremental improvement in stability predictions that arises from including structure in the representation is a strong endorsement for structural models and also sheds insight into the structural origins of material stability. While the thermodynamic driving force for forming a compound from its elements (formation energy) can be learned with high accuracy from only the composition, the structure dictates the subtle differences in thermodynamic driving force between chemically similar compounds and enables accurate ML predictions of material stability (decomposition energy). However, the obvious limitation of this approach is that it requires the structure as input, and the structure of new materials that are yet to be discovered is not known a priori. For example, because we do not know the ground-state structure for an arbitrary composition, we cannot repeat the test where we assess the ability of the ML model to find the stable Li–Mn–TM–O compounds among a large set of candidate compositions. Although CGCNN shows substantially improved performance in predicting material stability, these results are obtained using the DFT-optimized ground-state crystal structures as input.

Quantifying error cancellation in ML models

While it is well known that DFT predictions of stability are enhanced because of systematic error cancellation15,16, it is not yet known if the errors made by ML formation energy models are completely random or if they too exhibit some beneficial cancellation of errors. In Fig. 8a, we plot a hypothetical convex hull phase diagram in blue, labeled “ground truth.” Next, we represent the effect of a systematic error in ΔHf by shifting all points up in energy by the same amount (“systematic error,” green). The relative stabilities of these systematically shifted points remain comparable with the ground truth despite the error in each ΔHf, illustrating beneficial cancellation of errors. Finally, we show the effect of a random ΔHf error by shifting the ground truth ΔHf by the same average magnitude as in the “systematic” case, but here in random amounts for each point (purple, “random error”). In this case, stabilities (ΔHd) for all points deviate substantially from the ground truth because there is little beneficial cancellation of ΔHf errors.

Fig. 8: Error cancellation in formation energy predictions.
figure 8

a Schematic illustration contrasting how random and systematic errors on ΔHf of the same average magnitude manifest as larger and smaller errors on predicted ground-state lines (ΔHd). b Comparing the performance on stability predictions using ML-predicted ΔHfHf,pred, filled bars) and ΔHf with perturbations drawn randomly from the distribution of ΔHf,pred errors (ΔHf,rand = ΔHf,MP + PHf,MP − ΔHf,pred], hatched bars). The mean absolute error (MAE) on ΔHd is shown by the black bars (left axis). The F1 score for classifying compounds as stable (ΔHd ≤ 0) or unstable (ΔHd > 0) is shown by the brown bars (right axis). The results for the randomly perturbed case are averaged over three random samples with the standard deviation shown as an error bar. The standard deviation is too small to see in most cases.

The similar MAE on ΔHf (Figs. 2 and 7a) and ΔHd (Figs. 3 and 7b) for all models despite the much smaller range of energies spanned by ΔHd (Fig. 1b) make clear that the benefit of error cancellation is not fully realized for the ML models. Further, the set of tests on stability predictions discussed in this work (Figs. 46 and Table 1) show that the magnitude of error cancellation made by the compositional ML models remains insufficient to enable accurate stability predictions, especially in sparse chemical spaces. It is not clear, however, whether the improved prediction of stability by CGCNN arises from beneficial error cancellation within each chemical space or from decreasing the overall MAE from ~0.06 eV/atom (for the best performing compositional model—Roost) to ~0.03 eV/atom (for CGCNN).

To quantify the magnitude of error cancellation for the ML models, it is essential to establish a “random error” baseline for comparison. The random error baseline developed in this work utilizes random perturbations of the ground truth ΔHfHf,MP), where the perturbations were drawn from the discrete distribution of ML errors, PHf,MP − ΔHf,pred], for each model. It follows that ΔHf,rand = ΔHf,MP + PHf,MP − ΔHf,pred] (ΔHf,pred is shown for each compositional model in Fig. 2 and for CGCNN in Fig. 7a). With these randomly perturbed ΔHf,rand, we repeated the convex hull construction for all compounds in MP for comparison with the analysis of ΔHd presented in Fig. 3 and stability classification presented in Fig. 4 (both of which rely on ΔHf,pred). In Fig. 8b, the MAEs on ΔHd and F1 scores for classifying compounds as stable (ΔHd ≤ 0) or unstable (ΔHd > 0) are compared for predictions based on ΔHf,pred and ΔHf,rand for all seven models. All models show higher MAE on ΔHd and lower F1 scores using ΔHf,rand instead of ΔHf,pred as input, demonstrating that the ML models do generally exhibit some degree of error cancellation.

The extent of error cancellation is model-dependent, and the worst-performing model in this work, ElFrac, exhibits the most substantial relative error cancellation, with 148% higher MAE on ΔHd and 14% lower F1 when using ΔHf,rand instead of ΔHf,pred. For ElFrac (and to a lesser extent the other modestly performing compositional models), the high error cancellation likely arises because of the wide distribution of predicted ΔHf (PHf,MP − ΔHf,pred]), which drives up the error of the random error baseline dramatically. Roost is remarkably shown to have considerable error cancellation (80% higher MAE on ΔHd using ΔHf,rand) despite the MAE on ΔHd using ΔHf,rand already being competitive with the actual predictions (ΔHd using ΔHf,pred) made by the other compositional models.

However, we emphasize that even benefiting from this substantial error cancellation, Roost is not able to detect the stability of compounds in sparse chemical spaces (as shown in Fig. 5). The only model that performs suitably at this task is the structural representation, CGCNN (Fig. 7d), and this model exhibits a much smaller degree of error cancellation (MAE on ΔHd increased by 26% and F1 decreased by 3% using ΔHf,rand). Because ΔHf,pred for CGCNN is sufficiently accurate (MAE on ΔHf = 34 meV/atom), the lack of error cancellation does not have a deleterious effect on stability predictions.

Discussion

There have been a number of recent successes in the application of ML for materials design problems. These models have given the impression that ML can predict formation energies with near-DFT accuracy20,23,46. However, the critical question of whether this implies that compound stability can be predicted by ML has not been rigorously assessed. In this work, we show that while indeed existing ML models can predict ΔHf with relatively high accuracy from the chemical formula, they are insufficient to accurately distinguish stable from unstable compounds within an arbitrary chemical space. The error in predicting DFT-calculated ΔHf by ML models is often compared favorably with the error DFT makes in predicting ΔHf relative to experimentally obtained values. This comparison neglects the fact that the errors in DFT-calculated ΔHf are beneficially systematic, whereas the errors made by the ML models are not. For DFT calculations, this leads to substantially lower errors for stability predictions (ΔHd) than for ΔHf. A similarly beneficial cancellation of errors does not occur for ML models and the errors in ΔHd are comparable with ΔHf, inhibiting accurate predictions of material stability. Hence, while the claim that ML-predicted formation energies have similar errors as DFT compared with experiment is technically correct, it does not imply that in their current state ML models are as useful as DFT, or that ML can replace DFT for the computationally guided discovery of novel compounds. As new ML models for formation energy are developed, it is imperative to assess their viability as inputs for stability predictions and, most critically, for problems that resemble how the models would be implemented to address emerging materials design problems. In this work, we present a set of tests that facilitate this assessment and allow for direct comparison with existing ML models. All data and code required to repeat this set of stability analyses for the models shown in this work or any new model are available at https://github.com/CJBartel/TestStabilityML.

Methods

Materials project data

All entries in the MP9 database with ΔHf < 1 eV/atom were queried on July 26, 2019 using the MP API47. This produced 85,014 unique nonelemental chemical formulas. For each chemical formula, we obtained the formation energy per atom, ΔHf, for all structures having that formula, and used the most negative (ground-state) ΔHf for training the models and obtaining ΔHd by the convex hull construction. MP applies a correction scheme to improve the agreement between DFT-calculated thermodynamic properties (ΔHf and ΔHd) and experiment15,48,49. Additional details on the MP calculation procedure can be found at https://materialsproject.org/docs/calculations.

Although the MP database contains a wide range of inorganic crystalline solids, it is an evolving resource that periodically includes more and more compounds as they are discovered or calculated by the community. As such, the calculated ΔHd that were used for training and testing each model are subject to change over time as new stable materials are added to the database. This fact is not unique to MP and is inherent in all open materials databases that would be considered for training and evaluating ML models on large datasets of DFT calculations. The ΔHd and ΔHf used for all compounds in this work are available within https://github.com/CJBartel/TestStabilityML.

General training approach

Fivefold cross-validation was used to produce the model-predicted ΔHf,pred shown in Fig. 2. Each predicted value corresponds with the prediction made on that compound when it was in the validation set (i.e., not used for training). ΔHf,pred was then used in the convex hull analysis to generate ΔHd,pred shown in Fig. 3, from which stability classifications were made as shown in Fig. 4. For the Li–Mn–TM–O examples (Fig. 5 and Table 1), each model was trained on all MP entries except those 267 quaternary compounds belonging to the Li–Mn–TM–O chemical spaces. An analogous approach was used when training on ΔHd instead of ΔHf to generate the results shown in Fig. 6, Supplementary Figs. 3 and 4, Supplementary Table 1, and Supplementary Table 2.

Compositional model training

Three of the compositional representations—ElFrac, Meredig20, and Magpie21—were implemented using matminer50 and trained using gradient boosting as implemented in XGBoost30 with 200 trees and a maximum depth of 5. Preliminary tests showed XGBoost and these hyperparameters led to the highest accuracy of tested algorithms. AutoMat22 was used as implemented in ref. 22. Roost24 was trained for 500 epochs using an Adam optimizer with an initial learning rate of 5 × 10−4 and an L1 loss function. ElemNet was implemented as described in ref. 23 using the Keras ML framework51. ElemNet was trained using an initial learning rate of 10−4 with an Adam optimizer for 200 epochs. Overall, 10% of the input data was set aside for validation, and the model weights from the epoch with best loss on the validation set were used for predictions.

Regarding training each representation as a classifier, for ElFrac, Meredig, Magpie, and AutoMat, we used an XGBoost classifier with the same parameters as used for regression. For ElemNet, we added a sigmoid activation function to the output and used cross entropy loss for training. All other aspects of these models were identical to those trained for regression. Roost was excluded from the classification analysis as modifying this representation to perform classification required more extensive changes than for the other representations.

Learning curves for all compositional models trained on ΔHf are provided in Supplementary Fig. 5 along with training and inference times in Supplementary Table 3. The code used to train and evaluate all models is available at https://github.com/CJBartel/TestStabilityML.

CGCNN training

We used a nested fivefold cross-validation to train the CGCNN25 model for the MP ΔHf dataset. As a general procedure for cross-validation, the dataset was split into five groups and each group was iteratively taken as a hold-out test set. For each fold, we split the training set to 75% training and 25% validation, thus the overall ratio of training, validation, and test was 60%, 20%, and 20%, respectively. The CGCNN model was iteratively updated by minimizing the loss (mean squared error) on the training set, and the validation score (MAE) was monitored after each epoch. After 1000 epochs, the model with the best validation score was selected and then evaluated on the hold-out test set. Results of the fivefold hold-out test sets were accumulated as the final predictions of the dataset.

For the Li–Mn–TM–O case in which the test set is defined, we split the remaining compounds into five groups and iteratively took each group as the validation set (20%) and the remaining as the training set (80%). The best CGCNN model of each fold was selected as the one with the best validation score (MAE). We then applied the 5 CGCNN models to the 267 Li–Mn–TM–O test compounds and used the average of the predicted ΔHf for each model.