A critical examination of compound stability predictions from machine-learned formation energies

Machine learning has emerged as a novel tool for the efficient prediction of materials properties, and claims have been made that machine-learned models for the formation energy of compounds can approach the accuracy of Density Functional Theory (DFT). The models tested in this work include five recently published compositional models, a baseline model using stoichiometry alone, and a structural model. By testing seven machine learning models for formation energy on stability predictions using the Materials Project database of DFT calculations for 85,014 unique chemical compositions, we show that while formation energies can indeed be predicted well, all compositional models perform poorly on predicting the stability of compounds, making them considerably less useful than DFT for the discovery and design of new solids. Most critically, in sparse chemical spaces where few stoichiometries have stable compounds, only the structural model is capable of efficiently detecting which materials are stable. The non-incremental improvement of structural models compared with compositional models is noteworthy and encourages the use of structural models for materials discovery. This work demonstrates that accurate predictions of formation energy do not imply accurate predictions of stability, emphasizing the importance of assessing model performance on stability predictions, for which we provide a set of publicly available tests.


Introduction
Machine learning (ML) is emerging as a novel tool for rapid prediction of material properties. [1][2][3][4][5][6] In general, these predictions are made by fitting statistical models on a large number of data points. Because of the scarcity of well-curated experimental data in materials science, this input data is often obtained from Density Functional Theory (DFT) calculations housed in one of the many open materials databases. [7][8][9][10][11][12] In principle, once these models are trained on this immense set of quantum chemical data, the determination of properties for new materials can be made in orders-of-magnitude less time using the trained models compared to computationally expensive DFT calculations.
Of particular interest is the use of machine learning to discover new materials. The combinatorics of materials discovery make for an immensely challenging problem -if we consider the possible combinations of just four elements (A, B, C, D), from any of the ~80 elements that are technologically relevant, there are already ~1.6 million quaternary chemical spaces to consider. This is before we consider such factors as stoichiometry (ABCD2, AB2C3D4, etc.) or crystal structure, each of which add substantially to the combinatorial complexity. The Inorganic Crystal Structure Database (ICSD) of known solid-state materials contains ~10 5 entries, 13 several orders of magnitude less than the 10 10 quaternary compositions identified as plausible using electronegativity-and charge-based rules. 14 This suggests that 1) there is ample opportunity for new materials discovery and 2) the problem of finding stable materials may resemble the needlein-a-haystack problem, with many unstable compositions for each stable one. The immensity of this problem is a natural fit for high-throughput machine learning techniques.
In this work, we closely examine whether recently published machine learning models for formation energy are capable of distinguishing the relative stability of chemically similar materials and provide a roadmap for doing the same for future models. We show that although the formation energy of compounds from elements can be learned with high accuracy using a variety of machine learning approaches, these learned formation energies do not reproduce DFT-calculated relative stabilities. While the accuracy of these models for formation energy approaches the DFT error (relative to experiment), DFT predictions benefit from a systematic cancellation of error 15,16 when making stability predictions, while ML models do not. Of particular concern for most ML models is the high rate of materials predicted to be stable that are not stable by DFT, impeding the use of these models to efficiently discover new materials. As a result, we suggest that more critical evaluation methods for machine learning methods are developed.

Results and Discussion
The relationship between formation energy and stability A necessary condition for a material to be used for any application is stability (under some conditions). The thermodynamic stability of a material is defined by its Gibbs energy of decomposition, ΔGd, which is the Gibbs formation energy, ΔGf, of the specified material relative to all other compounds in the relevant chemical space. Temperature-dependent thermodynamics are not yet tractable with high-throughput DFT and have only sparsely been addressed with ML, 17 so material stability is primarily assessed using the decomposition enthalpy, ΔHd, which is approximated as the total energy difference between a given compound and competing compounds in a given chemical space. 15,16,18,19 For the purpose of this study we will directly compare ML predictions and DFT calculations of ΔHd, hence the lack of entropy contributions is not an issue.
The quantity ΔHd is obtained by a convex hull construction in formation enthalpy (ΔHf)composition space. Figure 1a shows an example for a binary A-B space, having three known compounds, A4B, A2B, and AB3. The convex hull is the lower convex enthalpy envelope which lies below all points in the composition space (blue line). Stable compositions lie on the convex hull, and unstable compositions lie above the hull. A4B is unstable (above the hull), so ΔHd > 0 and is calculated as the distance in ΔHf between A4B and the convex hull of stable points. AB3 is stable (on the hull), so ΔHd < 0 and is calculated as the distance in ΔHf between AB3 and a hypothetical convex hull constructed without AB3 (dashed line). |ΔHd| is therefore the minimum amount that ΔHf must decrease for an unstable compound to become thermodynamically stable or the maximum amount that ΔHf can increase for a stable compound to remain stable. We used ΔHd in this work instead of the more common, "energy above the hull", because the former provides a distribution of values for stable compounds whereas the latter is equal to 0 for all stable compounds. The convex hull construction, described here for a binary system, generalizes for chemical spaces comprised of any number of elements.
Hence, while ΔHf quantifies to what extent a compound may form from its elements, the thermodynamic quantity that controls phase stability is ΔHd and arises from the competition between ΔHf for all compounds within a chemical space. As we show later, while formation enthalpies can be of the order of several eV the value of ΔHd is typically 1-2 orders of magnitude smaller. In addition, thermodynamic stability is highly nonlinear in ΔHd around zero, as negative values indicate stable compounds, whereas positive values are unstable or metastable compounds.
Using data available in the Materials Project (MP), 16 we applied the convex hull construction to obtain ΔHd for 85,014 inorganic crystalline solids (the majority of which are in the ICSD) and compare ΔHd to ΔHf in Figure 1b. It is clear that effectively no linear correlation exists between ΔHd and ΔHf, except for the trivial case where only a single compound exists in a chemical space (ΔHd = ΔHf), which is true for only ~3% of materials in MP. While ΔHf somewhat uniformly spans a wide range of energies (mean ± average absolute deviation = -1.42 ± 0.95 eV/atom), ΔHd spans much smaller energies (0.06 ± 0.12 eV/atom), suggesting ΔHd is the more sensitive or subtle quantity to predict (a histogram of ΔHf and ΔHd is provided in Figure S1). Still, while no linear correlation exists between ΔHd and ΔHf, and ΔHd occurs over a much smaller energy range, it would be possible for ΔHf models to predict ΔHd as long as the relative differences in ΔHf within a given chemical space are predicted with accuracy comparable to the range of variation in ΔHd or if they would benefit from substantial error cancellation when comparing the energies of compounds with similar chemistry. The decomposition enthalpy, ΔHd, shown against the formation enthalpy, ΔHf, for 85,014 ground-state entries in Materials Project, indicating effectively no correlation between the two quantities. The strong linear correlation that is present at ΔHd = ΔHf arises for chemical spaces that contain only one compound (~3% of compounds). These compounds were excluded from the correlation coefficient, R 2 , determination.

Learning formation energy from chemical composition
Machine learning material properties requires that an arbitrary material is "represented" by a set of attributes (features). This representation can be as simple as a vector corresponding to the fractional amount of each element in the compound (e.g., Li2O = [0, 0, 2/3, 0, 0, 0, 0, 1/3, 0, 0, …], where the length of the vector is the number of elements in the periodic table), or a vector that includes substantial physical or chemical information about the material. In the search for new materials, the structure is rarely known a priori, and instead a list of compositions with unknown structure is screened for stability, i.e., the possibility that a thermodynamically stable structure exists at that composition. In this case, the material representation is constructed only from the chemical formula without including properties such as the geometric (i.e., lattice and basis) or electronic structure. These models, which take as input the chemical formula and output thermodynamic predictions, are henceforth referred to as compositional models here.
In this work, we assessed the potential for five recently introduced compositional models -Meredig, 20 Magpie, 21 AutoMat, 22 ElemNet, 23 and Roost 24 -to predict the stability of compounds in MP. Meredig, Magpie, and AutoMat include chemical information for each element in their material representations from quantities such as atomic electronegativities, radii, and elemental group. Each of these compositional models were trained using gradient-boosted regression trees (XGBoost 30 ). ElemNet and Roost differ in that no a priori information other than the stoichiometry is used as input. For ElemNet, a deep learning architecture maps the stoichiometry input into formation energy predictions. For Roost, the stoichiometric representation and fit are simultaneously learned using a graph neural network. In addition, we included a baseline model for comparison, ElFrac, where the representation is simply the stoichiometric fraction of each element in the formula, trained using XGBoost 30 . Because compositional models necessarily make the same prediction for all structures having the same formula, all analysis in this work was performed using the lowest energy (ground-state) structure in MP for each available composition.
Additional details on the training of each model and the MP dataset is available in the Methods

section.
Parity plots comparing ΔHf in MP (ΔHf,MP) to machine-learned ΔHf (ΔHf,pred) for each model are shown in Figure 2. It is clear that each published representation substantially improves upon the baseline ElFrac model, decreasing the mean absolute error (MAE) by 27-74%. This increased accuracy is attributed to the increased complexity of the representation. For most models, the MAE between MP and these ML models is comparable to the expected numerical disagreement between MP and experimentally obtained ΔHf, 8,16,[31][32][33] implying a substantial amount of the information required to determine ΔHf is contained in the composition (and not the structure). The success of ML models for predicting ΔHf is not surprising considering the historical context of simple heuristics that perform relatively well at predicting the driving force for the formation of compounds from elements -e.g., the Miedema model. 34 Figure 2. Parity plot for formation enthalpy predictions using six different machine learning models that take as input the chemical formula and output ΔHf. ElFrac refers to a baseline model that parametrizes each formula only by the stoichiometric coefficient of each element. Meredig, Magpie, AutoMat, ElemNet, and Roost refer to Refs. 20-24 , respectively. ΔHf,pred corresponds with ML predictions for aggregated hold-out sets during five-fold cross-validation of the Materials Project dataset (see Methods for details). ΔHf,MP refers to the formation energy per atom in the MP database. The absolute error on ΔHf is shown as the colorbar and the mean absolute error (MAE) is shown within each panel.

Implicit stability predictions from learned formation enthalpies
While the mean absolute error of the ML-predicted ΔHf approaches the mean absolute error between DFT and experiment for this quantity (~0.1-0.2 eV/atom) 8,16,31,32,35 , the use of ΔHf for stability predictions requires that the relative ΔHf between chemically similar compounds is predicted more accurately. To assess the accuracy of the relative ΔHf, we reconstructed, for each ML model, the convex hulls for all chemical spaces using ΔHf,pred. Parity plots for ΔHd are shown in Figure 3. Even though the quantity ΔHd is on average much smaller than ΔHf, the MAE in predicting it is almost identical to the error in predicting ΔHf (Figure 2), indicating very little error cancellation for the ML models when energy differences are taken in a chemical space, which is in contrast to the beneficial error cancellation for stability predictions with DFT. 15,16 In contrast to ΔHf, where all representations substantially improve the predictive accuracy from the baseline ElFrac model, for ΔHd, four of the five models (all except Roost) only negligibly improve upon the baseline model with MAE of ~0.11-0.14 eV/atom. Importantly, for the purposes of predicting stability, a difference of ~0.1 eV/atom can be the difference between a compound that is readily synthesizable and one that is unlikely to ever be realized. 36,37 DFT calculations benefit from a systematic cancellation of errors that leads to much smaller errors for ΔHd than for ΔHf, with MAE for ΔHd as low as ~0.04 eV/atom for a substantial fraction of decomposition reactions. 16 Unfortunately, ML models do not similarly benefit from this cancellation of errors and instead appear to learn clusters in material space that have similar ΔHf, but they are generally unable to distinguish between stable and unstable compounds within a chemical space. It is notable that Roost substantially improves upon the other models. However, there are still strong signatures of inaccurate stability predictions in its parity plot (Figure 3), most notably in the ~vertical line at ΔHd,pred = 0 and ~horizontal line at ΔHd,MP = 0. These two lines indicate substantial disagreement between the actual and predicted stabilities for many compounds, despite the relatively low MAE.  (Fig. 2). ΔHd,pred is obtained from convex hulls constructed with ΔHf,pred (Fig. 2). The annotations are the same as in Fig. 2.
The inability for compositional models to properly distinguish relative stability is further demonstrated by assessing how well the models classify compounds as stable (on the convex hull) or unstable (above the hull), as shown in Figure 4. 60% of the compounds in the MP dataset are not on the hull, so the classification accuracy of a naïve model that states that everything is unstable would be 60%. Five of the six models (all except Roost) only marginally improve upon this extremely naïve model accuracy (58-65%). Strikingly, Roost considerably outperforms the other compositional models (76% accuracy), despite using stoichiometry alone as input. Plausibly, this superior performance is due to the use of weighted soft attention mechanisms during training of the representation. 38 Although only the nominal chemical composition (element fraction) is used as input, the model learns a more meaningful representation of this input composition on a caseby-case basis during training. This is in contrast to the other compositional models, which have fixed stoichiometric representations and either include hand-picked elemental attributes such as electronegativity (Meredig and Magpie) or use deep learning (ElemNet). Notably, AutoMat uses a two-step process: first it rationally selects the most relevant elemental attributes from a large list using a decision tree model and then fits a regression model with the reduced feature space.
Considering the modest classification accuracy by AutoMat (65%), despite the wide range of elemental attributes considered in its optimization, we speculate that further improvements in the clever selection of these attributes is unlikely to lead to transformative improvements in predicted stabilities. Instead, major improvements to compositional formation energy models will likely result from qualitative changes in model architecture, as in Roost, and not from optimizing the selection of elemental attributes.
While Roost improves considerably upon other compositional models, the accuracy, F1 score, and false positive rate taken together do not inspire much confidence that any of these models can accurately predict the stability of solid-state materials (Figure 4). Of particular concern is the high false positive rates of 25-38%. This metric provides the likelihood that a compound predicted to be stable will not actually be. Further aggravating this situation is that the false positive rate reported here for the models is greatly underestimated compared to the false positive rate that is expected for new materials discovery. The MP database is largely populated with known materials extracted from the ICSD, and this results in ~40% of the entries in MP being on the hull.
The fraction of all conceivable hypothetical materials (from which new materials will be discovered) that are stable is likely several orders of magnitude lower than 40%. This necessitates that searches for new materials cover a huge number of possible compounds, and false positive rates in excess of 25% would lead to an enormous amount of predicted materials which are not stable, limiting the ability for these ML models to efficiently accelerate new materials discovery.
A key consideration when discussing the accuracy in classifying compounds as stable or unstable is the choice of threshold for stability, which we have chosen to be ΔHd = 0. In materials discovery or screening applications, compounds are often considered potentially synthesizable even for ΔHd > 0 to consider potential inaccuracies in the predicted stabilities and account for a range of accessible metastability. 36,37 To probe the effects of moving this threshold for stability to higher or lower values of ΔHd, we show the receiver operating characteristic (ROC) curves for each model in Figure S2. As the threshold for stability moves to larger positive values, increases in model accuracy are concomitant with increases in the rate of false positives and a decrease in the confidence that the compounds predicted to be stable are actually accessible. Conversely, as the threshold decreases below zero, the accuracy and false positive rate decrease together as less and less compounds meet this stricter threshold for stability. Ultimately, the conclusions we draw from setting a stability threshold of ΔHd = 0 are not affected by alternative stability thresholds. Figure 4. Classification of materials as stable (ΔHd ≤ 0) or unstable (ΔHd > 0) using each of the six compositional models. "Correct" predictions are those for which the ML models and MP both predict a given material to be either stable or unstable. The histograms are binned every 5 meV/atom with respect to ΔHd,MP to indicate how the correct and incorrect predictions and the number of compounds in our dataset vary as a function of the magnitude above or below the convex hull. Acc is the classification accuracy. F1 is the harmonic mean of precision and recall. FPR is the false positive rate. The moving average of the accuracy (computed within 20 meV/atom intervals) as a function of ΔHd,MP is shown as a blue line (right axis). As expected, the accuracy is lowest near the chosen stability threshold of ΔHd,MP = 0.

Predicting stability in sparse chemical spaces
While quantifying the accuracy of ML approaches on the entire MP dataset is instructive, it does not resemble the materials discovery problem because it assesses only the limited space of compositions that have been previously explored and therefore have many stable compounds. In order to simulate a realistic materials discovery problem, we identified a set of chemical spaces within the MP dataset that are sparse in terms of stable compounds. Lithium transition metal (TM) oxides are used as the cathode material for rechargeable Li-ion batteries and have attracted substantial attention for materials discovery in recent years. In particular, Li-Mn oxides have been considered as an alternative to LiCoO2 utilizing less or no cobalt: e.g., spinel LiMn2O4, 39 layered LiMnO2, 40 nickel-manganese-cobalt (NMC) cathodes, 41 and disordered rock salt cathodes. 42 For this work, the quaternary space, Li-Mn-TM-O with TM ∈{Ti, V, Cr, Fe, Co, Ni, Cu}, is an attractive space to test the efficacy of these models, as it contains only 9 stable compounds and 258 more that are unstable in MP. We tested the potential for ML models to discover these stable their stability. Importantly, we are again concerned with DFT-calculated stability at 0 K, so we are not considering the potential for compounds in this quaternary space to be stabilized due to entropic effects (e.g., configurational disorder). Figure S3 and reveals that all models have a higher accuracy predicting ΔHf for this subset of materials than for the entire dataset (Figure 3). The improved prediction of ΔHf is likely because the compounds in this subset have strongly negative ΔHf and are well-represented by the thousands of transition metal and lithium-containing oxides that comprise the MP dataset. Despite this improved accuracy on ΔHf, the models all have alarmingly poor performance in predicting ΔHd. In Figure 5, we show that none of the models are able to correctly detect more than three of the nine stable compounds, and even for the most successful model by this metric (AutoMat), the three true positives come with 24 false positives. It is noteworthy that in this experiment, the models are given a large headstart towards making these predictions because the composition space under investigation is restricted to those compounds that have DFT calculations tabulated in MP, which is biased towards stability compared to the space of all possible hypothetical compounds. To account for the MP stability bias and more closely simulate a realistic materials discovery problem, we assessed the potential for these models to identify the nine stable MP compounds when considering a much larger composition space. Using the approach defined in Ref. 14 Table 1). The compositional models each predict ~4-5% of these compounds to be stable, and all of the models fail to accurately predict the stability of more than one of the nine compounds that are actually stable in MP. A remarkable 159 compounds are predicted to be stable by all six models and 1,294 unique compounds are predicted to be stable by at least one model. While it is likely that the space of stable quaternary compounds in the Li-Mn-TM-O space has not yet been fully explored in MP (or by extension, the ICSD), our intuition suggests it is highly unlikely that the number of new stable materials in this well-studied space is orders of magnitude larger than the number of known stable materials. The false positive rates obtained on the entire MP dataset shown in Figure 4 suggest ~25-38% of these predicted stable Li-Mn-TM-O compounds are not actually stable, and these rates are likely underestimated, as discussed previously. The magnitude of compounds predicted to be stable by the ML models, and their false positive rates, imply that these models will inevitably identify a large number of unstable materials as candidates for further analysis (either with DFT calculations or experimental synthesis). This substantially impedes the capability of these formation energy models to accelerate the discovery of novel compounds that can be synthesized. Table 1. Predictions in the expanded Li-Mn-TM-O (TM ∈ {Ti, V, Cr, Fe, Co, Ni, Cu}) composition space. Candidate compounds were generated by combining all quaternary MP compounds in this space along with quaternary compounds generated by the approach described in Ref. 14 , resulting in 13,659 candidates. Among these candidates, 9 compounds are calculated to be stable in MP. The stability of all candidates was assessed using each compositional model for ΔHf. Note that while all models correctly predict 1 of 9 MPstable compounds to be stable, this compound is not the same for all models.

Direct training on decomposition energy
An alternative approach to consider is to train directly on ΔHd instead of using MLpredicted ΔHf to obtain ΔHd through the convex hull construction. Note that direct training on ΔHd is complicated by the fact that ΔHd for a given compound is dependent upon ΔHf for other compounds within a given chemical space. This is unlike ΔHf, which is intrinsic to a single compound. To assess the capability of each representation to directly predict stability, we repeated the analysis shown in Figures 3-5 and Table 1 Table S1, and results for ΔHf and ΔHd on the entire MP dataset are shown in Figure S3 and Figures S4-S5, respectively. While the prediction accuracy (MAE and stability classification) on the entire MP dataset is typically comparable to or slightly better when training on ΔHd (Figures S4-S5) instead of ΔHf (Figures 3-4), the capability of the trained model to predict stability in sparse chemical spaces is even worse than when training on ΔHf ( Figure 6, Table S1). None of the models are able to identify even one of the nine MP-stable quaternary compounds from the set of 267 Li-Mn-TM-O compounds in MP, and every model predicts all 267 Li-Mn-TM-O compounds to be unstable (Figure 6). It is especially notable that for all models except Roost and ElemNet, the predictions for all 267 quaternary compounds fall in a very small window (0.040 eV/atom < ΔHd,pred < 0.082 eV/atom), suggesting the models only learn that all compounds in this space should be within the vicinity of the convex hull and do nothing to distinguish between chemically similar compounds. When the space of potential compounds is expanded to 13,659 compounds, only Roost and ElemNet predict any compound to be stable, but again, none of the nine MP stable compounds are predicted to be stable by any model (Table S1). Beyond the poor performance associated with these models, the direct prediction of ΔHd is difficult to physically motivate because unlike ΔHf, ΔHd is not an intrinsic property of a material but depends on the energy at other compositions with which it may be in competition.
This non-locality of ΔHd also depends on the completeness of a given phase diagram: as new materials are discovered in a chemical space, ΔHd is subject to change for any compound in that space, even if that compound's energy itself does not change, complicating the application of ML models trained on ΔHd.

Revisiting stability predictions with a structural representation
In addition to compositional models, representations that rely on the crystal structure for predicting formation energy have also received substantial attention in recent years. 25,[27][28][29][43][44][45] These models perform a different task than compositional models because they evaluate the property of a material given both the composition and the structure. Nevertheless, it is interesting to assess whether these structural models can predict stability with improved accuracy relative to models that are only given composition.
Here we take the crystal graph convolutional neural network (CGCNN) 25 as a representative example of existing structural models. CGCNN is a flexible framework that uses message passing over the atoms and bonds of a crystal (see Methods for training details). In  (Figure 7a), constructing convex hulls with those predicted ΔHf to generate ΔHd (Figure 7b), assessing the capability of these ΔHd,pred values to classify materials as stable or unstable (Figure 7c), and probing the ability for this model to predict stability in the sparse Li-Mn-TM-O space (Figure 7d). It is clear that CGCNN improves substantially upon the direct prediction of ΔHf (Figure 7a) and the implicit prediction of ΔHd (Figure 7b), reducing the MAE by ~50% compared with the best performing compositional model  (Figure 7d). In addition to the improved predictive accuracy, the parity plot for this excluded set looks fundamentally different than for the compositional models. In the compositional models (Figure 5), the parity plot is scattered, and there is effectively no linear correlation between the actual and predicted ΔHd, whereas for CGCNN, there is a strong linear correlation (Figure 7d). It is not clear whether this improved correlation arises from beneficial error cancellation within each chemical space or from reducing the overall MAE from ~0.06 eV/atom (for the best performing compositional model -Roost) to ~0.03 eV/atom (for CGCNN).
The non-incremental improvement in stability predictions that arises from including structure in the representation is a strong endorsement for structural models and also sheds insight into the structural origins of material stability. While the thermodynamic driving force for forming a compound from its elements (formation energy) can be learned with high accuracy from only the composition, the structure dictates the subtle differences in thermodynamic driving force between chemically similar compounds and enables accurate machine learning predictions of material stability (decomposition energy). However, the glaring limitation of this approach is that it requires the structure as input, and the structure of new materials that are yet to be discovered is not known a priori. For example, because we do not know the ground-state structure for an arbitrary composition, we cannot repeat the test where we assess the ability of the ML model to find the stable Li-Mn-TM-O compounds among a large set of candidate compositions. While CGCNN shows substantially improved performance in predicting material stability, these results are obtained using the DFT-optimized ground-state crystal structures as input.

Outlook
There have been a number of recent successes in the application of machine learning for materials design problems. These models have given the impression that ML can predict formation energies with near-DFT accuracy. 20,23,46

General training approach
Five-fold cross validation was used to produce the model-predicted ΔHf,pred shown in Figure 2. Each predicted value corresponds with the prediction made on that compound when it was in the validation set (i.e., not used for training). ΔHf,pred was then used in the convex hull analysis to generate ΔHd,pred shown in Figure 3, from which stability classifications were made as shown in Figure 4. For the Li-Mn-TM-O examples ( Figure 5 and the results shown in Figure 6, Figure S4-S5, and Table S1.

Compositional model training
Three of the compositional models-ElFrac, Meredig, 20 and Magpie 21 -were implemented using matminer 50 and trained using gradient boosting as implemented in XGBoost 30 with 200 trees and a maximum depth of 5. Preliminary tests showed XGBoost and these hyperparameters led to the highest accuracy of tested algorithms. AutoMat 22 was used as implemented in Ref. 22 . Roost 24 was trained for 500 epochs using an Adam optimizer with an initial learning rate of 5⨯10 -4 and an L1 loss function. ElemNet was implemented as described in Ref. 23 using the Keras machine learning framework 51 . ElemNet was trained using an initial learning rate of 10 -4 with an Adam optimizer for 200 epochs. 10% of the input data was set aside for validation, and the model weights from the epoch with best loss on the validation set were used for predictions.

Author contributions
CJB, AT, and GC conceived the project. CJB, AT, QW, and AD designed the project. AT, QW, and AD implemented the machine learning models. CJB performed the stability analysis, processed the results, and drafted the manuscript. AJ and GC supervised the project. All authors reviewed and edited the manuscript.

Data availability
A public repository at https://github.com/CJBartel/TestStabilityML contains the following items: code for training each compositional model in this work, code for assessing the stability of compounds given predicted formation energies (for the models shown in this work or any new model), the formation and decomposition energy data for all models studied in this work, and code for generating all figures and tables in the main text and Supplementary Information. Figure S1. Distribution of ΔHf and ΔHd in Materials Project, with 85 bins and normalized such that the integral of the distribution is equal to 1. Figure S2. Receiver operating characteristic (ROC) curves for each model trained on ΔHf. TPR is the true positive rate and FPR the false positive rate. The colorbar indicates the stability threshold -i.e., a compound is classified as "stable" if ΔHd is less than the stability threshold. Note that the models are trained on ΔHf and are therefore insensitive to this changing threshold. Instead, the choice of threshold simply allows for an expanded analysis of the ΔHf model performance on ΔHd predictions.