Scaling deep learning for materials discovery

Novel functional materials enable fundamental breakthroughs across technological applications from clean energy to information processing1–11. From microchips to batteries and photovoltaics, discovery of inorganic crystals has been bottlenecked by expensive trial-and-error approaches. Concurrently, deep-learning models for language, vision and biology have showcased emergent predictive capabilities with increasing data and computation12–14. Here we show that graph networks trained at scale can reach unprecedented levels of generalization, improving the efficiency of materials discovery by an order of magnitude. Building on 48,000 stable crystals identified in continuing studies15–17, improved efficiency enables the discovery of 2.2 million structures below the current convex hull, many of which escaped previous human chemical intuition. Our work represents an order-of-magnitude expansion in stable materials known to humanity. Stable discoveries that are on the final convex hull will be made available to screen for technological applications, as we demonstrate for layered materials and solid-electrolyte candidates. Of the stable structures, 736 have already been independently experimentally realized. The scale and diversity of hundreds of millions of first-principles calculations also unlock modelling capabilities for downstream applications, leading in particular to highly accurate and robust learned interatomic potentials that can be used in condensed-phase molecular-dynamics simulations and high-fidelity zero-shot prediction of ionic conductivity.

1 Supplementary Note 1: Additional information on Materials Discovery Results

Details of Experimental Matches
In Fig. 1 we provide additional summary statistics of matched experimental crystals as aggregated by ICSD, as described in the main body of the paper.In Fig. 1a, we provide the journal year, and in Fig. 1b and c, we showcase that experimental matches occur across a variety of unique element sizes and number of atoms.The tails of both figures in finding quaternary / quintary structures as well as structures whose reduced formula is greater than 20 supports the usage machine-learning guidance in these more complex search spaces.Finally, in Fig. 1d, we highlight the final computed decomposition energies of the experimental structures.The findings suggest even metastable structures can be of interest for experimental applications, further supporting the complete GNoME dataset beyond materials that are only on the convex hull.

GNoME
Supplemental Figure 1: GNoME discoveries match experimental discoveries aggregated by the Inorganic Crystal Structure Database (ICSD).We present a number of summary statistics in the figures above, including the journal year, number of unique elements, and number of atoms.In the final plot, we plot the decomposition energy for all of the experimental matches.

Patterns in Formation Energies
Prior work [1,2] has postulated that stable materials with increasing number of elements should display increasingly negative formation energies.With the intuition that increasing the number of elements leads to a combinatorial increase in the number of competing phases, Fig. 2 displays distribution of formation energies for the stable materials in reproductions of Materials Project and OQMD in comparison to the GNoME dataset, indexed by the unique number of elements.While the same trend is observed in the GNoME dataset, the effect is significantly less pronounced.In addition, due to natural occurrence and research interests, current databases contain a relatively high population of highly ionic and covalent compounds with stronger bonds (e.g.oxides, fluorides etc.) yielding deeper formation energies.Our exploration in GNoME was not biased towards any particular chemistry and hence expanded into a wide range of chemical spaces where bonds are naturally not as strong (e.g.pnictides, chalcogenides, borides, alloys etc.), yielding formation energies that are not necessarily as Supplemental Figure 2: Formation energies of all stable materials reproduced from Materials Project and OQMD compared to the discoveries by GNoME.We note the drastic difference in distribution, suggesting that GNoME models are able to discover more complex patterns of stability.
-100 0 50 Formation Energy [meV/atom] Density (GNoME -Materials Project) Supplemental Figure 3: Difference in formation energies when using the GNoME pipeline to find improved structures for compositions already available in prior catalogs. deep.

Repeat Structure Comparisons
In the rest of this paper, we report results for compositions that are independent from downloaded databases such as Materials Project and OQMD [3,4].This decision ensures that the new structures discovered in this work are not slight modifications of already known materials, but are clearly novel.However, it is also known that the Materials Project [3] contains DFT calculations that are not intended to be global minima within a composition, for example, with the addition of many nano-wire structures between 2018 and 2019.There are also many compositions with only 1 viable structure, potentially indicating limited search heuristics used to find global minima.Therefore, repeating structure searches on compositions already in the Materials Project and OQMD using GNoME models allows us to assess model capabilities for discovering better structures for compositions lacking the global minima in these databases.Fig. 3 presents the relative formation energy difference for these repeated computation.While many of the compositions do appear to be at global minima structures (indicating the success of years of research into specific families of interest), there is a notable number of compositions that improve.For the main body of the paper, these "repeated" composition comparisons are included in the phase diagram calculations for the updated convex hull but are not counted in the presented 381,000 novel stable crystal structures due to limitations in available crystal structure matchers.

Data Limitations
In this paper, we are consistent with prior efforts in the structure prediction community to count stable materials relative to the known convex hull.Future discoveries can always displace crystal structures that are currently on the convex hull.For example, the GNoME discoveries displace approximately 5,000 of the "stable" structures from Materials Project and OQMD.The true convex hull will remain a continual goal in material science.
2 Supplementary Note 2: r 2 SCAN Comparison In the body of the paper, we validate the model predictions and materials discovered by the GNoME procedure by comparing energies between the commonly used PBE functionals and the more accurate but computationally expensive r 2 SCAN procedure.The overall agreement is high, with 84% of structures marked as stable via PBE calculated as stable via r 2 SCAN.Nevertheless, there is a somewhat noticeable discrepancy in phase separation energies near for elements close to the convex hull.Results from additional analysis are provided below.First, the discrepancy in phase separation energies near 0 appears standard and not a unique feature of the GNoME dataset.In particular, Fig 4 plots an equivalent graph comparing the phase separation energies of crystals available in the Materials Project and OQMD suggesting that the misclassification via PBE is also likely for public dataset.We note that calculations at the level of r 2 SCAN are also often intermediates between PBE and experimentation so the described errors in calculation are often fixed in preprocessing for downstream applications.As an exploratory measure, we have trained graph neural networks to explore transferring between PBE to r 2 SCAN energies; early results suggest that this delta can be predicted to within 13meV/atom.Due to the computational expense of r 2 SCAN compuations especially when not near local minima, this strategy has not yet been used for material discovery under the associated meta-GGA functional.

Supplementary Note 3: Patterns of Phase Separation
In Fig. 5, we provide greater context to the discussion surrounding phase separation energies of discoveries, to add to the quaternaries presented in the main body of the paper.There are improvements at all phase separation levels suggesting many structures found are meaningfully stable with respect to competing phases.Improvements to binaries are interesting as GNoME is able to find novel structures despite the simplicity and long-history of discovery in this space.Discoveries only increase in relative magnitude for ternaries and beyond.

Compositions Close to the Convex Hull
So far, we have placed predominant focus on compositions that are stable, as measured by being within floating point error (1e − 7) of the convex hull.This stringent definition, however, neglects problems with the numeric instability of density functional theory results as well as the error in estimating experimental energies via computational methods.Additionally, inorganic crystals may be metastable and viable for industrial applications though not being the ground state of a particular composition, for example the diamond form of carbon in comparison to graphite.Therefore, experimentalists hoping to produce inorganic crystals will include candidates that are close to convex hull.In Fig. 6

Probability Stable
Training Dataset

Materials Project GNoME
Supplemental Figure 8: Precision of machine learning models trained on GNoME vs. the Materials Project data, as a function of the prediction threshold.
showcase how the active learning process also introduces a significant number of the discoveries made by GNoME lie close to the convex hull, providing an even greater set of candidates for potential downstream applications.A promising avenue for future work would be to look into patterns of metastablility and contrast with the stable materials discovered.Fig. 7 further summarizes the counts of unique compositions or independent crystal structures that are close to the convex hull by filtering by decomposition energy.In particular, we find that with common thresholds of 25 or 50 meV/atom from the convex hull, the GNoME dataset provides a substantially large number of crystal structures.Even at the more conservative 25 meV/atom metric, there are over 2.4M unique compositions and associated structures in the GNoME dataset.

Filtration Improvement from Active Learning
In the main body of the paper, we reference the increasing hitrate as models improve via active learning.In Fig. 8, we break down the hitrate of GNoME models with respect to the threshold chosen.On subsets of materials generated from the initial round of active learning and the final round, we evaluate model performance and show that despite equivalent filters (e.g. with test-time augmentation and uncertainty quantification), GNoME models show improved performance across the board.Furthermore, the results suggest that a fraction of GNoME discoveries come from substitutions that would not have been immediately predicted as stable (before relaxation).

Remaining Materials
An open research question after the GNoME discoveries is how many materials remain to be discovered?In the paper, we discussed the results from 6 rounds of active learning, resulting models of high hitrate overall.Preliminary results suggest that a great deal more materials remain to be found, with active learning on the start of a 7th round and extrapolations of the hitrate suggesting at least an additional Supplemental Figure 9: Increase in the number of prototypes as a ratio to those available by the Materials Project.
1.6M (80% * 2 million candidates) materials that are likely to be stable with respect to the convex hull of Materials Project.It is likely some of these would drive up the number of stable crystals on the updated hull.These results suggest that GNoME can continue to aid in discovery either via high-throughput search or within specific families of interest.

Additional Prototype Analysis
Prototype analysis in the main body of the paper is presented in a log-scale making the relative contributions to individual number of elements difficult to ascertain.In Fig. 9, we present the relative increase in the number of prototypes across number of elements.This figure emphasizes the capabilities of GNoME models for discovering materials with an increasing number of unique elements.
4 Supplementary Note 4: Analysis of 'Newly' Discovered Materials

Composition Matching
In our work, Materials Project, OQMD and WBM were used as reference databases against which we check if a new stable material we identify should be labeled as a stable discovery or not.First two of these databases include structures from comprehensive curated sources such as ICSD and Pauling File, and hence are considered to encompass nearly all experimentally known ordered inorganic structures.There could, however, still be known materials reported in the literature but overlooked in these databases.To understand the extent of this gap, we randomly sampled 200 structures from the stable GNoME discoveries, and ran independent and deep manual literature searches, factoring in human chemical knowledge (e.g.searching for other potential ways of writing the formula).Out of these 200, we found only one composition (RbCaAsO 4 ) that was in the literature but was not curated into the reference databases.Albeit not a large sample, these results imply that the number of false positive stable discoveries due to missing entries in reference databases is expected to be small.

Targeted Search
While we focused on efficiently finding new stable materials throughout this effort, convex-hull by its nature is never complete and a deeper search can always find new compounds that displace the older ones from the convex hull.To gauge the impact of this on the number of new stable materials we found, we randomly picked chemical system of 100 structures from our final list of 381,000 stable entries, and ran a more comprehensive structure search for each system, including the subspaces (e.g. if the target system is a ternary, we also include its binaries).In this deeper search, we generate candidate structures by using our substitutional pipeline on all of the prototypes available in GNoME for each of the 100 chosen chemical systems, which yields > 10 million candidate structures.To downselect candidates for DFT, we further relax the filtering criteria used in our original pipeline, allowing (i) structures with a graph-network predicted energy within 100 meV/atom of our final convex hull and (ii) for each composition, up to 10 polymorphs to be included in the new batch of DFT calculations.
Supplemental Figure 10: Phase separation energies for materials removed from the convex hull by additional exploration within 6,500 selected chemical systems.Results showcase how stable materials can be displaced but often remain close to the updated convex hull.The results are likely still of interest to experimental researchers and for filtration in downstream applications.
This filtering yields approximately 100,000 new structures to calculate with DFT.After the DFT relaxations, we found that the total number of stable materials in these 100 chemical systems and their subspaces increased from 10,879 to 11,575.In addition, 270 materials were displaced from the convex-hull with this new search, for which we show the new decomposition energies remain close to the new hull (Fig. 10).These results indicate that while there is always room for finding new stable materials, our final convex hull provides a strong baseline for stabilities that do not significantly change with subsequent deeper searches.

Supplementary Note 5: Compositional model trained on AIRSS runs
As described in the methods section, we find that compositions for which more than 10 AIRSS runs have completed to be more reliable training and test labels for training compositional models.Fig. 11 shows that for compositions with fewer than 10 completed AIRSS runs, the decomposition energy can be anomalously large.Fig. 12 shows that while some compositional energy predictions from the GNN can seem too low, we find that these predictions almost always correspond to compositions for which fewer than 10 AIRSS runs have completed.When we restrict the comparison to compositions with at least 10 completed AIRSS runs, the agreement between the measured formation energies and predicted formation energies is much stronger.

Ionic conductivity prediction
Fig. 3 in the main text shows the performance of various pretrained MLIPs on the task of classifying whether an unseen material is a superionic conductor.These simulations are performed in a zero-shot manner, meaning that the pretrained GNoME potentials have neither seen any data sampled from AIMD, nor have been trained on the composition being simulated.In some cases, as common in MLIPs [5], one may encounter unstable simulations from which no measure of conductivity can be derived.When evaluating a model's ability to predict whether an unseen material displays superionic conductivity or not, including such simulations as misclassifications or discarding them from the test set may change the classifier performance.We report both.In Figure 3a in the main text, we report unstable simulations as not counting towards to the classification accuracy.In the supplementary information, in Fig. 13, we show the same experiment, but counting them as misclassified, thereby slightly increasing the classification error.13: The scaling of ionic conductivity classification error when the unstable GNN-MD simulations are considered to be classification errors.In contrast, in Fig. 3a the unstable GNN-MD simulations are considered as a no-prediction.We see that the general trends are similar for both assumptions.

Scaling behaviour of force errors on AIMD data
Deep learning has been empirically observed across various domains to exhibit predictable scaling behaviour, where the test error follows a power-law of the form = aN b , where and N are the predictive error and the number of training samples, respectively, and a and b are constants.Scaling laws have proven highly useful as they allow to predict what training set size would be required to obtain a given predictive error [6,7].We find that the mean absolute error in force components of downstream materials sampled from AIMD follows a power-law with respect to the pretraining size and consistently improves over multiple decades.Figures 14 -21 show the scaling behaviour of the mean absolute error in force component as a function of training set size.Blue points represent scaling of the GNoME potential on unique structures from AIMD trajectories, red denotes a NequIP potential trained on the M3GNET data, and purple shows a GNoME potential trained on the full relaxation trajectory, including repeated structures along the minimization trajectory up to approx.89 million structures.All four materials are compositions in the test set not included in training.We observe a clear power-law as we increase the pretraining data set size, for both T=400K and T=1,000K test evaluation.

Robustness
A common shortcoming of MLIPs is their inability to generalize to data distributions beyond what they are trained on.This can result in unstable simulations or unphysical configurations, if configurations that lie outside of the training distribution are encountered during the simulation.One example of such domain shift is simulations with higher temperatures than the training set.We find that pretraining on the large and diverse GNoME dataset of structural relaxations greatly improves the robustness under such distribution shifts.In an effort to assess the model's robustness, we train two types of NequIP potentials on data of increasing sizes, sampled from AIMD simulations at T=400K and evaluate them on structures sampled at T=1,000K.As discussed in the Methods section, we train a) a NequIP potential starting from randomly initialized weights, and b) a model that was pretrained on the GNoME data set and then fine-tuned on the T=400K data.We also compare to the performance of the zero-shot model, that was never trained on the 400K data or the material composition to begin with.In figures 22 to 33, we demonstrate the performance of the potentials under a distribution shift by testing them on four different materials that are not included in training.For each composition, we first show a scatter plot of the performance at T=400K vs T=1,000K, where a perfect y = x fit would indicate a perfectly robust potential.We find that pretraining consistently and substantially improves robustness.Second, we show the performance of the potentials trained from scratch and pretrained potentials as a function of fine-tuning data set size.We perform this experiment for evaluations at both 400K and 1,000K (note that in both cases, both the fine-tuning and training are performed at 400K).While in the 400K case, fine-tuning helps, the zero-shot model is quickly superseded in performance when training on data from the target distribution.The most significant improvement is observed on data sampled at T=1,000K where pretraining gives strong improvements, even at large data set size of > 1, 000 structures.In addition, here the zero-shot model is also often highly competitive, often outperform from-scratch models trained on hundreds of samples.
Interestingly, we find that for only three out of the four materials in the test set, fine-tuning improves over zero-shot performance on robustness.Similarly, with large enough data sets, training from scratch improves upon zero-shot robustness.But for one material, neither is true.Analysis of the AIMD data reveals that while the three materials that improve are all poor conductors at T=1,000K, the remaining material exhibits Li-ion conduction at T=1,000K, but not at 400K.This suggests that in this case, overfitting on the non-conducting data hurts model performance when evaluated on the conducting case.The pretrained potential appears to perform better in a zero-shot evaluation.
Supplemental Figure 22: Robustness, As 24 Ca 24 Li 24 .In-domain vs out-of-domain error of different fine-tuned and randomly initialized models trained at T=400K and evaluated at T=400K as well as T=1,000K on.Supplemental Figure 33: Robustness, Li 32 S 24 Si 4 , T=400K.Force errors on data sampled at T=400K of fine-tuned and randomly initialized models trained at T=400K as a function of training set size

Performance on elemental materials
We further test the zero-shot performance of the GNoME potential on a series of elemental systems previously used in the performance assessment of machine learning interatomic potentials [8].The data cover six elemental systems Li, Ni, Ge, Si, Cu, Mo, computed with DFT.The DFT computations were performed with VASP using the Perdew-Burke-Ernzerhof (PBE) generalized gradient approximation (GGA).The dataset from [8] covers a diverse set of structures including the ground state crystal structure, data sampled from NVT-AIMD at different temperatures (0.5x, 0.9x, 1.5x, and 2.0x the melting point), NVT-AIMD simulations of the bulk structure with a single vacancy at different temperatures (300K and 2.0x the melting point), strained structures, and slab structures.In [8], a series of MLIPs were trained on data from this distribution and tested on the same distribution.Here, we evaluate the pretrained GNoME potentials directly on the test set, without any training on the data.The data were obtained from the test set files from https: //github.com/materialsvirtuallab/mlearn/tree/master/data.
Tables 1 -6 show the zero-shot GNoME performance compared to different potentials trained on data from the same distribution, as well as to an evaluation of the pretrained, publicly available M3GNET and CHGNet potentials [9,10].We observe strong improvements from scaling up pretraining data sets.Moreover, we observe for the first time that a pretrained potential is competitive with machine learning interatomic potentials trained explicitly on the data.The pretrained GNoME potential is on par with a Behler-Parrinello neural network trained on hundreds of structures, outperforming it on four out of six materials.We note the special case of Nickel, where the composition is not present in our pretraining data, GNoME still performs highly competitively with MLIPs trained on the data.Finally, we note that the poor performance of the GNoME potential in Molybdenum is consistent with all other methods and with recent work [11,8], highlighting challenges of current existing MLIPs on certain elements.Fig. 3d in the main text displays the performance of the GNoME potential in comparison to other general-purpose potentials, including a zero-estimator (i.e.predicting a value of 0 for each force component), as well as the M3GNET [9] and CHGNet [10] potentials.Figures 38 -43 show the performance of the pretrained GNoME potential in comparison to the M3GNET potential trained on the M3GNET data.In addition they also show the performance of the GNoME potential on two subset of the combined test data, namely a) only the melted test structures including a vacancy as well as b) on the 300K bulk data.

Supplemental Figure 4 :
Comparison of phase separation energies from the PBE and r 2 SCAN functional for material structures directly from Materials Project or OQMD.

Supplemental Figure 5 :
Phase separation energies filtered by unique number of atoms on the complete GNoME dataset.Separation Energy [meV/atom] Separation Energy [meV/atom]

Supplemental Figure 6 :Supplemental Figure 7 :
Phase separation energies filtered by unique number of atoms on the complete GNoME dataset.A summary of the GNoME dataset, filtered by distance to the convex hull.

Supplemental Figure 11 :
Minimum decomposition energy for a composition as a function of AIRSS runs completed for that composition.f [eV/atom] Supplemental Figure 12: (a) A scatter plot of formation energy labels from DFT and formation energy predictions from the compositional model.Each point represents a composition, and all compositions in our training set is included in the figure.(b) A scatter plot of formation energy labels from DFT and formation energy predictions from the compositional model.Each point represents a composition, but only compositions with at least 10 completed AIRSS runs are included.

Figure 38 :
Zero-shot performance on Nickel.Force components in units of [eV/ Å] a) GNoME potential evaluated on all Nickel test set structures.The GNoME potential has never seen pure Nickel and has not been trained on data sampled from MD.In addition, the test data include surfaces, whereas the GNoME potential has only ever seen bulk structures.b) Performance of the M3GNET potential.c) GNoME potential evaluated on Nickel at T=3,000K in the melt, including a vacancy.The GNoME potential has never seen melted structures and has never seen a vacancy.d) GNoME potential evaluated on Nickel at T=300K.Supplemental Figure 39: Zero-shot performance on Copper.Force components in units of [eV/ Å] a) GNoME potential evaluated on all Copper test set structures.The GNoME potential has not been trained on data sampled from MD.In addition, the test data include surfaces, whereas the GNoME potential has only ever seen bulk structures.b) Performance of the M3GNET potential.c) GNoME potential evaluated on Copper at T=3,000K in the melt, including a vacancy.The GNoME potential has never seen melted structures and has never seen a vacancy.d) GNoME potential evaluated on Copper at T=300K.Supplemental Figure 40: Zero-shot performance on Lithium.Force components in units of [eV/ Å] a) GNoME potential evaluated on all Lithium test set structures.The GNoME potential has not been trained on data sampled from MD.In addition, the test data include surfaces, whereas the GNoME potential has only ever seen bulk structures.b) Performance of the M3GNET potential.c) GNoME potential evaluated on Lithium at T=907K in the melt, including a vacancy.The GNoME potential has never seen melted structures and has never seen a vacancy.d) GNoME potential evaluated on Lithium at T=300K.Supplemental Figure 41: Zero-shot performance on Germanium.Force components in units of [eV/ Å] a) GNoME potential evaluated on all Germanium test set structures.The GNoME potential has not been trained on data sampled from MD.In addition, the test data include surfaces, whereas the GNoME potential has only ever seen bulk structures.b) Performance of the M3GNET potential.c) GNoME potential evaluated on Germanium at T=2,422K in the melt, including a vacancy.The GNoME potential has never seen melted structures and has never seen a vacancy.d) GNoME potential evaluated on Germanium at T=300K.Supplemental Figure 43: Zero-shot performance on Silicon.Force components in units of [eV/ Å] a) GNoME potential evaluated on all Silicon test set structures.The GNoME potential has not been trained on data sampled from MD.In addition, the test data include surfaces, whereas the GNoME potential has only ever seen bulk structures.b) Performance of the M3GNET potential.c) GNoME potential evaluated on Silicon at T=3,374K in the melt, including a vacancy.The GNoME potential has never seen melted structures and has never seen a vacancy.d) GNoME potential evaluated on Silicon at T=300K.

Table 1 :
Zero-shot performance on GNoME model in comparison to methods trained on the Nickel data set.RMSE in units of [meV/ Å].

Table 2 :
Zero-shot performance on GNoME model in comparison to methods trained on the Copper data set.RMSE in units of [meV/ Å].

Table 3 :
Zero-shot performance on GNoME model in comparison to methods trained on the Lithium data set.RMSE in units of [meV/ Å].

Table 4 :
Zero-shot performance on GNoME model in comparison to methods trained on the Molybdenum data set.RMSE in units of [meV/ Å].

Table 5 :
Zero-shot performance on GNoME model in comparison to methods trained on the Silicon data set.RMSE in units of [meV/ Å].

Table 6 :
Zero-shot performance on GNoME model in comparison to methods trained on the Germanium data set.RMSE in units of [meV/ Å].