Network analysis of synthesizable materials discovery

Assessing the synthesizability of inorganic materials is a grand challenge for accelerating their discovery using computations. Synthesis of a material is a complex process that depends not only on its thermodynamic stability with respect to others, but also on factors from kinetics, to advances in synthesis techniques, to the availability of precursors. This complexity makes the development of a general theory or first-principles approach to synthesizability currently impractical. Here we show how an alternative pathway to predicting synthesizability emerges from the dynamics of the materials stability network: a scale-free network constructed by combining the convex free-energy surface of inorganic materials computed by high-throughput density functional theory and their experimental discovery timelines extracted from citations. The time-evolution of the underlying network properties allows us to use machine-learning to predict the likelihood that hypothetical, computer-generated materials will be amenable to successful experimental synthesis.

Assessing the synthesizability of inorganic materials is a grand challenge for accelerating their discovery using computations. [1][2][3][4] Synthesis of a material is a complex process that depends not only on its thermodynamic stability with respect to others, but also on factors from kinetics, to advances in synthesis techniques, to the availability of precursors. This complexity makes the development of a general theory or firstprinciples approach to synthesizability currently impractical. 3,5 Here we show how an alternative pathway to predicting synthesizability emerges from the dynamics of the "materials stability network": a scale-free network constructed by combining the convex free-energy surface of inorganic materials computed by high-throughput density functional theory [6][7][8][9] and their experimental discovery timelines extracted from citations. The time-evolution of the underlying network properties allows us to use machine-learning to predict the likelihood that hypothetical, computergenerated materials will be amenable to successful experimental synthesis. In other words, we can predict the synthesizability of novel compounds.
Thermodynamic stability strongly influences synthesizability of a material, but extracting it requires the knowledge of energetics of competing phases. This bottleneck has recently been addressed for inorganic materials by high-throughput (HT) density functional theory (DFT) databases, [6][7][8][9] which provide access to systematic DFT calculations of thousands of existing inorganic materials as well as hypothetical ones. These databases allow the construction of a comprehensive energy "convexhull": the multidimensional surface formed by the lowest energy combination of all phases. Phases that are on the convex-hull are thermodynamically stable, and tielines connecting two phases indicate two-phase equilibria. Given that it is composed of stable materials (nodes) connected by tie-lines (edges), the convex-hull is a naturally occurring thermodynamic network (Fig.1), analogous to the world-wide-web, social, citation, and protein networks, [10][11][12][13][14] and can be treated with the new paradigm of complex networks.
The chronology of "discoveries" can reveal the dynamics of this network of materials. The discovery of a material can be associated with the physical identification and The schematic illustrates phase diagrams with the order of the system ranging from 2-dimensional binary to the 89-dimensional "materials stability network" central to this work. The energy-composition convex-hull is shown for the binary system, and all higher order phase diagrams are projections of their respective N -dimensional convex-hulls to two dimensions, where materials are represented as nodes and tielines as edges. For clarity, only those tie-lines connected to high-degree nodes are shown in the materials stability network, where the sizes of the nodes are also scaled to reflect their degree.
recording of a new crystal structure and chemistry for a target application or general scientific exploration. With this definition, to be traceable as discovered, a material should (i) physically exist, i.e. be amenable to synthesis or occur in nature and not only exist on the computer as a "hypothetical" one, and (ii) have a record of structural characterization that can serve as a footprint for the onset of scientific interest. Both of these criteria can be traced from crystallographic databases, 15,16 which are dominated by structures of existing materials resolved with diffraction experiments. Assuming the time lag between the actual synthesis and/or characterization and the publication is not significant, the time of discovery of a material, and in particular the implied successful synthesis, can be approximated to be the earliest cited reference available in such collections. The thermodynamic information encoded in the convex-hull is important but not sufficient to explain the successful synthesis and discovery of a material. 4 However, when combined with information on the time of discovery, "temporal" stability networks encode circumstantial information beyond thermodynamics that may play a role in discovery. Such information implicitly includes scientific and non-scientific effects almost impossible to capture otherwise, such as the development of new synthesis techniques, availability of new precursors, changes 1950 1960 1970 1980 1990 2000 2010 2020 Year of discovery in interest or experience of researchers in a particular chemistry, structure or application, and even changes in policies that influence research directions. Here we combine the stability information from HT-DFT with the citation-extracted discovery timeline, both available in the Open Quantum Materials Database (OQMD), 6,7 and determine the temporal evolution of the stability network as more materials are discovered and added to it. Using the extracted network properties of materials, we demonstrate how a model can be developed to estimate the likelihood of synthesis of new, computationally-predicted stable materials.
The complete network formed by the current convexhull in the chemical space of all elements is extremely dense with 41 million tie-lines. 17 To find the most relevant set of tie-lines for synthesis, we subsample this network to obtain those that control the stability of at least one material, i.e. those in chemical subspaces where there is at least one stable material inside the composition simplex (See Methods). This process yields an informative subset of ∼ 2 × 10 5 tie-lines for synthesis that is also computationally tractable for repeated analysis, essential for building a predictive model as described later. Hereafter, we refer to this subset as the "materials stability network" to differentiate it from the complete network. We then trace retrospectively how this network was uncovered over time until it reached its present state. The number of stable materials discovered, N , and the number of tie-lines defining their equilibria as described above, E, are both growing with time ( Fig.2a). A poly- nomial fit to N (t) shows the number of stable materials discovered by year 2025 will reach ∼ 27 × 10 3 from the present number of ∼ 22 × 10 3 . The rate of stable materials discovery is ∼400 year −1 today and projected to reach ∼540 year −1 by 2025, suggesting that the discovery of stable materials is accelerating. E is increasing faster than N (Fig.2b), with α ≈ 1.04 in the densification power-law E(t) ∼ N (t) α . 18 Thus, the materials stability network is getting denser, which may be explained by researchers discovering materials closely connected with those already known, using the latter as stepping stones for the synthesis of new ones, 14,18 while uncovering the underlying "ultimate" network. 19 The degree distribution, p(k), where k is the degree of each node, is one measure of the topology of networks.
Here k corresponds to the number of tie-lines a material has. In recent years, scale-free networks that obey a power-law distribution, p(k) ∼ k −γ , have received significant attention. 13 While the materials stability network is far from a power-law in early times (e.g. 1960s), it has evolved into a distribution close to it, as shown in Fig.3 for 2010 (Supplementary Tables 1-2 and Supplementary Fig. 1). The exponent γ becomes constant at 2.6 ± 0.1 after the 1980s ( Supplementary Fig. 2), within the range 2 < γ < 3 as the other scale-free networks like the world-wide-web or collaborations 12,20 .
This scale-free character hints at the presence of "hubs" with significantly larger k compared to other nodes and a robust network connectivity, 21 implying that materials missing randomly from the network (because they have not been discovered yet) are not expected to hinder the discovery of others. However, if there are missing hubs, 13 new material classes disconnected from the present network may be awaiting discovery. The biggest hub here is O 2 with nearly 2,600 tie-lines, followed by Cu, H 2 O, H 2 , C and Ge with more than 1,100 tie-lines. Elemental N 2 , Ag, Si, Fe, Se, Mn, Co, K, Te, and Bi and oxides BaO, CaO, Li 2 O, SrO, Cu 2 O, MgO, SiO 2 , , and VO 2 are densely connected with 350 or more tie-lines. These are the species that play a dominant role in determining stabilities, and subsequently influencing synthesis of many materials. The predominance of oxide hubs suggests that identifying new hubs in chemistries such as pnictides, chalcogenides, halides or carbides may accelerate discovery in those spaces.
While the evolution of global properties of the network is slow (Supplementary Fig. 3), the network properties of individual nodes are evolving rapidly as their localenvironments change, as exemplified in Fig.4a-b for hightemperature superconductor YBa 2 Cu 3 O 6 and high-ZT thermoelectric BiCuSeO. Since this temporal evolution encodes circumstantial factors beyond thermodynamics that may contribute to discovery (and synthesis), properties that characterize the state of a material in relation to the rest of the network can realize a connection between these explicit or implicit factors and its discovery.
To reproduce that connection, we turn to designing a machine-learning model based on the network properties of materials, which we will then use to predict likelihood of synthesis of hypothetical materials: those created on the computer but have never been made. The present stability network has about 22,600 materials, of which ∼19,200 are physically-existing materials from crystallography databases and can be used in model building, and ∼3,400 are hypothetical, generated via high-throughput prototyping. 6,22-24 Prediction of synthesis likelihoods in the latter category can help bridge the gap between computational discovery and the real world.
We use six network properties for each material in model training; namely, degree and eigenvector centralities, degree, mean shortest path length, mean degree of neighbors, and clustering coefficient ( Fig. 4b and Supplementary Fig. 4). Degree and eigenvector centralities reflect the relative importance of a node in influencing stabilities, emphasizing the number of connections and importance of neighbors, respectively. We normalize these metrics such that they are mostly independent of the size of the network. 25 We find that degree without normalization is also useful for capturing the influence of the temporal state of the network on connectivity. Mean shortest path length, the mean of minimum number of tie-lines to travel from a node to every other node, and mean degree of neighbors serve as a proxy for ease of access to a particular material in synthesis. The clustering coefficient indicates how tightly-connected the neighborhood of a material is and may capture the local environment more immediately related to its synthesis.
Discovery is a time-irreversible event and its prediction is not a standard machine-learning problem in materials science. Here we use time-evolution of the aforementioned network properties of materials as features to form the basis of a sequential supervised-learning problem. 26 We adopt a sliding-window approach to train experimental "discovery" classifiers and estimate likelihood of discovery, and the implied synthesis for synthetic materials (See Methods). We employ two classification algorithms: L 2 -norm regularized logistic regression (LR) for well-calibrated probabilities, and random forest (RF) for classification accuracy. The performance metrics of models in the test set (unseen data) are shown in Supplementary Table 3 and in Fig.4c. With a window size of two, RF shows remarkable overall classification accuracy, reaching 97%, whereas LR performs reasonably at 77%. The most critical subclass is the sequences where a material changes from "undiscovered" to "discovered" in the time domain, where RF and LR perform well at a recall of 94% and 71%, respectively. To understand how these models accurately decide what is synthesizable, we investigate the correlations between the network properties, and how much they contribute to predictions (Figs.4d-e). Except the eigenvector and degree centralities, distinct features are not too strongly correlated. Identical features within a time sequence are naturally more correlated (e.g. degrees in a sequence), but distinct enough for the models to utilize them (Supplementary Fig. 5). Confirming the significance of tie-lines in influencing synthesis, degree and degree centrality, two closely connected but not highlycorrelated metrics, play the biggest roles in decisionmaking, with substantial contributions adding up to ∼90%. The rest of the features still play a non-negligible role, providing the remaining 10%.
The trained models can be used in multiple ways; for example, to predict class labels or probabilities for synthesizability in the present time or a past time. For the present time, models predict about 93% of hypothetical materials in the network have a synthesis probability, p, greater than 0.5. This prediction is in line with the notion that stable materials in HT-DFT databases are likely to be more amenable to synthesis. However, synthesis is a costly process and knowing its likelihood of success is critical before an attempt in the laboratory. Using LR's well-calibrated probabilities with RF's accurate classifications within the intersection of positive classification sets they predict (92% identical), we find that out of ∼3,400 stable hypothetical materials present, only about 9% have p > 0.95 for immediate synthesis (Supplementary Note 1).
Our approach can assist the decisions on where to allocate synthesis resources after computational design, and therefore has broad applications in materials discovery. ements), for which we find the likelihoods of synthesis to range from p = 0.52 for Li 4 SbRhO 6 to p = 0.85 for Li 4 NiTeO 6 . In fact, several of those predictions with p > 0.6 were synthesized. 27,28 For the novel ABO 3 perovskites identified in two recent computational studies, 23,29 we find the synthesis likelihoods to range from p = 0.54 for PuGaO 3 to p ≈ 1 for EuGeO 3 . For the inverse-Heusler alloys uncovered in a HT search for spintronics, 24 we predict p to be in the range 0.56 (TiInCo 2 ) -0.94 (FeGeRu 2 ) (a complete list of probabilities is available in Supplemen-tary Table 4). Today such computational studies can rapidly identify hundreds of new hypothethical materials with target functionalities, but the cost and complexity of synthesis often hinders systematic attempts for their realization. The ability to predict synthesis likelihoods is expected to bridge this gap between computational and experimental research groups. The network-based models can also be used to invert the discovery predictions and find at what point in time a hypothetical material could have already been made.
We find that only about 10% of the stable hypothetical materials that are predicted to be synthesizable today were synthesizable by 2005, and only about 30% were synthesizable by 2010 ( Supplementary Fig. 6), implying the advances and material discoveries within the last 10-15 year period are essential to their possible synthesis. The models assume that the mechanisms of materials discovery continue to follow similar trends to those in the past and present, and therefore by design they do not forecast the far "future". However, with the advances in materials discovery techniques including complex and high-throughput experimental and simulation capabilities, intuitively, the present models are likely a lower bound for the future synthesis likelihoods, as long as the non-scientific factors, such as the science policies and funding remain sustainable.

METHODS
The maximum-likelihood method was used to fit the distributions and the k min values were found by minimizing the Kolmogorov-Smirnov distance. 30,31 The powerlaw library was used in fitting the distributions. 31 Goodnessof-fit comparisons of different distributions are available in Supplementary Table 1. The method for subsampling of the complete network to obtain the materials stability network is further described in Supplementary Methods.
To prepare the input vectors for training the machine learning models, we create multiple sequential training examples (x i,t , y i,t ) for each material i from its temporal data, where feature vector for time t, x i,t extends to features for the past times within a window w, and where the target y i,t encodes binary labels 1 and 0 respectively indicating whether a material was discovered at that point in time or not. We adopted w = 2 for the present work. We analyzed the networks with 5year increments starting from 1945, and found that a window of width w = 2 (i.e. encompassing 10 years) provides sufficient prediction accuracy, without any need for recurrence (i.e. including past y as part of x). Network properties of a material pertaining to times when it was undiscovered are calculated by hypothetical, individual insertion of that material into the materials stability network (as if it existed at that point in time). Further details of each step in model creation can be found in Supplementary Methods. Discovery times of of "known" materials are approximated by their earliest dated reference for their structures reported in the ICSD, except for the elemental references, which are defined as discovered (y = 1) at all times.
Model training and evaluation was performed using scikit-learn. 32 We use an 80:20 splitting for training and testing of all models. In training the LR models, we use a larger weight (approximately 2.5 times) for the minority subclass of y = [0, 1] (i.e. a material transition-ing from "undiscovered" to "discovered"), compared to the other subclasses to obtain evenly distributed accuracies across all subclasses. RF models use 200 estimators. Models were also tested against baseline classifiers including class distribution prediction, majority class prediction and random classifiers, and found to outperform all by a significant margin. Feature importance in RF model is calculated as the Gini importance. For the analysis presented, we used the version 1.1 of the OQMD data, available for download at oqmd.org. Networks, network property data and the discovery times extracted are all available as Supplementary Data, along with a Jupyter notebook that can be used to reproduce the results presented in the text.