Predicting stable crystalline compounds using chemical similarity

We propose an efficient high-throughput scheme for the discovery of stable crystalline phases. Our approach is based on the transmutation of known compounds, through the substitution of atoms in the crystal structure with chemically similar ones. The concept of similarity is defined quantitatively using a measure of chemical replaceability, extracted by data-mining experimental databases. In this way we build 189,981 possible crystal phases, including 18,479 that are on the convex hull of stability. The resulting success rate of 9.72% is at least one order of magnitude better than the usual success rate of systematic high-throughput calculations for a specific family of materials, and comparable with speed-up factors of machine learning filtering procedures. As a characterization of the set of 18,479 stable compounds, we calculate their electronic band gaps, magnetic moments, and hardness. Our approach, that can be used as a filter on top of any high-throughput scheme, enables us to efficiently extract stable compounds from tremendously large initial sets, without any initial assumption on their crystal structures or chemical compositions.


INTRODUCTION
The quest for new materials is one of the most important endeavors of materials science 1,2 . The discovery of materials with tailored properties hold the promise of improving existing technologies, but also of enabling new disruptive applications 3 . Unfortunately, there exist many examples of technologies that remain in the realm of science fiction due to the unavailability of adequate materials 4,5 . This may happen because known compounds are toxic, rare, or too expensive for industrial, large scale use, or simply because no material is known with good enough properties 6-8 . It is clear that the number of imaginable materials is extremely large, as it derives from the combinatorial problem of arranging chemical elements of the periodic table in all possible stoichiometries and dynamically stable crystal structures 9 . This number is, however, reduced as most combinations are not prone to experimental synthesis 2 . There are several reasons for this: the crystal structure may describe a high-energy polymorph that can not be stabilized, the stoichiometry itself may be highly unstable to decomposition to other compounds, or it may simply be that there is no easy thermodynamically favored reaction path for experimental synthesis. In spite of these problems, there remains a very large number of experimentally reachable materials, of which we know only a small fraction 10 .
For the past decades, we have witnessed spectacular advances in computational materials science. One of the main reasons for this was the progression of density functional theory (DFT) 11,12 that, thanks to its excellent accuracy combined with remarkable computational efficiency, has become the workhorse method for the theoretical study of materials 13 . Favored by the advent of faster supercomputers and better software, DFT opened the way for extensive numerical studies of large datasets of compounds 14 . These so-called high-throughput studies 15 , whose results are conveniently stored in online databases, have greatly extended our knowledge of materials and have already lead to the discovery of a variety of compounds with improved properties 15-18 . There are several strategies that can be used for the theoretical search of materials 18,19 . One of the most prominent approaches for inorganic solids is "component prediction", following the definition of ref. 19 , meaning that one scans the composition space of a prototype structure searching for stable materials, instead of scanning the space of possible crystal structures for a given composition [19][20][21] .
In this context, we use the word "stable" to denote thermodynamical stability, i.e., compounds that do not transform or decompose (even in infinite time) to other different phases or stoichiometrically compatible compounds 9 . It is true that metastable materials, like diamond, are also synthesizable and advances in chemistry have made them more accessible 22,23 . Nevertheless, thermodynamically stable compounds are in general easier to produce and handle. The usual criterion for thermodynamic stability is based on the energetic distance to the convex hull 24 : the energy distance of a compound to the convex hull is hence a measure of its instability.
Using high-throughput approaches, the whole periodic table has already been scanned for a series of prototypes of relevant crystal structures. The most extensive studies of this kind can be found in the aflowlib database 25 that, at present, includes more than 2 million compounds. Unfortunately, this number is dwarfed by the total number of possibilities. Just for ternary intermetallics, there are 1391 structure-types known experimentally 26 and there are~500,000 possibilities of combining three metallic elements for each of these prototypes. Moreover, ternary structures can be rather complex: the average number of atoms in the unit cell turns out to be 14, but the majority of intermetallic ternary prototypes is considerably larger 26 . The situation is obviously even worse for quaternary or multinary systems. Considering that a DFT calculation scales with the cube of the number of atoms in the unit cell, we are quickly led to conclude that an exhaustive search of the composition space will be out of reach for the foreseeable future.
To mitigate the combinatorial curse, chemical constrains have been successfully applied to filter out compounds that are unlikely to be formed 27 . Alternatively, machine learning can be used to predict compounds and their properties 14,[28][29][30][31] . In view of the scarcity of experimental data, the machine is usually trained on DFT calculations and then used to predict which compositions and/or crystal structures are more likely to be stable 14,19,28,29 . Already in 2010, in the seminal work by Hautier et al. 32 , machine learning was used to predict the probability that a chemical substitution of an existing compound can give another stable compound. Predictions are then validated a posteriori performing DFT calculations of the candidate systems.
In this article, we propose an approach to scan efficiently the space of all possible stable materials that relies on data mining rather than empirical rules or chemical intuition, inspired from ref. 32 . We borrow the idea of component prediction [19][20][21] and combine it with the concept of chemical similarity. This means that the compositions to be tested are selected using a measure of the likelihood that a chemical element A can be replaced by another B in a given structure. Such a scale of similarity was obtained by statistical analysis through data mining in ref. 33 . To some extent, the concept of similarity can be intuitively understood from the graphical representation of the periodic table.
Elements that are neighbors in the periodic table are known to be similar chemically, a fact has been used by chemists to create materials for more than 100 years. However, statistical analysis goes beyond pure chemical intuition and can identify unexpected correspondences.
Any approach based on chemical similarity can be applied immediately to any crystal structure, and even to systems of reduced dimensionality, such as two-dimensional materials and nano-structures.

Thermodynamic stability
The number of substituted materials in each iteration that were not in the database, and hence were calculated is shown in Table 1. Our initial set was composed by elemental, binary, ternary, quaternary, and quintenary compounds. The first iteration is strongly biased by the distribution of materials in the database, which is mainly composed of binary and ternary compounds.
Before discussing in detail the results, we can better motivate the choice to set the threshold value of the element replaceabilty at 5%. We verified that higher values of the threshold would lead to a higher percentage of stable materials. In particular, our results indicate that a threshold of around 20% would maximize the fraction of stable compounds found in each iteration. However, the total number of stable compounds would be reduced by a factor of three. We believe therefore that setting the threshold value to around 5% is a more convenient compromise.
There are a total of 713 different prototypes in the first generation, and the most common one is the cubic full-Heusler compound, with a total of 10,653 systems. These are very simple ternary cubic compounds (from the crystallographic point of view) with composition ABC 2 , and that can be stable for a large variety of elements in the periodic table. This family has already been subject to extensive and systematic studies using either highthroughput or machine learning techniques, and the optimized crystal structures for most compounds can be found, e.g., in the Aflowlib database 25 . In the second generation, Heusler compounds remain the most common prototype, but with only 4238 systems. The situation changes in the third generation, where the most common prototype becomes the hexagonal ZrNiAl-Fe 2 P structure, with 5009 compounds.
It is interesting to analyze the distance to the convex hull (E hull ) of stability for all 189,981 materials. A histogram with this information can be found in Fig. 1. Note that we plot E hull with respect to the hull composed of compounds in the materials database solely. This means that stable structures not included in the database will appear with negative E hull . Of course, in this case, the hull has to be redefined to include these compounds. This will be further discussed in the following.
The first impression we get from the figure is that the distribution of E hull is very different from a skewed Gaussian we know for DFT calculations of families of materials (e.g., perovskites 30 or tI10 materials 31 ). In fact, we believe that the distribution displayed in Fig. 1 is a demonstration of the validity of our approach. In comparison with the distributions shown in refs. 30,31 , obtained by performing systematic substitutions, we observe an enhanced percentage of materials with a negative distance to the hull, while the histogram decays rapidly for positive distances. The large peak at zero is due to substitutions leading to materials already present in the database. We did check whether the transmuted material is already in the database, i.e., if an entry with the same composition and space group exists before running the calculation. However, often the geometry optimization procedure relaxes structures into other space groups (usually to more symmetric ones), and these final structures can sometimes be found in the database.  There are a total of 31,602 structures with a negative distance to the convex hull, but not all of these can be counted as stable structures. Firstly, the procedure we follow could find more than one structures with negative E hull with the same composition. And secondly, we have to redefine the convex hull including all our structures. After taking these two points into consideration, we found a total of 18,479 systems on the redefined convex hull. The structures of these materials are available in our website (see Section "Data Availability"). We crosschecked our list against the Aflowlib database 25 , and found that only 417 out of 18,479 (2.3%) stable structures are overlapping with entries of this database. Thus, almost all stable compounds reported in the present work are not included in materials databases.
We have to stress that our calculations are approximate (after all, we are using DFT with the PBE approximation to the exchangecorrelation functional), and that we are working at zero temperature, neglecting entropy effects. Systematic analysis reported that the error in DFT estimated stabilities are around several tens meV per atom, e.g. 24 meV per atom 34 , and 70 meV per atom 35 . Therefore, one can still expect that a large majority of these 18,479 structures may indeed be stable thermodynamically, and are therefore promising candidates for experimental synthesis.
In this work we decided not to take into account all systems that are "technically" unstable (having positive E hull ). In our opinion those structures that have a small positive distance to the theoretical convex hull should however not be completely discarded for two reasons: (i) Some might actually be stable, and only appear above the hull due to the Perdew-Burke-Ernzerhof (PBE) approximation; (ii) Some might be stabilized by temperature, pressure, defects, etc. and thus could be experimentally synthesizable. Nevertheless, due to the large number of structures, we decided, for the time being, to concentrate only on the theoretically stable materials and leave the rest for future investigations.
Comparing the number of stable structures (18,479) with the total amount of systems tried (189,981), we find a success rate of 9.72%. This result is encouraging if we compare it with the success rates of systematic high-throughput and machine learning studies. With a threshold set at 25 meV above the convex hull, Sarmiento-Perez et al. 36 have a success rate of 1%, while Körbel et al. in ref. 37 consider a much larger set of compositions and achieve only 0.25% unreported stable compounds. We should also consider that the success rate of a random search is already biased by restricting calculations to a specific family of compounds. In fact, one usually selects a family of systems that looks intuitively promising to start a materials search. In ref. 30 , the success rate of systematic calculations of the whole dataset of around 250,000 perovskites is 0.25%, while the proposed machine learning procedure allows to increase the rate by a factor of 4-5.
Indeed, by combining chemistry intuition with a highthroughput approach, our method provides a remarkably efficient overview of large portions of the phase space of stable compounds, at a strongly reduced computational effort. Furthermore, we should not forget that most of the "unstable" transmuted compounds are rather close to the hull, and might therefore be interesting for further research.
To further characterize our set of stable systems, we plot, in Fig. 2, the number of materials that contain one specific chemical element. We see that most stable materials include oxygen. One reason is probably the large number of oxides in our starting set, although other elements are also present in large numbers. We would also like to emphasize the abundance of predicted materials with lanthanide and actinide atoms. These elements are often overlooked in systematic studies, but of great importance in many areas of science. For example, they are often components of permanent magnets 38 , or are relevant to understand which materials are formed upon nuclear decay of radioactive waste 39 . In our work, we found 8970 and 2437 stable compounds including lanthanides and actinides, respectively, and the corresponding success rates were 11.6% and 12.2%, respectively. If we exclude entirely these chemical elements, we have 96,543 transmuted structures and a total of 7421 stable compounds. This gives a success rate of 7.7% for compounds that do not contain either lanthanides or actinides. Thus, replacements involving lanthanides and actinides are more likely to yield stable compounds, but 7.7% is still a rather high success rate. In contrast, we note the relatively small number of stable materials containing Be, and transition metals of the groups IVB-VIIB. These elements seem therefore to be harder to combine and form stable compounds.
Now we turn to how the distribution of stable structures and how the success rate changes across the periodic table. The number of stable (N new ) and initial structures (N ini ) that contain a certain chemical element are showed in Supplementary Fig. 1. We also show in that figure the success rates for substitutions that involved that element.
It can be seen that the distribution of compounds follows to some extent the distribution of the initial structures. For example, there are many oxides in both sets, and N new is approximately proportional to N ini for most 3d transition metals. However, there are several exceptions, e.g. for Al, Si, K, Ga, As, Rb, Cs, lanthanides, and actinides. In some of these cases, there are many more compounds than expected. In contrast, for Mo and W, there are much fewer than expected. This shows that the distribution is not completely biased by the initial database. Furthermore, there is some variation of the success rate through the periodic table, but most elements have a success rate around 10%. However, there are indeed some elements that yield very high success rates, especially some lanthanides or actinides like Pm or Pa.

Band gap
The electronic band-gap is certainly one of the most important properties of materials, and it can be used to determine the suitability of a given compound for opto-electronic applications. We plot a histogram of the electronic (indirect) band-gap in Fig. 3 for our stable materials. These were calculated with the PBE approximation to the exchange-correlation functional and are, therefore, underestimated by around 45% on average 40 . We find a total of 4840 systems with a gap larger than 0.1 eV, which is 26.1% of the total number of our stable systems. We should also expect a number of false negatives of around 5-10%, i.e., around 250-500 systems are likely misidentified as metals due to the PBE approximation.
Not surprisingly, the histogram decays with a fat tail as a function of the band-gap. We also show the distribution of semiconductors and insulators through the periodic table in Fig. 4. The most common non-metallic elements in the list of stable semiconductors and insulators is O and F, followed by halogens and other chalcogens. As expected from the electronegativity scale 41 , the largest gaps are obtained for fluorides, followed by oxides and chlorides. There are fewer, and still thousands, systems with narrower gaps that include pnictogens and hydrogen. For metallic elements, the most common one found in semiconductors and insulators are the heavy alkali metals Cs, Rb, and K. In all these systems, the largest PBE gap we found was around 7.8 eV for a series of tetragonal ternary fluorides, namely LiLnF 4 , where Ln is a lanthanide (Tm, Dy, Ho, Tb, Er, Sm, Nd, Pr in order of decreasing band gap).

Magnetic properties
Another property we analyzed is the magnetic moment. In Fig. 5 we plot a histogram of the total magnetization per unit volume (in μ B ⋅ Å −3 ) for all our compounds in the convex hull. Before analyzing the results, we would like to stress that each calculations started from an initial ferromagnetic configuration of the spin moments, as common in other high-throughput studies 15,37 . Thus we very likely obtain ferromagnetic states for most magnetic compounds after optimization. However, the correct identification of the ferromagnetic, antiferromagnetic, or ferrimagnetic ground states is crucial for understand the spin interactions in each system. Unfortunately, this would require accurate energy calculations for large supercells, drastically increasing the computational effort. Therefore, in present work we adopt the usual setup of high-throughput studies, and leave the precise identification of the correct ground-states magnetic phases for future research. In any case, from the energetic point of view, this problem is harmless, because the differences of total energy between different magnetic phases are often of the order of the meV per atom 42 , while the stability of the composition is evaluated on an energy scale that is one or two orders of magnitude larger.
As expected, from Fig. 5 one can see that a large majority of the systems is not spin-polarized (note that the y-axis is truncated). In fact, the probability of finding a magnetic compound is only 22.6% (4187 systems out of 18,479), and with the number of systems decreasing rapidly with the total magnetization. We show the number of magnetic systems containing each given element of the periodic table in Fig. 6. The ten most represented metallic elements in these magnetic compounds are, in decreasing order, Pu, Eu, Gd, Mn, Fe, Np, Ge, Ce, Ni, and Co. These include, evidently, the actinides (Pu and Np), the lanthanides (Eu, Gd, and Ce), and the 3d transition metals (Mn, Fe, Ni, and Co).   The fact that Ge appears in this list is actually interesting. By looking closer at the composition of the magnetic compounds containing this chemical element, we found that 91% of Gecontaining magnetic compounds include at least one other element included in the top-10 list. Moreover, the remaining 9% compounds also contain other rare-earth or transition metals. A quick look at some specific materials in our list reveals that the magnetic moments are not localized on Ge, but on the other (magnetic) atoms. Therefore, the reason why Ge appears in the list is that Ge is likely to form stable compounds together with magnetic elements. This also implies that Ge compounds could be a promising search ground for experimentalists aiming at the synthesis of magnetic compounds.
The most common non-metallic elements found in this set are O, and F. Among all systems, the highest magnetization is around 0.2 μ B ⋅ Å −3 for a cubic structure of SnGd 3 , followed by several other Gd and Eu compounds, often in the inverted perovskite structure (such as NAlGd 3 , CGeGd 3 , CGaGd 3 , and CSnGd 3 ). Finally, the most common crystal phase is the cubic double-perovskite structure with 215 compounds, while magnetic systems were found in a total of 253 different prototypes.
Having looked at magnetic systems and semiconductors, it is natural to ask how many magnetic semiconductors are found in our dataset. If the two properties are completely uncorrelated, the probability of finding a system exhibiting both is given by the product of the individual probabilities, yielding 22.6% × 26.1% = 5.9%. The actual number of systems that we found was 884, yielding a probability of 4.8%. This is consistent with the two properties being uncorrelated. We also performed a similar analysis on the Materials Project database. The fractions of stable systems with a gap above 0.1 eV and of magnetic systems are 45.7% and 31.5%, respectively. This yields a combined probability of 14.4% to find magnetic semiconductors if the two properties are uncorrelated. The actual percentage of stable magnetic semiconductors in the database is 12.1%, which also supports the hypothesis of absence of correlations.
Among all semiconducting magnetic systems, the most common prototype that we found was again the cubic doubleperovskite (75 systems). We note that most magnetic semiconductors could be, in fact, antiferromagnetic. Moreover, usually the antiferromagnetic state has a larger gap than the ferromagnetic one. Therefore, those band gaps could be "doubly" underestimated-due to the PBE approximation and the misidentification of magnetic phases. This subset of 884 materials is however, quite interesting, as it can serve, e.g., as a starting basis for the discovery of unreported transparent ferromagnets or antiferromagnets with high critical temperatures.

Mechanical properties
Finally, we performed a preliminary analysis of the mechanical properties by evaluating the hardness. The calculation of the Vicker's hardness for the predicted structures was based on the simple model by Zhang et al. 43 This model extends the work of Šimůnek and Vackář 44,45 and improves the earlier hardness models 46 based on bond strength by applying the Laplacian matrix 47 to account for highly anisotropic and molecular systems. It turns out that laminar systems are correctly described as having low hardness, but this model still fails for some molecular crystals that are incorrectly assigned large values for the hardness. This is, however, not a big problem as these false-positive cases can be easily identified and discarded.
Most systems are found to be extremely soft, with only a handfull of materials being hard or superhard (hardness > 40 GPa). These, usually a combination of light covalent elements with transition metals, are shown in Table 2, together with their bulk and shear moduli (calculated with the PBE). We found that the oxides in this list have low elastic moduli, which implies that the simple model has likely overestimated their hardness. This anomalous behavior can be explained by the unusual oxidation states and bonding patterns present in these structures. One should keep in mind that the stability of these oxides is likely overestimated, as it has been shown in several references 48,49 The remaining systems do exhibit large values of the hardness and of the bulk and shear moduli, indicating that they are probably hard or even super-hard. This is particularly true for three compounds, namely VRu 2 Sn, CrGeRu 2 , and MnH 2 .

DISCUSSION
In this work, we combined fundamental knowledge of chemistry with high-throughput calculations to efficiently search for stable crystals. To this end, we replaced chemical elements in known stable substances by choosing substitutions with "similar" chemical elements. The elusive concept of similarity was quantified by a similarity scale obtained by data-mining experimental databases of crystal structures. The transmuted compounds were then studied with DFT, and their stability was evaluated with respect to the convex hull of stability. The stable compounds were in their turn transmuted, and this cycle was repeated three times. We obtained in total 18,479 stable crystal structures out of 189,981 substitutions, resulting in a success rate of about 10%, one order of magnitude larger than the usual one of highthroughput methods. This success rate shows not only the validity of our approach, but also its high efficiency, leading to a significant reduction of the computational costs. Our set of stable materials include elements from across the periodic table, from main group elements to transition metals, to lanthanides and actinides.
We also performed a preliminary analysis of the physical properties of these crystals. We obtained 4840 semiconductors, with band gaps (calculated with the PBE approximation) extending almost to 8 eV. These include not only many oxides and fluorides, but also semiconductors with other halogens, chalcogens, pnictogens, etc. We also identified 4187 magnetic systems with magnetizations extending up to 0.2 μ B ⋅ Å −3 . As expected, these mostly include some actinides, some lanthanides, and some 3d transition metals. Combining both properties, we filtered out 884 structures having non-zero gap and magnetic moments.
Finally, we evaluated the hardness of our materials, and found few possible hard and super-hard systems that deserve further attention.
All in all, this work shows that with a systematic help of common chemistry knowledge, one can greatly improve the output of high-throughput calculations for material prediction. Thanks to this iterative procedure of transmutation, we efficiently gain access to large unknown portions of the phase space of stable materials, that may be hiding key materials for future technologies.

METHODS Prediction strategy
The starting point of our search is a set of stable compounds, i.e. the (experimental or theoretical) crystal structures and compositions of a series of materials on the convex hull of stability. We obtained these structures from the materials project database 50 . For computational affordability, we limited crystal structures to a maximum of 12 atoms in the unit cell. The starting set is composed of 9524 compounds in 713 different prototype crystal structures. For each material in this set, we mutate the composition by replacing each chemical element by another "similar" element, if the probability of a successful replacement is higher than a certain threshold. We will see below how we define this probability. Note that only one element is replaced at a time, and that we do not perform partial substitutions, i.e. all atoms of a given element in the crystal structure are replaced simultaneously.
The outcome of this procedure is a set of hypothetical materials. We observe that it is impossible to perform systematic substitutions of all elements in known stable crystal structures, employing all other atoms of the periodic table. Assuming 84 atomic species, from H up to Bi, excluding noble gases and including Ac, Th, Pa, U, Np, and Pu, and considering 713 prototype crystal structures, we can build 59,892 elementary crystals, Fig. 7 Work flow. An illustration of a work flow for predicting stable materials based on substitution.
H.-C. Wang et al. almost 5 million binaries, 400 million ternaries, and 33 billion quaternary compositions. We can clearly see that we need to filter out the most unlikely substitutions and focus on the most promising ones.
At any iteration, we validate the set by performing a geometry optimization of the resulting structure with DFT, and calculating its distance to the convex hull of stability. This step is performed with PYMATGEN 51 , using all materials present in the Materials Project database 50 as reservoirs. All stable phases (with negative distances to the materials project convex hull) are then collected, and the construction of the convex hull is repeated including our structures. A new cycle of substitutions starts then for the stable compounds identified in the previous iteration. In total we performed three iterations of this kind, replacing always one chemical species per iteration. Thus, the prediction procedure is illustrated in Fig. 7.
Of course, the crucial part of this approach is the knowledge of the probability that replacing an element by another will yield a stable compound. We could just take advantage of the periodic table, and define this probability as the (geometrical) distance between the two elements in its usual two-dimensional representation. A couple of counter-examples show, however, that this is clearly not the ideal approach. For example, it turns out that H can be much more easily replaced by F and not by Li, or Ba can be replaced by Eu more often than by Cs.
One can certainly use for filtering empirical rules based, e.g., on ionic radii and oxidation states 27 . However, in the age of data-driven research, we have the option to let computer algorithms transform empirical chemical knowledge into a similarity scale between the chemical elements. Recently, by performing a statistical analysis of stable crystal phases present in the inorganic crystal structure database 52,53 , some of us determined such a scale 33 . The first step was the calculation of the likelihood that an element A can be replaced by another B in a given structure. This information was then used to construct a matrix where each entry (A, B) is a measure of this likelihood. To obtain a probability, every entry of this matrix has to be normalized in some way. This is a rather nontrivial step that is complicated by the fact that our knowledge of materials is unfortunately rather incomplete. Here, we used the quantity 33 where δ IJ AB ¼ 1 if materials I and J are both in the experimental database and are connected by the substitution of the chemical element A by B, and is 0 otherwise. The normalization factor (N A ) is the total number of materials including the given chemical element that are present in the database.
We also need a threshold value of the element replaceability, below which we do not consider as likely the corresponding element mutation. We set the threshold to a value that is a good compromise to keep affordable the total number of substituted compounds and to have at the same time a sufficient variety of substitution pairs. A threshold lower than 20% is necessary to include all substitutions within each group of the periodic table. This means that fixing this threshold to 20% would lead to include only "obvious" cases, while we would miss other less intuitive and less common substitutions. We therefore decided to favor a practical approach and include as many substitutions as possible, selecting the lowest threshold that our computational resources could reasonably support. We have to keep in mind that the number of substitution increases rapidly with the number of substitution pairs, because we have a large initial set of materials (see Table 3 and discussion in Section "Thermodynamic stability"). We chose a threshold value equal to 5% that gives 992 pairs (see List 1 in Supplementary Notes), a number that is approximately twice as large as the number of in-group substitution pairs.
A schema depicting the result of this procedure can be found in Fig. 8. To improve readability, we gathered the chemical elements in groups. There are a series of immediate conclusions we can draw from the figure.
First of all, with the chosen threshold, almost no first-row element can be replaced by any other element. In chemistry this is known as the first-row anomaly 54 , i.e., the small-core elements of the first row are in some sense special and are only vaguely similar to second-row elements. Second, many elements only accept replacements with elements within the same group of the periodic table. This is in particular true for the alkali metals, the halogens, etc. Third, we identify two main groups of metals in Fig. 8, one centered around the lanthanides and the other around Fe, Co, and Ni.
It is rather interesting that our threshold roughly divides the metals in two families. The subdivision is simply related to the geometry of the periodic table, namely family I includes the left side of the periodic table (groups 2-5, as well as the lanthanides and actinides), while family II contains the remaining groups (6)(7)(8)(9)(10)(11)(12)(13)(14)(15). Furthermore, we find no substitutions between group 5 and 6. This would indeed indicate that there seems to be a significant discontinuity in the periodic table. In fact, we can see some indications of this discontinuity by looking, for example, at the typical oxidation states that show from a monotonous increase from +2 (group 2), +3 (group 3), +4 (group 4), +5 (group 5) back to +3 and +4 in group 6. However, we emphasize that this analysis depends on our choice for the threshold, and that a more detailed investigation, using more powerful statistical tools, is required to achieve general conclusions.

DFT calculations
We used the code VASP 55,56 , where all parameters were set to guarantee compatibility with the data available in the Materials Project database 50 . We used the PAW 57 datasets of version 5.2 with a cutoff of 520 eV. The Brillouin zone was sampled by Γ-centered k-point grids with a uniform density calculated to yield 1000 points per reciprocal atom, i.e. the same kpoint density used by the Materials Project 58 . All energies were converged to better than 2 meV per atom and the geometry optimization was stopped when forces were smaller than 0.005 eV per Å. We used a denser k-point mesh of 5000 points per atom to calculate band structures. All calculations were performed with spin-polarization using the PBE 59 exchange-correlation functional, with the exception of oxides and fluorides containing Co, Cr, Fe, Mn, Mo, Ni, V, W, where an on-site Coulomb repulsive interaction U with a value of 3.32, 3.7, 5.3, 3.9, 4,38, 6.2, 3.25, and 6.2 eV, respectively, was added to correct the d-state (https://docs. materialsproject.org/methodology/gga-plus-u/#calibration-of-u-values).
A correction scheme which allows to mix GGA and GGA + U calculations to obtain the correct formation energy and distance to the convex hull is applied 60 .

DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request. All stable structures found in this work are available on https://tddft.org/bmg/data.php.