The rule of four: anomalous distributions in the stoichiometries of inorganic compounds

Why are materials with specific characteristics more abundant than others? This is a fundamental question in materials science and one that is traditionally difficult to tackle, given the vastness of compositional and configurational space. We highlight here the anomalous abundance of inorganic compounds whose primitive unit cell contains a number of atoms that is a multiple of four. This occurrence—named here the rule of four—has to our knowledge not previously been reported or studied. Here, we first highlight the rule’s existence, especially notable when restricting oneself to experimentally known compounds, and explore its possible relationship with established descriptors of crystal structures, from symmetries to energies. We then investigate this relative abundance by looking at structural descriptors, both of global (packing configurations) and local (the smooth overlap of atomic positions) nature. Contrary to intuition, the overabundance does not correlate with low-energy or high-symmetry structures; in fact, structures which obey the rule of four are characterized by low symmetries and loosely packed arrangements maximizing the free volume. We are able to correlate this abundance with local structural symmetries, and visualize the results using a hybrid supervised-unsupervised machine learning method.


I. INTRODUCTION
Computational materials discovery is a fast-growing discipline leading to innovation in many fields.Within a specific technological sector (i.e., communications, renewable energies, medical), the choice of material is critical for the long-lasting success of the given product.Therefore, it is important -and of fundamental interest -to efficiently identify materials' structural and energetic characteristics through materials' data analysis to select structures for innovative applications.The emerging field of materials informatics has demonstrated its potential as a springboard for materials development, alongside first-principles techniques such as density-functional theory (DFT) 1,2 .The increase in computational power, together with largescale experimental 3 and computational high-throughput studies 4 , is paving the way for data-intensive, systematic approaches to classify materials' features and to screen for optimal experimental candidates.In addition, the collection of statistical methods offered by machine learning (ML) has accelerated these efforts, both within fundamental and applied research [5][6][7][8][9][10] .
However, the success of these endeavors is ultimately limited by the quality and diversity of the data serving as the underlying data source.Understanding the space of materials spanned by a dataset is integral to data-driven materials searches or machine-learning workflows.Thus, when anomalous correlations arise in datasets, it is useful to understand and investigate the origins, and potential implications, of such peculiarities.We use here the name rule of four (RoF) to describe the unusually high relative abundance of structures with primitive unit cells containing a multiple of 4 atoms.This occurrence is explored within two different databases of inorganic crystal structures: the Materials Project (MP) 11 database, which contains crystal structures that have been relaxed with first-principles calculations starting from experimental databases or from structure-prediction methods, and the Materials Cloud 3dimensional crystal structures 'source' database (MC3Dsource); this latter combines experimental structures from the crystallographic open database (COD) [13][14][15][16] , the inorganic crystal structures database (ICSD) 17 and the mate- The rule of four.The two datasets (the Materials Project (MP) 11 and the Materials Cloud 3-dimensional crystal structures 'source' database (MC3D-source)) 12 ) contain a disproportionate amount of compounds with a primitive unit cell containing multiples of 4 atoms.rials platform for data science (MPDS) 1 .Figure 1 is a visual representation of this striking abundance, while Table I demonstrates the RoF by comparing the relative abundance of structures with primitive unit cells made up of multiple of 3, 4, 5, 6 and 7 atoms.I. Percentages of structures in the MP and MC3Dsource databases whose primitive unit cells contain a number of atoms that is a multiple of the column header.The RoF emerges from the higher abundance of structures with a primitive unit cell containing a multiple of 4 atoms.Primitive unit cells with a number of atoms that is a multiple of two or more headers will contribute to each column; hence, the percentages will sum to more than 100.
Within the context of this study, we will label a structure that belongs to the subset of structures with a unit cell size multiple of four as a magic structure, and one that does not belong to the subset as a non-magic structure.In Figure 1 the x axis is capped at 100 atoms to best represent the RoF, as respectively 97.51% and 91.00% of structures in the MP and in the MC3D-source databases contain 100 atoms or less (the largest cell in the MC3D-source database contains 4986 atoms).
Before delving into a more extensive analysis, we want to rule out that the RoF is simply an artifact of how structures are mathematically described, or of how this description is curated and processed for storage in the aforementioned databases.When materials structure datasets are prepared, it is standard procedure to 'primitivise' unit cells, i.e., to reduce the unit cell to its minimum volume.As many conventional unit cells contain exactly four times the number of atoms that would be found in their respective primitive unit cell, it could be expected that misclassifying conventional unit cells as primitive ones could lead to an artificial emergence of the RoF.Both the MP and MC3Dsource databases obtain the primitive unit cell using the spglib software 18 .When primitivizing the structure, one needs to set the symprec tolerance parameter, which allows for slight deviations in the atomic positions stemming from thermal motion or experimental noise.To rule out that the primitivization is the source of the emergence of the RoF, we show in Fig. 2 that changing the symprec (1E-8 to 1E-1 Å) parameter has little effect on the RoF distribution, converting around 1% of magic structures into non-magic ones.It is only when one increases the symprec to unreasonably large values (close to 1 Å) that the slope changes -this is expected, as using such a large tolerance effectively considers sites with the same element that should be different as identical, producing primitive unit cells with a reduced number of sites, but which no longer correctly describe the structure.
Encouraged by these results, we decide to probe the RoF more deeply and attempt to understand its origins and impact.First, we examine the RoF with respect to traditional materials science metrics, including energies and symmetries, and uncover that the RoF is largely correlated with loosely-packed polyatomic systems.We then use symmetry-adapted machine learning techniques to relate the RoF to local atomic environments and determine that it has only little implications for energetic stability.We then manage to correctly classify the RoF by only providing the algorithm with information on local structural symmetry rather than a global one.FIG. 2. Percentage of magic structures that become labelled non-magic as a function of the symmetry tolerance parameter used for reduction to the primitive cell.The black and green lines correspond to structures in the MP and MC3D-source datasets, respectively.At typical symmetrization parameters, there is little to no change in the number of magic structures (roughly 1% of magic structures go to non-magic).At larger symmetrization parameters (≈ 1 Å), this increases to roughly 6% based upon the large deviations allowed in considering sites as symmetrically equivalent.

II. RESULTS AND DISCUSSION
Within this study, we make sure that the data is sufficiently diverse for the training set to cover the whole design space 19 by procuring the structural data from open and FAIR repositories [20][21][22] ; the same analytical workflow is applied to two different databases of bulk, crystalline, stoichiometric compounds.One database is the Materials Project, which contained 83 989 data entries obtained via high-throughput DFT calculations as of 10/18/2018, corresponding to the mp all 20181018 dataset retrieved with the matminer.datasetsmodule 23 .The other data source, the MC3D-source, contains 79 854 unique structures extracted from the MPDS, ICSD and COD, which have been curated via an AiiDA 24 workflow, as explained in Section I of the SI.

A. Energetic stability
We first test whether the RoF is correlated with energetic stability, as this would provide a straightforward explanation for the phenomenon.To test this assumption, we analyze the information contained in the MP dataset, namely the formation energy per atom within each compound.This is the energy of the compound with respect to standard states (elements), normalized per atom2 .It is computed at a temperature of 0 K and a pressure of 0 atm.This quantity is often a good approximation for formation enthalpy at ambient conditions, where a negative formation energy implies stability with respect to elemental compounds.
Our initial results provide no evidence of a correlation between magic compounds and their energetic stability, as shown in Figure 3. Nevertheless, it does appear that structures obeying the RoF have a longer positive tail of large formation energies, seen towards the bottom right of the figure.However, this result can be misleading -it does not take into consideration the large variance in structural composition across the database -and we must aim to compare the energies of similar structures within the magic and nonmagic subsets, as we will do in later sections.

B. Correlation with symmetry descriptors
The crystal symmetries of compounds -defined by the set of symmetry operations that, when performed, leave the structure unchanged -are captured in crystals by their space groups and point groups.Higher symmetry space groups inherit the symmetry operations of their 'parent' point groups; for example, cubic space groups inherit the one-fold, two-fold, and four-fold rotational symmetries 3 .Figure 4 shows histograms of inherited symmetries and their relative abundance within each of the two sets (magic in red and non-magic in blue).The point groups are ordered from the ones with the least number of symmetry operations (bottom) to the highest order ones (top).Symmetry groups that are equally represented in both sets (i.e.1-rotation, since all compounds are invariant to the simplest symmetry) have tails of equal length, whereas symmetries seen in a larger percentage of magic structures have a red tail to the right of the histogram.From Figure 4, the relative abundance of non-magic structures in the high symmetry point groups emerges, while on the contrary most magic structures in both databases are grouped in the lowest symmetry point groups (2, m, 2/m, mm2, 222 and mmm), which generally contain a relative abundance of them apart from one exception (the MC3D-source presents a slightly higher relative abundance of non-magic structures in the mm2 point group).This analysis shows how 4-fold symmetry is not a determining descriptor to classify the phenomenon.
The symmetrical 'disorder' -or higher asymmetry -characterising magic compounds may be caused by a more heterogeneous composition of atoms, as compared to nonmagic compounds.We can quantify this heterogeneity by counting the number of atomic species ( N species ) (first column of Figure 5) composing the structures: from this analysis we see that magic materials are mostly composed of 4 or more elements, while non-magic structures present a larger abundance of simpler composition, containing more often 1, 2, or 3 elements.Another property that emerges from our analysis and is more evident in the MP dataset (second column of Figure 5(b)) is the relative scarcity of smaller atomic radii within magic compounds, as often defined by the parameter x = NS NS +NL , where N S and N L are the counts, in a given structure, of the smallest and largest radii respectively.The scarcity of small radii in magic compounds (lower x parameter) partly explains the lower symmetries that characterise them, as no atoms will easily be inserted as 'interstitial' elements in a given structure.On the other side, the MC3D-source dataset also presents a peak in higher values for x, in which case the largest atoms are much less than the smallest ones.In this case, the smallest atoms might be seen as 'imperfections', lowering the overall structural symmetry of point groups analyzed in Figure 4.
The low abundance of smaller radii in magic compounds would likely lower the overall crystal symmetry of magic compounds.In general, the symmetry type of atomic crystal systems is strictly linked to packing mechanisms 26,27 .While the mathematical problem of sphere packing is not hard to pose (Kepler conjecture), it was historically difficult to prove 28 , and the complexity of its solution rises exponentially with polydispersity 29 .Despite this, a qualitative analysis of magic configurations shows that they contain chemical elements whose size variance is much greater compared to the variance in the non-magic population.This size variance is quantified by the parameter α = RS RL (where R L is the radius of the largest radius and R S of the smallest one), namely the ratio between the smallest and the biggest atomic radii within each compound (third column of Figure 5).It can be noticed how the MP dataset presents an abundance of magic structures with the smallest to largest ratio between 0.8 and 1; this feature characterises the lower variance in elements that make up magic compounds.Magic compounds from the MC3D-source exhibit a greater standard deviation between largest and smallest atoms, with the α parameter presenting a peak at around 0.35; this finding suggests the presence of very small radii filling the interstitial spaces, which contribute to keeping the symmetry of magic compounds low.The packing fraction (PF), defined as PF =

Vtot,atoms Vcell
, is another related property of sphere packing.This quantity is noticeably lower (with peaks at values around 0.1 -0.2) for magic structures, as can be seen in the last column of Figures 5 (a) and (b), pointing away from packing arguments as the cause of this database anomaly 26,27   which have been determined to be outliers for the MC3Dsource dataset.This aligns with the thesis of Hopkins 30 , namely that entropic (free-volume maximizing) particle interactions contribute to the structural diversity of mechanically stable and ground-state structures of atomic, molecular, and granular solids.

C. Employing symmetry-adapted descriptors for further insight
Up until this point we have employed classical techniques for analyzing crystal structures; here, we aim to understand the RoF using modern data-driven techniques.In the field of atomistic modeling, it has been common, albeit nontrivial, to represent crystal structures through symmetrized density correlations 9,31,32 in order to predict broad swaths of materials properties.Here, we represent the compounds using the Smooth Overlap of Atomic Positions (SOAP) 31 , a popular ML representation for structure-energy relations that contains information on the average three-body local environment for atomic arrangements.SOAP vectors provide an avenue for a statistical analysis on local environments, offering a robust framework through which we can explore and visualize the chemical and configuration space of the materials studied 33 .We use two parameterizations of SOAP vectors, detailed in Section II of the SI: one that uses separate channels to represent different chemical species and another that ignores the chemical identities in order to highlight the geometry of the local symmetry.The former, from hereon called the species-tagged representation, is necessary in energetic analysis, as similar geometry symmetries can correspond to wildly different energetics given the elements present; however, this representation is computationally cumbersome (roughly 100 000 sparse features for each compound, from which we take a diverse subset of 2 000 features).Thus, in later analyses where the chemical identities play a smaller role, it is beneficial and conceptually more straightforward to use the more lightweight, latter representation (roughly 80 features for each compound), hereon called the species-invariant representation.
Earlier, we noted that simply presenting a histogram of magic and non-magic energetics did not provide any specific understanding of the RoF; it might be more insightful to compare the energies of chemically similar structures.To determine whether the magic structures exhibit lower energy than structurally-similar non-magic ones, we use Principal Covariates Regression (PCovR) 34,35 , a ML method which constructs a latent space projection to explore the correlation between stability and local symmetries within the dataset by expanding regression models to incorporate information on the structure of the input data, as implemented in the scikit-matter library 36,37 .In this mixing model, the projection is weighted towards the property of interest using a mixing parameter (of which a more extensive explanation is given in Section III of the SI), and, where the input linearly correlates with the target property, the resulting embedding will reflect this property along the first component, with subsequent components representing orthogonal dimensions in structure space.In our case the PCovR is always trained on the species-tagged SOAP vectors and their formation energies.We plot the first two principal covariates in Figure 6.The first principal covariate is strongly correlated to the energetic descriptor, as can be seen in Figure 6, where in the lower plots we have colored each point in the projection by their magic classification (left) and formation energy (right).However, the second covariate (and all significant subsequent covariates, see Section IV of the SI) fail to separate the datasets into two distinct populations corresponding to this phenomenom.This implies that for structurally similar compounds, there is no significant difference in energy between magic and non- From left to right, the plots represent the distribution of the number of elemental species (Nspecies), the relative abundance of small (NS) to large (NL) radii (x), the ratio between smallest (RS) and largest (RL) atomic radii (α) and the packing fraction (P F ) for compounds with a unit cell size between 0 and 100 atoms.All of the results are plotted for the two sets, magic (red) and non-magic (blue), with the probability normalized to each set.
magic samples.We also see little difference in the spread of magic versus non-magic structures in the latent space, as shown by the kernel density probability map in the upper panel of Figure 6.Further principal covariates for the same PCovR representation are plotted in Section IV of the SI, as well as other relevant energetic descriptors (the energy above the convex hull energy, i.e., the envelope connecting the lowest energy compounds in the chemical space, and the band gap energy), in order to show how these targets yield similar results.Thus the RoF is neither correlated with the energetics, nor are magic lower in formation energy when compared to chemically similar non-magic ones.
The linear correlation between the average local symmetries and the RoF is not particularly strong (a logistic regression on the SOAP vectors results in an R 2 on the order of 0.6, as listed in Table 1 of the SI); thus, we turn to non-linear classifications to understand if the RoF is potentially correlated with these local neighborhoods.We ignore the species information to focus solely on the average local symmetries.We build a Random Forest (RF) classification 38 on both datasets, first varying the interaction cutoff that defines the local environment (see Fig. 7).We see a plateau in accuracy at 87% as we consider local environments of 4.0 Å, suggesting that differentiating local symmetries occur within the first two neighbor shells, also supported by the high false positive (FP) rate at small cutoff radii.From the learning curve on the 4.0 Å descriptors (inset), we see that the classification has a positive learning rate, although shows little saturation despite the large training set.This result implies that local features are sufficient for the ML model to pick up the complexity of the datasets and to predict with good probability the correct classification.We report the accuracy achieved by other classification algorithms in Section V of the SI.

III. CONCLUSIONS
Through an extensive investigation, in this work we highlight and analyze for the first time the anomalous abundance of inorganic compounds whose primitive unit cell contains a number of atoms that is a multiple of foura property that we name rule of four (ROF) -observed in both experimental and DFT-generated structure databases.Here, we: • highlight the rule's existence, especially notable when restricting oneself to mostly experimentally known compounds; • explore its possible relationship with established energetic descriptors, namely formation energies, and utilise hybrid ML methods combining regression and principal component analysis to surprisingly rule out the possibility that the relative abundance has the (expected) effect of stabilising compounds, bringing them to a lower energy state; • conclude, through a global structural composition analysis of point groups and packing fractions, that the overabundance does not either correlate with high-symmetry structures, but rather to low symmetries and loosely packed arrangements maximising the free volume; • predict, with an accuracy of 87% the association to the rule of four of a compound by providing a random forest classification algorithm with local structural descriptors (the smooth overlap of atomic positions) only, eventually highlighting the importance of local symmetry rather than global one for the emergence of the rule of four.
This analysis constitutes a valuable reference for further systematic studies targeting the classification of materials' features with novel ML approaches in order to screen for optimal experimental candidates.

A. Reduction to the primitive cell
All the structures in both databases are reduced to the primitive cell using the find primitive function of the spglib 18 package, varying the symprec value in the range of 1E − 8 to 1 Å.

B. Scalar global descriptors
The symmetry of compounds is investigated by looking at space groups and point groups.The point group of a given space group is the subgroup of symmetry operations over which the space group is invariant.With a total number of 32 point groups, it is easier to convey the symmetric properties of the vast variety of compounds via their point group rather than their space groups; while space groups uniquely identify geometric properties, point groups identify symmetry classes and reduce the parameter space to a lower degree when investigating the symmetries of all compounds.The point groups are calculated through the spglib 18 and seekpath 39 packages for the MC3D-source database, while we used the SymmetryAnalyzer pymatgen module -which also relies on the spglib package developed by Togo and Tanaka 18 -to find the symmetry operators and point groups for the MP dataset.As concerns packing mechanisms, we extend the conventions employed by Hopkins 30 to n-elements packing and employ the α, P F and x parameters.In structures with FCC and HCP symmetry, the maximum packing fraction is 0.74.α=1 denotes unary compound.Conversely, when α ∼ 0 the compounds contain elements whose radii distribution presents a wider spread.

C. Local symmetry descriptors and ML pipeline
We adopt the following ML pipeline to study local symmetries and energetic effects.First, the atomic representation of each compound is obtained with SOAP vectors (see section II of the SI), computed with the librascal library 33 .The SOAP features are then averaged within each compound, and the representations from the two datasets are normalised simultaneously.We then select a diverse subset of 2 000 features through Furthest Point Sampling (FPS) algorithm 36,37,40,41 (see Section II of the SI), efficiently reducing the dataset size without losing important information.For Sec.II C, we perform a linear ridge regression with 4-fold cross-validation -which optimises the regularisation parameter to prevent overfitting -on the formation energies data retrieved from the MP database to ascertain the accuracy of the model.Table II  Compared to results in the literature, which achieve an accuracy in formation energy prediction of 0.173 eV (Automatminer 42 ) and 0.0332 eV (Crystal Graph Convolutional Neural Networks 43 ), the accuracy of 0.4002 eV is sufficient for this study, since the aim of our study is not to find the most efficient way to predict energies, but rather to provide a sufficient regression prediction to employ in PCovR analysis.We use the species-invariant SOAP vectors to classify the RoF phenomenon using scikit-learn's 44 RandomForestClassifier algorithm 45 , which accepts binary labels as target properties (magic or non-magic) and outputs a probability between 0 and 1 for each compound to fall into the magic subset.Training and testing set constitute respectively 90 and 10% of the whole dataset.Our random forest classification comprises 100 random decision trees.This classifier performs better in our case compared to Support Vector Machine (SVM) and Logistic Regression (LR) classifiers, signifying a need for a stochastic model.

V. DATA AVAILABILITY
The full dataset employed for the analysis can be downloaded from the Materials Cloud Archive 46 , where the MC3D-source data is only provided in SOAP format as the experimental structures can not be released due to licensing constraints.Its DFT-relaxed counterpart is available at: https://archive.materialscloud.org/record/2022.38.Instead, we provide the full list of structure IDs for each database, including the version of the database upon the time of extraction.

VI. CODE AVAILABILITY
The codes to reproduce the results and figures can be found at: https://github.com/epfl-theos/r4-project.As the MC3D-source structure data cannot be made publicly available due to licensing contraints, the repository contains example data from a reduced random subset of the publicly available MP dataset in order to test run a preliminary analysis.

Supporting Information. The rule of four: anomalous stoichiometries of inorganic compounds
Elena Gazzarrini

I. Materials Databases
In this section we introduce the two datasets employed in the study and explain how the raw data is obtained.
The Materials Cloud 3-dimensional crystals Database -MC3D The MC3D [1] is a database of structures optimized with the Quantum ESPRESSO code [2, 3] using fully-automated workflows developed in AiiDA [4, 5].The starting set of structures for the geometry optimization is obtained from the COD [6], the ICSD [7] and the MPDS [8] databases.Each CIF file is parsed via an AiiDA workflow that removes unnecessary tags, performs minor corrections to the syntax, and parses the contents to extract the corresponding structure.The parsed structures are subsequently normalized and primitivized using SeeK-path [9], and a uniqueness analysis is performed to remove duplicate structures.Finally, hydrogen-containing structures from the COD are removed due to the prevalence of molecular crystals in this database, and any structure containing an actinide is also excluded from the database.The resulting 79 854 structures before geometry optimization are labelled as MC3Dsource and used for the analysis in this paper.In this early version of MC3D-source, most (63 093) of the structures came from the MPDS, 13 798 were obtained from the ICSD and 2 963 from the COD.Although the vast majority of the structures in the MC3D-source are experimental, some of the structures extracted from the ICSD and COD were found to be flagged as theoretical, i.e. hypothesized in a theoretical study instead of being observed experimentally.Screening the metadata for these flags, we find 3 071 theoretical structures, so approximately 3.85% of the full structure set.Due to licensing constraints, we are not allowed to publish the full MC3D-source structure set.Instead, we provide a YAML file on the Materials Cloud archive [10] called MC3D ids.yaml that contains the list of versions and IDs for each structure extracted from the three databases.

Materials Project -MP
The Materials Project (MP) [11] dataset used contains a total of 83 989 bulk, crystalline, inorganic compounds that have been relaxed with first-principles calculations starting from experimental databases or from structure-prediction methods.It is retrieved through the Matminer [12] Python library.The version of the database employed in the study dates back to 10/18/2018, corresponding to the p all 20181018 dataset retrieved with the matminer.datasetsmodule [13].

III. PCovR: tuning the mixing parameter
The Principate Covariates Regression (PCovR) [14] combines the losses of Linear Ridge Regression (LRR) and Principal Component Analysis (PCA) through the mixing parameter β.The feature matrix which embeds the reduced SOAP representation is projected into latent space with an orthogonal projection.Finding the optimal projection to the latent space amounts to minimizing the loss, which happens when the projection is built out of the principal eigenvectors of the covariance matrix of the initial feature matrix.

IV. Energetic analysis with PCovR
The following section explores the PCovR energetic analysis performed on the MP dataset with the aim of classifying the structures into the two subgroups by performing a linear regression on local energetic descriptors only.Different covariates are plotted against the first principal covariate (on the x axis each time) to explore the full database variance.Each image is reported in two different views: on the left, the compounds are coloured according to their energetic property, i.e. formation energy per atom (Figure 2), energy above the convex hull (Figure 3) and band gap energy (Figure 4), while on the right the same data is coloured according to the subset it belongs to using a kernel density probability estimation (KDE) normalised to the whole set of data.The isolated areas containing only magic structures mostly contain structures with Mg-O square bonds, ionic bonds with high bond energy and therefore lower formation energy per atom.They validate the SOAP representation's usefulness in separating between compounds' subgroups, but are not enough to draw insightful conclusions on the RoF.

FIG. 1 .
FIG. 1.The rule of four.The two datasets (the Materials Project (MP)11  and the Materials Cloud 3-dimensional crystal structures 'source' database (MC3D-source))12  ) contain a disproportionate amount of compounds with a primitive unit cell containing multiples of 4 atoms.

FIG. 3 .
FIG. 3. Probability distribution of formation energies for the 83 989 compounds from the Materials Project, normalized for each subgroup.Magic compounds are colored in red and nonmagic are colored in blue.

FIG. 4 .
FIG. 4. Proportion of structures in both databases ([a] MC3D-source and [b] MP) that belong to each point group represented on the y axis, counted based on their inherited symmetries.Magic compounds are coloured in red, while non-magic ones in blue.

FIG. 5 .
FIG.5.Different geometric properties of each compound are analysed for the (a) MC3D-source and (b) MP databases.From left to right, the plots represent the distribution of the number of elemental species (Nspecies), the relative abundance of small (NS) to large (NL) radii (x), the ratio between smallest (RS) and largest (RL) atomic radii (α) and the packing fraction (P F ) for compounds with a unit cell size between 0 and 100 atoms.All of the results are plotted for the two sets, magic (red) and non-magic (blue), with the probability normalized to each set.

FIG. 6 .
FIG. 6. PCovR representation of the MP dataset with a mixing parameter of β=0.5.The model is regressed on the formation energy per atom.The three plots contain the same data, represented on the top through a kernel density probability distribution (the magic subset is coloured in red and the non-magic one in blue), coloured according to the subset classification (lower left) and according to the formation energy per atom (lower right).

2 Learning
FIG. 7. Random forest classification on local symmetries.Here we use the species-invariant 3-body SOAP vectors to build a random forest ensemble classifier.Accuracy saturates at approximately 4.0 Å, with little additional gain at larger cutoff radii.Below the figure we show the table of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) results, showing that the classifier is unable to differentiate magic and non-magic structures at lower cutoff radii, leading to a high false positive (FP) rate.Inset in the upper figure is a learning curve for a cutoff radius of 4.0 Å, which shows a positive learning rate, albeit no saturation, an indication that secondary effects beyond the local environments play a role (or, more unlikely, that the dataset is not sufficiently large).

Figure 1 :
Figure 1: The combination of LR (far left) and PCA (far right) in the PCovR analysis on the MP database.The resulting projections and regressions are shown at the indicated β values.Magic compounds are coloured in red, while non-magic ones in blue.

Figure 2 :
Figure 2: PCovR representation with β = 0.5 containing information on SOAP representation and regressed on the formation energy per atom.

Figure 3 :
Figure 3: PCovR representation with β = 0.5 containing information on SOAP representation and regressed on the energy above the convex hull.

Figure 4 :
Figure 4: PCovR representation with β = 0.5 containing information on SOAP representation and regressed on the band gap energy.

TABLE II .
illustrates the RMSE and the accuracy in units of eV of the predicted energetic quantities.RMSE and uncertainty in units on the predicted energetic quantities for the MP database.The ML algorithm is a LRR with a 4-fold cross-validation.We report the formation energy per atom, the energy above the convex hull and the band gap energy.