Learning from data to design functional materials without inversion symmetry

Accelerating the search for functional materials is a challenging problem. Here we develop an informatics-guided ab initio approach to accelerate the design and discovery of noncentrosymmetric materials. The workflow integrates group theory, informatics and density-functional theory to uncover design guidelines for predicting noncentrosymmetric compounds, which we apply to layered Ruddlesden-Popper oxides. Group theory identifies how configurations of oxygen octahedral rotation patterns, ordered cation arrangements and their interplay break inversion symmetry, while informatics tools learn from available data to select candidate compositions that fulfil the group-theoretical postulates. Our key outcome is the identification of 242 compositions after screening ∼3,200 that show potential for noncentrosymmetric structures, a 25-fold increase in the projected number of known noncentrosymmetric Ruddlesden-Popper oxides. We validate our predictions for 19 compounds using phonon calculations, among which 17 have noncentrosymmetric ground states including two potential multiferroics. Our approach enables rational design of materials with targeted crystal symmetries and functionalities.

The bracketed numbers at each leaf node correspond to the total number of RP compositions that reach the leaf. Sometimes we also find two numbers. The first number is the total number of compositions reaching the leaf node, whereas the second number is the number of misclassified compositions in the same leaf node.       We used PCA to reduce the dimensionality of the data from 22 to 8 column vectors, yet capturing > 90% of the variation in the data. Each PC is a linear combination of the weighed contribution of orbital radii, and we show all PC's in the Supplementary Figure 2. We now turn our attention to the decision tree shown in Supplementary Figure 7 and follow the path PC1 ≤ −2.6796 AND PC2 ≤ −0.1335 AND PC5 ≤ 0.152 → X + 3 (η 1 , η 2 ) in the leaf node. From Supplementary Figure 2, the following orbitals are identified as important for predicting the irrep X + 3 (η 1 , η 2 ) using our decision tree: • PC1 : Orbital radii A-5p, A-6s, A-4f , B-4s, B-3d, B-5s and B-4d are important, because their weighted contributions are relatively larger than that for other orbital radii.
• PC2 : A-2p, A-3s, B-6s, B-5d, B-4p, B-5s and B-4d • PC5: A-4p, A-5s, A-4d, A-5p, A-6s, B-4s, B-3d, B-5s and B-4d Projected density of states (PDOS) from DFT calculations for RP compounds with X + 3 (η 1 , η 2 ) octahedral distortions in the ground state would allow us to validate this finding. Exploring changes in orbital bandwidths and shifts in their center-of-mass would permit us to glean insights necessary for describing the stability of a crystal structure (or distortions). Thus, one can potentially extract physical meaning from PCA and decision trees. We do not carry out the electronic structure calculations here, because we anticipate the decision trees to evolve as more compounds are validated and fed back for re-training our models.
Confusion matrix data for the five decision trees based on 10-fold cross-validation. Rows represent observed or true irrep labels and columns indicate output from the decision tree classifier. Diagonal elements represent the number of compositions that show perfect agreement between the true label and the classifier output. The synthetic minority class oversampling (SMOTE) was performed using WEKA for the two irrep labels, P 4 and X + 3 (η 1 , η 2 ) using three nearest neighbors (k) and default random seed. Three and six synthetic data points were augmented for P 4 and X + 3 (η 1 , η 2 ) labels, respectively. The five bootstrapped samples for classification learning were generated using the function sample() in R. We used the set.seed() function in R and the following five arguments were passed: 877, 963, 837, 212 and 505, for generating the random samples. Default metaparameters were used for J48 decision tree induction.
All potential NCS chemical compositions predicted from the 5 decision trees are given separately in an Excel Sheet that can downloaded from figshare. 1 The chemical IDs in the Excel sheet should be cross-referenced with Supplementary Table 2 to identify the exact chemical composition. The starting dataset with 69 RP chemical compositions, 22 orbital radii features and the irrep label is also given in figshare. 1 Full dataset with 3,253 chemical compositions that include 69 original, 9 from SMOTE and 3,175 virtual compounds is given in the Excel sheet that can be downloaded from figshare. 1 The data for experimentally known RP compounds were collected and organized from surveying the literature.  The decomposition reaction pathways for the RP compositions explored in this work from the Grand Canonical Linear Programming (GCLP) method as implemented in the Open Quantum Materials Database (OQMD). For all NaRSnO 4 , where R=La, Pr, Nd, Gd and Y, the ground state P42 1 m space group was considered to compute the total energies from DFT. For NaLaRuO 4 , NaPrRuO 4 and NaNdRuO 4 we considered P42 1 m space group in the ferromagnetic spin order. On the other hand, for NaGdRuO 4 and NaYRuO 4 we considered P ca2 1 space group in the ferromagnetic spin order. In the case of Ca 2 IrO 4 RP compound, we considered both the theoretical ground state (P bca) and high-symmetry (I4/mmm) structures. For the remaining RP compounds, the ground state structures given in Supplementary Table 3 were used. We also note that the RP BaLaGaO 4 is +23.7 meV/atom above the convex hull relative to another compound with identical chemical formula, but different atomic arrangement (containing GaO 4 tetrahedral units and its crystal structure belongs to P 2 1 2 1 2 1 space group, see Supplementary Figure 12). Thus, non-equilibrium synthesis techniques may be required to stabilize the RP phase in BaLaGaO 4 .

Decomposition Reactions from OQMD:
In the main manuscript, we provide the decomposition energy (∆E D ) data for KBaNbO 4 in Table 7. We calculated ∆E D to be −832 meV/atom, which is too low (relative to other compounds in the same Table 7). To test the reliability of this data, we performed additional calculations using a different set of pseudopotentials (GBRV ultrasoft PBEsol pseudopotentials 27 ), but we used the same crystal structures (P 2 1 , F m3m, R3m and P 4mm for KBaNbO 4 , K 2 O, Ba 3 Nb 2 O 8 and KNbO 3 , respectively) and decomposition reaction. We then fully relaxed the structures using the