Abstract
Compositional disorder induces myriad captivating phenomena in perovskites. Target-driven discovery of perovskite solid solutions has been a great challenge due to the analytical complexity introduced by disorder. Here, we demonstrate that an unsupervised deep learning strategy can find fingerprints of disordered materials that embed perovskite formability and underlying crystal structure information by learning only from the chemical composition, manifested in \(({{\rm{A}}}_{1-{\rm{x}}}{{\rm{A}}^{\prime} }_{{\rm{x}}}){{\rm{BO}}}_{3}\) and \({\rm{A}}({{\rm{B}}}_{1-{\rm{x}}}{{\rm{B}}^{\prime} }_{{\rm{x}}}){{\rm{O}}}_{3}\) formulae. This phenomenon can be capitalized to predict the crystal symmetry of experimental compositions, outperforming several supervised machine learning (ML) algorithms. The educated nature of material fingerprints has led to the conception of analogical materials discovery that facilitates inverse exploration of promising perovskites based on similarity investigation with known materials. The search space of unstudied perovskites is screened from ~600,000 feasible compounds using experimental data powered ML models and automated web mining tools at a 94% success rate. This concept further provides insights on possible phase transitions and computational modelling of complex compositions. The proposed quantitative analysis of materials analogies is expected to bridge the gap between the existing materials literature and the undiscovered terrain.
Similar content being viewed by others
Introduction
Perovskite oxides, identified in ABO3 general form, are utilized in broad spectrum of technological applications. The crystal structure of these materials is conveniently interpreted by an array of corner sharing BO6 octahedra and A-cations sitting at interstitial cages1 (Fig. 1b). However, in real solids, the perfect periodicity of the lattice is disrupted by several factors. As opposed to the long-standing notion of ordered crystals usually perform better than their disordered counterparts, it is now understood that chemically ‘messy’ or alloyed perovskite oxides indeed perform exceptionally better in numerous applications. These include ferroelectric photovoltaics2, electrocatalysts and photocatalysts3, tunable microwave ferroelectrics with shiftable Curie point4, solid oxide fuel cells5,6 and resistive switching memories that are promising to be tailored as memristors and neuromorphic circuit components7,8. Thus, perovskite oxides emanate as an imperative class of materials in their own right and carefully engineered disorder can potentially burnish their properties beyond theoretical limits.
Goldschmidt tolerance factor (t) and octahedral factor (μ) are simple yet powerful descriptors proposed nearly a century ago to predict the formability of perovskite structure9. Recent studies revisited these factors and even developed more accurate descriptors to predict promising ordered perovskites10,11,12. Computational methods of assessing the stability of perovskites include evolutionary algorithms13, molecular dynamics12, semi-empirical modelling14,15 and density functional theory (DFT) calculations of formation enthalpy and energy above convex hull16. Nevertheless, adopting high-throughput ab initio methods17 to model disordered solid solutions with mixed site-occupations is limited due to extreme computational power demanded by large supercells and, several random structure generations and optimizations. Alternatively, machine learning (ML) approaches underpinned by experimental data have been applied to rapidly predict complex functional materials, eventually synthesized to exhibit excellent properties18,19,20,21,22,23. Meanwhile, the complexities brought by compositional disorder such as random atom occupancy of the disordered site, non-stoichiometry and the presence of fractional occupations in the average unit cell have hindered the exploration of the complete composition space for target-driven perovskites discovery. Current supervised ML studies are limited to a minute fraction of disordered perovskites19,20,24, calling for a general strategy to inversely search the undiscovered territory. Fortunately, unsupervised learning techniques can extract hidden knowledge from raw input features and embed those information in a compressed numerical latent vector25. This opens up the avenue of ‘analogical materials discovery’ where unstudied materials analogous to an existing material are identified by quantitatively comparing their respective latent representations.
In this study, we focus on A-site or B-site alloyed quaternary perovskites, typically formulated as \(({{\rm{A}}}_{1-{\rm{x}}}{{\rm{A}}^{\prime} }_{{\rm{x}}}){{\rm{BO}}}_{3}\) and \({\rm{A}}({{\rm{B}}}_{1-{\rm{x}}}{{\rm{B}}^{\prime} }_{{\rm{x}}}){{\rm{O}}}_{3}\), respectively. The rules governing A (larger) and B (smaller) cationic radii are well established, yet lenient enough as we identified 49 and 62 periodic table elements that can entirely or partially occupy A-site and B-site, respectively, by incorporating both experimental and theoretical studies11,12 (Fig. 1a). Such elemental combinations propagate into a staggering and virtually endless diversity of disordered perovskites. Efficient screening of this massive compound pool, first to discover potential perovskites and second, elect promising candidates analogous to a target material/structure (i.e., inverse design) is paramount to supplement the increasing demand for high-performing, inexpensive and eco-friendly materials. By envisaging this objective, we first compiled an all possible candidates pool by capturing the combinations of A and B cations and incrementing x from 0.05 to 0.95 in 0.05 intervals, subject to basic chemistry rules such as charge neutrality and Pauling’s valency rule10,26. This resulted in a grand total of 591,129 distinct compositions. We then collected an experimental database of \(({{\rm{A}}}_{1-{\rm{x}}}{{\rm{A}}^{\prime} }_{{\rm{x}}}){{\rm{BO}}}_{3}\) and \({\rm{A}}({{\rm{B}}}_{1-{\rm{x}}}{{\rm{B}}^{\prime} }_{{\rm{x}}}){{\rm{O}}}_{3}\) compositions by exhaustively querying the Inorganic Crystal Structure Database (ICSD)27. After discarding the duplicates, we obtained 1758 distinct perovskites and 227 non-perovskites along with their crystal structure information, indicating less than only 0.4% of composition space has been experimentally synthesized before (see Fig. 1c and Supplementary Note 1). This database also comprised double perovskites in the format of \({{\rm{A}}}_{2}{\rm{BB}}^{\prime} {{\rm{O}}}_{6}\) and \({\rm{AA}}^{\prime} {{\rm{B}}}_{2}{{\rm{O}}}_{6}\), because the composition is the same to the disordered equivalents \({\rm{A}}({{\rm{B}}}_{0.5}{{\rm{B}}^{\prime} }_{0.5}){{\rm{O}}}_{3}\) and \(({{\rm{A}}}_{0.5}{{\rm{A}}^{\prime} }_{0.5}){{\rm{BO}}}_{3}\), respectively.
An appropriate feature space is developed starting from the chemical formula to represent each material in the experimental database and candidate pool. While characterization equipment such as X-ray photoemission spectroscopy is required to investigate the oxidation state of constituent elements28, here we implement a heuristic algorithm to unroll fractional oxidation states, closely reflecting the actual cation valences. We further add ten best perovskite descriptors procured from the sure independence screening and sparsifying operator (SISSO) algorithm11,29 to the feature space, consolidating hand-crafted and data-driven descriptors. A sequential screening strategy that involves ML classification, chemistry informed filtering and automated web scraping is employed to downselect promising unknown perovskites with 93.9% fidelity. We construct a variational autoencoder (VAE) model30 and train it on a sufficiently large, unlabelled experimental database, sanguinely retrieving a low-dimensional representation of compositions, designated as the material fingerprint (Fig. 1d). Notably, VAEs have been purposed as molecule generative models31 and feature extractors where the latent vector is sent through another ML model to predict element combinations that will likely form a specific topology, requiring crystallographic data as the input32. VAE-obtained latent vector has also been used to predict optical properties of materials33. Here, we demonstrate the exclusive potential of material fingerprints learned only from the chemical composition to carry crystal structure and perovskite formability information as numerical vectors, stressing their applicability in analogical materials discovery. Material attributes such as bandgap and phase transition details can be recovered by proximity analysis of fingerprint topology assembled by the VAE. Through a series of statistical testing on crystal symmetry prediction, we illustrate that structurally and physiochemically similar materials likely possess similar fingerprint representations. This phenomenon is employed to discover lead-free analogues of several lead-based ceramics. We further exemplify the versatility of material fingerprints in initializing the DFT simulation of disordered solid solutions by modelling six promising lead-free perovskites (see Fig. 1).
Results
Feature engineering
The roadmap to analogical materials discovery begins by establishing a solid and adequate collection of candidate compounds for the parent experimental material or class of materials. We use pymatgen34, an open-source python library for analyzing the materials in our database. Pymatgen is equipped to guess the most probable oxidation states of elements in a composition based on ICSD statistics. In the case of mixed valence constituent element, this may lead to a fractional oxidation state that must be unfolded to estimate the effective ionic radius of that element and the associated crystallographic site in the average unit cell. Considering the smallest possible supercell that relates to the integer formula, our algorithm first tries to solve for the valence states of individual ions by exploiting the common oxidation states of the mixed-valence element in question. If no solution is found, it looks through other less common oxidations states. Whenever the algorithm finds more than one configuration of valence states can result in the same fractional oxidation state, the one with the smallest oxidation state numbers and the lowest variation in ionic radii is selected (see Supplementary Note 2). We expect the perovskite structure is more likely to be retained under these conditions because, if there is large difference in size between the constituent cations, the octahedral units share faces, distorting the perovskite lattice more6. However, it should be noted that mixed valence states are highly dependent on the element, composition, vacancies etc., requiring experimental characterization for precise investigation. Our algorithm provides a systematic estimation of the mean ionic radius of such elements, permitting us to calculate t, μ and develop the feature space. Subsequently, t and μ of \(({{\rm{A}}}_{1-{\rm{x}}}{{\rm{A}}^{\prime} }_{{\rm{x}}}){{\rm{BO}}}_{3}\) type compositions are computed as1;
Similarly, for \({\rm{A}}({{\rm{B}}}_{1-{\rm{x}}}{{\rm{B}}^{\prime} }_{{\rm{x}}}){{\rm{O}}}_{3}\) type compositions;
where RA, \({R}_{A^{\prime} }\), RB, \({R}_{B^{\prime} }\) and RO are Shannon’s ionic radii35 of A, A\(^{\prime}\), B, B\(^{\prime}\) cations and oxygen anion, respectively. Mean ionic radius is used for mixed valence elements. The feature space is composed of 55 elemental and 11 compositional features, 10 SISSO descriptors and 5 feature combinations including t and μ (provided in Supplementary Table 1). SISSO is an efficient, data-driven screening approach that finds near-optimal material descriptors relating to a property of interest from a colossal number of potential descriptors obtained by applying a set of mathematical operations to one or many primary features29. In our specific case, we have 71 primary features, yielding a total of ~1071 SISSO descriptors. By training the SISSO algorithm on around 500 randomly sampled compositions from our database and considering a subspace size of 1000, we extracted 10 best descriptors reflecting perovskite formability (Supplementary Note 3). The features related to oxygen are common to all compositions (i.e., no relative importance) and have thus been omitted when mapping the feature space into perovskite or non-perovskite using ML.
Quaternary compositions classification
We examined the performance of four ML algorithms in the present classification task to select the best model to proceed. These include decision tree (DT), random forest (RF), support vector machine (SVM) and gradient boosting classifier (GBC). In order to assess the performance metrics, we iteratively trained and evaluated the individual models on sub-databases generated by integrating 227 non-perovskites with 250 randomly sampled perovskites from 1758 total perovskites. Such a strategy is required to maintain the class-balance of individual training sets, minimizing the potential overfitting to a particular class. It also ensures that all available perovskites are exploited without being dissipated. All four algorithms are trained and validated for 100 independent iterations by splitting each sub-database into 80% training and 20% test sets19. Table 1 summarizes tenfold cross validation (CV) results. GBC is selected to proceed as it provided the best performance with 93.91% (±0.96%) and 93.33% (±2.26%) tenfold CV and test mean classification accuracies, respectively. More details about these metrics including hyperparameter tuning are available in Methods, and Supplementary Fig. 1 and Supplementary Note 4. Figure 2b,c depict confusion matrix and receiver operating characteristic (ROC) curves, respectively, as obtained by tenfold CV. Moreover, five SISSO descriptors make it to the top ten features ranked by their relative importance (Fig. 2a). However, we note that the GBC model achieves 93.71% and 93.16% tenfold CV and test mean accuracies, respectively, even in the absence of SISSO descriptors in the feature space, endorsing the fact that perovskite formability is heavily dependent on geometrical features such as cationic radii.
We then classify the total possible compounds pool (0.6 Million candidates) by following the same procedure of 100 iterations, except that, this time, each GBC model is trained on the full corresponding sub-database without train/test splitting. Figure 2d highlights the geometric region that most probably results in a stable perovskite structure. Moreover, B-site doping is a prevalent technique used to enhance and tune the properties of many perovskites. Figure 3 shows the classification map of such systems for commonly observed A and B cations. Nonetheless, our objective here is to screen down a huge composition space to a sufficient collection of potential perovskites, such that the materials will have a greater chance of potentially being synthesized in single phase perovskite structure. Therefore, we extract the compositions that have over 95% mean perovskite likelihood and, that were classified 100 out of 100 times into the perovskite category, as the most promising perovskites19. This increases the precision (how many selected items are relevant) of the predicted set at the cost of some possible perovskites not being selected. This tactic is also expected to address the ongoing criticism that ML models tend to predict a high rate of materials to be stable, which are actually not36. Likewise, 46,228 highly-probable candidate perovskites are screened, narrowing the composition space by 92%. We further downselect the compositions made up of elements in their common oxidation states. This was followed by removing all the compositions whose quaternary system is available in the ICSD relating to our format as already studied materials. For example, when we find Pb(Zr0.5Ti0.5)O3 is present in the ICSD, we remove all Pb(Zr1−xTix)O3 compositions that belong to the quaternary system Pb-Zr-Ti-O . Furthermore, an automated web scraping tool is implemented to search for any mentioning of the resulting materials/systems on the internet, which were removed in turn. The competency of this tool is tested by scraping 576 ABX3-type experimental materials reported in ref. 11. It found the existence of 562 compounds on the web indicating over 97% success rate. This multi-step, novelty-ensuring screening flow further decreased the perovskite candidate headcount to 10,790, which will serve as the search space for analogical discovery of perovskites.
Unsupervised material fingerprinting
Learning numerical vector representations of materials, that is fingerprinting them, based only on the features extracted from their chemical formulae facilitates the similarity investigation of two or more materials by geometrical distance arithmetics between those vectors. By design, the compositions that are chemically and structurally alike should fall close to each other and disparate compositions should be further apart in the fingerprint space. An autoencoder neural network, composed of an encoder that efficiently compresses discrete material features into a latent vector and a decoder that learns to reconstruct the same features starting from those encodings, is an unsupervised technique of retrieving the material fingerprint, which refers to the encoding itself. However, a fundamental problem of such a strategy is the possibility of forming blind spots in the latent space, which, when decoded, may have no relevance to a real material. Variational autoencoders (VAEs), in contrast to vanilla autoencoders, resolve this issue by enforcing the encoder to produce two vectors relating to a set of means μ and standard deviations σ that closely resemble a standard Gaussian distribution (Z ~ N(0, 1)), which is then sampled to obtain the latent vector, causing the latent space to be continuous (Fig. 4a). Intuitively, this means that the latent encoding of a material is not unique, but stochastically distributed (Z ~ N(μ, σ2)) around the centre determined by the mean vector, enabling the decoder to ‘see’ the whole latent space during training and driving the encoder to generate interpretable representations. While the robustness of VAEs lies in the stochastic nature of the latent embedding, this vector is not adoptable as the material fingerprint due to its non-uniqueness. Therefore, we designate the mean vector (μ) produced by the encoder as the material fingerprint. In our VAE architecture, μ, σ and latent vector all are two-dimensional (2D), yielding 2D numerical fingerprints (see Methods for more details on VAE implementation). We further discuss other dimensionality reduction algorithms in Supplementary Note 5, namely, vanilla autoencoder (Supplementary Fig. 3), principle component analysis (Supplementary Fig. 4) and t-SNE (Supplementary Fig. 5).
Owing to the fact that training VAEs is unsupervised, that is no labelling of compositions is required, we used all available \(({{\rm{A}}}_{1-{\rm{x}}}{{\rm{A}}^{\prime} }_{{\rm{x}}}){{\rm{BO}}}_{3}\) and \({\rm{A}}({{\rm{B}}}_{1-{\rm{x}}}{{\rm{B}}^{\prime} }_{{\rm{x}}}){{\rm{O}}}_{3}\) compositions totalling 2104 to train and validate the model. Over 90% of experimental data relate to ambient conditions, so do our predictions. The database included 119 materials that were difficult to be reliably deposited into either of perovskite or non-perovskite categories. Clearly, there is no hard boundary between these two classes, separating them perfectly. Ultimately, it comes down to which degree of distortion from the ideal cubic perovskite lattice can be tolerated to still regard the resulting structure as perovskite? We let the VAE model to place the materials in the region that they ‘deserve’. The input to the VAE contains the same features as used in the previous classification step. Despite not being explicitly trained as perovskite or non-perovskite, the material fingerprints capture this information and form separate regions in the fingerprint space as can be observed from Fig. 4b. In addition, the alloyed site is clearly recognizable from the 2D fingerprint. Notably, some of the non-perovskite-labelled materials fall within the perovskite region as circled in Fig. 4b. Instead of disregarding them as ambiguous, we scrutinized the root cause for this and found that at least 14 out of 20 materials falling inside the circle were described with the terms ‘perovskite’ or ‘hexagonal perovskite’ in other areas of the literature (see Supplementary Table 2). Apparently, these compounds are more ‘perovskite’ than ‘non-perovskite’. This is a prime example to elucidate that the materials are fingerprinted on their own merits. It also draws the attention to what appears to be outlier data for cross checking the correctness of data labelling.
Now, the cardinal question is, are these fingerprint vectors really learned such that they embed the underlying chemical and structural properties of materials? To answer this question, we investigated small local clusters in the fingerprint space. The key idea is that the unsupervised model finds similar fingerprint representations for analogous compositions, which can be corroborated with a k-nearest neighbours approach. Anticipating that similar materials should find themselves close to each other, we compared the crystal system (cubic, orthorhombic, tetragonal, etc.) of each material with that of the five nearest neighbours (NNs) in the 2D fingerprint space. This can be viewed as a leave one out cross validation (LOOCV) technique, because, despite never being trained to predict the crystal system, all the other materials except the one being tested are present on the fingerprint space. In order to quantify materials analogies in native domain, we define euclidean distance derived similarity measure (SE) as; SE(p, q) = 1/(1 + dE(p, q)) where p and q are two composition points and dE(p, q) is the euclidean distance between p and q. If majority of neighbours vote for (belong to) the same crystal system, that will be assigned as the predicted crystal system of the original composition. In the case of equal number of votes, the one relating to the lowest euclidean distance is selected. By performing this analysis for all 2104 materials, we found that the 5-NNs can determine the correct experimental crystal system of the parent composition with a success rate of 71.8%. This implies that the unsupervised material fingerprints indeed carry crystal structure information concealed in its relative locality. Note that the optimal number of neighbours is found to be 5 by heuristic tuning. Figure 4c illustrates crystal system prediction performance as a normalized confusion matrix. Moreover, this strategy is extensible to predicting the space group of compositions, which resulted in 59.9% prediction accuracy (see Supplementary Fig. 6 for confusion matrix). Given that the total number of distinct space groups present in our database is 66 and the model is not explicitly trained to predict them, nearly 60% accuracy figure reflects that even such diminutive information is inherent in the material fingerprint. Although the data circumstances are different, material fingerprint based prediction of crystal system proposed here offers around 6% accuracy improvement over existing multi-class supervised approaches that rely on multitude of magpie features obtained only from the composition24,37. Furthermore, space group detection of experimental compounds, even by training a supervised convolutional neural network (CNN) on about 150,000 powder XRD patterns is proved to be challenging, with a reported accuracy of 81%38. With no supplemental knowledge on crystallographic or XRD data of complex solid solutions, our model performs appreciably close to the state-of-the-art CNNs on this aspect, albeit the differences between sample sizes and data contexts should be accounted.
To assess how well this fingerprint proximity oriented approach contrasts with supervised parametric models, we evaluated the crystal symmetry detection performance of GBC, DT, RF and SVM classifiers using LOOCV for unbiased comparison. The feature space is chosen to be the same as used in perovskites classification step. The performance metrics are summarized in Table 2. GBC achieves the highest crystal system classification accuracy of 78.8% (see Fig. 4d) while the proposed scheme, relying only on the 2D fingerprints, outperforms supervised algorithms such as SVM and RF. The trend is similar for space group classification. Most importantly, unsupervised deep learning techniques are quite rigorous when finding their own representations, and advantageous in inversely exploring the compositional space, which is not feasible in supervised ML regime. For instance, when the database is of sufficient size and diversity, the representations learned by VAE form an information-embedded topology that can be used not only to predict the crystal structure of unknown compositions but also to discover compositions analogous to target experimental materials, a point that we come back later in this paper.
Phase identification and bandgap prediction
Some perovskites such as ferroelectrics display temperature-dependent phase transitions. There can also be coexisting phases at the same temperature39. We emphasize that chemical composition is not duplicated in our database, leading to an elimination of additional phases, if there are any. Taking a step back, we examined whether the NNs can reflect phase information of a material that has multiple crystallization phases. We found that if the NNs imply strong relation to two or more different symmetries by means of low euclidean distance (i.e., SE > ~ 95%), it may indicate a potential phase transition in the material of interest. For example, consider ferroelectric (Sr0.6Pb0.4)TiO3 which is not recorded in the ICSD (neither ICSD nor our database contains any record of Sr − Pb − Ti − O system). Four experimental NNs of this composition vote for the cubic system and the remaining NN shows an SE of 95.7% to the original ferroelectric and votes for the tetragonal system. There is indeed a cubic to tetragonal phase transition at 66 °C in this material40. 219 compositions in our experimental database can crystallize in multiple phases (upto 4 crystal systems). In 97 cases, all the crystal systems shown by each material are present in the set of crystal systems of corresponding 5-NNs. We stress that the same experimental fingerprints obtained above are used throughout with no further training. Notably, 33 compositions show 3 or more phases. The stated approach can predict 2 or more phases in 27 out of 33 compositions as depicted in Fig. 5a.
We search for the compositions with no known phase transformations by probing those which are reported to have only one phase (one crystal system) for all alloying ratios within the system. We found that 1283 compositions have no potential phase transformations. Out of these, 573 were predicted to have only one phase as expected. Two phases were predicted for 475 compositions. We conjecture that a considerable number of materials have not been fully characterized and even some of the known phases are not yet recorded in the ICSD. For example, ferroelectric (Ag0.9Li0.1)NbO3 favors a trigonal phase at room temperature and displays temperature-induced sequence trigonal → orthorhombic → tetragonal → cubic41. However, the ICSD reports only the trigonal phase of this material (as of March 2021). Therefore, it was wrongly labelled to have no phase transformations. We notice that 4-NNs of (Ag0.9Li0.1)NbO3 vote for the trigonal system and one votes for the cubic system. This is sensible given the above phases of this material. While capturing the Curie point is too much to ask from this unsupervised approach, it does convey some useful insights on possible phase transitions. We further tested whether the bandgaps of the compositions located in close proximity are similar by predicting the DFT computed bandgaps of 1016 quaternary perovskites reported in ref. 5. These compositions are fingerprinted and the bandgap is predicted by averaging this figure of 5-DFT-NNs of the considered composition (see Fig. 5b). The mean absolute error of the predictions is 0.385 eV, consistent with existing ML based bandgap prediction studies12,42. We further employed several supervised approaches for predicting the bandgap. Although the error rate above is very similar to LOOCV MAE of support vector regressor (0.369 eV), the lowest MAE is achieved by RF regressor (0.17 ± 0.004 eV). Based on these predictions, it is safe to assume that geographically close materials likely possess similar properties.
Analogical discovery of perovskites
The fingerprints of the 10,790 predicted perovskites were extracted at the μ layer by sending the associated features through the trained VAE network. This brings the candidate compositions and the experimental observations to a common ground, that is the fingerprint space, where unexplored perovskites landing close to a target experimental material are identified as analogous compositions to that material. We first tested the effectiveness of fingerprints for a set of predicted perovskites. During the materials screening stage, our web scraping tool recovered 24 GBC-predicted perovskites from literature that were not available in the ICSD. 17 out of 24 were reported in credible experimental studies that contained crystal structure information. As predicted, all 17 compositions have been experimentally identified as perovskites. By placing them in the fingerprint space and taking the majority vote of 5-experimental-NNs into account, we correctly identified the crystal system of 13 perovskites out of 17 (Table 3). Not only does this underpin the accuracy of our ML prediction, but it also confirms the generalizability of the material fingerprints to unseen data.
The inauguration of analogical materials discovery is driven by the exigent demand for alternatives to some of the existing industrial materials that may be toxic, expensive, unstable or lossy. For example, discovering lead-free ceramics and solar cell materials is a heavily invested specialism43. Identifying lead-free promising perovskites that potentially resemble the crystal structure and physiochemical properties of a Pb-based material is achievable by fingerprinting the Pb-composition and capturing the nearest candidates in the fingerprint space. For instance, our analysis suggests that unstudied Bi(Mg0.65V0.35)O3 could be a non-toxic substitution for room temperature multiferroic Pb(Fe0.5Nb0.5)O3 (PFN). PFN ceramic undergoes several phase transitions, namely, cubic → tetragonal → rhombohedral → monoclinic with decreasing temperature44. Rather unexpectedly, 5-NNs to Bi(Mg0.65V0.35)O3 evidence the relation to all these phases with tetragonal P4mm receiving two votes and the rest obtaining one vote each. Moreover, Bi(Sc0.2Co0.8)O3 and Bi(Ti0.5Zn0.5)O3 locate near the relaxor ferroelectric Pb(Mg1/3Nb2/3)O3. Note that double perovskite equivalent of Bi(Ti0.5Zn0.5)O3 has experimentally been identified as an analogue of PbTiO345. Furthermore, (K0.5Bi0.5)ZrO3 shows similarity to the studied ceramic (Ba0.67Pb0.33)TiO3. It is well regarded that bismuth-based ceramics are eco-friendly alternatives to lead-rich piezoelectrics, with the downside of usually requiring high-pressure and high-temperature synthesis conditions. Intriguingly, many compositions with bismuth occupying A-site, when fingerprinted, lie close to those containing lead. This implies that our model can rediscover known phenomena even with no domain-specific knowledge explicitly supplied. Full list of suggested alternatives to current selected perovskites is tabulated in Supplementary Table 3.
DFT validation
One of the major hindrances to modelling unstudied materials with ab initio methods such as DFT is the lack of a priori knowledge about the starting geometry. This is especially challenging for perovskites having previously unknown cation combinations. In the era of experimental records reaching big data status, chances are structurally similar materials to an unstudied material do exist. Not only does material fingerprinting facilitate analogical materials discovery, but it also is readily applicable to find a reasonable experimental structure required to initialize DFT geometry optimization. As a demonstration, we perform DFT calculations to realize the optimized structures of six selected promising perovskites. Tractable computational methods of modelling random solid solutions in conjunction with the DFT framework include the special quasirandom structures (SQS) approach46, the virtual crystal approximation47 and the supercell approach that signifies disorder aspects at local level. While the SQS method has been particularly successful in predicting the properties of random solid solutions48, we note that the supercell approach is significantly faster than SQS generation and the associated groundwork. Since property prediction is not intended here, we opt for the supercell method and use the recent supercell program especially designed to construct reflective supercells of solids having mixed/partial site occupancies49. It potentially includes SQS systems within its exhaustive search space and implements a combinatorial structure ranking method based on their respective electrostatic (Coulomb) energies. Owing to the importance of conformer search, here, we select the supercell configuration with the lowest Coulomb energy to start geometry optimization. Note that, supercell generation itself could be very computationally demanding depending on the number of combinatorial structures, not to mention the subsequent DFT calculation. Figure 6 elaborates the DFT structure optimization steps of the selected promising perovskite (Ce0.5Ag0.5)ZrO3. By analyzing the fingerprints, we found that (La0.8Al0.2)FeO3 orthorhombic Pbnm, is the most similar experimental material in our database, to the selected candidate. Hence, it is used as the starting geometry. Table 4 details the structure generation and the optimized lattice constants of the six selected promising perovskites. The structures are fully relaxed, allowing to change the ionic positions, cell volume and cell shape. All the optimized structures preserve the perovskite nature as well as the starting structural integrity with minimal change in lattice parameters, seemingly indicating that the initial geometry found by material fingerprint analogy is a plausible basis.
Discussion
There are several extensions to the present study worthwhile considering. Virtually all materials in our experimental database are published single phase structures. Collecting non-perovskite materials from the literature or databases essentially means that we miss failed syntheses of perovskites or those having major secondary phases, which are, strictly speaking, non-perovskites too. While this data is not usually published, the possible inclusion of such materials as non-perovskites would enhance the reliability of ML predictions. Another possibility would be to use a transfer learning technique by training a neural network classification model subsequently on a large computational database of perovskites/non-perovskites and available experimental data to mitigate data limitation problem15,50. Furthermore, because over 90% of our data has been acquired near room temperature, we did not scrutinize temperature dependence for crystal structure prediction. All our predictions are based only on the composition, assuming ambient conditions. Present study can be extended to include temperature by curating a sufficient database having enough diversity in the temperature field. Due to data limitations, this method is restricted to 66 space groups available in our database. Therefore, the phases outside of these space groups cannot be predicted. This is one of the reasons why we focus on the crystal system, as the data is abundant for all the categories. Moreover, the typical black-box nature of artificial neural networks makes the material fingerprint a black-box too. Investigating the ties of the fingerprint to a physical meaning would make the materials design process more transparent. A relatively large and ideally experimental database of heterogeneous compositions is essential for the proposed framework to succeed. This may be viewed as a constraint to apply our concept in very specific domains where materials data is scarce. We believe the database used in this study is reasonably sized to follow a neural network strategy, and we recommend at least a similar sized database to apply our method in a different context.
In summary, by establishing a comprehensive experimental database of disordered quaternary compositions and involving powerful ensemble ML methods, we have rapidly screened potential perovskites from total feasible compounds. A further series of filtering steps including automated web scraping were imposed to ensure the originality and potential synthesisability of the predicted perovskites. Concurrently, we introduced an unsupervised deep learning strategy to retrieve fingerprints of materials. We elucidated the effectiveness and the educated essence of fingerprints by unravelling hidden salient information such as perovskite formability and underlying crystal structure, that can be used to identify the crystal symmetry of materials. The fingerprints essentially regulate all compositions into a common numerical platform where similar compositions tend to locally stick together. This phenomenon can be capitalized to predict the possible phases and the bandgap of materials. This further led to the conception of analogical materials discovery, that could accelerate the pursuit of finding inexpensive green materials as a replacement for current hazardous materials. We also emphasized the importance of fingerprints to pick up an initial experimental structure required for DFT modelling. Material fingerprint uncovered in this study is not universal, meaning that, it is mutable in a different data context. Nevertheless, due to the versatility of unsupervised techniques, the materials analogy quantification conceptualized here is expected to be valid for other material systems as well. Therefore, we believe the proposed framework has far-reaching implications for accelerated target-driven discovery of materials as well as computational modelling of complex compounds.
Methods
Machine learning
All supervised ML algorithms used in this study, including KNN and GBC are adapted from the scikit-learn package51 in a python programming language environment. Gradient boosting algorithms create an ensemble of weak learners, typically by adding DTs over time to produce a powerful predictive model. The learning takes place by applying gradient descent algorithm to optimizing a differentiable cost function, usually logarithmic loss for classification tasks. The hyper-parameters are external to the ML model and should be tuned in advance to control the training process and achieve better predictive power. We tuned the hyper-parameters of the models (learning rate and No. of estimators in the case of GBC) using tenfold CV and grid search method. In k-fold CV, the database is randomly partitioned into k equal size subsets. The ML model is trained by combining k-1 of them and validated on the remaining subset. This is repeated k times such that each data-point in the original database gets to be in the test set exactly once. LOOCV is the extreme version of k-fold CV where k is equal to the number of samples in the database. Material features were acquired and analyzed using pymatgen34 and mendeleev52 python packages. The computer codes required to train the models and reproduce the results presented in this paper may be found online at https://github.com/ihalage/analogmat.
VAE architecture and training
The VAE model is implemented in python using Keras deep learning framework53 with TensorFlow backend54. The depth of the VAE neural network and the dimension of the latent vector should be carefully controlled to obtain meaningful representations. A high depth (i.e., more parameters) means low reconstruction loss, but this often comes with the risk of data memorization rather than learning anything useful. High dimensional latent space can also give rise to this problem. We found that, using a significantly low dimension for the latent vector not only forces the encoder to find near optimal compression due to the added pressure by high reconstruction loss, but it also facilitates convenient data visualization without incorporating other dimensionality reduction algorithms. After some extensive heuristic tuning, we eventually constructed the VAE with five hidden layers with 2D μ, σ and latent vectors. The remaining four hidden layers have 128, 64, 32, and 64 neurons, chronologically. The weights were initialized using Glorot uniform initializer. Rectified linear unit is used as the activation function of the hidden layers and linear activation is used for the output layer. L2 regularization is applied in hidden layers to avoid the overfitting phenomenon. Kullback–Leibler divergence (KL loss)55 and mean squared error (MSE) were used additively as the loss function, which was optimized using Adam optimizer with a learning rate of 0.001. The exponential decay parameters β1, β2 and the stability constant ϵ were set to 0.9, 0.999 and 10−7, respectively. The model is trained on an RTX 2080 Ti graphics processing unit (GPU) with a batch size of 32 and the validation data (20%) loss was monitored for an early-stopping method to prevent possible data memorization.
Density functional theory
Spin-polarized DFT calculations were performed with projector augmented wave pseudopotentials56 using the generalized gradient approximation as implemented in the Vienna Ab initio Simulation Package (VASP)57. We used perdew–burke–ernzerhof exchange-correlation functional for the present calculation. The Brillouin zone was sampled with a 10 × 10 × 10 Monkhorst-Pack k-point grid. Cutoff energy for the plane wave basis set was set to 520 eV. The ionic positions, cell volume and cell shape were allowed to change during the structure relaxation until reaching an energy convergence threshold of 2 × 10−6 eV and net atomic force less than 0.01 eV/Å on any atom. The solid solutions were approximated with a supercell having the lowest Coulomb energy as generated by the supercell package49. The supercell dimensions used in this study are sufficient to demonstrate the geometry optimization process, however, larger supercells or a SQS based method is recommended if further calculation of physical properties is intended. The crystal structures were analyzed using pymatgen and visualized with VESTA58. We used the GPU port of VASP59 to run the calculations on an RTX 2080 Ti GPU.
Data availability
The experimental database is provided in Supplementary Data 1. Predicted compositions having over 98% perovskite likelihood are provided in Supplementary Data 2 along with their respective five most similar experimental materials. All other data needed to evaluate the findings of this study are present in the paper and/or the Supplementary Information.
Code availability
The codes required to reproduce the results of this study are publicly available at https://github.com/ihalage/analogmat.
References
Tilley, R. J. The ABX3 Perovskite Structure, chap. 1, 1–41 (John Wiley & Sons, Ltd, 2016).
Grinberg, I. et al. Perovskite oxides for visible-light-absorbing ferroelectric and photovoltaic materials. Nature 503, 509–512 (2013).
Yin, W.-J. et al. Oxide perovskites, double perovskites and derivatives for electrocatalysis, photocatalysis, and photovoltaics. Energy Environ. Sci. 12, 442–462 (2019).
Ahmed, A., Goldthorpe, I. A. & Khandani, A. K. Electrically tunable materials for microwave applications. Appl. Phys. Rev. 2, 011302 (2015).
Jacobs, R., Mayeshiba, T., Booske, J. & Morgan, D. Material discovery and design principles for stable, high activity perovskite cathodes for solid oxide fuel cells. Adv. Energy Mater. 8, 1702708 (2018).
Fop, S. et al. High oxide ion and proton conductivity in a disordered hexagonal perovskite. Nat. Mater. 19, 752–757 (2020).
Lähteenlahti, V., Schulman, A., Huhtinen, H. & Paturi, P. Transport properties of resistive switching in ag/pr0.6ca0.4mno3/al thin film structures. J. Alloys Compd. 786, 84 – 90 (2019).
Panda, D. & Tseng, T.-Y. Perovskite oxides as resistive switching memories: a review. Ferroelectrics 471, 23–64 (2014).
Goldschmidt, V. M. Die gesetze der krystallochemie. Naturwissenschaften 14, 477–485 (1926).
Filip, M. R. & Giustino, F. The geometric blueprint of perovskites. PNAS 115, 5397–5402 (2018).
Bartel, C. J. et al. New tolerance factor to predict the stability of perovskite oxides and halides. Sci. Adv. 5, eaav0693 (2019).
Lu, S., Zhou, Q., Ma, L., Guo, Y. & Wang, J. Rapid discovery of ferroelectric photovoltaic perovskites and material descriptors via machine learning. Small Methods 3, 1900360 (2019).
Woodley, S. M. & Catlow, R. Crystal structure prediction from first principles. Nat. Mater. 7, 937–946 (2008).
Marchenko, E. I. et al. Transferable approach of semi-empirical modeling of disordered mixed-halide hybrid perovskites ch3nh3pb(i1–xbrx)3: prediction of thermodynamic properties, phase stability, and deviations from vegard’s law. J. Phys. Chem. C 123, 26036–26040 (2019).
Liu, N. et al. Interactive human-machine learning framework for modelling of ferroelectric-dielectric composites. J. Mater. Chem. C 8, 10352–10361 (2020).
Emery, A. A. & Wolverton, C. High-throughput dft calculations of formation energy, stability and oxygen vacancy formation energy of abo3 perovskites. Sci. Data 4, 170153 (2017).
Im, J. et al. Identifying pb-free perovskites for solar cells by machine learning. npj Comput. Mater. 5, 37 (2019).
Schmidt, J., Marques, M. R. G., Botti, S. & Marques, M. A. L. Recent advances and applications of machine learning in solid-state materials science. npj Comput. Mater. 5, 83 (2019).
Balachandran, P. V., Kowalski, B., Sehirlioglu, A. & Lookman, T. Experimental search for high-temperature ferroelectric perovskites guided by two-step machine learning. Nat. Commun. 9, 1668 (2018).
Weng, B. et al. Simple descriptor derived from symbolic regression accelerating the discovery of new perovskite catalysts. Nat. Commun. 11, 3513 (2020).
Yuan, R. et al. Accelerated discovery of large electrostrains in batio3-based piezoelectrics using active learning. Adv. Mater. 30, 1702884 (2018).
Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016).
Ren, F. et al. Accelerated discovery of metallic glasses through iteration of machine learning and high-throughput experiments. Sci. Adv. 4, eaaq1566 (2018).
Zhao, Y. et al. Machine learning-based prediction of crystal systems and space groups from inorganic materials compositions. ACS Omega 5, 3596–3606 (2020).
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
Pauling, L. The nature of the chemical bond. application of results obtained from the quantum mechanics and from a theory of paramagnetic susceptibility to the structure of molecules. J. Am. Chem. Soc. 53, 1367–1400 (1931).
Nist inorganic crystal structure database, nist standard reference database number 3, national institute of standards and technology, gaithersburg md, 20899. https://icsd.nist.gov/.
Nechache, R. et al. Bandgap tuning of multiferroic oxide solar cells. Nat. Photonics 9, 61–67 (2015).
Ouyang, R., Curtarolo, S., Ahmetcik, E., Scheffler, M. & Ghiringhelli, L. M. Sisso: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Phys. Rev. Mater. 2, 083802 (2018).
Kingma, D. P. & Welling, M., Auto-Encoding Variational Bayes, in 2nd International Conference on Learning Representations (ICLR), Banff, AB, Canada, April 14-16 (2014).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Ryan, K., Lengyel, J. & Shatruk, M. Crystal structure prediction via deep learning. J. Am. Chem. Soc. 140, 10158–10168 (2018).
Stein, H. S., Guevarra, D., Newhouse, P. F., Soedarmadji, E. & Gregoire, J. M. Machine learning of optical properties of materials - predicting spectra from images and images from spectra. Chem. Sci. 10, 47–55 (2019).
Ong, S. P. et al. Python materials genomics (pymatgen): a robust, open-source python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).
Shannon, R. D. Revised effective ionic radii and systematic studies of interatomic distances in halides and chalcogenides. Acta Crystallogr. Sect. A 32, 751–767 (1976).
Bartel, C. J. et al. A critical examination of compound stability predictions from machine-learned formation energies. npj Comput. Mater. 6, 97 (2020).
Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput. Mater. 2, 16028 (2016).
Park, W. B. et al. Classification of crystal structure using a convolutional neural network. IUCrJ 4, 486–494 (2017).
Li, M.-D. et al. Large electrocaloric effect in lead-free ba(hfxti1–x)o3 ferroelectric ceramics for clean energy applications. ACS Sustain. Chem. Eng. 6, 8920–8925 (2018).
Huang, X.-X. et al. Dielectric relaxation and pinning phenomenon of (sr,pb)tio3 ceramics for dielectric tunable device application. Sci. Rep. 6, 31960 (2016).
Farid, U., Gibbs, A. S. & Kennedy, B. J. Impact of li doping on the structure and phase stability in agnbo3. Inorg. Chem. 59, 12595–12607 (2020).
Pilania, G. et al. Machine learning bandgaps of double perovskites. Sci. Rep. 6, 19375 (2016).
Koruza, J. et al. Requirements for the transfer of lead-free piezoceramics into application. J. Materiomics 4, 13 – 26 (2018).
Raymond, O., Font, R., Portelles, J. & Siqueiros, J. M. Magnetoelectric coupling study in multiferroic pb(fe0.5nb0.5)o3 ceramics through small and large electric signal standard measurements. J. Appl. Phys. 109, 094106 (2011).
Suchomel, M. R. et al. Bi2zntio6: a lead-free closed-shell polar perovskite with a calculated ionic polarization of 150 μc cm-2. Chem. Mater. 18, 4987–4989 (2006).
Zunger, A., Wei, S.-H., Ferreira, L. G. & Bernard, J. E. Special quasirandom structures. Phys. Rev. Lett. 65, 353–356 (1990).
Bellaiche, L. & Vanderbilt, D. Virtual crystal approximation revisited: application to dielectric and piezoelectric properties of perovskites. Phys. Rev. B 61, 7877–7882 (2000).
Von Pezold, J., Dick, A., Friák, M. & Neugebauer, J. Generation and performance of special quasirandom structures for studying the elastic properties of random alloys: application to al-ti. Phys. Rev. B 81, 094203 (2010).
Okhotnikov, K., Charpentier, T. & Cadars, S. Supercell program: a combinatorial structure-generation approach for the local-level modeling of atomic substitutions and partial occupancies in crystals. J. Cheminformatics 8, 17 (2016).
Jha, D. et al. Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning. Nat. Commun. 10, 5316 (2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Mentel, L. mendeleev – a python resource for properties of chemical elements, ions and isotopes, ver. 0.6.0 (2014). https://github.com/lmmentel/mendeleev.
Chollet, F. et al. Keras (2015). https://keras.io.
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/.
Kullback, S. & Leibler, R. A. On information and sufficiency. Ann. Math. Statist. 22, 79–86 (1951).
Blöchl, P. E. Projector augmented-wave method. Phys. Rev. B 50, 17953–17979 (1994).
Kresse, G. & Furthmüller, J. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys. Rev. B 54, 11169–11186 (1996).
Momma, K. & Izumi, F. VESTA: a three-dimensional visualization system for electronic and structural analysis. J. Appl. Crystallogr. 41, 653–658 (2008).
Hacene, M. et al. Accelerating vasp electronic structure calculations using graphic processing units. J. Comput. Chem. 33, 2581–2589 (2012).
Ge, P.-Z. et al. Composition dependence of giant electrocaloric effect in pbxsr1-xtio3 ceramics for energy-related applications. J. Materiomics 5, 118–126 (2019).
Cavalcante, L. et al. Dielectric properties of ca(zr0.05ti0.95)o3 thin films prepared by chemical solution deposition. J. Solid State Chem. 179, 3739–3743 (2006).
He, T., Chen, J., Calvarese, T. & Subramanian, M. Thermoelectric properties of la1-xaxcoo3 (a=pb, na). Solid State Sci. 8, 467–469 (2006).
Sutar, B., Das, P. & Choudhary, R. N. Synthesis and electrical properties of sr(bi0.5v0.5)o3 electroceramic. Adv. Mater. Lett. 5, 131–137 (2014).
Zhengfa, L. et al. Synthesis and structures of k0.5nd0.5tio3 and k0.5sm0.5tio3. RARE METAL MAT. ENG. 42, 476–478 (2013).
Yang, J. H., Choo, W. K., Lee, J. H. & Lee, C. H. The crystal structure of the B-site ordered complex perovskite Sr(Yb0. 5Nb0. 5)O3. Acta Crystallogr. Section B: 55, 348–354 (1999).
Yan, L. et al. High permittivity srhf0.5ti0.5o3 films grown by pulsed laser deposition. Appl. Phys. Lett. 94, 232903 (2009).
Tarrida, M., Larguem, H. & Madon, M. Structural investigations of (ca,sr)zro3 and ca(sn,zr)o3 perovskite compounds. Phys. Chem. Miner. 36, 403–413 (2009).
Wan ali, Wff et al. Synthesis and characterization of ba0.3sr0.7zro3 ceramic thick films prepared by sol-gel technique. Adv. Mat. Res. 620, 435–439 (2012).
Chen, K., Zheng, J., Diao, C., Yang, C. & Chiu, Y. Photoluminescence characteristics of perovskite eu-doped (ba0.9sr0.1)zro3 ceramic. In 2016 International Conference on Advanced Materials for Science and Engineering (ICAMSE), 419–422 (2016).
Ahmed, M., Selim, M. S. & Arman, M. Novel multiferroic la0.95sb0.05feo3 orthoferrite. Mater. Chem. Phys. 129, 705–712 (2011).
Acknowledgements
The authors thank Sachini Amararathna for collecting the experimental data analyzed in this study. We thank M.J. Reece for stimulating discussions on disordered materials and their attributes. The authors acknowledge funding received by The Institution of Engineering and Technology (IET) under the AF Harvey Research Prize. This work is supported in part by Engineering and Physical Sciences Research Council (EPSRC) ANIMATE grant (No. EP/R035393/1).
Author information
Authors and Affiliations
Contributions
A.I. conceived the idea, performed ML and DFT analysis and wrote the paper. Y.H. directed and coordinated the research. All authors discussed the results and reviewed the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ihalage, A., Hao, Y. Analogical discovery of disordered perovskite oxides by crystal structure information hidden in unsupervised material fingerprints. npj Comput Mater 7, 75 (2021). https://doi.org/10.1038/s41524-021-00536-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41524-021-00536-2