Introduction

Perovskite oxides, identified in ABO3 general form, are utilized in broad spectrum of technological applications. The crystal structure of these materials is conveniently interpreted by an array of corner sharing BO6 octahedra and A-cations sitting at interstitial cages1 (Fig. 1b). However, in real solids, the perfect periodicity of the lattice is disrupted by several factors. As opposed to the long-standing notion of ordered crystals usually perform better than their disordered counterparts, it is now understood that chemically ‘messy’ or alloyed perovskite oxides indeed perform exceptionally better in numerous applications. These include ferroelectric photovoltaics2, electrocatalysts and photocatalysts3, tunable microwave ferroelectrics with shiftable Curie point4, solid oxide fuel cells5,6 and resistive switching memories that are promising to be tailored as memristors and neuromorphic circuit components7,8. Thus, perovskite oxides emanate as an imperative class of materials in their own right and carefully engineered disorder can potentially burnish their properties beyond theoretical limits.

Fig. 1: Workflow demonstrating the analogical discovery of perovskites.
figure 1

a Elements that can fully or partially occupy octahedral and interstitial sites. b Average unit cells of \(({{\rm{A}}}_{1-{\rm{x}}}{{\rm{A}}^{\prime} }_{{\rm{x}}}){{\rm{BO}}}_{3}\) and \({\rm{A}}({{\rm{B}}}_{1-{\rm{x}}}{{\rm{B}}^{\prime} }_{{\rm{x}}}){{\rm{O}}}_{3}\) materials, usually realized by X-ray diffraction (XRD) analysis. During total possible chemical formulae generation, Pauling’s valency rule is imposed by restricting the mean oxidation state of the A-site (\({\bar{n}}_{A}\)) to be less than or equal to that of B-site (\({\bar{n}}_{B}\)); (\({\bar{n}}_{A}\le {\bar{n}}_{B}\)). We further set the condition, mean ionic radius of A-site (\({\bar{R}}_{A}\)) should be greater than or equal to that of B-site (\({\bar{R}}_{A}\)); (\({\bar{R}}_{A}\ge {\bar{R}}_{B}\)), because otherwise the composition relates to interchanged A–B cations. A gradient boosting classifier is iteratively trained on sampled experimental data (c), extracted from the ICSD. Total possible compounds pool is classified on a probability threshold (L), in order to improve precision. The originality of positively classified candidates is confirmed by high-throughput web scraping. d VAE model is trained on 2104 unlabelled experimental data samples. The euclidean distances between an experimental composition and candidate perovskites are calculated in fingerprint space. Nearest candidates are reported as analogous to the original composition. DFT structure optimization reveals that the analogous candidates preserve the native crystal structure.

Goldschmidt tolerance factor (t) and octahedral factor (μ) are simple yet powerful descriptors proposed nearly a century ago to predict the formability of perovskite structure9. Recent studies revisited these factors and even developed more accurate descriptors to predict promising ordered perovskites10,11,12. Computational methods of assessing the stability of perovskites include evolutionary algorithms13, molecular dynamics12, semi-empirical modelling14,15 and density functional theory (DFT) calculations of formation enthalpy and energy above convex hull16. Nevertheless, adopting high-throughput ab initio methods17 to model disordered solid solutions with mixed site-occupations is limited due to extreme computational power demanded by large supercells and, several random structure generations and optimizations. Alternatively, machine learning (ML) approaches underpinned by experimental data have been applied to rapidly predict complex functional materials, eventually synthesized to exhibit excellent properties18,19,20,21,22,23. Meanwhile, the complexities brought by compositional disorder such as random atom occupancy of the disordered site, non-stoichiometry and the presence of fractional occupations in the average unit cell have hindered the exploration of the complete composition space for target-driven perovskites discovery. Current supervised ML studies are limited to a minute fraction of disordered perovskites19,20,24, calling for a general strategy to inversely search the undiscovered territory. Fortunately, unsupervised learning techniques can extract hidden knowledge from raw input features and embed those information in a compressed numerical latent vector25. This opens up the avenue of ‘analogical materials discovery’ where unstudied materials analogous to an existing material are identified by quantitatively comparing their respective latent representations.

In this study, we focus on A-site or B-site alloyed quaternary perovskites, typically formulated as \(({{\rm{A}}}_{1-{\rm{x}}}{{\rm{A}}^{\prime} }_{{\rm{x}}}){{\rm{BO}}}_{3}\) and \({\rm{A}}({{\rm{B}}}_{1-{\rm{x}}}{{\rm{B}}^{\prime} }_{{\rm{x}}}){{\rm{O}}}_{3}\), respectively. The rules governing A (larger) and B (smaller) cationic radii are well established, yet lenient enough as we identified 49 and 62 periodic table elements that can entirely or partially occupy A-site and B-site, respectively, by incorporating both experimental and theoretical studies11,12 (Fig. 1a). Such elemental combinations propagate into a staggering and virtually endless diversity of disordered perovskites. Efficient screening of this massive compound pool, first to discover potential perovskites and second, elect promising candidates analogous to a target material/structure (i.e., inverse design) is paramount to supplement the increasing demand for high-performing, inexpensive and eco-friendly materials. By envisaging this objective, we first compiled an all possible candidates pool by capturing the combinations of A and B cations and incrementing x from 0.05 to 0.95 in 0.05 intervals, subject to basic chemistry rules such as charge neutrality and Pauling’s valency rule10,26. This resulted in a grand total of 591,129 distinct compositions. We then collected an experimental database of \(({{\rm{A}}}_{1-{\rm{x}}}{{\rm{A}}^{\prime} }_{{\rm{x}}}){{\rm{BO}}}_{3}\) and \({\rm{A}}({{\rm{B}}}_{1-{\rm{x}}}{{\rm{B}}^{\prime} }_{{\rm{x}}}){{\rm{O}}}_{3}\) compositions by exhaustively querying the Inorganic Crystal Structure Database (ICSD)27. After discarding the duplicates, we obtained 1758 distinct perovskites and 227 non-perovskites along with their crystal structure information, indicating less than only 0.4% of composition space has been experimentally synthesized before (see Fig. 1c and Supplementary Note 1). This database also comprised double perovskites in the format of \({{\rm{A}}}_{2}{\rm{BB}}^{\prime} {{\rm{O}}}_{6}\) and \({\rm{AA}}^{\prime} {{\rm{B}}}_{2}{{\rm{O}}}_{6}\), because the composition is the same to the disordered equivalents \({\rm{A}}({{\rm{B}}}_{0.5}{{\rm{B}}^{\prime} }_{0.5}){{\rm{O}}}_{3}\) and \(({{\rm{A}}}_{0.5}{{\rm{A}}^{\prime} }_{0.5}){{\rm{BO}}}_{3}\), respectively.

An appropriate feature space is developed starting from the chemical formula to represent each material in the experimental database and candidate pool. While characterization equipment such as X-ray photoemission spectroscopy is required to investigate the oxidation state of constituent elements28, here we implement a heuristic algorithm to unroll fractional oxidation states, closely reflecting the actual cation valences. We further add ten best perovskite descriptors procured from the sure independence screening and sparsifying operator (SISSO) algorithm11,29 to the feature space, consolidating hand-crafted and data-driven descriptors. A sequential screening strategy that involves ML classification, chemistry informed filtering and automated web scraping is employed to downselect promising unknown perovskites with 93.9% fidelity. We construct a variational autoencoder (VAE) model30 and train it on a sufficiently large, unlabelled experimental database, sanguinely retrieving a low-dimensional representation of compositions, designated as the material fingerprint (Fig. 1d). Notably, VAEs have been purposed as molecule generative models31 and feature extractors where the latent vector is sent through another ML model to predict element combinations that will likely form a specific topology, requiring crystallographic data as the input32. VAE-obtained latent vector has also been used to predict optical properties of materials33. Here, we demonstrate the exclusive potential of material fingerprints learned only from the chemical composition to carry crystal structure and perovskite formability information as numerical vectors, stressing their applicability in analogical materials discovery. Material attributes such as bandgap and phase transition details can be recovered by proximity analysis of fingerprint topology assembled by the VAE. Through a series of statistical testing on crystal symmetry prediction, we illustrate that structurally and physiochemically similar materials likely possess similar fingerprint representations. This phenomenon is employed to discover lead-free analogues of several lead-based ceramics. We further exemplify the versatility of material fingerprints in initializing the DFT simulation of disordered solid solutions by modelling six promising lead-free perovskites (see Fig. 1).

Results

Feature engineering

The roadmap to analogical materials discovery begins by establishing a solid and adequate collection of candidate compounds for the parent experimental material or class of materials. We use pymatgen34, an open-source python library for analyzing the materials in our database. Pymatgen is equipped to guess the most probable oxidation states of elements in a composition based on ICSD statistics. In the case of mixed valence constituent element, this may lead to a fractional oxidation state that must be unfolded to estimate the effective ionic radius of that element and the associated crystallographic site in the average unit cell. Considering the smallest possible supercell that relates to the integer formula, our algorithm first tries to solve for the valence states of individual ions by exploiting the common oxidation states of the mixed-valence element in question. If no solution is found, it looks through other less common oxidations states. Whenever the algorithm finds more than one configuration of valence states can result in the same fractional oxidation state, the one with the smallest oxidation state numbers and the lowest variation in ionic radii is selected (see Supplementary Note 2). We expect the perovskite structure is more likely to be retained under these conditions because, if there is large difference in size between the constituent cations, the octahedral units share faces, distorting the perovskite lattice more6. However, it should be noted that mixed valence states are highly dependent on the element, composition, vacancies etc., requiring experimental characterization for precise investigation. Our algorithm provides a systematic estimation of the mean ionic radius of such elements, permitting us to calculate t, μ and develop the feature space. Subsequently, t and μ of \(({{\rm{A}}}_{1-{\rm{x}}}{{\rm{A}}^{\prime} }_{{\rm{x}}}){{\rm{BO}}}_{3}\) type compositions are computed as1;

$$t=\frac{[(1-x){R}_{A}+x{R}_{A^{\prime} }+{R}_{O}]}{\sqrt{2}({R}_{B}+{R}_{O})}$$
(1)
$$\mu ={R}_{B}/{R}_{O}$$
(2)

Similarly, for \({\rm{A}}({{\rm{B}}}_{1-{\rm{x}}}{{\rm{B}}^{\prime} }_{{\rm{x}}}){{\rm{O}}}_{3}\) type compositions;

$$t=\frac{({R}_{A}+{R}_{O})}{\sqrt{2}[(1-x){R}_{B}+x{R}_{B^{\prime} }+{R}_{O}]}$$
(3)
$$\mu =[(1-x){R}_{B}+x{R}_{B^{\prime} }]/{R}_{O}$$
(4)

where RA, \({R}_{A^{\prime} }\), RB, \({R}_{B^{\prime} }\) and RO are Shannon’s ionic radii35 of A, A\(^{\prime}\), B, B\(^{\prime}\) cations and oxygen anion, respectively. Mean ionic radius is used for mixed valence elements. The feature space is composed of 55 elemental and 11 compositional features, 10 SISSO descriptors and 5 feature combinations including t and μ (provided in Supplementary Table 1). SISSO is an efficient, data-driven screening approach that finds near-optimal material descriptors relating to a property of interest from a colossal number of potential descriptors obtained by applying a set of mathematical operations to one or many primary features29. In our specific case, we have 71 primary features, yielding a total of ~1071 SISSO descriptors. By training the SISSO algorithm on around 500 randomly sampled compositions from our database and considering a subspace size of 1000, we extracted 10 best descriptors reflecting perovskite formability (Supplementary Note 3). The features related to oxygen are common to all compositions (i.e., no relative importance) and have thus been omitted when mapping the feature space into perovskite or non-perovskite using ML.

Quaternary compositions classification

We examined the performance of four ML algorithms in the present classification task to select the best model to proceed. These include decision tree (DT), random forest (RF), support vector machine (SVM) and gradient boosting classifier (GBC). In order to assess the performance metrics, we iteratively trained and evaluated the individual models on sub-databases generated by integrating 227 non-perovskites with 250 randomly sampled perovskites from 1758 total perovskites. Such a strategy is required to maintain the class-balance of individual training sets, minimizing the potential overfitting to a particular class. It also ensures that all available perovskites are exploited without being dissipated. All four algorithms are trained and validated for 100 independent iterations by splitting each sub-database into 80% training and 20% test sets19. Table 1 summarizes tenfold cross validation (CV) results. GBC is selected to proceed as it provided the best performance with 93.91% (±0.96%) and 93.33% (±2.26%) tenfold CV and test mean classification accuracies, respectively. More details about these metrics including hyperparameter tuning are available in Methods, and Supplementary Fig. 1 and Supplementary Note 4. Figure 2b,c depict confusion matrix and receiver operating characteristic (ROC) curves, respectively, as obtained by tenfold CV. Moreover, five SISSO descriptors make it to the top ten features ranked by their relative importance (Fig. 2a). However, we note that the GBC model achieves 93.71% and 93.16% tenfold CV and test mean accuracies, respectively, even in the absence of SISSO descriptors in the feature space, endorsing the fact that perovskite formability is heavily dependent on geometrical features such as cationic radii.

Table 1 Performance metrics obtained by tenfold CV for perovskites classification.
Fig. 2: GBC classification metrics and prediction results.
figure 2

a Relative importance of top 20 features. Notably, tolerance factor (t) has dropped to the 10th place, which may indicate that t alone is not sufficient to predict the perovskite formability of complex solid solutions at a good accuracy. b Normalized confusion matrix obtained by tenfold CV. c ROC curves obtained for each fold in tenfold CV and the mean ROC curve. Area under the curve (AUC) summarizes the performance of each GBC into a single measure where high AUC indicates the model has a greater capability of distinguishing between classes. This demonstrates that our model performs consistently even with different random data divisions. Shaded region shows the standard deviation of mean ROC curve. d Classification probability heatmap for 0.6 Million compounds pool, plotted as μ vs t.

We then classify the total possible compounds pool (0.6 Million candidates) by following the same procedure of 100 iterations, except that, this time, each GBC model is trained on the full corresponding sub-database without train/test splitting. Figure 2d highlights the geometric region that most probably results in a stable perovskite structure. Moreover, B-site doping is a prevalent technique used to enhance and tune the properties of many perovskites. Figure 3 shows the classification map of such systems for commonly observed A and B cations. Nonetheless, our objective here is to screen down a huge composition space to a sufficient collection of potential perovskites, such that the materials will have a greater chance of potentially being synthesized in single phase perovskite structure. Therefore, we extract the compositions that have over 95% mean perovskite likelihood and, that were classified 100 out of 100 times into the perovskite category, as the most promising perovskites19. This increases the precision (how many selected items are relevant) of the predicted set at the cost of some possible perovskites not being selected. This tactic is also expected to address the ongoing criticism that ML models tend to predict a high rate of materials to be stable, which are actually not36. Likewise, 46,228 highly-probable candidate perovskites are screened, narrowing the composition space by 92%. We further downselect the compositions made up of elements in their common oxidation states. This was followed by removing all the compositions whose quaternary system is available in the ICSD relating to our format as already studied materials. For example, when we find Pb(Zr0.5Ti0.5)O3 is present in the ICSD, we remove all Pb(Zr1−xTix)O3 compositions that belong to the quaternary system Pb-Zr-Ti-O . Furthermore, an automated web scraping tool is implemented to search for any mentioning of the resulting materials/systems on the internet, which were removed in turn. The competency of this tool is tested by scraping 576 ABX3-type experimental materials reported in ref. 11. It found the existence of 562 compounds on the web indicating over 97% success rate. This multi-step, novelty-ensuring screening flow further decreased the perovskite candidate headcount to 10,790, which will serve as the search space for analogical discovery of perovskites.

Fig. 3: Perovskite likelihood of selected \({\rm{A}}({{\rm{B}}}_{1-{\rm{x}}}{{\rm{B}}^{\prime} }_{{\rm{x}}}){{\rm{O}}}_{3}\) systems as predicted by the GBC model.
figure 3

Each panel depicts the classification probability variation as the content x of different \({\rm{B}}^{\prime}\) cations increases, for some commonly observed A and B element combinations. Empty spaces imply no charge balanced probable compositions exist. Caesium cation is too large to occupy A-site and the formation of a stable perovskite is generally improbable. ICSD reports one quaternary perovksite oxide, K(Ta0.77Nb0.23)O3, with potassium fully occupying A-site, as opposed to four stable ternary perovskites, KNbO3, KTaO3, KUO3 and KIO3. Despite having no training data with A-site K and B-site U occupied perovskites, the model predicts that U can be alloyed with Nb, Ta or even Ti and Zr to forge a perovskite. This claim is somewhat supported by above ternary oxides. Supplementary Fig. 2 shows the composition map where A-site is partially occupied by K. Strontium, an ideally sized cation sitting at A-site can form plenty of stable perovskites, supporting diverse B-site substitutions for tailoring the properties of materials. Alkaline earth metals in A-site generally show high perovskite likelihood. Likewise, these elements can replace toxic lead in many useful perovskites.

Unsupervised material fingerprinting

Learning numerical vector representations of materials, that is fingerprinting them, based only on the features extracted from their chemical formulae facilitates the similarity investigation of two or more materials by geometrical distance arithmetics between those vectors. By design, the compositions that are chemically and structurally alike should fall close to each other and disparate compositions should be further apart in the fingerprint space. An autoencoder neural network, composed of an encoder that efficiently compresses discrete material features into a latent vector and a decoder that learns to reconstruct the same features starting from those encodings, is an unsupervised technique of retrieving the material fingerprint, which refers to the encoding itself. However, a fundamental problem of such a strategy is the possibility of forming blind spots in the latent space, which, when decoded, may have no relevance to a real material. Variational autoencoders (VAEs), in contrast to vanilla autoencoders, resolve this issue by enforcing the encoder to produce two vectors relating to a set of means μ and standard deviations σ that closely resemble a standard Gaussian distribution (Z ~ N(0, 1)), which is then sampled to obtain the latent vector, causing the latent space to be continuous (Fig. 4a). Intuitively, this means that the latent encoding of a material is not unique, but stochastically distributed (Z ~ N(μ, σ2)) around the centre determined by the mean vector, enabling the decoder to ‘see’ the whole latent space during training and driving the encoder to generate interpretable representations. While the robustness of VAEs lies in the stochastic nature of the latent embedding, this vector is not adoptable as the material fingerprint due to its non-uniqueness. Therefore, we designate the mean vector (μ) produced by the encoder as the material fingerprint. In our VAE architecture, μ, σ and latent vector all are two-dimensional (2D), yielding 2D numerical fingerprints (see Methods for more details on VAE implementation). We further discuss other dimensionality reduction algorithms in Supplementary Note 5, namely, vanilla autoencoder (Supplementary Fig. 3), principle component analysis (Supplementary Fig. 4) and t-SNE (Supplementary Fig. 5).

Fig. 4: Unsupervised material figerprinting using VAE.
figure 4

a VAE architecture. Encoder must find an efficient compression, discarding bulk of irrelevant information from the input to support the decoder to reconstruct the input itself from a highly compressed dimension. Because latent vector is sampled from a probability distribution, during the training process, the decoder learns not only a single point in latent space is relevant to a particular material, but also the nearby points. This removes blind spots to the decoder and enables placing unseen compositions on the fingerprint map. b Crystal system vs 2D fingerprint visualization of the experimental materials. XY plane relates to the fingerprint space. 14 non-perovskite labelled materials out of 20 landing inside the circled area were identified as ‘perovskite’ or ‘hexagonal perovskite’ in separate literature, which may not have been examined during data labelling. The crystal system of the indicated candidate is predicted to be orthorhombic based on the plurality vote of 5-NNs. c and d display the normalized confusion matrices detailing the crystal system prediction of 2104 experimental materials as obtained by proximity analysis of fingerprints and supervised gradient boosting classifier, respectively.

Owing to the fact that training VAEs is unsupervised, that is no labelling of compositions is required, we used all available \(({{\rm{A}}}_{1-{\rm{x}}}{{\rm{A}}^{\prime} }_{{\rm{x}}}){{\rm{BO}}}_{3}\) and \({\rm{A}}({{\rm{B}}}_{1-{\rm{x}}}{{\rm{B}}^{\prime} }_{{\rm{x}}}){{\rm{O}}}_{3}\) compositions totalling 2104 to train and validate the model. Over 90% of experimental data relate to ambient conditions, so do our predictions. The database included 119 materials that were difficult to be reliably deposited into either of perovskite or non-perovskite categories. Clearly, there is no hard boundary between these two classes, separating them perfectly. Ultimately, it comes down to which degree of distortion from the ideal cubic perovskite lattice can be tolerated to still regard the resulting structure as perovskite? We let the VAE model to place the materials in the region that they ‘deserve’. The input to the VAE contains the same features as used in the previous classification step. Despite not being explicitly trained as perovskite or non-perovskite, the material fingerprints capture this information and form separate regions in the fingerprint space as can be observed from Fig. 4b. In addition, the alloyed site is clearly recognizable from the 2D fingerprint. Notably, some of the non-perovskite-labelled materials fall within the perovskite region as circled in Fig. 4b. Instead of disregarding them as ambiguous, we scrutinized the root cause for this and found that at least 14 out of 20 materials falling inside the circle were described with the terms ‘perovskite’ or ‘hexagonal perovskite’ in other areas of the literature (see Supplementary Table 2). Apparently, these compounds are more ‘perovskite’ than ‘non-perovskite’. This is a prime example to elucidate that the materials are fingerprinted on their own merits. It also draws the attention to what appears to be outlier data for cross checking the correctness of data labelling.

Now, the cardinal question is, are these fingerprint vectors really learned such that they embed the underlying chemical and structural properties of materials? To answer this question, we investigated small local clusters in the fingerprint space. The key idea is that the unsupervised model finds similar fingerprint representations for analogous compositions, which can be corroborated with a k-nearest neighbours approach. Anticipating that similar materials should find themselves close to each other, we compared the crystal system (cubic, orthorhombic, tetragonal, etc.) of each material with that of the five nearest neighbours (NNs) in the 2D fingerprint space. This can be viewed as a leave one out cross validation (LOOCV) technique, because, despite never being trained to predict the crystal system, all the other materials except the one being tested are present on the fingerprint space. In order to quantify materials analogies in native domain, we define euclidean distance derived similarity measure (SE) as; SE(p, q) = 1/(1 + dE(p, q)) where p and q are two composition points and dE(p, q) is the euclidean distance between p and q. If majority of neighbours vote for (belong to) the same crystal system, that will be assigned as the predicted crystal system of the original composition. In the case of equal number of votes, the one relating to the lowest euclidean distance is selected. By performing this analysis for all 2104 materials, we found that the 5-NNs can determine the correct experimental crystal system of the parent composition with a success rate of 71.8%. This implies that the unsupervised material fingerprints indeed carry crystal structure information concealed in its relative locality. Note that the optimal number of neighbours is found to be 5 by heuristic tuning. Figure 4c illustrates crystal system prediction performance as a normalized confusion matrix. Moreover, this strategy is extensible to predicting the space group of compositions, which resulted in 59.9% prediction accuracy (see Supplementary Fig. 6 for confusion matrix). Given that the total number of distinct space groups present in our database is 66 and the model is not explicitly trained to predict them, nearly 60% accuracy figure reflects that even such diminutive information is inherent in the material fingerprint. Although the data circumstances are different, material fingerprint based prediction of crystal system proposed here offers around 6% accuracy improvement over existing multi-class supervised approaches that rely on multitude of magpie features obtained only from the composition24,37. Furthermore, space group detection of experimental compounds, even by training a supervised convolutional neural network (CNN) on about 150,000 powder XRD patterns is proved to be challenging, with a reported accuracy of 81%38. With no supplemental knowledge on crystallographic or XRD data of complex solid solutions, our model performs appreciably close to the state-of-the-art CNNs on this aspect, albeit the differences between sample sizes and data contexts should be accounted.

To assess how well this fingerprint proximity oriented approach contrasts with supervised parametric models, we evaluated the crystal symmetry detection performance of GBC, DT, RF and SVM classifiers using LOOCV for unbiased comparison. The feature space is chosen to be the same as used in perovskites classification step. The performance metrics are summarized in Table 2. GBC achieves the highest crystal system classification accuracy of 78.8% (see Fig. 4d) while the proposed scheme, relying only on the 2D fingerprints, outperforms supervised algorithms such as SVM and RF. The trend is similar for space group classification. Most importantly, unsupervised deep learning techniques are quite rigorous when finding their own representations, and advantageous in inversely exploring the compositional space, which is not feasible in supervised ML regime. For instance, when the database is of sufficient size and diversity, the representations learned by VAE form an information-embedded topology that can be used not only to predict the crystal structure of unknown compositions but also to discover compositions analogous to target experimental materials, a point that we come back later in this paper.

Table 2 LOOCV performance metrics of crystal system and space group classification.

Phase identification and bandgap prediction

Some perovskites such as ferroelectrics display temperature-dependent phase transitions. There can also be coexisting phases at the same temperature39. We emphasize that chemical composition is not duplicated in our database, leading to an elimination of additional phases, if there are any. Taking a step back, we examined whether the NNs can reflect phase information of a material that has multiple crystallization phases. We found that if the NNs imply strong relation to two or more different symmetries by means of low euclidean distance (i.e., SE > ~ 95%), it may indicate a potential phase transition in the material of interest. For example, consider ferroelectric (Sr0.6Pb0.4)TiO3 which is not recorded in the ICSD (neither ICSD nor our database contains any record of Sr − Pb − Ti − O system). Four experimental NNs of this composition vote for the cubic system and the remaining NN shows an SE of 95.7% to the original ferroelectric and votes for the tetragonal system. There is indeed a cubic to tetragonal phase transition at 66 °C in this material40. 219 compositions in our experimental database can crystallize in multiple phases (upto 4 crystal systems). In 97 cases, all the crystal systems shown by each material are present in the set of crystal systems of corresponding 5-NNs. We stress that the same experimental fingerprints obtained above are used throughout with no further training. Notably, 33 compositions show 3 or more phases. The stated approach can predict 2 or more phases in 27 out of 33 compositions as depicted in Fig. 5a.

Fig. 5: Predicting the properties of materials via fingerprint analogies.
figure 5

a Phase prediction of experimental compositions that can crystallize in three or more phases. We focus only on the crystal system (i.e., the phase transitions occurring within a single crystal system are ignored). The phase is considered to be predicted if at least one of the 5-NNs belong to the crystal system of original composition. b Bandgap prediction results for a DFT database of quaternary perovskites5. It should be noted that generalized gradient approximation is used for the calculations in ref. 5 that tends to underestimate the bandgap.

We search for the compositions with no known phase transformations by probing those which are reported to have only one phase (one crystal system) for all alloying ratios within the system. We found that 1283 compositions have no potential phase transformations. Out of these, 573 were predicted to have only one phase as expected. Two phases were predicted for 475 compositions. We conjecture that a considerable number of materials have not been fully characterized and even some of the known phases are not yet recorded in the ICSD. For example, ferroelectric (Ag0.9Li0.1)NbO3 favors a trigonal phase at room temperature and displays temperature-induced sequence trigonal → orthorhombic → tetragonal → cubic41. However, the ICSD reports only the trigonal phase of this material (as of March 2021). Therefore, it was wrongly labelled to have no phase transformations. We notice that 4-NNs of (Ag0.9Li0.1)NbO3 vote for the trigonal system and one votes for the cubic system. This is sensible given the above phases of this material. While capturing the Curie point is too much to ask from this unsupervised approach, it does convey some useful insights on possible phase transitions. We further tested whether the bandgaps of the compositions located in close proximity are similar by predicting the DFT computed bandgaps of 1016 quaternary perovskites reported in ref. 5. These compositions are fingerprinted and the bandgap is predicted by averaging this figure of 5-DFT-NNs of the considered composition (see Fig. 5b). The mean absolute error of the predictions is 0.385 eV, consistent with existing ML based bandgap prediction studies12,42. We further employed several supervised approaches for predicting the bandgap. Although the error rate above is very similar to LOOCV MAE of support vector regressor (0.369 eV), the lowest MAE is achieved by RF regressor (0.17 ± 0.004 eV). Based on these predictions, it is safe to assume that geographically close materials likely possess similar properties.

Analogical discovery of perovskites

The fingerprints of the 10,790 predicted perovskites were extracted at the μ layer by sending the associated features through the trained VAE network. This brings the candidate compositions and the experimental observations to a common ground, that is the fingerprint space, where unexplored perovskites landing close to a target experimental material are identified as analogous compositions to that material. We first tested the effectiveness of fingerprints for a set of predicted perovskites. During the materials screening stage, our web scraping tool recovered 24 GBC-predicted perovskites from literature that were not available in the ICSD. 17 out of 24 were reported in credible experimental studies that contained crystal structure information. As predicted, all 17 compositions have been experimentally identified as perovskites. By placing them in the fingerprint space and taking the majority vote of 5-experimental-NNs into account, we correctly identified the crystal system of 13 perovskites out of 17 (Table 3). Not only does this underpin the accuracy of our ML prediction, but it also confirms the generalizability of the material fingerprints to unseen data.

Table 3 Crystal system prediction of perovskites not present in the ICSD using material fingerprints.

The inauguration of analogical materials discovery is driven by the exigent demand for alternatives to some of the existing industrial materials that may be toxic, expensive, unstable or lossy. For example, discovering lead-free ceramics and solar cell materials is a heavily invested specialism43. Identifying lead-free promising perovskites that potentially resemble the crystal structure and physiochemical properties of a Pb-based material is achievable by fingerprinting the Pb-composition and capturing the nearest candidates in the fingerprint space. For instance, our analysis suggests that unstudied Bi(Mg0.65V0.35)O3 could be a non-toxic substitution for room temperature multiferroic Pb(Fe0.5Nb0.5)O3 (PFN). PFN ceramic undergoes several phase transitions, namely, cubic → tetragonal → rhombohedral → monoclinic with decreasing temperature44. Rather unexpectedly, 5-NNs to Bi(Mg0.65V0.35)O3 evidence the relation to all these phases with tetragonal P4mm receiving two votes and the rest obtaining one vote each. Moreover, Bi(Sc0.2Co0.8)O3 and Bi(Ti0.5Zn0.5)O3 locate near the relaxor ferroelectric Pb(Mg1/3Nb2/3)O3. Note that double perovskite equivalent of Bi(Ti0.5Zn0.5)O3 has experimentally been identified as an analogue of PbTiO345. Furthermore, (K0.5Bi0.5)ZrO3 shows similarity to the studied ceramic (Ba0.67Pb0.33)TiO3. It is well regarded that bismuth-based ceramics are eco-friendly alternatives to lead-rich piezoelectrics, with the downside of usually requiring high-pressure and high-temperature synthesis conditions. Intriguingly, many compositions with bismuth occupying A-site, when fingerprinted, lie close to those containing lead. This implies that our model can rediscover known phenomena even with no domain-specific knowledge explicitly supplied. Full list of suggested alternatives to current selected perovskites is tabulated in Supplementary Table 3.

DFT validation

One of the major hindrances to modelling unstudied materials with ab initio methods such as DFT is the lack of a priori knowledge about the starting geometry. This is especially challenging for perovskites having previously unknown cation combinations. In the era of experimental records reaching big data status, chances are structurally similar materials to an unstudied material do exist. Not only does material fingerprinting facilitate analogical materials discovery, but it also is readily applicable to find a reasonable experimental structure required to initialize DFT geometry optimization. As a demonstration, we perform DFT calculations to realize the optimized structures of six selected promising perovskites. Tractable computational methods of modelling random solid solutions in conjunction with the DFT framework include the special quasirandom structures (SQS) approach46, the virtual crystal approximation47 and the supercell approach that signifies disorder aspects at local level. While the SQS method has been particularly successful in predicting the properties of random solid solutions48, we note that the supercell approach is significantly faster than SQS generation and the associated groundwork. Since property prediction is not intended here, we opt for the supercell method and use the recent supercell program especially designed to construct reflective supercells of solids having mixed/partial site occupancies49. It potentially includes SQS systems within its exhaustive search space and implements a combinatorial structure ranking method based on their respective electrostatic (Coulomb) energies. Owing to the importance of conformer search, here, we select the supercell configuration with the lowest Coulomb energy to start geometry optimization. Note that, supercell generation itself could be very computationally demanding depending on the number of combinatorial structures, not to mention the subsequent DFT calculation. Figure 6 elaborates the DFT structure optimization steps of the selected promising perovskite (Ce0.5Ag0.5)ZrO3. By analyzing the fingerprints, we found that (La0.8Al0.2)FeO3 orthorhombic Pbnm, is the most similar experimental material in our database, to the selected candidate. Hence, it is used as the starting geometry. Table 4 details the structure generation and the optimized lattice constants of the six selected promising perovskites. The structures are fully relaxed, allowing to change the ionic positions, cell volume and cell shape. All the optimized structures preserve the perovskite nature as well as the starting structural integrity with minimal change in lattice parameters, seemingly indicating that the initial geometry found by material fingerprint analogy is a plausible basis.

Fig. 6: DFT modelling workflow of the composition (Ce0.5Ag0.5)ZrO3.
figure 6

The initial geometry cif (crystallographic information file) is borrowed from an analogous experimental material (La0.8Al0.2)FeO3 having an orthorhombic Pbnm structure. The atomic species, occupancies and oxidation states were replaced accordingly. A supercell of size 2 × 1 × 1 having the lowest Coulomb energy is selected. DFT optimization of ionic positions, cell volume and cell shape results in an orthorhombic Pmn21 structure, preserving perovskite nature and staring structural attributes.

Table 4 DFT geometry optimization results of 6 selected promising perovskites.

Discussion

There are several extensions to the present study worthwhile considering. Virtually all materials in our experimental database are published single phase structures. Collecting non-perovskite materials from the literature or databases essentially means that we miss failed syntheses of perovskites or those having major secondary phases, which are, strictly speaking, non-perovskites too. While this data is not usually published, the possible inclusion of such materials as non-perovskites would enhance the reliability of ML predictions. Another possibility would be to use a transfer learning technique by training a neural network classification model subsequently on a large computational database of perovskites/non-perovskites and available experimental data to mitigate data limitation problem15,50. Furthermore, because over 90% of our data has been acquired near room temperature, we did not scrutinize temperature dependence for crystal structure prediction. All our predictions are based only on the composition, assuming ambient conditions. Present study can be extended to include temperature by curating a sufficient database having enough diversity in the temperature field. Due to data limitations, this method is restricted to 66 space groups available in our database. Therefore, the phases outside of these space groups cannot be predicted. This is one of the reasons why we focus on the crystal system, as the data is abundant for all the categories. Moreover, the typical black-box nature of artificial neural networks makes the material fingerprint a black-box too. Investigating the ties of the fingerprint to a physical meaning would make the materials design process more transparent. A relatively large and ideally experimental database of heterogeneous compositions is essential for the proposed framework to succeed. This may be viewed as a constraint to apply our concept in very specific domains where materials data is scarce. We believe the database used in this study is reasonably sized to follow a neural network strategy, and we recommend at least a similar sized database to apply our method in a different context.

In summary, by establishing a comprehensive experimental database of disordered quaternary compositions and involving powerful ensemble ML methods, we have rapidly screened potential perovskites from total feasible compounds. A further series of filtering steps including automated web scraping were imposed to ensure the originality and potential synthesisability of the predicted perovskites. Concurrently, we introduced an unsupervised deep learning strategy to retrieve fingerprints of materials. We elucidated the effectiveness and the educated essence of fingerprints by unravelling hidden salient information such as perovskite formability and underlying crystal structure, that can be used to identify the crystal symmetry of materials. The fingerprints essentially regulate all compositions into a common numerical platform where similar compositions tend to locally stick together. This phenomenon can be capitalized to predict the possible phases and the bandgap of materials. This further led to the conception of analogical materials discovery, that could accelerate the pursuit of finding inexpensive green materials as a replacement for current hazardous materials. We also emphasized the importance of fingerprints to pick up an initial experimental structure required for DFT modelling. Material fingerprint uncovered in this study is not universal, meaning that, it is mutable in a different data context. Nevertheless, due to the versatility of unsupervised techniques, the materials analogy quantification conceptualized here is expected to be valid for other material systems as well. Therefore, we believe the proposed framework has far-reaching implications for accelerated target-driven discovery of materials as well as computational modelling of complex compounds.

Methods

Machine learning

All supervised ML algorithms used in this study, including KNN and GBC are adapted from the scikit-learn package51 in a python programming language environment. Gradient boosting algorithms create an ensemble of weak learners, typically by adding DTs over time to produce a powerful predictive model. The learning takes place by applying gradient descent algorithm to optimizing a differentiable cost function, usually logarithmic loss for classification tasks. The hyper-parameters are external to the ML model and should be tuned in advance to control the training process and achieve better predictive power. We tuned the hyper-parameters of the models (learning rate and No. of estimators in the case of GBC) using tenfold CV and grid search method. In k-fold CV, the database is randomly partitioned into k equal size subsets. The ML model is trained by combining k-1 of them and validated on the remaining subset. This is repeated k times such that each data-point in the original database gets to be in the test set exactly once. LOOCV is the extreme version of k-fold CV where k is equal to the number of samples in the database. Material features were acquired and analyzed using pymatgen34 and mendeleev52 python packages. The computer codes required to train the models and reproduce the results presented in this paper may be found online at https://github.com/ihalage/analogmat.

VAE architecture and training

The VAE model is implemented in python using Keras deep learning framework53 with TensorFlow backend54. The depth of the VAE neural network and the dimension of the latent vector should be carefully controlled to obtain meaningful representations. A high depth (i.e., more parameters) means low reconstruction loss, but this often comes with the risk of data memorization rather than learning anything useful. High dimensional latent space can also give rise to this problem. We found that, using a significantly low dimension for the latent vector not only forces the encoder to find near optimal compression due to the added pressure by high reconstruction loss, but it also facilitates convenient data visualization without incorporating other dimensionality reduction algorithms. After some extensive heuristic tuning, we eventually constructed the VAE with five hidden layers with 2D μ, σ and latent vectors. The remaining four hidden layers have 128, 64, 32, and 64 neurons, chronologically. The weights were initialized using Glorot uniform initializer. Rectified linear unit is used as the activation function of the hidden layers and linear activation is used for the output layer. L2 regularization is applied in hidden layers to avoid the overfitting phenomenon. Kullback–Leibler divergence (KL loss)55 and mean squared error (MSE) were used additively as the loss function, which was optimized using Adam optimizer with a learning rate of 0.001. The exponential decay parameters β1, β2 and the stability constant ϵ were set to 0.9, 0.999 and 10−7, respectively. The model is trained on an RTX 2080 Ti graphics processing unit (GPU) with a batch size of 32 and the validation data (20%) loss was monitored for an early-stopping method to prevent possible data memorization.

Density functional theory

Spin-polarized DFT calculations were performed with projector augmented wave pseudopotentials56 using the generalized gradient approximation as implemented in the Vienna Ab initio Simulation Package (VASP)57. We used perdew–burke–ernzerhof exchange-correlation functional for the present calculation. The Brillouin zone was sampled with a 10 × 10 × 10 Monkhorst-Pack k-point grid. Cutoff energy for the plane wave basis set was set to 520 eV. The ionic positions, cell volume and cell shape were allowed to change during the structure relaxation until reaching an energy convergence threshold of 2 × 10−6 eV and net atomic force less than 0.01 eV/Å on any atom. The solid solutions were approximated with a supercell having the lowest Coulomb energy as generated by the supercell package49. The supercell dimensions used in this study are sufficient to demonstrate the geometry optimization process, however, larger supercells or a SQS based method is recommended if further calculation of physical properties is intended. The crystal structures were analyzed using pymatgen and visualized with VESTA58. We used the GPU port of VASP59 to run the calculations on an RTX 2080 Ti GPU.