Introduction

Perovskite materials derived from the mineral of CaTiO3 have the general formula ABX3, in which A and B are cations, and X is an anion. Perovskites have a typical crystal structure shown in Fig. 1. The smaller B-site cation has six-fold coordination and forms a BX6 octahedron. Eight BX6 octahedra share corners to form a dodecahedron interstitial space for accommodating a twelve-fold coordinated larger A-site cation. The X is usually an oxygen, halogen or chalcogen anion, which can tolerate many vacancies. Perovskites have demonstrated ubiquitous applications to energy conversion/storage/harvesting, catalysis, sensor, superconductor, ferroelectric, piezoelectric, magnetic, and luminescence due to the numerous unique properties caused by their versatile compositions and flexible crystal structure symmetries.

Fig. 1: The schematic drawing of the crystal structure of fractionally doped perovskites.
figure 1

The chemical formula of fractionally doped perovskites is AxA’1-xByB’1-yO3-δ, where 1 ≥ x ≥ 0, 1 ≥ y ≥ 0 and 3-δ represents oxygen nonstoichiometry.

Hybrid organic-inorganic perovskites (HOIPs) and inorganic perovskites (IPs) are the two main perovskites. HOIPs have a positively charged organic group (e.g., methylammonium, MA; formamidinium, FA) in A site, a metal cation (e.g., Pb, Sn) in B site, and a halogen anion (e.g., I, Cl, Br) in X site1. As semiconductors, HOIPs showed excellent optical and electrical properties and found extensive applications to solar cells2,3,4,5, photodetectors6, photocatalysis7, and light-emitting8. IPs usually denote perovskites oxides with metal cations in A and B sites and oxygen anions in the X site. Perovskite oxides allow fractionally doping multiple metals in both A and B sites and forming crystal structures with ~15 symmetries, enabling almost infinite compositions with unique properties. Instead of the simple ABO3 perovskite oxides, the fractionally doped perovskite oxides (FDPOs) with a generic formula of \({{{{{{\rm{A}}}}}}}_{{{{{{{\rm{x}}}}}}}_{1}}^{1}{{{{{{\rm{A}}}}}}}_{{{{{{{\rm{x}}}}}}}_{2}}^{2}\cdots {{{{{{\rm{A}}}}}}}_{{{{{{{\rm{x}}}}}}}_{{{{{{\rm{m}}}}}}}}^{{{{{{\rm{m}}}}}}}{{{{{{\rm{B}}}}}}}_{{{{{{{\rm{y}}}}}}}_{1}}^{1}{{{{{{\rm{B}}}}}}}_{{{{{{{\rm{y}}}}}}}_{2}}^{2}\cdots {{{{{{\rm{B}}}}}}}_{{{{{{{\rm{y}}}}}}}_{{{{{{\rm{n}}}}}}}}^{{{{{{\rm{n}}}}}}}{{{{{{\rm{O}}}}}}}_{3}\) (Am and Bn are metal cations, m ≥ 1, n ≥ 1, x1 + x2 + …xm = 1, y1 + y2 + …yn = 1) have demonstrated pervasive applications. Mn and Sr doped LaAlO3 was the first experimentally proved perovskite oxide to efficiently split water and carbon dioxide to produce hydrogen and carbon monoxide based on the high-temperature and low-temperature solar thermochemical redox cycles9. Ba0.5Sr0.5Co0.8Fe0.2O3-δ showed extremely high performance as an oxygen-permeable membrane and the cathode of oxygen-ion conducting solid oxide fuel cells (O-SOFCs)10,11. Perovskite oxides of BaCe0.7Zr0.1Y0.1Yb0.1O3-δ and BaCo0.4Fe0.4Zr0.1Y0.1O3-δ demonstrated promising performance as an electrolyte and an oxygen electrode for protonic ceramic fuel/electrolysis cells (PCFCs and PCECs)12,13. Doped lanthanum gallate oxide (La1-xSrxGa1-yMgyO3-δ) showed high ionic conductivity at intermediate temperatures (~600 °C) as an O-SOFC electrolyte material14. YBa2Cu3O7 has become a state-of-the-art superconductor for many years as one perovskite-derived structure15. BaTiO3 and BaNi0.5Nb0.5O3 showed promising ferroelectric and piezoelectric properties16,17. Perovskite manganites with a general formula R1-xAxMnO3 (R = rare-earth cation, A = divalent alkaline-earth cation) attracted much attention due to their colossal magnetoresistance and magnetocaloric effect18,19. Therefore, discovering new FDPOs with desired geometrical symmetries and properties for targeted applications has become one of the most important research topics in many materials science fields. However, FDPOs allow nearly 90% of the elements from the periodic table to come into the perovskite structures and fractionally doping, resulting in almost infinite complex compositions, which may be the targeted perovskite oxides. The prerequisite for discovering new perovskite oxides is to make quick and accurate predictions to determine if the designed compositions are perovskite or not before the time-consuming and expensive theoretical computation, experimental synthesis, and property characterization.

Establishing a link descriptor between composition and structure via basic elemental and simple oxide properties has attracted a significant effort to predict new perovskite oxides. Historically, simple geometrical factors were in use for predicting the formability of perovskite oxides. The Goldschmidt tolerance factor calculated from the ionic radii of A and B cations and oxygen anion (\(0.75\le {{{{{\rm{t}}}}}}=\frac{\left({{{{{{\rm{r}}}}}}}_{{{{{{\rm{A}}}}}}}+{{{{{{\rm{r}}}}}}}_{{{{{{\rm{O}}}}}}}\right)}{\sqrt{2}\left({{{{{{\rm{r}}}}}}}_{{{{{{\rm{B}}}}}}}+{{{{{{\rm{r}}}}}}}_{{{{{{\rm{O}}}}}}}\right)}\le 1.05\)) was the most commonly used criterion to determine if a new composition was a perovskite or not20,21,22,23,24. Considering that the stable octahedron BO6 was the basic unit for perovskite structure, the octahedral factor (\(0.414\le {{{{{\rm{\mu }}}}}}={{{{{{\rm{r}}}}}}}_{{{{{{\rm{B}}}}}}}/{{{{{{\rm{r}}}}}}}_{{{{{{\rm{O}}}}}}}\le 0.732\)) defined as the ratio between B-site cation and oxygen anion was a prerequisite to forming the perovskite structure25,26. The combination of the octahedral factor and the tolerance factor showed improvement in the prediction accuracy21. Most recently, a new tolerance factor (\({{{{{\rm{\tau }}}}}}=\frac{{{{{{{\rm{r}}}}}}}_{{{{{{\rm{O}}}}}}}}{{{{{{{\rm{r}}}}}}}_{{{{{{\rm{B}}}}}}}}-{{{{{{\rm{n}}}}}}}_{{{{{{\rm{A}}}}}}}({{{{{{\rm{n}}}}}}}_{{{{{{\rm{A}}}}}}}-\frac{{{{{{{\rm{r}}}}}}}_{{{{{{\rm{A}}}}}}}/{{{{{{\rm{r}}}}}}}_{{{{{{\rm{B}}}}}}}}{{{{{{\rm{ln}}}}}}\left({{{{{{\rm{r}}}}}}}_{{{{{{\rm{A}}}}}}}/{{{{{{\rm{r}}}}}}}_{{{{{{\rm{B}}}}}}}\right)}) \, < \, 4.18\)) (where nA is the oxidation state of A, ri is the ionic radius of ion i, and rA > rB by definition) integrated A-site cation oxidation state information with the geometry properties demonstrated an improved prediction accuracy for simple (ABO3) and double perovskite oxides (A2B1B2O6)27. To further improve the prediction accuracy, based on two-dimensional structure map technology, a series of scientific works also studied the effect of different feature pairs on the formability of simple perovskite oxides, which involved the ionic radii of A and B ions21, the bond length between A or B site and oxygen atom23, the tolerance factor and octahedral factor21,26. However, the two-dimensional structure map only utilized one feature pair to predict, which did not simultaneously consider more than two features to improve prediction accuracy further. Besides, first-principles density functional theory (DFT) calculations were also adopted to predict the formability of perovskite oxides. Emery et al. calculated the formation energies for 5,329 ABO3 combinations generated by inserting 73 metals and semi-metals into A and B sites, respectively, which quickly predicted 395 thermodynamically stable simple perovskite oxides28. Although DFT calculation showed a promising future for predicting perovskite oxides, it still has not shown very successful predicting performance for the infinite FDPOs. Computing many complicated compositions needs more complex unit cells related to time-consuming and expensive DFT computations.

In recent years, the machine learning method has emerged as a powerful tool to accelerate new materials discovery. Several research groups applied machine learning techniques with DFT calculated results as training data to identify promising simple perovskite and double perovskite candidates29,30,31,32. Balachandran et al. used the random forest and gradient tree boosting classifiers to correctly classify 90% of their database as perovskites or non-perovskites using DFT as a source of training data33. Ye et al. utilized deep neural networks to predict the formation energy of ABO3 perovskites using DFT training data34. However, most machine learning prediction work focused on the simple ABO3 or double perovskites calculated from the DFT method. Utilizing the machine learning tool to predict new complex FDPOs based on experimental perovskite structure data has not been reported yet. The more reliable perovskite and non-perovskite experimental data are not ready yet. The extensive selection of features for predicting perovskite structures still needs significant effort. In the current work, we confined the FDPO study to a specific application of the solar thermochemical hydrogen (STCH) production for ensuring high-quality machine learning using a limited amount of data, serving as a case study for other functional FDPOs. We established a reasonable experimental database, selected 21 basic features, and screened six models to predict FDPOs. The verification based on literature data and the synthesis by solid-state reaction and modified Pechini method and XRD characterization proved the successful prediction of FDPOs using our model.

Results

Data collection

The prerequisite to using machine learning to predict FDPOs is to collect enough reliable data for the given compositions with identified crystal structures, either perovskite or non-perovskite. Instead of the data derived from DFT calculation, the experimental data from Inorganic Crystal Structure Database (ICSD) and sporadic literature served as the crucial data sources. However, all data must be manually examined and compared to the original literature by the person with experience in FDPOs to ensure data reliability. Collecting a large quantity of data (e.g., more than 1000) takes too long or costs too much to perform efficiently. Furthermore, sometimes we don’t have enough data available to train the model. Therefore, we confined the FDPO study to a specific application of the solar thermochemical hydrogen (STCH) production for ensuring high-quality machine learning using a limited amount of data, serving as a case study for other functional FDPOs.

We searched the ICSD database by inputting key STCH elements, setting the total element numbers, and downloaded eligible FDPO data experimentally obtained at room temperature under atmospheric pressure. Since the first experimental report of SrxLa1-xAlyMn1-yO3-δ as excellent STCH materials with promising oxygen exchange capacity and high hydrogen production yield, FDPOs containing La, Sr, Mn, and O have become one of the most popular STCH material systems9. With La, Sr, Mn, and O as STCH key elements and 4, 5, 6, and 7 as the total element numbers, we downloaded 578 perovskite data. Furthermore, the most recent studies (e.g., CaTi1-xFexO335 and BaCexMn1-xO336) indicated that some Ca, Ba, or Ti-doped FDPOs exhibited fast reaction kinetics as the potential STCH materials. So we also searched the ICSD database with Ca, Mn, and O as key STCH elements and 4, 5, and 6 as total element numbers, resulting in another 661 eligible perovskite data. In addition, using Ba and Ti as crucial elements and a total element number of 5, we got 166 eligible perovskite data. In summary, we downloaded around 1400 experimental FDPO data from the ICSD database by confining the specific application of STCH materials.

We established several rules to thoroughly clean up the raw data to improve the data’s reliability. First, many repeated data in our data set resulted from the overlapped searching and the repeated reports. We manually checked the data set and removed all the repeated data points. Secondly, we were interested in the FDPOs, which were stable under room temperature and atmospheric pressure, which urged us to rule out some thermodynamically metastable compounds from our data set. Furthermore, some compounds did not show the desired unity for A and B metal sites, which did not qualify them for reliability. We also removed some unrealistic perovskite data involved in radioactive elements or expensive rare earth elements by considering the safety and economic viability. While looking into original literature to verify the data reliability during the cleaning process, we found some perovskite oxides containing one or more elements among La, Sr, Mn, Ca, Ba, and Ti elements did not show in our data set. We treated these as reliable data and added them to our data set. Finally, we obtained a data set consisting of 516 FDPOs (Supplementary Data 1).

The collection of the non-perovskite compounds for the negative data point for machine learning was not a trivial work. We first searched the ICSD database for pyroxene, ilmenite, or other non-perovskite structures with a doped ABO3 formula. The stability at room temperature under atmospheric pressure was also additional confinement. The inputting elements of these compositions were the same as the chosen perovskites, e.g., Ca, Ba, Mn, and Ti. We also double-checked these non-perovskites by tracing back to the original literature. Some mislabeled non-perovskite compounds in the ICSD database, which were confirmed to be perovskite by further studying literature, were removed from our non-perovskite data set. We obtained 96 non-perovskite data points from the ICSD database. The non-perovskites with mixed phases commonly occurred during the experimental synthesis, which did not show in the ICSD database. We extensively searched sporadic literature and collected another 20 non-perovskites with mixed phases. Finally, we obtained 116 non-perovskite data in our data set (Supplementary Data 1).

Figure 2 summarizes the chemical space of these compounds in our data set. Our STCH FDPO data set contains 31 elements (viz. Li, Na, Mg, Al, K, Ca, Sc, Ti, Cr, Mn, Fe, Co, Ni, Zn, Ga, Sr, Y, Nb, Ag, Ba, La, Ce, Pr, Nd, Sm, Gd, Tb, Dy, Yb, Pb, and Bi) in A site, 26 elements (viz. Mg, Al, Si, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Ga, Ge, Y, Zr, Nb, Mo, In, Sn, Ce, Yb, Ta, W, and Pb) in B site, and 41 elements in total. The numbers at the corner of each element show the perovskite compound numbers, which contain that element. The chemical space significantly expanded the key STCH elements we used to search perovskite compounds in the ICSD database, which provides us more opportunities to establish new potential PDFOs to serve as STCH materials for future study.

Fig. 2: The chemical space of the database.
figure 2

The background colors indicate different subcategories and numbers exhibit the appearance frequency of different elements in A or B site.

Features selection and models selection

The feature selection was initialized by summarizing the popular perovskite structure descriptors commonly reported in the literature. Historically, the widely used descriptors for predicting the formability of perovskites were the Goldschmidt tolerance factor (t) and the octahedral factor (µ) defined by the radii of the metal and oxygen ions21,37,38. Recently, the new tolerance factor (τ) further included the information of the oxidation state of A (nA) besides the ionic radii to improve the predictability27. These three features are closely related to the A and B elements’ ionic radii and oxidation states. In addition, although it has not been studied extensively as a perovskite formation descriptor, the ratio between A-site cation to oxygen anion (rA/rO) is a parallel factor to the octahedral factor. Therefore, we selected Goldschmidt tolerance factor (t), octahedral factor (µ), rA/rO, new toleration factor (τ), ionic radii (rA and rB), and oxidation states (OA and OB) of A site and B site elements as our features. The previous prediction of simple ABO3 perovskite oxides by machine learning indicated that, besides the ionic radii of A-site and B-site cations, the tolerance factor and the octahedral factor, the A-O and B-O bond lengths and the Villars’ Mendeleev numbers also played a critical role to decide if a known composition could form perovskite structure or not30. Therefore, we also chose the dAO, dBO, MA, and MB as our features for predicting FDPOs.

Besides the 12 commonly used features (rA, rB, OA, OB, t, µ, rA/rO, τ, dAO, dBO, MA, and MB) for predicting the formability of simple perovskites, we further introduced another 9 features for emphasizing the contribution of multiple dopant ions in both A and B sites of FDPOs. The multiple dopants introduced the size discrepancy, resulting in phase separation of perovskite or the transformation to other structures. The size variance factor (\({{{{{{\rm{\sigma }}}}}}}^{2}={\sum }_{i}{{{{{{\rm{y}}}}}}}_{{{{{{\rm{i}}}}}}}{{{{{{\rm{r}}}}}}}_{{{{{{\rm{i}}}}}}}^{2}-{({\sum }_{i}{{{{{{\rm{y}}}}}}}_{{{{{{\rm{i}}}}}}}{{{{{{\rm{r}}}}}}}_{{{{{{\rm{i}}}}}}})}^{2}\))39, which described the mismatch in A-site ion radii in perovskite, was correlated to phase separation of perovskites. We adapted the size variance factors for both A-site and B-site ions as our new features. Furthermore, the differences in the size scales of different atoms pairs (i.e., rA max/rO, rBmax/rO, rAmin/rO, rBmin/rO, rAmax/rAmin, and rBmax/rBmin) were also defined as new features further to investigate the size effect on the formability of FDPOs. Moreover, the multiple dopants commonly broke electroneutrality in FDPOs, relating to the formation of oxygen nonstoichiometry in the perovskite structures. However, the large nonstoichiometric oxygen or the big offset of electroneutrality can break the perovskite structure. Therefore, we defined a new factor of (OA+OB)/6 to include the electroneutrality effect. As summarized in Supplementary Table 1, we selected 21 features of rA, rB, OA, OB, t, τ, rB/rO(µ), rA/rO, dAO, dBO, MA, MB, rAmax/rO, rBmax/rO, rAmin/rO, rBmin/rO, rAmax/rAmin, rBmax/rBmin, σ2(A),σ2(B), and (OA + OB)/6 to perform our model training. Here, the fractionally doped ABO3 involved multiple A-site and B-site cations. Therefore, we took the average using a weighted linear combination based on the fractions of the multiple elements to get the features for each fractionally doped ABO3 composition (including both positive and negative data). Here, we simplified the expression by using the original symbols to express the average features.

The feature calculation started from the basic element properties such as ionic radius, stable oxidation state, bond length, and Mendeleev number for the individual primary elements from an easily accessible database and literature (Supplementary Tables 2 and 3). Then we either took average or did calculations based on the feature definition equations on getting the values for the 21 features. We adapted the ionic radii and oxidation states for all the A-site and B-site elements shown in Fig. 2. Typically, coordination number (CN), oxidation state, electronic spin, etc. affect the ionic radius for each element40. The CNs for A-site and B-site ions in the perovskite structure are 12 and 6, respectively. Usually, the 12-fold coordinated A-site ions have ionic radii much bigger than B-site ions, relating to lower oxidation states (e.g., +1, +2, and +3). The smaller 6-fold coordinated B-site ions have higher and sometimes multiple oxidation states (e.g., +4, +3, +2). It is easy to find the Shannon ionic radii for the 6-fold coordinated B-site cations. However, some A-site elements don’t have the Shannon ionic radii available with a CN of 12. Instead of directly deriving the ionic radii from the Shannon database, we adapted the newly complemented Shannon ionic radii by using the data from Ouyang’s work41. The data-driven method SISSO extrapolated the Shannon radii to unusual oxidation states and arbitrary CNs, allowing the easy access of 12-coordinated A-site ions. As for the bond length, an empirical correlation \({{{{{{\rm{s}}}}}}}_{{{{{{\rm{ij}}}}}}}={{\exp }}(\frac{{{{{{{\rm{d}}}}}}}_{0}-{{{{{{\rm{d}}}}}}}_{{{{{{\rm{ij}}}}}}}}{{{{{{\rm{B}}}}}}})\) was used to describe the relationship between the bond length dij and the bond valence sij in the bond-valence theory, where d0 and B were constants whose values were determined empirically for each pair of atoms that form bonds42,43. Assuming sij equal to the ideal valence of the atom, the quotient of the natural valence divided by the number of the nearest oppositely charged ions (CN), we estimated the ideal bond length dij. In our case, atoms in the A site and B site were both bonded to oxygen atoms. Then we could get the ideal bond length dAO and dBO. As for the Mendeleev numbers, Villars et al. found an expression involving the Mendeleev number to predict the formability of compounds for any binary, ternary, or quaternary system44. Many researchers have adopted the Mendeleev number for the prediction of simple perovskite oxides30,33,45.

Based on the primary element properties of ionic radius, stable oxidation state, bond length, and Mendeleev number for both A-site and B-site elements, we could obtain the feature values of the average rA, rB, OA, OB, dAO, dBO, MA, and MB for each FDPOs. After that, we could calculate the values for the features of t, τ, rB/rO(µ), rA/rO, and (OA + OB)/6 from the above calculated average values. By comparing the ionic radii for the A-site cations and B-site cations in each composition, we could get rAmax, rBmax, rAmin, and rBmin. Therefore, we could quickly get rAmax/rO, rBmax/rO, rAmin/rO, rBmin/rO, rAmax/rAmin, rBmax/rBmin. Following the definition equations for σ2(A),σ2(B), we could calculate their values easily from the ionic radii of multiple A-site and B-site cations. All the values of the 21 features for the 632 perovskite and non-perovskite data used for our model training are summarized in the Supplementary Data 1.

We evaluated the performances of the six models using the data (632 compositions with 21 features). Table 1 provides the average accuracy, F1 score, and their corresponding standard deviations for the trained six different models. We can see that the gradient boosting classifier exhibits the best accuracy and the best F1 score among all models, which indicates the gradient boosting classifier is the most suitable model for this classification problem. Consequently, we chose the gradient boosting classifier to study prediction of FDPOs further.

Table 1 A summary of average accuracy, average F1 score, and their corresponding standard deviations of six models.

We have found out that the gradient boosting classifier with 21 features has demonstrated higher accuracy and F1 score during the training process. However, a large number of features usually increases the model complexity46. It is desirable to identify and employ the most relevant features to develop a relatively simple model to predict the FDPOs efficiently and accurately. We employed the Pearson correlation coefficient (R), feature importance, and univariate feature selection to select necessary features. Figure 3 provides the Pearson correlation matrix showing the correlation for each pair of the 21 features used in this study. The features of rA, rA/rO, rAmin/rO, and dAO strongly correlate. The features of rB, rB/rO, rBmin/rO, and dBO strongly correlate. The strong linear correlations for the two groups of the A-site cation and B-site cation-related features make it possible to shrink the necessary features. We may only use two features to represent the eight features for our model development.

Fig. 3: Pearson correlation map for 21 features based on the dataset.
figure 3

The number represents the linear correlation between two variables.

The feature importance given by the model was adopted to guarantee the rationality and correctness of the feature selection overall. Table 2 indicates that the top 10 features for each fivefold cross-validation experiment are not entirely the same. However, the features of dAO, dBO, MA, (OA + OB)/6, τ and σ2(B) showed up in all five experiments, indicating that these six features are the most important ones for achieving good performance. The rAmax/rAmin feature showed up in four experiments, while MB and rBmin/rO features showed up in three experiments. Compared with other features, these 9 features have a more significant influence on the prediction results.

Table 2 Top 10 crucial features given by the gradient boosting classifier with 5-fold cross-validation.

Furthermore, we performed univariate feature selection with F-test to select the top-n most significant features on the whole data set. First, we computed the ANOVA (Analysis of Variance) F-value between features and labels to see any statistically significant relationship between them. Then we selected features according to the n highest F-values. Here we set n from 9 to 4 (top 9 features to top 4 features). Aiming at further down-selecting features, we explored the performance of the gradient boosting classifier by employing different feature combinations among nine features, respectively.

Supplementary Table 4 showed the specific different feature combinations among top nine features of the gradient boosting classifier. Table 3 summarizes the performance of different feature combination choices by evaluating their average accuracy, average F1 score, and corresponding standard deviations. From Table 3, we can see that the model with 5 features (dAO, dBO, MA, (OA + OB)/6 and σ2(B)) has the highest average accuracy and average F1 score. Based on the Pearson correlation map in Fig. 3, the absolute Pearson correlation coefficient of any pairwise features of these 5 features is much smaller than 0.85, indicating no highly linear correlation among these 5 features. And the importance of these 5 features is consistent with the previous theoretical and experimental researches30,45,47,48. Thus, using the gradient boosting classifier with these 5 features is reasonable and trustworthy to predict the unknown compounds’ formability in the target chemical space. It is noteworthy that adopting as many features as possible would not guarantee higher validation accuracy here49 because the presence of uncorrelated or nonsignificant features is likely to make the prediction performance of models degraded. It indicates that selecting high-quality features is crucial to utilize machine learning for predicting new material discovery.

Table 3 A summary of average accuracy, average F1 score, and corresponding standard deviations of the gradient boosting classifier with different feature combinations.

Verification of the established machine learning method

Using the machine learning method established from the gradient boosting classifier, the five features, and 632 perovskite and non-perovskite data, we predicted the perovskite structure formation for 47 compositions. Among them, 36 were reported in the literature, and 11 were newly synthesized in this work (Supplementary Table 5). Among the 36 compositions with known crystal structures from the literature, we obtained 34 accurate predictions. Although the verification data came from the FDPOs for extensive applications (i.e., STCH, SOFC electrolyte, SOFC electrode, and dual-phase perovskite materials), the established method achieved a prediction accuracy as high as 94.4%. Therefore, our method can predict potential FDPOs extensively without considering the chemical space too strictly. Among the 11 new compositions, 9 were predicted to be perovskite, while two were non-perovskite. We synthesized the powder samples using the conventional solid-state reaction method and characterized their crystal structure using XRD for all 11 compositions. Supplementary Figures 1-4 provide the XRD patterns for all 11 samples of Sr-La-Ti-Fe-O, Sr-La-Cr-Mn-O, Ca-La-Cr-Mn-O, and Ba-Ce-Co-Fe-O groups. For Sr-La-Ti-Fe-O group in Supplementary Figure 1, the Miller indices of the main reflection planes were be identified according to the diffraction files of International Centre for Diffraction Data (ICDD) (04-021-6619). And the presence of impurities was observed in Sr0.4La0.6Ti0.8Fe0.2O3 and Sr0.6La0.4Ti0.8Fe0.2O3 samples. For Sr-La-Cr-Mn-O group in Supplementary Figure 2, Sr0.1La0.9Cr0.5Mn0.5O3 were well assigned to ICDD standard data card (04-013-5401). It was seen that the rest diffraction peaks of the Sr0.7La0.3Cr0.5Mn0.5O3 and Sr0.9La0.1Cr0.5Mn0.5O3 samples of the impurities exist. All three compositions of Ca-La-Cr-Mn-O group in Supplementary Figure 3 were assigned to ICDD standard data card (04-007-6214). As for BaCe0.5Co0.5O3 and BaCe0.3Y0.2Fe0.3Co0.2O3, secondary phases clearly existed except the perovskite structure in XRD patterns of Supplementary Figure 4.

The XRD patterns indicate that 4 samples (Sr0.7La0.3Cr0.5Mn0.5O3, Sr0.9La0.1Cr0.5Mn0.5O3, Sr0.4La0.6Ti0.8Fe0.2O3, and Sr0.6La0.4Ti0.8Fe0.2O3) synthesized via the solid-state methods did not form pure perovskite. However, our model predicted that they should be able to form a perovskite structure. Interestingly, the 2 samples of Sr0.7La0.3Cr0.5Mn0.5O3 and Sr0.9La0.1Cr0.5Mn0.5O3 contain many Mn and Cr, which could easily vaporize and make the solid-state reaction synthesis difficult. The Mn and Cr loss at the high-temperature calcination might cause the wrong structure results.

We suspected that Sr0.7La0.3Cr0.5Mn0.5O3 and Sr0.9La0.1Cr0.5Mn0.5O3 synthesized by solid-state reaction might encounter the Cr or Mn problem, causing non-perovskite structure and inconsistency with the predicted result (perovskite structure). After XRD characterization, XRD patterns in Figs. 4a and 5a exhibited the presence of SrCrO4 phase, one common impurity for similar compositions50, in two compositions synthesized by the solid-state method. Besides, the XRF test results of the two samples synthesized by the solid-state reaction method in Supplementary Table 6 and Supplementary Table 7 showed apparent Cr loss and Mn loss (For Sr0.7La0.3Cr0.5Mn0.5O3, the Mn loss is apparent, and for Sr0.9La0.1Cr0.5Mn0.5O3, the Mn loss is negligible). The fitted Cr spectra in Supplementary Figure 5 show that Cr6+ ions resulting from SrCrO4 impurity exist in Sr0.7La0.3Cr0.5Mn0.5O3 and Sr0.9La0.1Cr0.5Mn0.5O3, which matched with XRD results well. XPS peak fitting for Cr 2p peaks was determined based on published literature51,52. Therefore, we further synthesized the compositions of Sr0.7La0.3Cr0.5Mn0.5O3 and Sr0.9La0.1Cr0.5Mn0.5O3 using the modified Pechini method and did XRD characterization again for the as-synthesize powders. We also optimized the calcination conditions to ensure the synthesis quality. Finally, as shown in Figs. 4d and 5d, both Sr0.7La0.3Cr0.5Mn0.5O3 and Sr0.9La0.1Cr0.5Mn0.5O3 formed phase-pure perovskite structures. We can conclude that the prediction can even help correct the wrong synthesis process to achieve the correct crystal structure.

Fig. 4: XRD patterns of Sr0.7La0.3Cr0.5Mn0.5O3 in different experimental conditions.
figure 4

a calcination at 1350 °C for 12 hours via the solid-state method, b calcination at 1100 °C for 16 hours via the modified Pechini method, c calcination at 1350 °C for 12 hours via the modified Pechini method, and d calcination at 1500 °C for 2 hours via the modified Pechini method.

Fig. 5: XRD patterns of Sr0.9La0.1Cr0.5Mn0.5O3 in different experimental conditions.
figure 5

a calcination at 1350 °C for 12 hours via the solid-state method, b calcination at 1350 °C for 12 hours via the modified Pechini method, c calcination at 1500 °C for 2 hours via the modified Pechini method, and d calcination at 1500 °C for 12 hours via the modified Pechini method.

As for Sr0.4La0.6Ti0.8Fe0.2O3 and Sr0.6La0.4Ti0.8Fe0.2O3, it is not easy to synthesize Ti-containing oxides using wet chemistry because of the strong hydrolysis effect of Ti4+ in an aqueous solution. We have done multiple-time grinding and calcination to improve the synthesis quality based on solid-state reaction synthesis. However, we still did not get a phase-pure compound as the model predicted. Therefore, we counted these compositions as wrong predictions. We are developing new wet chemistry to synthesize the Ti-containing FDPOs and hope to clarify the composition in a future publication.

In summary, the machine learning prediction guided us to improve the experimental synthesis results for some synthesis-history-sensitive compounds. The model obtained 9 correct predictions out of 11 new compounds. The prediction accuracy for new compounds is 81.8%. Figure 6 shows the visualization result of the confusion matrix for the best prediction on a total of 47 experimental data. The confusion matrix is a common way to evaluate and visualize the performance of binary classification problems. The confusion matrix is a table with two rows and two columns, where rows report the true classification and columns report the predicted classification. There are four areas in the confusion matrix. All correct predictions where prediction labels are the same with true labels are in the diagonal of the table (a red area and a grey area). The accuracy of this model is 0.915 (43 true predictions of the total 47 data points), which means excellent reliability in predicting the formability of FDPOs.

Fig. 6: Confusion matrix of the gradient boosting classifier with 5 features on 47 experimental data.
figure 6

The rows represent true labels, and the columns represent the predicted labels.

Comparison with traditional approaches

For comparison, we performed predictions for the 47 compositions using conventional tolerance factor (t), octahedral factor (µ), and new tolerance factor (τ). Figure 7 provides the prediction accuracy results for the five prediction methods. The model has a much better prediction accuracy (91.5%) than the prediction accuracies (76.6%, 48.9%, 48.9%, and 76.6%) of tolerance factor (t), octahedral factor (µ), the feature pair (t and µ), and new tolerance factor (τ). Furthermore, both tolerance factor and new tolerance factor predicted that all the 47 compositions could form perovskites. Therefore, the most popular conventional method could not identify the compositions with complicated elemental compositions. The simple octahedral factor shows the worst prediction accuracy (48.9%), consistent with a simple consideration of only one feature. Compared with the prediction results using the single octahedral factor or the tolerance factor, using the feature pair does not guarantee the better prediction accuracy of the formability of FDPOs. All the values of these features for the 47 compositions used for our model training are summarized in the Supplementary Data 2.

Fig. 7: The comparison in accuracy between traditional methods and machine learning prediction with five features.
figure 7

The feature pair refers to using both octahedral factor and tolerance factor for the prediction.

Conclusions

A function-confined facile machine learning method was discovered to predict new fractional and multicomponent perovskite oxides from the basic properties of metal elements and their corresponding simple oxides using typical classification models and a small number of experimental data. As a case study for discovering new perovskite oxides for solar thermochemical hydrogen production by water splitting, we obtained 632 experimental training data (516 are perovskites and 116 are non-perovskites), selected 12 available features, and created 9 new features. Based on high prediction accuracy (0.954), high F1 score (0.921), and the corresponding low standard errors (0.018 and 0.037), we selected the gradient boosting classifier to predict perovskite formation. The further verification of the derived machine learning algorithm using 47 data not included in the training data set (36 are from the literature, and 11 newly synthesized data) showed that the prediction accuracy reached 91.5%, much higher than the prediction performance of the conventional tolerance factor (76.6%), octahedral factor (48.9%), and new tolerance factor (76.6%). Furthermore, the machine learning prediction helped correct two wrong solid-state reaction syntheses. For future research, our model also can be expanded to discover other functional perovskite oxides.

Methods

Machine Learning

Model training

We applied models to classify input compositions into perovskites or non-perovskites using the above-selected 21 features and 632 perovskite and non-perovskite data. A feature vector (21 features) described each sample and labeled it with y (1 or 0, 1: perovskites, 0: non-perovskites). We employed six different known models in our classification work. Three models among them were often used for prediction in the material field: SVM (support vector machine)30,53, random forest30,53, and gradient boosting classifier30,45. The XGBoost54, AdaBoost55, and MLP (multilayer perceptron) classifier56 were widely used models in classification tasks and showed excellent performance. Based on the nonlinear algorithm, SVM constructed a hyperplane with the maximum-margin having the largest separation between the nearest training data set of two classes in high dimensional space, guaranteeing accurate prediction results on unseen data53. The random forest, gradient boosting, XGBoost, and AdaBoost were ensemble learning methods used with the decision-tree algorithm. The different paths from the root to leaf represented classification rules57. The random forest classifier got the final prediction based on the majority voting of classification results from the decision-tree algorithm by learning several classification rules from the data set. Gradient boosting classifier trained the decision tree models sequentially, and new models corrected the previous ones’ errors and improved the performance by training. AdaBoost could learn from the weak classifiers sequentially and give feedback to build a predictive classifier. XGBoost was an open-source library, providing an implementation of the regularizing gradient boosting algorithm. MLP was an artificial neural network and could model nonlinear relationships between targets and features because of the multiple layers and nonlinear activation.

We split the entire data set into a training set and a testing set. Models only trained by the training data are easily overfitting and overoptimistic. The testing set helped us check how models performed on unseen examples. We used 5-fold cross-validation to objectively estimate different models’ performance. We divided the total data set into 5 subsets of the same size and chose one subset as the testing set, making the remaining subsets the training set. We performed 5 iterations for every algorithm. Over 5 iterations, the models were trained on 4 subsets and used to predict the held-out subset.

Model evaluation

Our data classification using models resulted in four possible cases, including true positive (TP), true negative (TN), false positive (FP), and false negative (FN). The commonly used metric to evaluate classification involves the accuracy (\(\frac{{{{{{\rm{TP}}}}}}+{{{{{\rm{TN}}}}}}}{{{{{{\rm{TP}}}}}}+{{{{{\rm{TN}}}}}}+{{{{{\rm{FP}}}}}}+{{{{{\rm{FN}}}}}}}\)) parameter. The accuracy parameter equals the number of correctly predicted data points divided by the total number of all data points in the data set, which is one essential criterion to evaluate the performance of classification prediction. The model with higher accuracy has more accurate predictions on unseen data. Since our data set includes more positive data points than negative data points, our task is an imbalanced classification problem. Therefore, the typical F1 score (\(\frac{2{{{{{\rm{TP}}}}}}}{2{{{{{\rm{TP}}}}}}+{{{{{\rm{FP}}}}}}+{{{{{\rm{FN}}}}}}}\)) was used to assess the imbalanced classification problem for further evaluating our models’ performance. Since we performed the 5-fold cross-validation over five times and got five sets of accuracy and F1 score values, we adapted the average values to demonstrate the models’ performance.

Features selection

We employed the Pearson correlation coefficient (R), feature importance, and univariate feature selection to select necessary features. The R with a value ranging from −1 to +1 measures the linear correlation between two variables. The R with a value of zero indicates that the studied two variables have no linear correlation. The positive and negative values represent positive linear correlation and negative linear correlation, respectively. The high absolute value of R indicates a strong correlation between the two variables. The closer to 1 the absolute value of R is the stronger correlation the variables have. Usually, we can regard the two features are highly correlated when |R | > 0.8546.

The feature importance given by the model was adopted to guarantee the rationality and correctness of the feature selection overall. The value of each feature importance is between 0 and 1. The total sum of all feature importance equals 1.0. The bigger the value of the feature importance is, the more essential the feature is for the prediction.

Model validation and prediction of FDPOs

After finding the best model from the mentioned six models and selecting the most important features out of the 21 features, we validated the performance using 47 unseen experimental data (Supplementary Table 5) from recently reported literature and newly synthesized in this work (not included in our training data set). The validation data consisted of FDPOs used for STCH production, solid oxide fuel cell electrolytes, mixed ionic and electronic conducting electrode materials, dual-perovskite composites, and some non-perovskites. The selection of FDPOs with extensive applications rather than STCH materials only is to find out the effectiveness of the established method for extensive study of the FDPO formation. Supplementary Table 5 indicated that the verification data includes seven materials groups (Sr-La-Mn-Al-O, Sr-La-Ti-Fe-O, Sr-La-Cr-Mn-O, Ca-La-Cr-Mn-O, Ba-Ce-Fe-O, Ba-Ce-Co-Fe-O, and Ba-Ce-Zr-Y-O). 36 compositions are the materials reported in the literature, and the rest 11 compositions were the newly designed and synthesized ones for discovering new potential STCH or SOFC materials. We also compared our prediction performance with those obtained using commonly used descriptors, the tolerance factor, the new tolerance factor, and the octahedral factor.

Experimental section

Synthesis of the predicted FDPOs

We first synthesized all the 11 proposed compositions using the conventional solid-state reaction method. The stoichiometric amounts of raw materials of carbonates and oxides calculated from the proposed FDPO formula were weighed and mixed by ball-milling in isopropanol with 3 mm YSZ ball grinding media for 48 h. The ball-milled slurry was dried at 90 °C for 24 h in the oven to remove the solvent. After calcining the well-mixed powders at 1100–1500 °C for 12–48 h in stagnant air in a box furnace, we got the final powders for structure characterization.

For the compositions synthesized by the solid-state reaction method, which have shown opposite results to the prediction, we adapted a more powerful wet-chemistry synthesis method of the modified Pechini method to synthesize them again to ensure the synthesis accuracy. Using the synthesis of Sr0.7La0.3Cr0.5Mn0.5O3 as an example, we can briefly describe the modified Pechini synthesis process. The stoichiometric amounts of water-soluble materials such as nitrates calculated from the proposed FDPO formula were weighed and dissolved into distilled water. The chelating organic compounds of ethylenediaminetetraacetic acid (Alfa Aesar 99.4%) and citric acid monohydrate (ACROS Organics 99.5%) with a mole ratio of 1.5: 1.5: 1 for EDTA: citric acid: total metal ions were added to the metal salt aqueous solution. Then the pH of the obtained solution was adjusted to be ~9 by adding ammonium hydroxide or nitric acid. After slow water vaporization at ~80–90 °C, the solution turned into a homogeneous viscous gel. Any solid precipitation during the evaporation process should be avoided by controlling the vaporization temperature, initial pH, and relative amount of chelates and metal ions. The bouffant charcoal-like powders obtained after heating the viscous gel in the oven at 150 °C for 48 h were further calcined at 600 °C for 5 hours in stagnant air to burn out organic substances. The final calcination at 900–1500°C for 2–16 h allows us to obtain the final powders.

Characterization

The crystal structure of as-synthesized powders was characterized by powder X-ray diffraction using Cu Kα radiation. The XRD patterns were collected by Rigaku Ultima IV diffractometer at the voltage of 40 kV and current of 15 mA. The XRD patterns were recorded in the 2θ range of 15–85° with a step size of 0.02°. XRD pattern analysis was conducted using HighScore software.

The elemental compositions for some samples were characterized by an X-ray fluorescence (XRF) spectrometer (Thermo ARL Perform’x wavelength-dispersive spectrometer). The powder materials are grinded in a puck mill, and they are all mixed with cellulose binder, then pressed into a disk for analysis. The cellulose was added in a ratio of 1.5 g of Cellulose to 12 g of sample (0.125: 1). The OXSAS software by ThermoFisher commands the instrument to go to the correct detector angle and counts the intensity for each element. Then the UniQuant software was adopted to convert each intensity into a mass percent.

X-ray photoelectron spectroscopy (XPS) spectra were recorded by a PHI VersaProbe III spectrometer using monochromatic Al Kα x-ray source ( = 1486.7 eV) at an accelerating voltage of 15 kV and output power of 25 W. The high-resolution spectra for Cr 2p were collected at 90 degrees with a pass energy of 69 eV, 0.125 eV step size, and a dwell time of 50 ms. To calibrate the binding energy scale, the C 1 s level was assumed to be 284.8 eV. Ar ion etching at 3 kV was used to characterize chemical information of deeper position. Etch steps were 5 min with the etching speed of 136.5623 Å min−1. Peaks were fitted using Shirley baseline with the peak fitting parameters of GL(60) and a FWHM of 1.4 eV for peaks within CasaXPS.