Introduction

The band gap (Eg), which is non-existent for a metal but is positive for semiconductors and other insulators, is a fundamental property of a periodic solid and directly influences its electrical conductivity. Based on their band gap, functional electronic materials are used in a diverse array of applications such as field effect transistors1, LEDs2, photovoltaics3, and scintillators4. By controlling the composition or structure of the material, systematic tuning of the band gap may be achieved, allowing for materials that are tailored to the desired application. Consequently, the band gap is widely used as a screening criterion in data sets generated via high-throughput calculations for application-based discovery5,6,7.

The compositional and structural complexity afforded by perovskites of the form ABX3 in conjunction with their fascinating electrical and magnetic properties such as piezoelectricity8, optical properties9, high-temperature superconductivity10, ferro-electricity11, and magnetostrictive effects12 make them especially attractive candidates for band gap tuning. The prototypical ABX3 cubic perovskite structure as indicated in Fig. 1a is composed of a three-dimensional BX3 network of corner-sharing BX6 octahedra. The A-site cations occupy the 12 coordinate sites formed by the octahedral units and each A cation is surrounded by 12 equidistant anions. The perovskite structure can accommodate 90% of the metallic ions in the periodic table13 with a wide variety of different anions, increasing their utility in a wide range of applications. As we will show in this work, the band gap in these materials is closely correlated with their physical and chemical properties. Thus, varying the composition by replacing any of the atoms in these structures can be used to precisely tune the physical and/or chemical properties of interest and thereby the band gap. This tunability may be amplified by increasing the types of cations occupying the A- and/or B-sites, giving rise to double perovskites with formulae \({{{{{{{{\rm{A}}}}}}}}}_{x}{{{{{{{{\rm{A}}}}}}}}}_{2-x}^{{\prime} }{{{{{{{{\rm{B}}}}}}}}}_{2}{{X}}_{6}\), \({{{{{{{{\rm{A}}}}}}}}}_{2}{{{{{{{{\rm{B}}}}}}}}}_{y}{{{{{{{{\rm{B}}}}}}}}}_{2-y}^{{\prime} }{{X}}_{6}\) and, in the most general case, \({{{{{{{{\rm{A}}}}}}}}}_{x}{{{{{{{{\rm{A}}}}}}}}}_{2-x}^{{\prime} }{{{{{{{{\rm{B}}}}}}}}}_{y}{{{{{{{{\rm{B}}}}}}}}}_{2-y}^{{\prime} }{{X}}_{6}\) for 0 ≤ x, y ≤ 2, as shown in Fig. 1a.

Fig. 1: Hierarchical discovery framework for wide band gap double perovskites.
figure 1

a Double perovskite crystal structure with rocksalt ordering of both A- and B-site cations. b Histogram of DFT-calculated band gaps in the training dataset of oxide perovskite used in the development of the ML model presented in this work. c Chemical space of the perovskite oxides explored in the present study. Cations appearing at the A-site and/or the B-site are highlighted. d Hierarchical down-selection framework implemented in this work. Starting with more than 5 million potential chemistries, through a series of ML models we identify about 300 double perovskites that are likely to exhibit a wide band gap.

In recent years, to accelerate the discovery of novel electronic materials, tremendous strides have been made by using high-throughput Density Functional Theory (DFT) techniques to determine the electronic structure of materials. However, the accuracy of the computed band gap of a material depends on the particular exchange-correlation functional employed in the DFT technique that is used. If we consider the band gap to be an excitation energy, it is naive to expect an accurate description using ground state DFT. In fact, when using DFT computations with local and semi-local exchange-correlation functionals, the errors in semiconductor and insulator band gaps can be as large as 50%14, which is especially significant in research in fields involving semiconductors, optical and photovoltaic materials, and thermoelectrics. This underestimation of the band gap is attributed to the connate lack of derivative discontinuity15, self-interaction error (SIE)16,17, and delocalization error18 within conventional DFT functionals such as the local density approximation (LDA) or the generalized gradient approximation (GGA)19. Consequently, much attention has focused on solving the problem of the underestimation of the DFT band gap both within the Kohn-Sham DFT formalism and outside it20,21. However, most of these techniques, such as the GW approximation22 and delta self-consistent-field (ΔSCF) method20, as well the use of hybrid functionals23, improve the accuracy of band gap estimation at increased computational cost which make them untenable for high-throughput calculation efforts. One feasible way to circumvent this problem is to screen large datasets using low fidelity GGA-type calculations and down-select to a tractable subset of potential candidates. High-fidelity band-gap estimation techniques or experiments can then be used to accurately determine the electronic structure of this smaller subset.

Machine learning (ML) has enjoyed much popularity in the field of materials science and condensed matter physics as an efficient tool to predict a physical property or quantity, particularly when the target property cannot be directly determined without the use of extensive resources, either experimental or computational24,25,26,27,28. In the last two decades, statistical learning frameworks in conjunction with regression techniques have been widely used to predict band gaps of large datasets in order to overcome the band gap estimation problem. Earliest among these was the work of Gu et al.29 who used experimental band gaps of 25 binary compounds and 31 ternary compounds to construct support vector regression (SVR) and artificial neural network (ANN) models, using elemental predictors. Since then, SVR has featured prominently as the regression method of choice in band gap prediction30,31,32,33 in addition to ordinary least squares regression (OLSR)30,34,35,36 and least absolute shrinkage and selection operator (LASSO)30,35,36 methods. Pilania et al.37 presented a general formalism to discover decision rules that can be used to make ultra-fast, yet accurate, property predictions. They used ML to establish a mapping between chemo-structural fingerprints and the electronic charge density distribution of polymeric insulators and their properties, including the band gap. In 2016, they demonstrated a robust learning framework for efficient and accurate predictions of electronic band gaps of double perovskites using a systematic feature-engineering approach38. They also implemented a multi-fidelity Gaussian process (GP)-based co-Kriging regression model to predict Eg for the class of elpasolites39. Recently, Na et al.40 and Li et al.41 reported multi-fidelity band gap prediction models based on graph neural networks (GNNs) using that can use band gaps data computed and/or measured at varying levels of fidelity to provide best estimates at the hhighest level of fidelity. Omprakash et al.42 also used GNNs trained on experimental data gathered from literature to predict band gaps for a variety of different 2D, 3D, organometal and inorganic inorganic single perovskites. In a departure from traditional targets of high accuracy, Gladkikh et al.43 used alternating conditional expectations (ACE), a ML technique suitable for small data sets which performs worse than more commonly used ML methods but presents its results in a graphic form, helping in interpretation, the lack of which is a core critique of ML applications today44. Kauwe et al.45 used the example property of the band gap to demonstrate an ensemble learning approach which allows the efficient modeling of experimental data by combining models trained on otherwise disparate computational and experimental data. Most of these approaches favor the use of elemental predictors, lending credence to the idea of physics-informed models. Recently, Stanley et al.33 used formation energy as a predictor to develop highly accurate band gap prediction models for ABO3 perovskites. However, if intended to be applied to large prediction datasets, the use of properties which are relatively inexpensive to compute as model predictors can still be prohibitive.

Thus we see that prior research in the area has focused on exploring the potential of using different ML techniques to predict the band gap in organic as well as inorganic materials. While considerable work has been done on predicting the band gap in perovskites, and oxide perovskites in particular, these have either focused on single perovskites32,43,46,47,48 or only small datasets of double perovskites38,49,50, which limits the generalization of these models beyond certain chemistries and warrants more work to extend the models’ applicabilities. Further, these works often demonstrate the accuracy of their models but do not take the next step of predicting new compounds that satisfy design criteria.

In this work, we make use of a ML-based screening framework, employing the low-fidelity semi-local exchange-correlation Perdew–Burke–Ernzerhof (PBE) flavor51 of the GGA functional, to down-select a tractable number of promising compounds from large candidate datasets. Specifically, we investigate a previously identified52 exhaustive chemical space of formable and stable cubic single and double oxide perovskites. These predictions of formability and stability have been independently validated by other researchers who have synthesized some of the compositions that were originally predicted in our past work53,54. Starting from these, we first identify novel cubic compositions that are likely to be have relatively wide band gaps (Eg ≥ 0.5 eV) and then predict their DFT band gaps with high accuracy using ML models based on Random Forests (RF)55. We use this two-step strategy of separating the materials which have or are predicted to have no band gap or a very small band gap (within the precision limits of PBE calculated band gaps) from the materials which have a significant band gap. We use a threshold value of 0.5 eV for the sake of illustration. In principle, the framework may be modified to account for any application-specific cutoff or a targeted range. We employ a very large dataset of DFT-based band gaps that are generated in-house for training and implement a hierarchical screening process, wherein we build two ML models independently, the first being a classification model to separate wide band gap materials from narrow band gap materials, and, the second, a regression model that quantitatively predicts the band gap of designated wide band gap materials. The trained models are then applied sequentially to our large candidate dataset. We intentionally do this to screen out a large number of materials which have a vanishing band gap to avoid biasing the subsequent regression model. To implement this scheme, we use a threshold band gap of 0.5 eV to demarcate narrow (Eg < 0.5 eV) and wide (Eg ≥ 0.5 eV) band gap materials. Thus, the regression model is applied to only those candidates that are likely to have a band gap greater than 0.5 eV as determined by the classification model. It is to be noted that the terminology of narrow and wide band gap materials used in this manuscript to differentiate between the two material classes is distinct from the similar terminology used with reference to semiconductor materials. A histogram of DFT-calculated band gaps used in the development of the ML model presented in this work and the associated chemical space are shown in Fig. 1b and c, respectively. In Fig. 1d, the number of down-selected candidates are shown at each step, details of which are discussed further in the manuscript. Since we focus on a very large chemical space, the developed models are deemed generalizable to the entire space of perovskites and double perovskite oxides, potentially containing millions of compounds.

Results

Hierarchical Down-selection workflow

We note that when we refer to oxide perovskite structures used for model training as well as the novel perovskite compositions that are predicted to have a wide band gap, we only consider the cubic variants. In our previous work52, we consider the formability and stability of oxide perovskites, and in building our models, use a stability criterion of energy above hull (Ec) ≤50 meV/atom. We showed that cubic thermodynamic stability is a very conservative criterion for perovskite synthesizability, and in many cases, it is also possible to further reduce the total energy (leading to an increased thermodynamic stability) by relaxing to several lower energy phases (e.g., tetragonal, orthorhombic, rhombohedral etc.) that are commonly found in perovskites. Therefore, if a compound is predicted to be stable in a cubic symmetry, it is only going to be more stable in these other possible reduced symmetry phases. Furthermore, these stabilizing local distortions are also known to slightly widen the bandgap in perovskites. Accounting for all the possible reduced symmetry configurations for a given double perovskite composition can lead to several tens of lower symmetry phases. Therefore, this assumption of restricting ourselves to only cubic symmetry allows us to keep the number of computations in this study limited to a practically-feasible level. However, a downside of this assumption is possible omission of some promising compounds whose PBE bandgaps are greater than 0.5 eV in the lower symmetry phases but less than 0.5 eV in the cubic phase.

The novelty of this complete hierarchical framework is multi-fold in that no prior work exists, to the best of our knowledge, in which multiple ML models have been used to connect the stability, formability, insulating nature, and band gap for such a large perovskite oxide chemical space. We consider all possible double oxide perovskite combinations for 68 elements from the periodic table, resulting in a very large chemically diverse set of candidate materials, which we contend is the largest and most diverse that has been evaluated till date. Owing to this heterogeneous and large dataset, our machine learning models are able to attain an very high prediction accuracy over such a vast chemical space (we note that in past this level of accuracy is only demonstrated on chemical spaces which exhibit a rather limited chemical diversity, eventually limiting the exploration potential of the developed surrogate models). The models are trained adaptively in order to achieve highly accurate and efficient predictive performance during the model building stages and are analyzed rigorously via performance metrics and inter-dependence of model features in an effort to gain physical insight into the band gap prediction problem. Our ML models allow for instant band gap predictions in the vast perovskite chemical space and screening for a variety of applications. The exhaustive design space that we explore here lends insight into design rules and dopant selection for band gap and band edge engineering56.

Figure 2a shows an overview of the model building and model application workflows adopted in this work. First, to build the ML models, DFT is used to compute the band gaps of more that 5000 materials to compile a training dataset. These materials are then classified as having either a narrow or wide band gap; in this work, a threshold value of 0.5 eV is used to separate them. Note that this value is chosen for the sake of illustration only and, in principle, any cutoff value depending on a target application can be implemented within the workflow. Simultaneously, the training descriptors are generated for this dataset and the descriptors along with the band gaps are used to build two ML models: i) a wide/narrow band gap classification model (MC) trained on both wide and narrow band gap data and ii) a band gap regression model (MR) trained on only wide band gap data. Then, the models are applied to the large chemical space of potentially formable and thermodynamically stable perovskite materials sequentially to first identify wide band gap candidates via MC, and then predict the band gap of those wide band gap candidates using MR.

Fig. 2: Schematic representation of workflow and associated datasets.
figure 2

a Computational workflow for model development and predictions. DFT calculated band gaps wserve as the foundation of two ML models, one for classification of insulators and a second regression model that predicts the band gaps of the insulating compounds. The two ML models are then applied sequentially to candidate datasets DFS and DW respectively. b Venn diagram representation of training dataset and candidate datasets used in this work.

Training data, prediction data and descriptors

To build training and candidate datasets, the 68 elements highlighted in the periodic table in Fig. 1c were considered and all possible ABO\({}_{3},{A}_{2}B{B}^{{\prime} }\)O\({}_{6},A{A}^{{\prime} }{B}_{2}\)O6 and \(A{A}^{{\prime} }B{B}^{{\prime} }\)O6 compounds that could be formed by substituting them at the A- and B-sites were enumerated. Considering all possible combinations and accounting for charge neutrality, this resulted in a set of 946,292 unique single and double perovskite compositions (some of which have multiple valence combinations). From these, 5152 oxide perovskite compounds were adaptively selected to form the training dataset DP and the structures were optimized (while constrained to remain cubic) and their band gaps were calculated with DFT. Further technical details of our DFT calculations are provided in the methods section. Initially, all experimentally known oxide perovskites were calculated to form the training dataset. The initial classification model predictions of wide and narrow band gap materials and their corresponding band gaps were then used to adaptively augment this training dataset.

In total, 5152 compounds were evaluated to create a robust training dataset DP for the wide/narrow band gap classification model. Structures with calculated band gaps equal to or greater than 0.5 eV were labeled as wide band gap materials (insulators) while those with band gaps less than 0.5 eV were labeled as narrow band gap materials. Applying this 0.5 eV threshold criterion, of the 5152 calculated perovskites, 1575 (i.e. about 30%) were found to have a wide band gap, while the remaining 3577 were found to have a narrow band gap. These 1575 structures encompass the training dataset DBG for the band gap prediction regression model and the distribution of their band gaps is shown in Fig. 1b. As is self-evident, DBG is a subset of DP. The various training datasets used and referenced in this work are listed in Table 1 and are also represented in a Venn diagram in Fig. 2b. The complete training dataset is included in Supplementary Data 1.

Table 1 Definition of training and candidate oxide perovskite datasets used in this work.

Candidate dataset for prediction

The exhaustive dataset of 946,292 unique single and double perovskite compositions less the 5152 compounds that comprise the training dataset, results in the foundational chemically compatible candidate dataset DC comprising 941,140 perovskite oxide compounds. ML models described in52 that predict formability and theromodynamic stability of oxide perovskites were then applied to DC, to identify the subset DFS of 462,248 oxide perovskites (using a cutoff probability of 0.5) that are predicted to be formable and thermodynamically stable. This dataset DFS is then our candidate dataset to which we apply the classification model MC and band gap regression model MR built in this work sequentially to identify the oxide perovskites that are likely to have a wide band gap and then predict the actual band gaps. It is worth repeating that in order to maintain zero overlap between the training and candidate datasets, all compounds that comprise the training dataset were removed from the DC dataset and hence do not feature in the DFS database. Additionally, as mentioned earlier, all experimentally known oxide perovskites are part of the training dataset. Thus the compounds in the candidate dataset DFS are all perovskite oxide compositions that have never been experimentally synthesized to the best of our knowledge. The nested venn diagram in in Fig. 2b delineates the relationship between the training dataset, the candidate dataset and the sets of down-selected candidates at each step in our hierarchical strategy.

Descriptors for machine-learning

In this work, we use a combination of geometric and atomic descriptors to train the ML models. As far as possible, the intention was to use the simplest possible inputs and minimize computational overhead necessary to compile inputs while achieving high prediction accuracies by effectively incorporating notions of chemical similarity across different chemistries. To this end, elemental properties corresponding to the A- and B-site cations were used as descriptors for the single perovskites and their symmetric and anti-symmetric compound variants were used for the double perovskites. For a double perovskite \(A{A}^{{\prime} }B{B}^{{\prime} }\)O6, for a given property P, the symmetric compound feature is calculated as PA+ = (PA + P\({}_{A}^{{\prime} }\))/2 and the anti-symmetric compound feature as PA = PA − P\({}_{A}^{{\prime} }|\)/2 for the A-site, where PA and P\({}_{A}^{{\prime} }\) are the elemental properties of A and \({A}^{{\prime} }\); similar descriptors are defined for the B-site. Such compound descriptors have been previously adopted by the authors38,52,57 and others58 and are an effective and well-known technique to account for the multiple cations in a single site scenario. We intentionally do not include compound properties such as total energy, formation energy or electronic charge densities as descriptors even though they may be calculated relatively easily since the goal is to create robust ML models which may be universally applied to large prediction datasets.

All descriptors were normalized to ensure zero mean and standard deviation of unity. In the preliminary stages of model building, we started out with a very comprehensive list of 68 structural and chemical properties that may be indicators of the insulator nature of the perovskites and the band gap. Using Pearson correlation59 values as a first step screening process and then using the recursive feature elimination (RFE)60 technique during model development using the open-source Scikit-learn61 python package, the least important descriptors were pruned and the final set of relevant descriptors were identified. RFE works to select features by recursively removing those features which exhibit the smallest weight that are assigned by an extra trees classifier. The estimator is first trained on the initial set of features and the importance of each feature is determined. The most important features are then retained from the current set of features. This procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

The final set of atom-specific descriptors include the Highest Occupied Molecular Orbital (HOMO) and Lowest Unoccupied Molecular Orbital (LUMO) energies62, ionization energy (IE), electro-negativity63, Zunger’s pseudopotential radius (Z radius), and electron affinity (EA). For the Zunger’s pseudopotential radius the sum of the radii for the s and p orbitals was used. As mentioned earlier, symmetric and anti-symmetric compound descriptors for the A- and B-sites were used for these 6 atom-specific descriptors resulting in 24 compound descriptors. We further included perovskite-specific geometric descriptors – the Goldschmidt tolerance factor (t)64, octahedral factor (μ)65, and mismatch factors52,66 for the A-site (\(\bar{{\mu }_{A}}\)) and B-sites (\(\bar{{\mu }_{B}}\)) which are defined as:

$$\mu =\frac{{r}_{B}}{{r}_{O}},$$
(1)
$$t=\frac{{r}_{A}+{r}_{O}}{\sqrt{2}({r}_{B}+{r}_{O})},$$
(2)
$$\bar{{\mu }_{A}}=| {r}_{A}-{r}_{{A}^{{\prime} }}| /2{r}_{O},$$
(3)
$$\bar{{\mu }_{B}}=| {r}_{B}-{r}_{{B}^{{\prime} }}| /2{r}_{O},$$
(4)

where \({r}_{A},{r}_{A}^{{\prime} }\,{r}_{B},{r}_{B}^{{\prime} }\) and rO are the coordination dependent Shannon’s ionic radii67 of the A-site cations, the octahedrally coordinated B cations and the oxygen anion respectively. Including these four geometric descriptors, we were left with a total of 28 descriptors. Using RFE, for the insulator classification model, it was found that all 28 descriptors are important, while for the band gap regression model, 21 atom-specific descriptors and all the geometric descriptors were relevant. The specific descriptors identified for both models are indicated in Table 2. Values of these 28 descriptors for the \({D}_{P}^{{\prime} }\) database are provided in Supplementary Data 1. The dataset also includes the DFT-calculated band gap of the compounds. The HOMO and LUMO Kohn-Sham levels of isolated atoms were computed using DFT without spin polarization, in a large orthorhombic supercell (15 Å × 14.5 Å × 14 Å, to break the cubic symmetry) with respect to the vacuum level38,68.

Table 2 List of descriptors used in the ML models.

The two ML models MC and MR were built using the descriptors listed in Table 2 and the respective training databases \({D}_{P}^{{\prime} }\) and DBG. To gauge the robustness of the models and to identify the minimum significant value for the feature importance, both models were also tested using random descriptors. The models were then tested on the test datasets created using the 90/10 shuffle split as described previously. Subsequent to the training and testing phases, the models were applied to the DFS dataset of predicted stable and formable perovskites. In the following sub-sections, we will discuss the models and their performance in detail. Technical details regarding the ML models used may be found in the methods section.

Wide/narrow band gap classification

The accuracy, precision, and recall metrics of MC were evaluated using five-fold cross validation for different combinations of training/test splits; the values are shown in Table 3. For the final model, we used a 90/10 training/test split and achieved a model accuracy of 0.94 and 0.95 and model precision of 0.95 and 0.93 on the training and test sets, respectively.

Table 3 Performance metrics for MC averaged over 100 different runs for varying percentages of training training/test data splits.

In Fig. 3a, we present the 24 most important descriptors for the perovskite insulator classification problem. It is seen that the symmetric HOMO energy, electronegativity, and ionization energy for the B-site are the most important descriptors, followed by the symmetric electronegativity and LUMO energy for the A-site, in differentiating between compounds that are likely to have a wide or narrow band gap. Figure 3b shows the average test confusion matrix for 100 runs of the model. The off-diagonals (in red) indicate the False Positives (FP) and False Negatives (FN) while the diagonal elements (in green) indicate the True Positives (TP) and True Negatives (TN).

Fig. 3: Results for perovskite insulator character using MC.
figure 3

a Feature importance plot for all the descriptors with significant values, b Confusion matrix, c Receiver operating characteristic (ROC) curves, and d Precision-recall curves of the cross-validated RF classification on test data using MC.

In Fig. 3c, we plot the receiver operating characteristic (ROC)69 curves of the cross-validated classification model MC on the test data. The closer the curve is to the ideal point (top left corner on the plot), the greater the area under the curve (AUC), and the better the performance of the classifier. The plot emphasizes the value of the curve for a threshold value of 0.5, i.e., compounds having a probability of ≥0.5 of being an insulator are classified as wide band gap materials while those with a probability <0.5 are classified as narrow band gap materials. For this classifier, we achieve a very good AUC value of 0.98. Figure 3(d) showcases the precision-recall (PR) curves for the classifier, which is a plot of the precision rate on the ordinate and the recall rate on the abcissa for varying threshold values. The point (1,1) on this plot implies a perfect predictive model, and the closer the curve is to this point, the more robust is the model. As is evident, the PR curve for a threshold value of 0.5, is very close to the ideal (1,1) point yielding an AUC of 96, resulting in a very skillful model. The f1-score is the harmonic mean of the precision and recall rates for a particular probability threshold value. On the PR curve, we indicate the iso-curves for various f1-score values.

Band gap regression

As mentioned in prior sections, the band gap prediction model was continually improved by adaptively augmenting the size of the training dataset until the MAE for the model on the test data was below 0.2 eV. The regression coefficient (R2) and MAE values for different combinations of training/test data splits evaluated using five-fold cross validation are shown in Table 4. The band gap prediction model results are shown in Fig. 4. To visualize the accuracy of our band gap predictions, the machine-learning predictions of the band gap as a function of the calculated band gap for the training and test data are shown in Fig. 4a and b, respectively. Regression coefficient (R2) values of 0.97 and 0.86 and MAE values of 0.07 eV and 0.18 eV were achieved for the training and test data, respectively, averaged over 100 different runs, each with randomly selected 90% training and 10% test sets. These values compare very favorably with and are at least as good as previously reported band gap prediction models38,39,40 which are trained on much smaller datasets.

Table 4 Performance metrics (regression coefficient R2 and MAE in eV) for MR, averaged over 100 different runs for different combinations of training and test set sizes.
Fig. 4: Results for prediction of band gaps of oxide perovskites using regression model MR.
figure 4

a Machine-learning predictions for training data and (b) machine-learning predictions for test data. Learning curves representing the variation in (c) regression coefficient (R2) and (d) MAE in eV with increase in size of training dataset. Each step represents the average R2 and MAE of 100 iterations of shuffle split cross-validation with 10% of the data for validation. The shaded regions indicate the averaged standard deviation. e Feature importance plot for all the descriptors with significant values.

Figure 4c, d shows the learning curves for the regression model for the R2 and MAE performance metrics. Learning curves are a widely used diagnostic tool in ML algorithms to indicate the incremental improvement in learning with respect to an increase in the size of the training dataset. The model is evaluated for a chosen evaluation metric on the training dataset and on a hold-out validation dataset after each update during training; the measured performance when plotted comprise the learning curves. Here, we plot the learning curves for our chosen metrics of R2 (Fig. 4c and the MAE (Fig. 4d). In both these plots, we see that the test curve follows the trajectory of the training curve with a moderate gap between them - indicating that our model exhibits low bias and moderate variance, as is desired.

Figure 4e indicates the regression feature importances. Here, we see that the geometric factors (\(t,\bar{{\mu }_{B}}\)) feature very low on the importance list while descriptors such as the electronegativities, HOMO and LUMO energies, Zunger’s pseudopotential radius, and ooctahedral factor (μ) prominently affect the band gap. Overall, We see that the features corresponding to the B-site appear more predominantly. This is expected since, in general, we observe (as indicated in the SI) that for a given oxide perovskite altering the A-site cation results in a smaller band gap variation as compared to an alteration in the B-site cation. This observation may be further attributed to the ionic nature of the perovskites, wherein the valence band maximum (VBM) states are predominantly oxygen p states while the conduction band minimum (CBM) states are predominantly B element d states. Thus, if keeping the anion chemistry constant, one would expect the nature of the B cations to have the greatest effect on the band gap.

To gain additional insight into the interplay of the model, the model’s features, and their effect on the band gap, we also calculate the two dimensional (binary) partial dependence for the most important features identified. Partial dependence plots (PDP)70 which are calculated after the model has been fit to the data, show the marginal effect a descriptor has on the predicted outcome of the ML model. PDPs are useful in determining whether the relationship between the target property and the descriptors is linear, monotonic, or more complex. This is done by varying the value of the descriptor of interest and using the model to predict the target value for multiple instances; the average target value is then plotted as a function of the descriptor. Thus, the partial dependence method considers all instances and gives a statement about the global relationship of a feature with the predicted outcome. While the single feature PDPs capture the average trend of a given feature with the target property, the binary PDPs visualize the partial dependence of two descriptors simultaneously. Note that partial dependence works by marginalizing the machine learning model output over the distribution of the remaining features, so that the function shows the relationship between the features we are interested in and the predicted outcome. Marginalization is a method that requires summing over the possible values of one variable to determine the marginal contribution of another. By marginalizing over the other features, we get a function that depends only on the features of interest.

Figure 5a shows the binary PDPs for the top 5 descriptors, in which the yellow regions indicate the largest values of predicted band gaps. Figure 5b shows the expanded PDP for the top two features i.e., XB+ and XA+. This plot indicates that the maximum values of band gaps may be found for approximate value ranges of 0 > XB+ > 1.6 and 0 > XA+ > 1.0. Such feature value ranges may be obtained for various feature pairs, or even single features if one refers to unary PDPs. Figure 5c shows the corresponding scatter plot for the training data, i.e. the marginalized DFT calculated band gaps for varying values of XB+ on the ordinate and XA+ on the abcissa. On comparison of Fig. 5b, c, we see that the training data does not include a significant number of data points in the corresponding high band gap region predicted by the PDP. Thus, while PDPs may be used to delineate ranges of feature values to obtain specific ranges of band gaps, it is imperative to keep in mind the limitation that the predictions are averaged values over the remaining features and the training data rarely cover the entire targeted feature ranges in a uniform manner.

Fig. 5: Partial dependence plots (PDPs) for model features.
figure 5

a PDPs for the five top ranked features of the band gap prediction regression model. The units of the features used here are consistent with those reported in the Table 2. b Expanded PDP for the top two features i.e., XB+ and XA+, and c the marginalized DFT calculated band gaps for varying values of XB+ on the ordinate and XA+ on the abcissa.

The statistical nature of predictions obtained using ML is concomitant with uncertainties in the predictions themselves. Hence, while it is important to ensure high prediction accuracy while building ML models, it is equally if not more important to explain a prediction, and the confidence that we have in the prediction made by a model. Also, predictions are interpolative for data points that lie within the range of previously seen data, and extrapolative for data points that fall beyond the range of previously seen data. The uncertainty arising due to this distance from the domain of the training data needs to be quantified and hence confidence intervals for the test data using our band gap regression model were calculated and are and shown in Fig. 6a. Here, we see that for a large number of data points, the error bars cross the prediction-equals-calculation diagonal, indicating the lack of residual noise in the predicted band gaps and that our features that constitute the regression model adequately describe the oxide perovskite band gaps. The variation in the lengths of the error bars indicates that our model is more confident about some predictions than others. Reassuringly, we observe that, in general, the model is more confident about those predictions for which the predicted and calculated band gaps are similar.

Fig. 6: Band gap prediction results.
figure 6

a Confidence intervals for test data for the band gap prediction regression. b Histogram of the predicted probabilities of candidate compounds that are likely to be wide band gap materials, binned by 0.01. The inset indicates the zoomed-in histogram for for the 310 oxide perovskites with a predicted wide band gap probability between 0.9 and 1. c Histogram of predicted band gaps for the 310 oxide perovskites, binned in 0.125 eV intervals. d Parity plot of calculated vs. predicted band gaps for 310 down-selected candidates.

Prediction of new oxide perovskites with wide band gaps

Subsequent to the training and testing phases, the wide/narrow band gap classification and band gap regression models were applied to the dataset DFS of predicted formable and stable oxide perovskites to predict new wide band gap materials and their band gaps respectively. This prediction dataset DFS consists of 462,248 distinct compounds. Using a 50% probability cutoff, this was reduced to 13,589 (<3%) compounds that were predicted to possess a wide band gap; this set of compounds is designated DW. The distribution of the wide band gap prediction probabilities for these 13,589 compounds is shown in Fig. 6b. As discussed earlier, the candidate compounds do not contain any experimentally known perovskites, and hence these 13,589 oxide perovskite compositions are novel compounds that have never been synthesized to the best of our knowledge. To down-select a tractable number of wide band gap candidates for further exploration, from these 13,589 compounds, we retained those compounds (Dw) for which the prediction probabilities of experimental formability, thermodynamic stability, and wide band gap nature were greater than 90% which amounted to 310 compounds. The inset in Fig. 6b reflects the wide band gap prediction probabilities for these 310 compounds. For these 310 compounds, we then predicted the band gaps using MR, the distribution of which is shown in Fig. 6c. The 310 compounds are listed in Table 5 and complete descriptions of these compounds along with their predicted band gaps are included in Supplementary Data 2.

Table 5 List of 310 predicted wide band gap oxide perovskite compounds satisfying cutoff probabilities >0.9 for stability, formability and wide/narrow band gap prediction models.

The DFT band gaps of these 310 compounds were calculated in an effort to computationally confirm our predictions, as shown in Fig. 6d. Calculations showed that all of these 310 compounds are indeed wide band gap materials as defined in this work, i.e., all their band gaps were greater than 0.5 eV, thus proving that our insulator classification model is indeed very accurate. Further, the calculated band gaps agree very well with the predicted band gaps, with a maximum error of 0.48 eV, a MAE of 0.21 eV, a mean square error (MSE) of 0.07 and a R2 value of 0.84. Absolute difference in predicted and calculated band gaps for these compounds are also denoted in eV in parentheses in Table 5. In these calculations shown in Fig. 6d, we examine 31 distinct AA’ pairs. For a given AA’ pair, the band gap shows wide variation with variation in the BB’ pair. For example, for A = A’ = Ba, we examine 32 distinct BB’ pairs and find that the calculated band gap varies from 1.57–4.1 eV. In Supplementary Note 1 and Supplementary Note 2 we include detailed heat maps for different combinations of AA’ and BB’.

Discussion

As mentioned in prior sections, with a 50% probability cutoff, 13,589 compounds were predicted to be wide band gap materials, i.e., materials with a DFT band gap greater than 0.5 eV. These 13,589 compounds are double perovskites with \({A}_{2}B{B}^{{\prime} }\)O\({}_{6},A{A}^{{\prime} }{B}_{2}\)O6 and \(A{A}^{{\prime} }B{B}^{{\prime} }\)O6 compositions. Thus our predictions enable us to compile design maps which can give an insight into the band gap variation and associated trends of a basic ABO3 structure when another element (\({A}^{{\prime} }/{B}^{{\prime} }\)) replaces half of the A-site, B-site or A- and B-site cations. Figure 7 indicates one such design map for the Ba\({}_{2}B{B}^{{\prime} }\)O6 for selected formable and stable combinations of B and \({B}^{{\prime} }\). The white squares indicate combinations which are predicted to be narrow band gap materials (Eg < 0.5 eV) and hence do not have an accompanying band gap prediction. Elements are arranged on the x- and y-axes in order of increasing atomic number (left to right and bottom to top respectively). Here, we see that Ta, Sb, In, La on the B-site result in larger band gaps across the board, with the specific combination of Ta and In resulting in the largest band gap of 3.9 eV. On the other hand, the presence of Bi on the B-site always lowers the band gap. Supplementary Note 3 includes Design maps of predicted band gaps with respect to average values of the five top ranked features individually and pairwise are included in Supplementary Note 4 and Supplementary Note 5 respectively. Average values of variation in band gap for base single perovskite oxides with addition of element in A- and B-sublattices are shown in Supplementary Note 5.

Fig. 7: Band gap prediction map for selected combinations of Ba2BB’O6.
figure 7

White squares indicate compositions that are predicted to be narrow band gap materials and hence do not have an accompanying band gap prediction.

Thus, a hierarchical screening strategy (summarized in Fig. 1d is presented to identify 310 novel formable, stable double oxide perovskites exhibiting wide band gaps via a ML-guided exploration of a very large fraction of the double perovskite chemical space. We employ a large training dataset spanning a vast chemical space to train the ML models and achieve high predictive accuracies. Confidence intervals were derived for the quantitative band gap prediction model and our predictions are validated with DFT computations. We find excellent agreement between our predictions and these validating calculations.

The calculated band gap data that comprises the training data as well as the predicted band gaps may be used to generate design maps for desired combinations of elements on the A- and B- sites that can offer insight into the variation of the band gap due to doping and be used to identify preliminary candidates for specific applications. The 310 identified candidates are novel chemistries that have not been explored experimentally to date. These candidates are predicted to have band gaps ranging from 0.5 to 4 eV and, consequently, can potentially find application in a wide range of areas ranging from infrared radiation detection, solar cells, and other light emitting devices (LEDs), to scintillator materials.

The efficiency of our multi-step hierarchical screening approach, which may be generalized to investigate other classes of materials in addition to the oxide perovskites examined here, provides further impetus to the application of physics-based ML models to the discovery of novel functional materials. This approach - of creating models and then using them to identify specific novel candidates that will be of value to the community at large and that are suggested for follow-on experimental studies - is not always followed in the prior literature. We note that this hierarchical strategy, where we first classify materials with perovskite structure as either insulators or metals and then train regression models on the insulators only with such high accuracies, is unique in the literature and provides a route for improved models for these types of properties.

Lastly, we note that although the absolute DFT gaps are underestimated due to our choice of the low-fidelity but less expensive PBE functional, the relative changes in band gap as a function of chemistry are well estimated by the PBE-GGA and they correlate with the experimental band gaps (since the underestimation is systemic and generally proportional to the band gap itself) and therefore chemical trends, which in themselves are very useful, are expected to be well captured using our approach. A high-throughput study such as the present one necessitates a low-cost technique and recent studies56,71 have indicated that ground state properties calculated using the PBE-GGA functional are sufficiently accurate, particularly for changes in electronic structure with chemistry. However, to ensure that this is true for the current application, 100 materials were randomly selected from our training dataset of wide band gap materials and their HSE band gaps were calculated; the results are included in the Supplementary Note 6. It is seen that PBE-GGA indeed systematically underestimates the band gap for most materials.

Methods

DFT calculation details

The Vienna ab initio simulation package (VASP)72,73 implementation of the DFT framework was used in this work to calculate the band gaps of the training data. The parameterization proposed by Perdew, Burke, and Ernzerhof 51 of the GGA74 approach was used. A Monkhorst-Pack mesh was used to perform the Brillouin zone integrations with at least 5000k points per reciprocal atom. The structures were fully relaxed using the Methfessel–Paxton smearing method75 of order one and a final self-consistent static calculation was carried out. The calculations were spin polarized and we used a cutoff energy of 533 eV for all of the structures. All relaxations were carried out until changes in the total energy between relaxation steps were within 1 × 10−6 eV and atomic forces on each of the atoms were smaller than 0.01 eV/Å.

ML models using random forests

The choice to use RFs to build our ML models was made to leverage the inherent robustness, low bias, and moderate variance afforded by the technique. The RF (also known as Random Decision Trees) is a bagging-based ensemble learning method. It has been shown that using ensembles of trees, where each tree in the ensemble is grown in accordance with the realization of a random vector, results in consequential gains in classification or prediction accuracy. Final predictions are obtained by aggregate voting over the ensemble using equal weights in most cases. RFs seek to induce randomness by using subsets of descriptors drawn at random to determine the optimal split of a given node of a tree, thereby reducing the correlation between the quantities being averaged and consequently enhancing the variance gains. We used the Scikit-learn package to implement our RF models.

For the RF classification model MC, the training and test dataset selections were stratified over the wide/narrow band gap chemistries using a 90/10 training/test split. Thus, we used 90% of the dataset \({D}_{P}^{{\prime} }\) to train the classification model MC and then tested on the remaining 10%. The maximum tree depth was set at 25 and the number of estimators or trees was 200 for the classification model. For the RF regression model MR, we also used 90% of the dataset DBG to train the regression model MR and then tested on the remaining 10%. The maximum tree depth was set to 50 and the number of estimators was chosen to be 200. To maximize accuracy while minimizing the standard deviation on unseen data, we used 5-fold cross-validation using a 90% training subset to determine the hyper-parameters for both classification and regression.

Uncertainty quantification

RF models are difficult to interpret owing to their black box nature, and it is difficult to quantify the associated modeling and input uncertainties. Two methods are most widely used for the quantification of confidence intervals in RF-based regression models: i) U-Statistic-based RFs76 and ii) bootstrap77, jackknife-after-bootstrap78, and infinitesimal jackknife79 based methods, which we use in this work.

U-statistics80 is a class of statistics in which a minimum-variance unbiased prediction is derived by drawing a predetermined number of times through all combinatorial selections of the training data set and then averaging over the possible results of these sub-samples. Mentch and Hooker76 showed that under a strict sub-sampling scheme, predictions for individual feature vectors are asymptotically normal, allowing for application of U-statistics to RF predictions. Relevant statistical measures can then be used to quantify the uncertainty related to the reducible error of the RF prediction and construct confidence intervals.

Bootstrap sampling and jackknifing rely on estimating the variance of a prediction by using the variability between re-samples rather than using statistical distributions. In bootstrapping, numerous prediction models are developed by randomly excluding varied small subsets of the training data and the mean and variance of the predictions is estimated from these bootstrapped models. This methods attempts to quantify the sensitivity of the model with respect to slight perturbations in the training data. Jackknifing is another re-sampling technique predating bootstrapping, in which each training data point is systematically left out, the model is trained on the remaining data and an estimate is calculated; the jackknife estimate is found by evaluating the average of these calculations. These two ideas may be combined in a jackknife-after-bootstrap method which is used to find the an error estimate (for example variance) to a bootstrap estimate. As opposed to jackknifing and jackknifing-after-bootstrapping, where the behavior of a prediction is studied after excluding one or more observations at a time, the infinitesimal jackknife (IJ) looks at the effect on a prediction after down-weighting each observation by an infinitesimal amount. In 2014, Wager et al.81 demonstrated that both the jackknife-after-bootstrap and the infinitesimal jackknife methods suffer from considerable Monte Carlo bias, and they proposed a bias corrected version of the method, the implementation of which is used to calculate the confidence intervals for the test data using our band gap regression model.