Band gap predictions of double perovskite oxides using machine learning

Talapatra, Anjana; Uberuaga, Blas Pedro; Stanek, Christopher Richard; Pilania, Ghanshyam

doi:10.1038/s43246-023-00373-4

Download PDF

Article
Open access
Published: 10 June 2023

Band gap predictions of double perovskite oxides using machine learning

Communications Materials volume 4, Article number: 46 (2023) Cite this article

4281 Accesses
8 Citations
1 Altmetric
Metrics details

Subjects

Abstract

The compositional and structural variety inherent to oxide perovskites spawn wide-ranging applications. In perovskites, the band gap E_g, a key material parameter for these applications, can be optimally controlled by varying the composition. Here, we implement a hierarchical screening process in which two cross-validated and predictive machine learning models for band gap classification and regression, trained using exhaustive datasets that span 68 elements of the periodic table, are applied sequentially. The classification model separates wide band gap materials, with E_g ≥ 0.5 eV, from materials which have zero or relatively small band gaps, namely E_g < 0.5 eV, and the second regression model quantitatively predicts the gap value of the wide band gap compounds. The study down-selects 13,589 cubic oxide perovskite compositions that are predicted to be experimentally formable, thermodynamically stable, and have a wide band gap. Of these, a subset of 310 compounds, which are predicted to be stable and formable with a confidence greater than 90%, are identified for further investigation. Our models are methodically analyzed via performance metrics and inter-dependence of model features to gain physical insight into the band gap prediction problem. Design maps to identify the variation of band gap with substitution of different elements are also presented.

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Scaling deep learning for materials discovery

Article Open access 29 November 2023

Uncovering the predictive pathways of lithium and sodium interchange in layered oxides

Article 16 April 2024

Introduction

The band gap (E_g), which is non-existent for a metal but is positive for semiconductors and other insulators, is a fundamental property of a periodic solid and directly influences its electrical conductivity. Based on their band gap, functional electronic materials are used in a diverse array of applications such as field effect transistors¹, LEDs², photovoltaics³, and scintillators⁴. By controlling the composition or structure of the material, systematic tuning of the band gap may be achieved, allowing for materials that are tailored to the desired application. Consequently, the band gap is widely used as a screening criterion in data sets generated via high-throughput calculations for application-based discovery^5,6,7.

The compositional and structural complexity afforded by perovskites of the form ABX₃ in conjunction with their fascinating electrical and magnetic properties such as piezoelectricity⁸, optical properties⁹, high-temperature superconductivity¹⁰, ferro-electricity¹¹, and magnetostrictive effects¹² make them especially attractive candidates for band gap tuning. The prototypical ABX₃ cubic perovskite structure as indicated in Fig. 1a is composed of a three-dimensional BX₃ network of corner-sharing BX₆ octahedra. The A-site cations occupy the 12 coordinate sites formed by the octahedral units and each A cation is surrounded by 12 equidistant anions. The perovskite structure can accommodate 90% of the metallic ions in the periodic table¹³ with a wide variety of different anions, increasing their utility in a wide range of applications. As we will show in this work, the band gap in these materials is closely correlated with their physical and chemical properties. Thus, varying the composition by replacing any of the atoms in these structures can be used to precisely tune the physical and/or chemical properties of interest and thereby the band gap. This tunability may be amplified by increasing the types of cations occupying the A- and/or B-sites, giving rise to double perovskites with formulae ${{{{{{{{\rm{A}}}}}}}}}_{x}{{{{{{{{\rm{A}}}}}}}}}_{2-x}^{{\prime} }{{{{{{{{\rm{B}}}}}}}}}_{2}{{X}}_{6}$, ${{{{{{{{\rm{A}}}}}}}}}_{2}{{{{{{{{\rm{B}}}}}}}}}_{y}{{{{{{{{\rm{B}}}}}}}}}_{2-y}^{{\prime} }{{X}}_{6}$ and, in the most general case, ${{{{{{{{\rm{A}}}}}}}}}_{x}{{{{{{{{\rm{A}}}}}}}}}_{2-x}^{{\prime} }{{{{{{{{\rm{B}}}}}}}}}_{y}{{{{{{{{\rm{B}}}}}}}}}_{2-y}^{{\prime} }{{X}}_{6}$ for 0 ≤ x, y ≤ 2, as shown in Fig. 1a.

**Fig. 1: Hierarchical discovery framework for wide band gap double perovskites.**

In recent years, to accelerate the discovery of novel electronic materials, tremendous strides have been made by using high-throughput Density Functional Theory (DFT) techniques to determine the electronic structure of materials. However, the accuracy of the computed band gap of a material depends on the particular exchange-correlation functional employed in the DFT technique that is used. If we consider the band gap to be an excitation energy, it is naive to expect an accurate description using ground state DFT. In fact, when using DFT computations with local and semi-local exchange-correlation functionals, the errors in semiconductor and insulator band gaps can be as large as 50%¹⁴, which is especially significant in research in fields involving semiconductors, optical and photovoltaic materials, and thermoelectrics. This underestimation of the band gap is attributed to the connate lack of derivative discontinuity¹⁵, self-interaction error (SIE)^16,17, and delocalization error¹⁸ within conventional DFT functionals such as the local density approximation (LDA) or the generalized gradient approximation (GGA)¹⁹. Consequently, much attention has focused on solving the problem of the underestimation of the DFT band gap both within the Kohn-Sham DFT formalism and outside it^20,21. However, most of these techniques, such as the GW approximation²² and delta self-consistent-field (ΔSCF) method²⁰, as well the use of hybrid functionals²³, improve the accuracy of band gap estimation at increased computational cost which make them untenable for high-throughput calculation efforts. One feasible way to circumvent this problem is to screen large datasets using low fidelity GGA-type calculations and down-select to a tractable subset of potential candidates. High-fidelity band-gap estimation techniques or experiments can then be used to accurately determine the electronic structure of this smaller subset.

Machine learning (ML) has enjoyed much popularity in the field of materials science and condensed matter physics as an efficient tool to predict a physical property or quantity, particularly when the target property cannot be directly determined without the use of extensive resources, either experimental or computational^{24,25,26,27,28}. In the last two decades, statistical learning frameworks in conjunction with regression techniques have been widely used to predict band gaps of large datasets in order to overcome the band gap estimation problem. Earliest among these was the work of Gu et al.²⁹ who used experimental band gaps of 25 binary compounds and 31 ternary compounds to construct support vector regression (SVR) and artificial neural network (ANN) models, using elemental predictors. Since then, SVR has featured prominently as the regression method of choice in band gap prediction^30,31,32,33 in addition to ordinary least squares regression (OLSR)^30,34,35,36 and least absolute shrinkage and selection operator (LASSO)^30,35,36 methods. Pilania et al.³⁷ presented a general formalism to discover decision rules that can be used to make ultra-fast, yet accurate, property predictions. They used ML to establish a mapping between chemo-structural fingerprints and the electronic charge density distribution of polymeric insulators and their properties, including the band gap. In 2016, they demonstrated a robust learning framework for efficient and accurate predictions of electronic band gaps of double perovskites using a systematic feature-engineering approach³⁸. They also implemented a multi-fidelity Gaussian process (GP)-based co-Kriging regression model to predict E_g for the class of elpasolites³⁹. Recently, Na et al.⁴⁰ and Li et al.⁴¹ reported multi-fidelity band gap prediction models based on graph neural networks (GNNs) using that can use band gaps data computed and/or measured at varying levels of fidelity to provide best estimates at the hhighest level of fidelity. Omprakash et al.⁴² also used GNNs trained on experimental data gathered from literature to predict band gaps for a variety of different 2D, 3D, organometal and inorganic inorganic single perovskites. In a departure from traditional targets of high accuracy, Gladkikh et al.⁴³ used alternating conditional expectations (ACE), a ML technique suitable for small data sets which performs worse than more commonly used ML methods but presents its results in a graphic form, helping in interpretation, the lack of which is a core critique of ML applications today⁴⁴. Kauwe et al.⁴⁵ used the example property of the band gap to demonstrate an ensemble learning approach which allows the efficient modeling of experimental data by combining models trained on otherwise disparate computational and experimental data. Most of these approaches favor the use of elemental predictors, lending credence to the idea of physics-informed models. Recently, Stanley et al.³³ used formation energy as a predictor to develop highly accurate band gap prediction models for ABO₃ perovskites. However, if intended to be applied to large prediction datasets, the use of properties which are relatively inexpensive to compute as model predictors can still be prohibitive.

Thus we see that prior research in the area has focused on exploring the potential of using different ML techniques to predict the band gap in organic as well as inorganic materials. While considerable work has been done on predicting the band gap in perovskites, and oxide perovskites in particular, these have either focused on single perovskites^{32,43,46,47,48} or only small datasets of double perovskites^38,49,50, which limits the generalization of these models beyond certain chemistries and warrants more work to extend the models’ applicabilities. Further, these works often demonstrate the accuracy of their models but do not take the next step of predicting new compounds that satisfy design criteria.

In this work, we make use of a ML-based screening framework, employing the low-fidelity semi-local exchange-correlation Perdew–Burke–Ernzerhof (PBE) flavor⁵¹ of the GGA functional, to down-select a tractable number of promising compounds from large candidate datasets. Specifically, we investigate a previously identified⁵² exhaustive chemical space of formable and stable cubic single and double oxide perovskites. These predictions of formability and stability have been independently validated by other researchers who have synthesized some of the compositions that were originally predicted in our past work^53,54. Starting from these, we first identify novel cubic compositions that are likely to be have relatively wide band gaps (E_g ≥ 0.5 eV) and then predict their DFT band gaps with high accuracy using ML models based on Random Forests (RF)⁵⁵. We use this two-step strategy of separating the materials which have or are predicted to have no band gap or a very small band gap (within the precision limits of PBE calculated band gaps) from the materials which have a significant band gap. We use a threshold value of 0.5 eV for the sake of illustration. In principle, the framework may be modified to account for any application-specific cutoff or a targeted range. We employ a very large dataset of DFT-based band gaps that are generated in-house for training and implement a hierarchical screening process, wherein we build two ML models independently, the first being a classification model to separate wide band gap materials from narrow band gap materials, and, the second, a regression model that quantitatively predicts the band gap of designated wide band gap materials. The trained models are then applied sequentially to our large candidate dataset. We intentionally do this to screen out a large number of materials which have a vanishing band gap to avoid biasing the subsequent regression model. To implement this scheme, we use a threshold band gap of 0.5 eV to demarcate narrow (E_g < 0.5 eV) and wide (E_g ≥ 0.5 eV) band gap materials. Thus, the regression model is applied to only those candidates that are likely to have a band gap greater than 0.5 eV as determined by the classification model. It is to be noted that the terminology of narrow and wide band gap materials used in this manuscript to differentiate between the two material classes is distinct from the similar terminology used with reference to semiconductor materials. A histogram of DFT-calculated band gaps used in the development of the ML model presented in this work and the associated chemical space are shown in Fig. 1b and c, respectively. In Fig. 1d, the number of down-selected candidates are shown at each step, details of which are discussed further in the manuscript. Since we focus on a very large chemical space, the developed models are deemed generalizable to the entire space of perovskites and double perovskite oxides, potentially containing millions of compounds.

Results

Hierarchical Down-selection workflow

We note that when we refer to oxide perovskite structures used for model training as well as the novel perovskite compositions that are predicted to have a wide band gap, we only consider the cubic variants. In our previous work⁵², we consider the formability and stability of oxide perovskites, and in building our models, use a stability criterion of energy above hull (Ec) ≤50 meV/atom. We showed that cubic thermodynamic stability is a very conservative criterion for perovskite synthesizability, and in many cases, it is also possible to further reduce the total energy (leading to an increased thermodynamic stability) by relaxing to several lower energy phases (e.g., tetragonal, orthorhombic, rhombohedral etc.) that are commonly found in perovskites. Therefore, if a compound is predicted to be stable in a cubic symmetry, it is only going to be more stable in these other possible reduced symmetry phases. Furthermore, these stabilizing local distortions are also known to slightly widen the bandgap in perovskites. Accounting for all the possible reduced symmetry configurations for a given double perovskite composition can lead to several tens of lower symmetry phases. Therefore, this assumption of restricting ourselves to only cubic symmetry allows us to keep the number of computations in this study limited to a practically-feasible level. However, a downside of this assumption is possible omission of some promising compounds whose PBE bandgaps are greater than 0.5 eV in the lower symmetry phases but less than 0.5 eV in the cubic phase.

The novelty of this complete hierarchical framework is multi-fold in that no prior work exists, to the best of our knowledge, in which multiple ML models have been used to connect the stability, formability, insulating nature, and band gap for such a large perovskite oxide chemical space. We consider all possible double oxide perovskite combinations for 68 elements from the periodic table, resulting in a very large chemically diverse set of candidate materials, which we contend is the largest and most diverse that has been evaluated till date. Owing to this heterogeneous and large dataset, our machine learning models are able to attain an very high prediction accuracy over such a vast chemical space (we note that in past this level of accuracy is only demonstrated on chemical spaces which exhibit a rather limited chemical diversity, eventually limiting the exploration potential of the developed surrogate models). The models are trained adaptively in order to achieve highly accurate and efficient predictive performance during the model building stages and are analyzed rigorously via performance metrics and inter-dependence of model features in an effort to gain physical insight into the band gap prediction problem. Our ML models allow for instant band gap predictions in the vast perovskite chemical space and screening for a variety of applications. The exhaustive design space that we explore here lends insight into design rules and dopant selection for band gap and band edge engineering⁵⁶.

Figure 2a shows an overview of the model building and model application workflows adopted in this work. First, to build the ML models, DFT is used to compute the band gaps of more that 5000 materials to compile a training dataset. These materials are then classified as having either a narrow or wide band gap; in this work, a threshold value of 0.5 eV is used to separate them. Note that this value is chosen for the sake of illustration only and, in principle, any cutoff value depending on a target application can be implemented within the workflow. Simultaneously, the training descriptors are generated for this dataset and the descriptors along with the band gaps are used to build two ML models: i) a wide/narrow band gap classification model (M_C) trained on both wide and narrow band gap data and ii) a band gap regression model (M_R) trained on only wide band gap data. Then, the models are applied to the large chemical space of potentially formable and thermodynamically stable perovskite materials sequentially to first identify wide band gap candidates via M_C, and then predict the band gap of those wide band gap candidates using M_R.

**Fig. 2: Schematic representation of workflow and associated datasets.**

Training data, prediction data and descriptors

To build training and candidate datasets, the 68 elements highlighted in the periodic table in Fig. 1c were considered and all possible ABO${}_{3},{A}_{2}B{B}^{{\prime} }$O${}_{6},A{A}^{{\prime} }{B}_{2}$O₆ and $A{A}^{{\prime} }B{B}^{{\prime} }$O₆ compounds that could be formed by substituting them at the A- and B-sites were enumerated. Considering all possible combinations and accounting for charge neutrality, this resulted in a set of 946,292 unique single and double perovskite compositions (some of which have multiple valence combinations). From these, 5152 oxide perovskite compounds were adaptively selected to form the training dataset D_P and the structures were optimized (while constrained to remain cubic) and their band gaps were calculated with DFT. Further technical details of our DFT calculations are provided in the methods section. Initially, all experimentally known oxide perovskites were calculated to form the training dataset. The initial classification model predictions of wide and narrow band gap materials and their corresponding band gaps were then used to adaptively augment this training dataset.

In total, 5152 compounds were evaluated to create a robust training dataset D_P for the wide/narrow band gap classification model. Structures with calculated band gaps equal to or greater than 0.5 eV were labeled as wide band gap materials (insulators) while those with band gaps less than 0.5 eV were labeled as narrow band gap materials. Applying this 0.5 eV threshold criterion, of the 5152 calculated perovskites, 1575 (i.e. about 30%) were found to have a wide band gap, while the remaining 3577 were found to have a narrow band gap. These 1575 structures encompass the training dataset D_BG for the band gap prediction regression model and the distribution of their band gaps is shown in Fig. 1b. As is self-evident, D_BG is a subset of D_P. The various training datasets used and referenced in this work are listed in Table 1 and are also represented in a Venn diagram in Fig. 2b. The complete training dataset is included in Supplementary Data 1.

Table 1 Definition of training and candidate oxide perovskite datasets used in this work.

Full size table

Candidate dataset for prediction

The exhaustive dataset of 946,292 unique single and double perovskite compositions less the 5152 compounds that comprise the training dataset, results in the foundational chemically compatible candidate dataset D_C comprising 941,140 perovskite oxide compounds. ML models described in⁵² that predict formability and theromodynamic stability of oxide perovskites were then applied to D_C, to identify the subset D_FS of 462,248 oxide perovskites (using a cutoff probability of 0.5) that are predicted to be formable and thermodynamically stable. This dataset D_FS is then our candidate dataset to which we apply the classification model M_C and band gap regression model M_R built in this work sequentially to identify the oxide perovskites that are likely to have a wide band gap and then predict the actual band gaps. It is worth repeating that in order to maintain zero overlap between the training and candidate datasets, all compounds that comprise the training dataset were removed from the D_C dataset and hence do not feature in the D_FS database. Additionally, as mentioned earlier, all experimentally known oxide perovskites are part of the training dataset. Thus the compounds in the candidate dataset D_FS are all perovskite oxide compositions that have never been experimentally synthesized to the best of our knowledge. The nested venn diagram in in Fig. 2b delineates the relationship between the training dataset, the candidate dataset and the sets of down-selected candidates at each step in our hierarchical strategy.

Descriptors for machine-learning

In this work, we use a combination of geometric and atomic descriptors to train the ML models. As far as possible, the intention was to use the simplest possible inputs and minimize computational overhead necessary to compile inputs while achieving high prediction accuracies by effectively incorporating notions of chemical similarity across different chemistries. To this end, elemental properties corresponding to the A- and B-site cations were used as descriptors for the single perovskites and their symmetric and anti-symmetric compound variants were used for the double perovskites. For a double perovskite $A{A}^{{\prime} }B{B}^{{\prime} }$O₆, for a given property P, the symmetric compound feature is calculated as P^A+ = (P_A + P${}_{A}^{{\prime} }$)/2 and the anti-symmetric compound feature as P^A− = ∣P_A − P${}_{A}^{{\prime} }|$/2 for the A-site, where P_A and P${}_{A}^{{\prime} }$ are the elemental properties of A and ${A}^{{\prime} }$; similar descriptors are defined for the B-site. Such compound descriptors have been previously adopted by the authors^38,52,57 and others⁵⁸ and are an effective and well-known technique to account for the multiple cations in a single site scenario. We intentionally do not include compound properties such as total energy, formation energy or electronic charge densities as descriptors even though they may be calculated relatively easily since the goal is to create robust ML models which may be universally applied to large prediction datasets.

All descriptors were normalized to ensure zero mean and standard deviation of unity. In the preliminary stages of model building, we started out with a very comprehensive list of 68 structural and chemical properties that may be indicators of the insulator nature of the perovskites and the band gap. Using Pearson correlation⁵⁹ values as a first step screening process and then using the recursive feature elimination (RFE)⁶⁰ technique during model development using the open-source Scikit-learn⁶¹ python package, the least important descriptors were pruned and the final set of relevant descriptors were identified. RFE works to select features by recursively removing those features which exhibit the smallest weight that are assigned by an extra trees classifier. The estimator is first trained on the initial set of features and the importance of each feature is determined. The most important features are then retained from the current set of features. This procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

The final set of atom-specific descriptors include the Highest Occupied Molecular Orbital (HOMO) and Lowest Unoccupied Molecular Orbital (LUMO) energies⁶², ionization energy (IE), electro-negativity⁶³, Zunger’s pseudopotential radius (Z radius), and electron affinity (EA). For the Zunger’s pseudopotential radius the sum of the radii for the s and p orbitals was used. As mentioned earlier, symmetric and anti-symmetric compound descriptors for the A- and B-sites were used for these 6 atom-specific descriptors resulting in 24 compound descriptors. We further included perovskite-specific geometric descriptors – the Goldschmidt tolerance factor (t)⁶⁴, octahedral factor (μ)⁶⁵, and mismatch factors^52,66 for the A-site ($\bar{{\mu }_{A}}$) and B-sites ($\bar{{\mu }_{B}}$) which are defined as:

$$\mu =\frac{{r}_{B}}{{r}_{O}},$$

(1)

$$t=\frac{{r}_{A}+{r}_{O}}{\sqrt{2}({r}_{B}+{r}_{O})},$$

(2)

$$\bar{{\mu }_{A}}=| {r}_{A}-{r}_{{A}^{{\prime} }}| /2{r}_{O},$$

(3)

$$\bar{{\mu }_{B}}=| {r}_{B}-{r}_{{B}^{{\prime} }}| /2{r}_{O},$$

(4)

where ${r}_{A},{r}_{A}^{{\prime} }\,{r}_{B},{r}_{B}^{{\prime} }$ and r_O are the coordination dependent Shannon’s ionic radii⁶⁷ of the A-site cations, the octahedrally coordinated B cations and the oxygen anion respectively. Including these four geometric descriptors, we were left with a total of 28 descriptors. Using RFE, for the insulator classification model, it was found that all 28 descriptors are important, while for the band gap regression model, 21 atom-specific descriptors and all the geometric descriptors were relevant. The specific descriptors identified for both models are indicated in Table 2. Values of these 28 descriptors for the ${D}_{P}^{{\prime} }$ database are provided in Supplementary Data 1. The dataset also includes the DFT-calculated band gap of the compounds. The HOMO and LUMO Kohn-Sham levels of isolated atoms were computed using DFT without spin polarization, in a large orthorhombic supercell (15 Å × 14.5 Å × 14 Å, to break the cubic symmetry) with respect to the vacuum level^38,68.

Table 2 List of descriptors used in the ML models.

Full size table

The two ML models M_C and M_R were built using the descriptors listed in Table 2 and the respective training databases ${D}_{P}^{{\prime} }$ and D_BG. To gauge the robustness of the models and to identify the minimum significant value for the feature importance, both models were also tested using random descriptors. The models were then tested on the test datasets created using the 90/10 shuffle split as described previously. Subsequent to the training and testing phases, the models were applied to the D_FS dataset of predicted stable and formable perovskites. In the following sub-sections, we will discuss the models and their performance in detail. Technical details regarding the ML models used may be found in the methods section.

Wide/narrow band gap classification

The accuracy, precision, and recall metrics of M_C were evaluated using five-fold cross validation for different combinations of training/test splits; the values are shown in Table 3. For the final model, we used a 90/10 training/test split and achieved a model accuracy of 0.94 and 0.95 and model precision of 0.95 and 0.93 on the training and test sets, respectively.

Table 3 Performance metrics for M_C averaged over 100 different runs for varying percentages of training training/test data splits.

Full size table

In Fig. 3a, we present the 24 most important descriptors for the perovskite insulator classification problem. It is seen that the symmetric HOMO energy, electronegativity, and ionization energy for the B-site are the most important descriptors, followed by the symmetric electronegativity and LUMO energy for the A-site, in differentiating between compounds that are likely to have a wide or narrow band gap. Figure 3b shows the average test confusion matrix for 100 runs of the model. The off-diagonals (in red) indicate the False Positives (FP) and False Negatives (FN) while the diagonal elements (in green) indicate the True Positives (TP) and True Negatives (TN).

**Fig. 3: Results for perovskite insulator character using M_C.**

In Fig. 3c, we plot the receiver operating characteristic (ROC)⁶⁹ curves of the cross-validated classification model M_C on the test data. The closer the curve is to the ideal point (top left corner on the plot), the greater the area under the curve (AUC), and the better the performance of the classifier. The plot emphasizes the value of the curve for a threshold value of 0.5, i.e., compounds having a probability of ≥0.5 of being an insulator are classified as wide band gap materials while those with a probability <0.5 are classified as narrow band gap materials. For this classifier, we achieve a very good AUC value of 0.98. Figure 3(d) showcases the precision-recall (PR) curves for the classifier, which is a plot of the precision rate on the ordinate and the recall rate on the abcissa for varying threshold values. The point (1,1) on this plot implies a perfect predictive model, and the closer the curve is to this point, the more robust is the model. As is evident, the PR curve for a threshold value of 0.5, is very close to the ideal (1,1) point yielding an AUC of 96, resulting in a very skillful model. The f1-score is the harmonic mean of the precision and recall rates for a particular probability threshold value. On the PR curve, we indicate the iso-curves for various f1-score values.

Band gap regression

As mentioned in prior sections, the band gap prediction model was continually improved by adaptively augmenting the size of the training dataset until the MAE for the model on the test data was below 0.2 eV. The regression coefficient (R²) and MAE values for different combinations of training/test data splits evaluated using five-fold cross validation are shown in Table 4. The band gap prediction model results are shown in Fig. 4. To visualize the accuracy of our band gap predictions, the machine-learning predictions of the band gap as a function of the calculated band gap for the training and test data are shown in Fig. 4a and b, respectively. Regression coefficient (R²) values of 0.97 and 0.86 and MAE values of 0.07 eV and 0.18 eV were achieved for the training and test data, respectively, averaged over 100 different runs, each with randomly selected 90% training and 10% test sets. These values compare very favorably with and are at least as good as previously reported band gap prediction models^38,39,40 which are trained on much smaller datasets.

Table 4 Performance metrics (regression coefficient R² and MAE in eV) for M_R, averaged over 100 different runs for different combinations of training and test set sizes.

Full size table

**Fig. 4: Results for prediction of band gaps of oxide perovskites using regression model M_R.**

Figure 4c, d shows the learning curves for the regression model for the R² and MAE performance metrics. Learning curves are a widely used diagnostic tool in ML algorithms to indicate the incremental improvement in learning with respect to an increase in the size of the training dataset. The model is evaluated for a chosen evaluation metric on the training dataset and on a hold-out validation dataset after each update during training; the measured performance when plotted comprise the learning curves. Here, we plot the learning curves for our chosen metrics of R² (Fig. 4c and the MAE (Fig. 4d). In both these plots, we see that the test curve follows the trajectory of the training curve with a moderate gap between them - indicating that our model exhibits low bias and moderate variance, as is desired.

Figure 4e indicates the regression feature importances. Here, we see that the geometric factors ($t,\bar{{\mu }_{B}}$) feature very low on the importance list while descriptors such as the electronegativities, HOMO and LUMO energies, Zunger’s pseudopotential radius, and ooctahedral factor (μ) prominently affect the band gap. Overall, We see that the features corresponding to the B-site appear more predominantly. This is expected since, in general, we observe (as indicated in the SI) that for a given oxide perovskite altering the A-site cation results in a smaller band gap variation as compared to an alteration in the B-site cation. This observation may be further attributed to the ionic nature of the perovskites, wherein the valence band maximum (VBM) states are predominantly oxygen p states while the conduction band minimum (CBM) states are predominantly B element d states. Thus, if keeping the anion chemistry constant, one would expect the nature of the B cations to have the greatest effect on the band gap.

To gain additional insight into the interplay of the model, the model’s features, and their effect on the band gap, we also calculate the two dimensional (binary) partial dependence for the most important features identified. Partial dependence plots (PDP)⁷⁰ which are calculated after the model has been fit to the data, show the marginal effect a descriptor has on the predicted outcome of the ML model. PDPs are useful in determining whether the relationship between the target property and the descriptors is linear, monotonic, or more complex. This is done by varying the value of the descriptor of interest and using the model to predict the target value for multiple instances; the average target value is then plotted as a function of the descriptor. Thus, the partial dependence method considers all instances and gives a statement about the global relationship of a feature with the predicted outcome. While the single feature PDPs capture the average trend of a given feature with the target property, the binary PDPs visualize the partial dependence of two descriptors simultaneously. Note that partial dependence works by marginalizing the machine learning model output over the distribution of the remaining features, so that the function shows the relationship between the features we are interested in and the predicted outcome. Marginalization is a method that requires summing over the possible values of one variable to determine the marginal contribution of another. By marginalizing over the other features, we get a function that depends only on the features of interest.

Figure 5a shows the binary PDPs for the top 5 descriptors, in which the yellow regions indicate the largest values of predicted band gaps. Figure 5b shows the expanded PDP for the top two features i.e., X^B+ and X^A+. This plot indicates that the maximum values of band gaps may be found for approximate value ranges of 0 > X^B+ > 1.6 and 0 > X^A+ > 1.0. Such feature value ranges may be obtained for various feature pairs, or even single features if one refers to unary PDPs. Figure 5c shows the corresponding scatter plot for the training data, i.e. the marginalized DFT calculated band gaps for varying values of X^B+ on the ordinate and X^A+ on the abcissa. On comparison of Fig. 5b, c, we see that the training data does not include a significant number of data points in the corresponding high band gap region predicted by the PDP. Thus, while PDPs may be used to delineate ranges of feature values to obtain specific ranges of band gaps, it is imperative to keep in mind the limitation that the predictions are averaged values over the remaining features and the training data rarely cover the entire targeted feature ranges in a uniform manner.

**Fig. 5: Partial dependence plots (PDPs) for model features.**

The statistical nature of predictions obtained using ML is concomitant with uncertainties in the predictions themselves. Hence, while it is important to ensure high prediction accuracy while building ML models, it is equally if not more important to explain a prediction, and the confidence that we have in the prediction made by a model. Also, predictions are interpolative for data points that lie within the range of previously seen data, and extrapolative for data points that fall beyond the range of previously seen data. The uncertainty arising due to this distance from the domain of the training data needs to be quantified and hence confidence intervals for the test data using our band gap regression model were calculated and are and shown in Fig. 6a. Here, we see that for a large number of data points, the error bars cross the prediction-equals-calculation diagonal, indicating the lack of residual noise in the predicted band gaps and that our features that constitute the regression model adequately describe the oxide perovskite band gaps. The variation in the lengths of the error bars indicates that our model is more confident about some predictions than others. Reassuringly, we observe that, in general, the model is more confident about those predictions for which the predicted and calculated band gaps are similar.

**Fig. 6: Band gap prediction results.**

Prediction of new oxide perovskites with wide band gaps

Subsequent to the training and testing phases, the wide/narrow band gap classification and band gap regression models were applied to the dataset D_FS of predicted formable and stable oxide perovskites to predict new wide band gap materials and their band gaps respectively. This prediction dataset D_FS consists of 462,248 distinct compounds. Using a 50% probability cutoff, this was reduced to 13,589 (<3%) compounds that were predicted to possess a wide band gap; this set of compounds is designated D_W. The distribution of the wide band gap prediction probabilities for these 13,589 compounds is shown in Fig. 6b. As discussed earlier, the candidate compounds do not contain any experimentally known perovskites, and hence these 13,589 oxide perovskite compositions are novel compounds that have never been synthesized to the best of our knowledge. To down-select a tractable number of wide band gap candidates for further exploration, from these 13,589 compounds, we retained those compounds (D_w) for which the prediction probabilities of experimental formability, thermodynamic stability, and wide band gap nature were greater than 90% which amounted to 310 compounds. The inset in Fig. 6b reflects the wide band gap prediction probabilities for these 310 compounds. For these 310 compounds, we then predicted the band gaps using M_R, the distribution of which is shown in Fig. 6c. The 310 compounds are listed in Table 5 and complete descriptions of these compounds along with their predicted band gaps are included in Supplementary Data 2.

Table 5 List of 310 predicted wide band gap oxide perovskite compounds satisfying cutoff probabilities >0.9 for stability, formability and wide/narrow band gap prediction models.

Full size table

The DFT band gaps of these 310 compounds were calculated in an effort to computationally confirm our predictions, as shown in Fig. 6d. Calculations showed that all of these 310 compounds are indeed wide band gap materials as defined in this work, i.e., all their band gaps were greater than 0.5 eV, thus proving that our insulator classification model is indeed very accurate. Further, the calculated band gaps agree very well with the predicted band gaps, with a maximum error of 0.48 eV, a MAE of 0.21 eV, a mean square error (MSE) of 0.07 and a R² value of 0.84. Absolute difference in predicted and calculated band gaps for these compounds are also denoted in eV in parentheses in Table 5. In these calculations shown in Fig. 6d, we examine 31 distinct AA’ pairs. For a given AA’ pair, the band gap shows wide variation with variation in the BB’ pair. For example, for A = A’ = Ba, we examine 32 distinct BB’ pairs and find that the calculated band gap varies from 1.57–4.1 eV. In Supplementary Note 1 and Supplementary Note 2 we include detailed heat maps for different combinations of AA’ and BB’.

Discussion

As mentioned in prior sections, with a 50% probability cutoff, 13,589 compounds were predicted to be wide band gap materials, i.e., materials with a DFT band gap greater than 0.5 eV. These 13,589 compounds are double perovskites with ${A}_{2}B{B}^{{\prime} }$O${}_{6},A{A}^{{\prime} }{B}_{2}$O₆ and $A{A}^{{\prime} }B{B}^{{\prime} }$O₆ compositions. Thus our predictions enable us to compile design maps which can give an insight into the band gap variation and associated trends of a basic ABO₃ structure when another element (${A}^{{\prime} }/{B}^{{\prime} }$) replaces half of the A-site, B-site or A- and B-site cations. Figure 7 indicates one such design map for the Ba${}_{2}B{B}^{{\prime} }$O₆ for selected formable and stable combinations of B and ${B}^{{\prime} }$. The white squares indicate combinations which are predicted to be narrow band gap materials (E_g < 0.5 eV) and hence do not have an accompanying band gap prediction. Elements are arranged on the x- and y-axes in order of increasing atomic number (left to right and bottom to top respectively). Here, we see that Ta, Sb, In, La on the B-site result in larger band gaps across the board, with the specific combination of Ta and In resulting in the largest band gap of 3.9 eV. On the other hand, the presence of Bi on the B-site always lowers the band gap. Supplementary Note 3 includes Design maps of predicted band gaps with respect to average values of the five top ranked features individually and pairwise are included in Supplementary Note 4 and Supplementary Note 5 respectively. Average values of variation in band gap for base single perovskite oxides with addition of element in A- and B-sublattices are shown in Supplementary Note 5.

**Fig. 7: Band gap prediction map for selected combinations of Ba₂*BB’*O₆.**

Thus, a hierarchical screening strategy (summarized in Fig. 1d is presented to identify 310 novel formable, stable double oxide perovskites exhibiting wide band gaps via a ML-guided exploration of a very large fraction of the double perovskite chemical space. We employ a large training dataset spanning a vast chemical space to train the ML models and achieve high predictive accuracies. Confidence intervals were derived for the quantitative band gap prediction model and our predictions are validated with DFT computations. We find excellent agreement between our predictions and these validating calculations.

The calculated band gap data that comprises the training data as well as the predicted band gaps may be used to generate design maps for desired combinations of elements on the A- and B- sites that can offer insight into the variation of the band gap due to doping and be used to identify preliminary candidates for specific applications. The 310 identified candidates are novel chemistries that have not been explored experimentally to date. These candidates are predicted to have band gaps ranging from 0.5 to 4 eV and, consequently, can potentially find application in a wide range of areas ranging from infrared radiation detection, solar cells, and other light emitting devices (LEDs), to scintillator materials.

The efficiency of our multi-step hierarchical screening approach, which may be generalized to investigate other classes of materials in addition to the oxide perovskites examined here, provides further impetus to the application of physics-based ML models to the discovery of novel functional materials. This approach - of creating models and then using them to identify specific novel candidates that will be of value to the community at large and that are suggested for follow-on experimental studies - is not always followed in the prior literature. We note that this hierarchical strategy, where we first classify materials with perovskite structure as either insulators or metals and then train regression models on the insulators only with such high accuracies, is unique in the literature and provides a route for improved models for these types of properties.

Lastly, we note that although the absolute DFT gaps are underestimated due to our choice of the low-fidelity but less expensive PBE functional, the relative changes in band gap as a function of chemistry are well estimated by the PBE-GGA and they correlate with the experimental band gaps (since the underestimation is systemic and generally proportional to the band gap itself) and therefore chemical trends, which in themselves are very useful, are expected to be well captured using our approach. A high-throughput study such as the present one necessitates a low-cost technique and recent studies^56,71 have indicated that ground state properties calculated using the PBE-GGA functional are sufficiently accurate, particularly for changes in electronic structure with chemistry. However, to ensure that this is true for the current application, 100 materials were randomly selected from our training dataset of wide band gap materials and their HSE band gaps were calculated; the results are included in the Supplementary Note 6. It is seen that PBE-GGA indeed systematically underestimates the band gap for most materials.

Methods

DFT calculation details

The Vienna ab initio simulation package (VASP)^72,73 implementation of the DFT framework was used in this work to calculate the band gaps of the training data. The parameterization proposed by Perdew, Burke, and Ernzerhof ⁵¹ of the GGA⁷⁴ approach was used. A Monkhorst-Pack mesh was used to perform the Brillouin zone integrations with at least 5000k points per reciprocal atom. The structures were fully relaxed using the Methfessel–Paxton smearing method⁷⁵ of order one and a final self-consistent static calculation was carried out. The calculations were spin polarized and we used a cutoff energy of 533 eV for all of the structures. All relaxations were carried out until changes in the total energy between relaxation steps were within 1 × 10⁻⁶ eV and atomic forces on each of the atoms were smaller than 0.01 eV/Å.

ML models using random forests

The choice to use RFs to build our ML models was made to leverage the inherent robustness, low bias, and moderate variance afforded by the technique. The RF (also known as Random Decision Trees) is a bagging-based ensemble learning method. It has been shown that using ensembles of trees, where each tree in the ensemble is grown in accordance with the realization of a random vector, results in consequential gains in classification or prediction accuracy. Final predictions are obtained by aggregate voting over the ensemble using equal weights in most cases. RFs seek to induce randomness by using subsets of descriptors drawn at random to determine the optimal split of a given node of a tree, thereby reducing the correlation between the quantities being averaged and consequently enhancing the variance gains. We used the Scikit-learn package to implement our RF models.

For the RF classification model M_C, the training and test dataset selections were stratified over the wide/narrow band gap chemistries using a 90/10 training/test split. Thus, we used 90% of the dataset ${D}_{P}^{{\prime} }$ to train the classification model M_C and then tested on the remaining 10%. The maximum tree depth was set at 25 and the number of estimators or trees was 200 for the classification model. For the RF regression model M_R, we also used 90% of the dataset D_BG to train the regression model M_R and then tested on the remaining 10%. The maximum tree depth was set to 50 and the number of estimators was chosen to be 200. To maximize accuracy while minimizing the standard deviation on unseen data, we used 5-fold cross-validation using a 90% training subset to determine the hyper-parameters for both classification and regression.

Uncertainty quantification

RF models are difficult to interpret owing to their black box nature, and it is difficult to quantify the associated modeling and input uncertainties. Two methods are most widely used for the quantification of confidence intervals in RF-based regression models: i) U-Statistic-based RFs⁷⁶ and ii) bootstrap⁷⁷, jackknife-after-bootstrap⁷⁸, and infinitesimal jackknife⁷⁹ based methods, which we use in this work.

U-statistics⁸⁰ is a class of statistics in which a minimum-variance unbiased prediction is derived by drawing a predetermined number of times through all combinatorial selections of the training data set and then averaging over the possible results of these sub-samples. Mentch and Hooker⁷⁶ showed that under a strict sub-sampling scheme, predictions for individual feature vectors are asymptotically normal, allowing for application of U-statistics to RF predictions. Relevant statistical measures can then be used to quantify the uncertainty related to the reducible error of the RF prediction and construct confidence intervals.

Bootstrap sampling and jackknifing rely on estimating the variance of a prediction by using the variability between re-samples rather than using statistical distributions. In bootstrapping, numerous prediction models are developed by randomly excluding varied small subsets of the training data and the mean and variance of the predictions is estimated from these bootstrapped models. This methods attempts to quantify the sensitivity of the model with respect to slight perturbations in the training data. Jackknifing is another re-sampling technique predating bootstrapping, in which each training data point is systematically left out, the model is trained on the remaining data and an estimate is calculated; the jackknife estimate is found by evaluating the average of these calculations. These two ideas may be combined in a jackknife-after-bootstrap method which is used to find the an error estimate (for example variance) to a bootstrap estimate. As opposed to jackknifing and jackknifing-after-bootstrapping, where the behavior of a prediction is studied after excluding one or more observations at a time, the infinitesimal jackknife (IJ) looks at the effect on a prediction after down-weighting each observation by an infinitesimal amount. In 2014, Wager et al.⁸¹ demonstrated that both the jackknife-after-bootstrap and the infinitesimal jackknife methods suffer from considerable Monte Carlo bias, and they proposed a bias corrected version of the method, the implementation of which is used to calculate the confidence intervals for the test data using our band gap regression model.

Data availability

The training dataset used in this study is described in Supplementary Data 1 and the predicted candidates are listed in Supplementary Data 2. Additional data related to this work is available on reasonable request.

Code availability

The source code used in this study may be downloaded from GitHub (https://github.com/anjanatalapatra/perovskite_oxide_discovery).

References

Ueno, K. et al. Field-effect transistor based on ktao 3 perovskite. Appl. Phys. Lett. 84, 3726–3728 (2004).
Article CAS Google Scholar
Schubert, E. F. & Kim, J. K. Solid-state light sources getting smart. Science 308, 1274–1278 (2005).
Article CAS Google Scholar
Goetzberger, A. & Hebling, C. Photovoltaic materials, past, present, future. Solar Energy Mater. Solar Cells 62, 1–19 (2000).
Article CAS Google Scholar
Van Loef, E., Dorenbos, P., Van Eijk, C., Krämer, K. & Güdel, H.-U. High-energy-resolution scintillator: Ce 3+ activated labr 3. Appl. Phys. Lett. 79, 1573–1575 (2001).
Article Google Scholar
Yu, L. & Zunger, A. Identification of potential photovoltaic absorbers based on first-principles spectroscopic screening of materials. Phys. Rev. Lett. 108, 068701 (2012).
Article Google Scholar
Castelli, I. E. et al. New light-harvesting materials using accurate and efficient bandgap calculations. Adv. Energy Mater. 5, 1400915 (2015).
Article Google Scholar
Huo, Z., Wei, S.-H. & Yin, W.-J. High-throughput screening of chalcogenide single perovskites by first-principles calculations for photovoltaics. J. Phys. D Appl. Phys. 51, 474003 (2018).
Article Google Scholar
Uchino, K. Glory of piezoelectric perovskites. Sci. Technol. Adv. Mater. 16, 046001 (2015).
Article Google Scholar
DiDomenico Jr, M. & Wemple, S. Optical properties of perovskite oxides in their paraelectric and ferroelectric phases. Phys. Rev. 166, 565 (1968).
Article Google Scholar
Galasso, F. Perovskite type compounds and high t c superconductors. JOM 39, 8–10 (1987).
Article CAS Google Scholar
Towler, M., Dovesi, R. & Saunders, V. R. Magnetic interactions and the cooperative Jahn-Teller effect in KCuF₃. Phys. Rev. B 52, 10150 (1995).
Article CAS Google Scholar
Visser, D., Ramirez, A. & Subramanian, M. Thermal conductivity of manganite perovskites: colossal magnetoresistance as a lattice-dynamics transition. Phys. Rev. Lett. 78, 3947 (1997).
Article CAS Google Scholar
Dulian, P. Solid-state mechanochemical syntheses of perovskites. In Perovskite Materials: Synthesis, Characterisation, Properties, and Applications, 1 (eds Pan, L. & Guang, Z.) (BoD–Books on Demand, 2016).
Sham, L. J. & Schlüter, M. Density-functional theory of the energy gap. Phys. Rev. Lett. 51, 1888 (1983).
Article Google Scholar
Perdew, J. P. & Levy, M. Physical content of the exact kohn-sham orbital energies: band gaps and derivative discontinuities. Phys. Rev. Lett. 51, 1884 (1983).
Article CAS Google Scholar
Anisimov, V. I., Aryasetiawan, F. & Lichtenstein, A. First-principles calculations of the electronic structure and spectra of strongly correlated systems: the lda+ u method. J. Phys. Condensed Matter 9, 767 (1997).
Article CAS Google Scholar
Wang, L., Maxisch, T. & Ceder, G. Oxidation energies of transition metal oxides within the GGA+U framework. Phys. Rev. B 73, 195107 (2006).
Article Google Scholar
Cohen, A. J., Mori-Sánchez, P. & Yang, W. Fractional charge perspective on the band gap in density-functional theory. Phys. Rev. B 77, 115123 (2008).
Article Google Scholar
Martin, R. M. Electronic structure: basic theory and practical methods (Cambridge University Press, 2020).
Chan, M. & Ceder, G. Efficient band gap prediction for solids. Phys. Rev. Lett. 105, 196403 (2010).
Article CAS Google Scholar
Crowley, J. M., Tahir-Kheli, J. & Goddard III, W. A. Resolution of the band gap prediction problem for materials design. J. Phys. Chem. Lett. 7, 1198–1203 (2016).
Article CAS Google Scholar
Aryasetiawan, F. & Gunnarsson, O. The GW method. Rep. Progr. Phys. 61, 237 (1998).
Article CAS Google Scholar
Heyd, J., Scuseria, G. E. & Ernzerhof, M. Hybrid functionals based on a screened coulomb potential. J. Chem. Phys. 118, 8207–8215 (2003).
Article CAS Google Scholar
Pilania, G. Machine learning in materials science: from explainable predictions to autonomous design. Comput. Mater. Sci. 193, 110360 (2021).
Article CAS Google Scholar
Ramprasad, R., Batra, R., Pilania, G., Mannodi-Kanakkithodi, A. & Kim, C. Machine learning in materials informatics: recent applications and prospects. npj Comput. Mater. 3, 1–13 (2017).
Article Google Scholar
Morgan, D. & Jacobs, R. Opportunities and challenges for machine learning in materials science. Ann. Rev. Mater. Res. 50, 71–103 (2020).
Article CAS Google Scholar
Schmidt, J., Marques, M. R., Botti, S. & Marques, M. A. Recent advances and applications of machine learning in solid-state materials science. npj Comput. Mater. 5, 1–36 (2019).
Article Google Scholar
Choudhary, K. et al. Recent advances and applications of deep learning methods in materials science. npj Comput. Mater. 8, 1–26 (2022).
Article Google Scholar
Gu, T., Lu, W., Bao, X. & Chen, N. Using support vector regression for the prediction of the band gap and melting point of binary and ternary compound semiconductors. Solid State Sci. 8, 129–136 (2006).
Article CAS Google Scholar
Lee, J., Seko, A., Shitara, K., Nakayama, K. & Tanaka, I. Prediction model of band gap for inorganic compounds by combination of density functional theory calculations and machine learning techniques. Phys. Rev. B 93, 115104 (2016).
Article Google Scholar
Huang, Y. et al. Band gap and band alignment prediction of nitride-based semiconductors using machine learning. J. Mater. Chem. C 7, 3238–3245 (2019).
Article CAS Google Scholar
Li, J., Pradhan, B., Gaur, S. & Thomas, J. Predictions and strategies learned from machine learning to develop high-performing perovskite solar cells. Adv. Energy Mater. 9, 1901891 (2019).
Article CAS Google Scholar
Stanley, J. C., Mayr, F. & Gagliardi, A. Machine learning stability and bandgaps of lead-free perovskites for photovoltaics. Adv. Theory Simul. 3, 1900178 (2020).
Article CAS Google Scholar
Setyawan, W., Gaume, R. M., Lam, S., Feigelson, R. S. & Curtarolo, S. High-throughput combinatorial database of electronic band structures for inorganic scintillator materials. ACS Combinatorial Sci. 13, 382–390 (2011).
Article CAS Google Scholar
Dey, P. et al. Informatics-aided bandgap engineering for solar materials. Comput. Mater. Sci. 83, 185–195 (2014).
Article CAS Google Scholar
Khmaissia, F. et al. Accelerating band gap prediction for solar materials using feature selection and regression techniques. Comput. Mater. Sci. 147, 304–315 (2018).
Article CAS Google Scholar
Pilania, G., Wang, C., Jiang, X., Rajasekaran, S. & Ramprasad, R. Accelerating materials property predictions using machine learning. Sci. Rep. 3, 1–6 (2013).
Article Google Scholar
Pilania, G. et al. Machine learning bandgaps of double perovskites. Sci. Rep. 6, 1–10 (2016).
Article Google Scholar
Pilania, G., Gubernatis, J. E. & Lookman, T. Multi-fidelity machine learning models for accurate bandgap predictions of solids. Comput. Mater. Sci. 129, 156–163 (2017).
Article CAS Google Scholar
Na, G. S., Jang, S., Lee, Y.-L. & Chang, H. Tuplewise material representation based machine learning for accurate band gap prediction. J. Phys. Chem. A 124, 10616–10623 (2020).
Li, X.-G. et al. Graph network based deep learning of bandgaps. J. Chem. Phys. 155, 154702 (2021).
Article CAS Google Scholar
Omprakash, P. et al. Graph representational learning for bandgap prediction in varied perovskite crystals. Comput. Mater. Sci. 196, 110530 (2021).
Article CAS Google Scholar
Gladkikh, V. et al. Machine learning for predicting the band gaps of ABX₃ perovskites from elemental properties. J. Phys. Chem. C 124, 8905–8918 (2020).
Article CAS Google Scholar
Baker, N. et al. Workshop report on basic research needs for scientific machine learning: core technologies for artificial intelligence. Technical Report, (USDOE Office of Science (SC), 2019).
Kauwe, S. K., Welker, T. & Sparks, T. D. Extracting knowledge from dft: experimental band gap predictions through ensemble learning. Integr. Mater. Manuf. Innov. 9, 213–220 (2020).
Article Google Scholar
Li, W. et al. Predicting band gaps and band-edge positions of oxide perovskites using density functional theory and machine learning. Phys. Rev. B 106, 155156 (2022).
Article CAS Google Scholar
Zhang, S. et al. Predicting the formability of hybrid organic–inorganic perovskites via an interpretable machine learning strategy. The J. Phys. Chem. Lett. 12, 7423–7430 (2021).
Article CAS Google Scholar
Liu, H. et al. Screening stable and metastable abo3 perovskites using machine learning and the materials project. Comput. Mater. Sci. 177, 109614 (2020).
Article CAS Google Scholar
Yang, Z. et al. Machine learning accelerates the discovery of light-absorbing materials for double perovskite solar cells. J. Phys. Chem. C 125, 22483–22492 (2021).
Article CAS Google Scholar
Wu, Y., Lu, S., Ju, M.-G., Zhou, Q. & Wang, J. Accelerated design of promising mixed lead-free double halide organic–inorganic perovskites for photovoltaics using machine learning. Nanoscale 13, 12250–12259 (2021).
Article CAS Google Scholar
Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865 (1996).
Article CAS Google Scholar
Talapatra, A., Uberuaga, B. P., Stanek, C. R. & Pilania, G. A machine learning approach for the prediction of formability and thermodynamic stability of single and double perovskite oxides. Chem. Mater. 33, 845–858 (2021).
Article CAS Google Scholar
Bondzior, B., Vu, T., Stefańska, D., Winiarski, M. & Dereń, P. Tunable broadband emission by bandgap engineering in (ba, sr) 2 (mg, zn) wo6 inorganic double-perovskites. J. Alloys Compounds 888, 161567 (2021).
Article CAS Google Scholar
Jia L, Lloyd M, Lees M, Huang L, Walton R. Limits of solid solution and evolution of crystal morphology in (La1-x RE x) FeO3 perovskites by low temperature hydrothermal crystallization. Inorg. Chem. 62, 4503–4513 (2023).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar
Yadav, S. K., Uberuaga, B. P., Nikl, M., Jiang, C. & Stanek, C. R. Band-gap and band-edge engineering of multicomponent garnet scintillators from first principles. Phys. Rev. Appl. 4, 054012 (2015).
Article Google Scholar
Pilania, G., Balachandran, P. V., Gubernatis, J. E. & Lookman, T. Data-based methods for materials design and discovery: basic ideas and general methods. Synth. Lect. Mater. Optics 1, 1–188 (2020).
Google Scholar
Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput. Mater. 2, 1–7 (2016).
Article Google Scholar
Pearson, K. & Lee, A. Mathematical contributions to the theory of evolution. viii. on the inheritance of characters not capable of exact quantitative measurement. part i. introductory. part ii. on the inheritance of coat-colour in horses. part iii. on the inheritance of eye-colour in man. Philos. Trans. R. Soc. Lond. Ser. A 195, 79–150 (1900).
Google Scholar
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).
Article Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Beurich, H., Madach, T., Richter, F. & Vahrenkamp, H. Experiments on the HOMO-LUMO nature of metal-metal bonds. Angew. Chemie Int. Ed. Engl. 18, 690–691 (1979).
Article Google Scholar
Zunger, A. A pseudopotential viewpoint of the electronic and structural properties of crystals. Struct. Bond. Cryst. 1, 73–135 (1981).
CAS Google Scholar
Goldschmidt, V. M. Die gesetze der krystallochemie. Naturwissenschaften 14, 477–485 (1926).
Article CAS Google Scholar
Li, C. et al. Formability of ABX₃ (X= F, Cl, Br, I) Halide Perovskites. Acta Crystallogr. Sect. B Struct. Sci. 64, 702–707 (2008).
Article CAS Google Scholar
Filip, M. R. & Giustino, F. The geometric blueprint of perovskites. Proc. Natl Acad. Sci. 115, 5397–5402 (2018).
Article CAS Google Scholar
Shannon, R. T. & Prewitt, C. Revised values of effective ionic radii. Acta Crystallogr. Sect. B Struct. Crystallogr. Crystal Chem. 26, 1046–1048 (1970).
Article CAS Google Scholar
Ghiringhelli, L. M., Vybiral, J., Levchenko, S. V., Draxl, C. & Scheffler, M. Big data of materials science: critical role of the descriptor. Phys. Rev. Lett. 114, 105503 (2015).
Article Google Scholar
Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143, 29–36 (1982).
Article CAS Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. The elements of statistical learning, vol. 1, Springer series in statistics (Springer, 2001).
Fasoli, M. et al. Band-gap engineering for removing shallow traps in rare-earth Lu₃Al₅O₁₂ garnet scintillators using ga 3+ doping. Phys. Rev. B 84, 081102 (2011).
Article Google Scholar
Kresse, G. & Furthmüller, J. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys. Rev. B 54, 11169 (1996).
Article CAS Google Scholar
Kresse, G. & Furthmüller, J. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Comput. Mater. Sci. 6, 15–50 (1996).
Article CAS Google Scholar
Perdew, J. P. & Wang, Y. Accurate and simple analytic representation of the electron-gas correlation energy. Phys. Rev. B 45, 13244 (1992).
Article CAS Google Scholar
Methfessel, M. & Paxton, A. High-precision sampling for brillouin-zone integration in metals. Phys. Rev. B 40, 3616 (1989).
Article CAS Google Scholar
Mentch, L. & Hooker, G. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 17, 841–881 (2016).
Google Scholar
Efron, B., Halloran, E. & Holmes, S. Bootstrap confidence levels for phylogenetic trees. Proc. Natl Acad. Sci. 93, 13429–13429 (1996).
Article CAS Google Scholar
Efron, B. Jackknife-after-bootstrap standard errors and influence functions. J. R. Stat. Soc. Ser. B 54, 83–111 (1992).
Google Scholar
Efron, B. Estimation and accuracy after model selection. J. Am. Stat. Assoc. 109, 991–1007 (2014).
Article CAS Google Scholar
Hoeffding, W. A class of statistics with asymptotically normal distribution. In Breakthroughs in statistics, 308–334 (Springer, 1992).
Wager, S., Hastie, T. & Efron, B. Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J. Mach. Learn. Res. 15, 1625–1651 (2014).
Google Scholar

Download references

Acknowledgements

Research presented in this paper was supported by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project numbers 20190656PRD4 and 20190043DR. Computational support for this work was provided by LANLs high-performance computing clusters. This work was supported by the U.S. Department of Energy through the Los Alamos National Laboratory. Los Alamos National Laboratory is operated by Triad National Security, LLC, for the National Nuclear Security Administration of U.S. Department of Energy (Contract No. 89233218CNA000001).

Author information

Ghanshyam Pilania
Present address: GE Research, Schenectady, NY, 12309, USA

Authors and Affiliations

Materials Science and Technology Division, Los Alamos National Laboratory, Los Alamos, NM, 87544, USA
Anjana Talapatra, Blas Pedro Uberuaga, Christopher Richard Stanek & Ghanshyam Pilania

Authors

Anjana Talapatra
View author publications
You can also search for this author in PubMed Google Scholar
Blas Pedro Uberuaga
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Richard Stanek
View author publications
You can also search for this author in PubMed Google Scholar
Ghanshyam Pilania
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.P.U and G.P. proposed and supervised the entire project and C.R.S. was involved in the initial conceptualization. A.T. worked on the development and testing of the model and performed DFT simulations. B.P.U., G.P. and A.T. analyzed and discussed theory results. A.T. prepared the final draft of the manuscript which was then reviewed and edited by all the authors.

Corresponding author

Correspondence to Anjana Talapatra.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Materials thanks Felix Mayr, Seunghun Jang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Milica Todorović and Aldo Isidori. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Talapatra, A., Uberuaga, B.P., Stanek, C.R. et al. Band gap predictions of double perovskite oxides using machine learning. Commun Mater 4, 46 (2023). https://doi.org/10.1038/s43246-023-00373-4

Download citation

Received: 28 April 2022
Accepted: 22 May 2023
Published: 10 June 2023
DOI: https://doi.org/10.1038/s43246-023-00373-4

This article is cited by

Methods and applications of machine learning in computational design of optoelectronic semiconductors
- Xiaoyu Yang
- Kun Zhou
- Lijun Zhang
Science China Materials (2024)
Optoelectronic and transport properties of Na2CuInY6 (Y = cl, br, I) lead-free double perovskites for infrared imaging and remote sensing
- Ghulam M. Mustafa
- Muhammad Amin
- A. Ibrahim
Optical and Quantum Electronics (2024)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Hierarchical Down-selection workflow

Training data, prediction data and descriptors

Candidate dataset for prediction

Descriptors for machine-learning

Wide/narrow band gap classification

Band gap regression

Prediction of new oxide perovskites with wide band gaps

Discussion

Methods

DFT calculation details

ML models using random forests

Uncertainty quantification

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links