Introduction

Identifying optimal materials in applications research is a time-consuming step due to the vast scope of possible materials composed of three-dimensional (3D) networks of elements selected from the periodic table. Data-driven research has recently received attention as a new route to accelerating this step.1,2,3,4,5,6,7,8,9 This approach uses a pre-computed materials database and statistical tools that efficiently screen candidates in a search for optimal materials. The availability of open-access databases of material properties,10,11,12,13,14 along with machine learning (ML) techniques, has rapidly advanced research in this area. Over the last decade, ML has been applied to materials science problems in a variety of directions, such as prediction and classification of crystal structures,1,15,16,17 development of interatomic potentials,18,19,20 finding of optimal density functionals for density functional theory,21,22,23 and building of predictive models of material properties.24,25,26,27

The use of ML in materials science, however, has been hindered by the accuracy and interpretability of predictive models. Complex interaction among compositions of materials leads to highly nonlinear relationships between material features and target properties. To accurately describe such relationships, nonlinear ML algorithms have been utilized due to their flexible forms. However, lack of interpretability of most nonlinear ML predictive models prevents further mechanistic understanding such as finding key ingredients for target properties. Thus, finding ML algorithms that can achieve both accurate prediction and interpretability is crucial to the further advance of data-driven materials research.

Tree-based learning algorithms can be one candidate due to their advantages in both accuracy and interpretability.28 Utilizing tree-based algorithms, we here focus on finding optimal candidates for double perovskite solar cells. While recent solar cell technology has been prompted by the development of hybrid lead perovskites having an increase of power conversion efficiency and low-cost manufacturing, the inclusion of lead ion raises environmental and health issues preventing commercialization.29,30 Alternatively, a new strategy using mixed mono- and tri-valent cations, in the form of the double perovskite A2B1+B3+X6, has been introduced to replace lead-based perovskite solar cell materials.31,32,33,34 In this approach, sizable combinations of double perovskites can be possible, and thus a combination of high-throughput computations and the ML technique can be a powerful tool to explore the large combinatorial space.

Here, employing the gradient-boosted regression tree (GBRT) algorithm and a dataset of calculated electronic structures of A2B1+B3+X6, we present an ML-based investigation, which can be ultimately used to identify Pb-free double perovskite solar cell materials. The GBRT method allows us to obtain highly accurate predictive models for the heat of formation (ΔHF) and bandgap (Eg), with importance scores for each feature of materials. Based on the scores, we extract crucial features to determine the values ΔHF and Eg of halide double perovskites, enabling an overall understanding of the relationships between features and properties. Finally, we discuss the relevance of extracted features to the chemical and physical aspects of ΔHF and Eg, and practical approaches of the ML model toward finding optimal candidates of Pb-free halide double perovskites solar cell materials.

Results

Dataset of Pb-free halide double perovskites

For the ML investigation, we first generated a dataset of the electronic structures of halide double perovskites. Figure 1a presents the crystal structure of the double perovskite A2B1+B3+X6. Compared to the original perovskite, this structure incorporates two different types of cations, B1+ and B3+, instead of a single B cation. With anion X, both B1+ and B3+ cations form octahedral units. Usually, the perovskite has a structural phase transition upon tilting and rotating of the octahedral unit. As shown in Fig. 1b, two possible crystal structures can be considered: one has a cubic space group; the other has an orthorhombic space group.

Fig. 1
figure 1

a Crystal structure of double perovskite with A, B1+, B3+, and X-sites denoted by light blue, blue, red, and gray spheres, respectively. b Structural deformation by tilting of octahedral unit presents. c List of chemical elements considered in a dataset of halide double perovskites. Distribution of calculated d heat of formation (∆HF) and e bandgap (Eg). In each panel, average values are depicted as a white point

In this study, we considered coinage elements and lower group XIII elements for B1+, and upper group XIII and lower group XV elements for B3+. Then, substitutable combinations for di-valent lead ion could span over 30 combinations. Furthermore, we considered a series of alkali metals from K to Cs for the A-site (Li and Na were not considered because of their small size). Here, for simplicity of calculation, organic molecules, such as methyl ammonium, were not included. Halogen ions were assigned for the X-site. Figure 1c summarizes all combinations of chemical constituents. In sum, along with the two space groups of the crystal structure, 540 hypothetical compounds of A2B1+B3+X6 were considered.

Using first-principles density functional theory (DFT), we generated a dataset, including values of ΔHF and Eg for the 540 compounds. ΔHF indicates the stability of a compound compared to those of the elemental phases of its chemical constituents. Generally, a more negative value of ΔHF indicates a more stable compound. On the other hand, Eg can represent the capability to absorb solar energy, which is critical to achieving high performance of a solar cell. The optimal value for solar absorption is reported to range from 1.1 eV to 1.8 eV.35 However, note that Eg is severely underestimated in this work due to the limitation of the standard DFT.36 Previous studies have shown that a DFT bandgap from 0.3 eV to 0.8 eV can recover to an optimal value of a solar cell material if more accurate computation methods such as hybrid DFT or GW are used.37,38 Further computational details for these two quantities can be found in the method section.

We can detect a few notable characteristics of ΔHF and Eg without relying on ML analysis. In the case of ΔHF, all candidate materials have negative values, indicating that all can be stably synthesized (see Fig. 1d). Another prominent observation about ΔHF is its dependence on halogen anion, which contrasts to its weak dependence on other elements. As halogen atoms change from iodine to chlorine, ΔH decreases. On the other hand, the relationship between Eg and atomic species and space group is more complicated, two remarkable characteristics of which we summarize in the following. First, in most cases, Eg increases by changing the space group (SG) from cubic to orthorhombic (Fig. 1e). Second, it is found that Eg mostly increases as halogen atoms change from iodine to chlorine. However, in mapping between Eg and materials, no other dependencies are observed.

Machine learning and features

We next apply the machine learning algorithms to the dataset of halide double perovskites. In general, machine learning study requires the appropriate selection of a learning algorithm and an optimized set of input features. In this study, we employed the Gradient-Boosted Regression Tree (GBRT), which is one of the tree-based machine learning algorithms. Decision tree learning is a machine learning method that uses a tree-like diagram, usually a binary tree, to predict a target variable. The goal is to create a tree in which each node represents a split based on one of the input features, and each leaf represents the prediction of the target variable. The prediction can be nonlinear because the partitioning of the input variable space is repeated recursively.39 Compared to other ML methods, the decision tree is advantageous for its accuracy and speed, although it is prone to overfitting.40 Using ensemble methods such as bagging and boosting can prevent overfitting, and thus can improve the accuracy.41,42,43 GBRT adopts the gradient boosting method, which combines weak learners into one strong learner using the gradient descent algorithm.

Furthermore, the predictive model can be used to record the improvement of prediction results for a specific feature as each node corresponding to a single feature is added to the trees. In this manner, one can measure the feature importance automatically.44 This feature importance fosters interpretability of predicted results and it leads to the extraction of key-features, which we will show later in the results. In the present work, we adopted a gradient boost method to generate a regression tree ensemble that is implemented in the XGBoost library.28 See further details in the method section.

Another critical step for achieving good prediction performance is the selection of appropriate input features, referred to as feature engineering. For the dataset of materials, features should clearly describe a single given material and, also, discriminate separate materials. In this study, we selected 32 features, including chemical information of atomic constituents and geometric information such as bond length and crystal symmetry. The total of 32 features include the following:

Paulings electronegativity (χ), ionization potential (IP), highest occupied atomic level (h), lowest unoccupied atomic level (l), and s-, p-, and d- valence orbital radii rs, rp, and rd of isolated neutral atoms A, B1+, B3+, and X; atomic distance (D) between cations and the nearest halogen atom; space group of crystal (SG). Unlike the other features, SG is considered a categorical variable for cubic and orthorhombic symmetry.

Here, we note that the GBRT algorithm cannot appropriately evaluate the importance scores of two strongly correlated features. The reason is that two strongly correlated features cannot be distinguished in the learning process. Thus, reducing the dimensions of the feature space can improve the quality of prediction while simultaneously decreasing the computing cost. Here, we implemented a dimensionality reduction based on the square of the Pearson correlation matrix among features. For each pair of features x and y, the square of the correlation coefficient \(R_{xy}^2\) is defined as

$$R_{xy}^2 = \frac{{\mathop {\sum}\nolimits_{i = 1}^n {{\kern 1pt} (x_i - \bar x){\kern 1pt} (y_i - \bar y)} }}{{\sqrt {\mathop {\sum}\nolimits_{i = 1}^n {{\kern 1pt} (x_i - \bar x)^2} {\kern 1pt} \mathop {\sum}\nolimits_{i = 1}^n {{\kern 1pt} (y_i - \bar y)^2} } }},$$

where \(\bar x\) and \(\bar y\) are the sample means of features xi and yi of the i-th material over a total of n compounds. As shown in Fig. 2, strong correlations are found in several pairs of features: (1) all atomic features of halogen atoms at the X-site, (2) rs and rp for A-, B1+-, or B3+-site atoms, and (3) IE and h for A-, B1+-, or B3+-site atoms. A lack of atomic variation at the X-site (only three atoms: Cl, Br, and I) led to strong correlations among all pairs of atomic features. The same trend was observed at the A-site, although this trend was not as strong as it was at the X-site. The correlation matrices were used to downselect features of the halide double perovskite dataset. We selected χ as a representative feature of all atomic features of the X-site atoms. Furthermore, we selected only rs and excluded rp for the A-, B1+-, and B3+-site atoms. For IE and h of the A-, B1+-, or B3+-site atoms, we considered the only h. We found that machine learning performance is almost the same under the dimensional reduction from the Pearson correlation matrix among features.

Fig. 2
figure 2

Square of Pearson correlation coefficient matrix halide double perovskites. Size and color of circles vary with values

Predictive model and feature importance

We performed regression using GBRT to predict values of Eg and ΔHF of the halide double perovskites A2B1+B3+X6. Figure 3a presents the prediction of ΔHF. The results show that averaged root-mean-square-error (RMSE) of test sets for ΔHF is 0.021 eV/atom. Even though the number of the current dataset was limited, it is noteworthy that the accuracy of the predictive model of ΔHF from GBRT is comparable to the fundamental error of 0.024 eV/atom that results from differences between the experimental and DFT-based values of ΔHF of ternary oxides.45 In the case of Eg, the averaged RMSE of the test sets is equal to 0.223 eV (Fig. 3b), which is worse than the error of ΔHF. However, as the effective range (1.1–1.8 eV) of the bandgap of solar cell material is much larger than the RMSE, such low accuracy of Eg could be acceptable for solar cell applications.

Fig. 3
figure 3

Prediction of a heat of formation and b bandgap for halide double perovskites. Orange filled circles correspond to training dataset and blue circles to test dataset. Red solid lines indicate the reference line corresponding to the perfect fit. Feature importance from GBRT for c heat of formation and d bandgap of halide double perovskite

The origin of the high accuracy of the predictive models can be attributed to several things. One is the nonlinear nature of the GBRT algorithm, in contrast to that of most linear algorithms (See Supplementary Information 1). Another reason for the high accuracy could be the structural similarity of the materials considered in the dataset. In the present study, crystal structures of all materials are perovskite. Given the same structure, materials having similar chemical constituents might have similar properties, which allows for more feasible interpolation of properties in the predictive models. Discussing structural similarity is beyond the scope of our study, but it is becoming a significant topic in ML studies of materials science.46,47

On top of providing highly accurate predictive models, the GBRT method provides interpretation of the results via feature importance scores, which is the main advantage of this method. Figure 3c, d show the importance score of all features for ΔHF and Eg, respectively. For ΔHF, it is revealed that the type of halogen anions, represented by χX, is the most important feature (Fig. 3c). Remarkably, the importance score of χX is more than two times higher than that of the secondly-ranked feature, \(D_{B^{3 + }}\). Beyond the first two features, importance scores steeply decrease, indicating that ΔHF strongly depends on only a few features of materials. On the other hand, the feature importance used to predict Eg is more dispersed (Fig. 3d). This implies a more complex relationship between Eg and the material features, which is consistent with the tendency observed without ML techniques (see Fig. 1e). We found that although SG is the most important feature, the following features, such as χX, \(h_{B^{1 + }}\), \(l_{B^{1 + }}\), \(D_{B^{3 + }}\), \(rd_{B^{3 + }}\), or \(rd_{B^{1 + }}\), are not negligible.

Extraction of key-features and application

In this section, we suggest possible utilization of the feature importance scores in an efficient search for target materials. First, we show that the feature importance scores can be utilized to extract key-features determining target properties. To this end, we recursively excluded the least important feature and built a predictive model using only the remaining set of features and learning new decision trees. This process was repeated until the top three features remained for each target property. Figure 4a, b show how the error in the prediction of ΔHF and E changes under the process of extracting key-features. Remarkably, we found that RMSEs increase almost monotonically in both cases, which indicates that a feature with a higher feature score has stronger predictive power.

Fig. 4
figure 4

Root-mean-square-error (RMSE) of a heat of formation and the b bandgap of halide double perovskite as a function of the number of features. In each panel, the blue curve corresponds to the training set and the orange curve to the test set

Through this process, we selected the most important features for each target properties. For ΔHF, RMSE abruptly increases when the number of features is smaller than five (Fig. 4a). This means that at least five features are required to predict ΔHF within RMSE of 0.036 eV. In the case of Eg, seven features comprise minimal set to predict Eg within an RMSE of 0.322 eV (Fig. 4b). The list of selected features includes χX, \(D_{B^{3 + }}\), DA, \(h_{B^{1 + }}\), and \(D_{B^{1 + }}\) for ΔHF, and SG, χX, \(h_{B^{1 + }}\), \(l_{B^{1 + }}\), \(D_{B^{3 + }}\), \(rd_{B^{3 + }}\), and \(rd_{B^{1 + }}\) for Eg.

Key-features extracted from the feature importance score of GBRT method can have various implications for a fast search for new material. For instance, selected key-features can be utilized as an optimal set of features in another type of ML algorithms such as classification. We considered a binary classification with class 1 for Eg in the range of 0.3–0.8 eV and class 2 for otherwise, to search for an optimal candidate of solar cell materials. Figure 5 presents the accuracy of classification tasks, as well as corresponding confusion matrices where the classification task was performed with a different number of features. With a smaller number of features, the prediction accuracy is lower than that with all features. However, it is notable that classification with the top seven key-features selected for Eg provides a good approximation to that with the full features. As the classification can be accelerated with the reduced feature space, materials screening before a further investigation can be feasible in a massive dataset.

Fig. 5
figure 5

a The accuracy of classification as a function of the number of features. Confusion matrices of classification results for the fivefold test set with b 3 features, c 7 features, and d 12 features. Class “1” means set of materials with a bandgap in a range of [0.3 eV, 0.8 eV] and class “0” of everyone else

The mechanistic understanding of the relationship between features and target properties can provide a practical guide in the search for optimal double perovskite solar cell materials. For example, Cs2InAgCl6 is one of newly synthesized Pb-free halide double perovskite, but its bandgap energy is too large to be used as a solar cell material.34 Knowing key-features determining Eg, one may have several plausible remedies to decrease Eg, such as anion-mixing from the electronegativity of halogen anions and partial-substitution/alloying for In and Ag from the highest occupied and lowest unoccupied orbitals of B1+ ions, the distance between B3+ ion and anion, radii of d-orbitals of B1+ and B3+.

However, we note that an explicit relationship between features and properties is still difficult to obtain directly from the feature importance score, and additional steps would be needed. To this end, one may still utilize the process of feature selection described above and fit the properties using basis functions of the reduced features. Such an explicit relationship can be used to further mechanistic understanding of target properties such as revealing interaction between important features, but it is out of the scope of our paper.

Discussion

In the importance-score-based selection process of the GBRT method, scientific knowledge is not reflected. Here, we consider whether the roles of the features selected using the GBRT method in determining the target properties are consistent with previous known scientific knowledge.

First, we investigated the role of the features selected for ΔHF for the bonding mechanism. The top five selected features for ΔHF were χX, \(D_{B^{3 + }}\), DA, \(h_{B^{1 + }}\), and \(D_{B^{1 + }}\) (Fig. 4), among which the most important feature was χX. Interestingly, it is well-known that differences between electronegativities of bonding partners are good indicators of the bonding character and bonding strength, which strongly affect ΔHF. Compared to other groups of elements, the halogen group shows relatively large variation in electronegativity. This could be attributed to a strong dependency of ΔHF on χX. Along with electronegativity, the distances between cation and anion (DA, \(D_{B^{1 + }}\), and \(D_{B^{3 + }}\)) are also good indicators of bonding strength. Usually, strong bonding is accompanied by a shorter bond length. In the case of \(h_{B^{1 + }}\), \(IP_{B^{1 + }}\) is strongly correlated with \(h_{B^{1 + }}\), (see Fig. 2), and IP is also an important chemical quantity to explain ionic bonding. In this way, the top five selected features for ΔHF are all relevant to the bonding mechanism, which is important to determine ΔHF.

In the case of Eg, more complicated theories are required to understand the role of the selected features. For this purpose, we performed DFT analysis of the band structure, explicitly focusing on the selected features. Generally, symmetry-lowering by tilting of the octahedral unit of a halide perovskite increases Eg because of bandwidth shrinkage.48 In the case of halide double perovskite, a similar transformation of band structure is found. Figure 6a–d show the band structures of cubic and orthorhombic phases of the representative compounds. In all cases, the bandwidths of the conduction and valence bands were reduced by the tilted octahedral units, and this led to increases of the bandgap in orthorhombic phases. For other compounds, the same trend occurs, as shown in Fig. 1e. Thus, SG is relevant in determining Eg.

Fig. 6
figure 6

ad Present cubic and orthorhombic crystal structures and corresponding band structures of Rb2AgBiBr6, Rb2AgInBr6, Cs2TlBiBr6, and Rb2TlGaBr6, respectively. e Schematic illustration of a band diagram for halide double perovskite

To check the validity of features χX, \(h_{B^{1 + }}\), \(l_{B^{1 + }}\), \(D_{B^{3 + }}\), \(rd_{B^{3 + }}\), and \(rd_{B^{1 + }}\), we plotted a schematic diagram of the orbital hybridization between cations and anions in halide double perovskites based on the DFT band structure calculations (see Fig. 6e). The diagram shows that Eg originates from energy differences between two hybridized states, the valence band maximum (VBM), composed of an anti-bonding state involving B1+-site and X-site atoms, and the conduction band minimum (CBM), composed of an anti-bonding state involving B3+-site and X-site atoms. Even though complicated interaction exists, the energy levels of the valence electrons of B1+, B3+, and X- play important roles in determining the value of Eg of a given compound. The highest occupied and lowest unoccupied atomic levels can be good indicators of the energy levels of the valence electrons. In addition to the atomic levels, the electronegativity can play a crucial role in determining Eg by controlling energy splitting between the bonding and anti-bonding states, denoted by ΔE in Fig. 6e. This splitting indicates the strength of the hybridization among the orbitals. The high electronegativity of the compounds leads to tightly bound electronic distribution around the atoms and reflects strong hybridization via small bonding length, such as \(D_{B^{3 + }}\).

In this study, we used the GBRT method to investigate the ML prediction of the values ΔHF and Eg of halide double perovskites. The GBRT method provided outstanding prediction performance for those properties, as well as providing an interpretable feature importance score. Notably, for ΔHF, GBRT worked accurately, with an error comparable to the fundamental error associated with the difference between experimental and DFT values. The prediction of Eg was also acceptable for use in the search for solar cell materials. Key-features extracted based on the importance score provide a better mechanistic understanding of ΔHF and Eg. On top of the accurate and interpretable predictive models, we further verified that the key-feature was relevant to scientific knowledge.

Methods

Density functional theory (DFT)

Structural optimization, total energy, and electronic band structure of 540 halide double perovskites were performed within density functional theory (DFT) formalism. We utilized a plane-wave basis set (cutoff energy = 350 eV), and the projector augmented wave method,49 implemented in the Vienna Ab-initio simulation package (VASP).50,51 For the exchange correlational functional, the generalized gradient approximation was adopted within Perdue-Ernzerhof-Burke formalism.52 A 5 x 5 x 5 regular grid was employed for momentum space sampling. The heat of formation, ΔHF, was calculated using the following formula:

$${\it{\Delta }}H_{\mathrm{F}} = E_{{\mathrm{tot}}}\left( {A_2B^{1 + }B^{3 + }X_6} \right) - \left( {2E_{{\mathrm{ref}}}\left( A \right) + E_{{\mathrm{rel}}}\left( {B^{1 + }} \right) + E_{{\mathrm{rel}}}\left( {B^{3 + }} \right) + 6E_{{\mathrm{rel}}}\left( X \right)} \right),$$

where Etot is the total energy and Eref is the reference energy. For the band structures, spin–orbit interaction was considered.

Atomic features

The highest occupied atomic level (h) and the lowest unoccupied atomic level (l) were taken from the atomic parameters of the VASP pseudopotential.50,51 We set h as the highest orbital energy of the partially or fully occupied orbitals and l as the lowest orbital energy of the unoccupied ones. If the orbital energies of the unoccupied orbitals were not available in the parameter files, the highest orbital energy was set at l. In determining h and l, we considered degenerated atomic orbitals to be the same orbital.

Gradient-boosted regression tree

The supervised learning model has a loss function to be minimized. In XGBoost the loss function of the model (ensemble of trees fk) is

$${\cal L} = \mathop {\sum}\limits_i {{\kern 1pt} {\mathrm{l}}(y_i,{\hat{ y}}_i)} + \mathop {\sum}\limits_{\mathrm{k}} {{\kern 1pt} {\mathrm{\Omega }}(f_{\mathrm{k}})}$$

where l is a function that measures the difference between the prediction and the target and Ω is a regularization term (complexity of the tree) to prevent overfitting. This loss function cannot be optimized using traditional optimization methods: the model is trained in an additive manner by adding a tree at a time that most improves the model (most decreases the loss) to the existing set of trees. Let xi be an i-th sample and \({\hat{ y}}_i^{(t - 1)}\) be its prediction with the current set of t − 1 trees. The model needs to add the t-th tree ft to minimize the loss function

$${\cal L}^{({\mathrm{t}})}(f_{\mathrm{t}}) = \mathop {\sum }\limits_{i = 1}^{\mathrm{n}} {\mathrm{l}}(y_i,{\hat{ y}}_i^{(t - 1)} + f_{\mathrm{t}}({\mathbf{x}}_i)) + {\mathrm{\Omega }}(f_{\mathrm{t}}).$$

It is impossible, however, to check all possible tree structures ft to be added. So, the algorithm starts from a root (single leaf) and greedily adds branches to the tree. For each step, the model finds a leaf to split, and a feature and its value for the split that maximize loss reduction after the split. If the current tree structure is \(f_{\mathrm{t}}^{({\mathrm{current}})}\) and the structure after the split is \(f_{\mathrm{t}}^{({\mathrm{split}})}\), then the loss reduction by branching can be calculated by the difference of the loss, \(D_{{\mathrm{score}}}(f_{\mathrm{t}}) = {\cal L}^{({\mathrm{t}})}\left( {f_{\mathrm{t}}^{({\mathrm{current}})}} \right) - {\cal L}^{({\mathrm{t}})}\left( {f_{\mathrm{t}}^{({\mathrm{split}})}} \right)\). This means that for each branch of the trees, the model knows, which feature is used for the split and its loss reduction.28 For each split of the trees during model training, the algorithm finds (approximately) the feature and splitting point that provide the largest cost decrease. It is possible to calculate the number of times each feature is selected for a split in the trees, or the average of the gain of the splits that use each feature. The library offers these numerical values as an F-score of type weight or type gain, respectively. In this study, we used gain.

The hyperparameters of each model were optimized using a cross-validated grid search or a randomized search over the parameter settings. The RMSE of the target variable was used as a cost function for all models. For all ML models and each model’s training, randomly chosen 80% samples of the data are used for training; the remaining 20% are used for a test set. We averaged 200 evaluations, 40 sets of fivefold training/test set splitting with random shuffling (which also have 80% of the data as training set) and calculated the importance scores of the figures. In the gradient-boosted regression tree, several hyperparameters exist. A parameter subsample for bagging (subsample ratio of the training instance) and colsample_bytree (subsample ratio of features) for the random forest were optimized to prevent overfitting. The regularization parameters were also optimized. The values of the hyperparameters used here are:

$${\mathrm{max}}\_{\mathrm{depth}} = {\mathrm{6}},{\mathrm{min}}\_{\mathrm{child}}\_{\mathrm{weight}} = {\mathrm{1}},{\mathrm{colsample}}\_{\mathrm{bytree}} = {\mathrm{0}}{\mathrm{.5}},$$
$${\mathrm{subsample}} = {\mathrm{0}}{\mathrm{.7}},{\mathrm{reg}}\_{\mathrm{alpha}} = {\mathrm{0}}{\mathrm{.1}},{\mathrm{learning}}\_{\mathrm{rate}} = {\mathrm{0}}{\mathrm{.03}}{\mathrm{.}}$$

The regularization parameters were also optimized. The parameter max_depth is for maximum depth of a tree, and min_child_weight is for the minimum number of instances needed in each node. The smaller max_depth and the larger min_child_weight are, the less the training is likely to overfit. The parameter reg_alpha is an L1 regularization term on weights.

Gradient-boosted classification trees

The classifications were performed for halide double perovskite dataset with gradient booted classification trees (GBCT). For these classifications and predictions for all data, the halide double perovskite dataset was separated into five disjoint sub-datasets, and then train data consisted of four sub-datasets and test data of the other sub-dataset. This means that there were five predictions for the full halide double perovskite dataset. In establishing the predictive model, typically given hyperparameters of GBCT were adopted for the classifications. The given parameter values for the classifications used here are:

$${\mathrm{max}}\_{\mathrm{depth}} = {\mathrm{4}},\,{\mathrm{min}}\_{\mathrm{child}}\_{\mathrm{weight}} = {\mathrm{4}},\,{\mathrm{colsample}}\_{\mathrm{bytree}} = {\mathrm{0}}{\mathrm{.8}},$$
$${\mathrm{subsample}} = {\mathrm{0}}{\mathrm{.8}},\,{\mathrm{reg}}\_{\mathrm{alpha}} = {\mathrm{0}}{\mathrm{.1}},\,{\mathrm{learning}}\_{\mathrm{rate}} = {\mathrm{0}}{\mathrm{.1}}{\mathrm{.}}$$