Accelerated discovery of stable lead-free hybrid organic-inorganic perovskites via machine learning

Rapidly discovering functional materials remains an open challenge because the traditional trial-and-error methods are usually inefficient especially when thousands of candidates are treated. Here, we develop a target-driven method to predict undiscovered hybrid organic-inorganic perovskites (HOIPs) for photovoltaics. This strategy, combining machine learning techniques and density functional theory calculations, aims to quickly screen the HOIPs based on bandgap and solve the problems of toxicity and poor environmental stability in HOIPs. Successfully, six orthorhombic lead-free HOIPs with proper bandgap for solar cells and room temperature thermal stability are screened out from 5158 unexplored HOIPs and two of them stand out with direct bandgaps in the visible region and excellent environmental stability. Essentially, a close structure-property relationship mapping the HOIPs bandgap is established. Our method can achieve high accuracy in a flash and be applicable to a broad class of functional material design.


Supplementary Figures
Supplementary Figure 1 | Training dataset ratio selection. The test percentage of training data set is from 1% to 99%. Each training data set is used to train the ML model and record the model score R 2 . As is shown in Supplementary Fig. 1, when the training data radio is up to 80%, the ML model performs best. So we split the input data set in to training dataset (80%) and test dataset (20%).

Supplementary Tables
Supplementary Table 1 | Thirty initial features with description.

Features Description
A,eff , B , and X Iron radii for the A-, B-and X-site atoms [7][8][9] Tf Tolerance factor defined as A,eff + X √2( B + X ) 10,11 Of Octahedral factor defined as B  Supplementary Fig. 3a. Firstly, 30 initial features are ranked by GBR algorithm according to the relative importance. Then, we remove the least important feature (i.e., the 30th feature) out of the whole feature set.
The remaining 29 features constitute a new feature set for the next step feature selection.
Repeatedly, we rank the rest of features and remove the least important one. We record the model score (R 2 ) of trained ML model during each selection step and find that the ML model shows the best performance when the feature set includes 14 features, as is shown in Supplementary Fig. 3b. Moreover, it clearly shows when the number of features reaches 14, the addition of features has little impact on the prediction performance of the ML model. In other words, the rest 16 features removed have little effect on the bandgap of HOIPs. Supplementary Fig. 6, GBR algorithm has an advantage in terms of R 2 and MSE. When predicting unknown data, the performances of DTR and MLPR algorithm are not stable (the standard deviations of their R 2 are large). Although SVR, KRR and GPR have no standard deviations, their MSE is a little larger than GBR algorithm. As a result, we chose GBR algorithm, whose performance is the best among the six algorithms.

Supplementary Note 3. As is shown in
On the other hand, we compared GBR algorithm with other five algorithms according to R 2 . We found that all of their P values (Sig.) are equal to zero (< 0.05) after test of normality for five paired samples' R 2 differences (Supplementary Table 2).
It demonstrates that their R 2 differences are all normally distributed and we are able to apply paired samples t test to their R 2 values. Table 3, the mean R 2 differences between GBR algorithm and other five ML algorithms are all positive, which means the prediction performance of GBR algorithm is better in general. What's more, the P values (Sig. (2-tailed)) of five pairs are all equal to zero (< 0.05), which shows that GBR algorithm is significantly different from the other five algorithms. This result can also be obtained from the 95% confidence interval (CI) of the average difference. If zero is not included in 95% CI, P<0.05. In five paired samples t tests, none of them contains zero. Supplementary Fig. 8 As is shown in Supplementary Fig. 11

Supplementary Note 4. As shown in
The value of r is between -1 and +1. If r is larger than zero, it indicates that the two variables are positively correlated, that is, the larger of one variable is, the larger of the other variable will be. If r is less than zero, it suggests that the two variables are negatively correlated. In addition, the greater the absolute value of r, the stronger the correlation.
Density functional theory. All first-principles calculations for selected HOIPs were carried out using the projector-augmented wave (PAW) method with the generalized gradient approximation (GGA), implemented in the Vienna Ab initio Simulation Package package 23 . The exchange-correlation functional was described by Perdew-Burke-Ernzerh (PBE) 24 functional considering the PBE method is more consistent with the experimental results for the HOIP materials due to fortuitous error-error offset 25,26 .
The cutoff energy for the plane-wave basis was set as 520 eV. Furthermore, the DFT-D3 method was adopted for the van der Waals correction 27 . The structure optimization process ended until an energy convergence threshold of 10 -5 eV and atomic force less than 0.01 eV/Å. The initial HOIP structures in a ( √2 × √2 × 2 ) unit cell were constructed within periodic boundary condition. The Brillouin zone integration was performed using a 4 × 3 × 4 k-point mesh for the orthorhombic phase.
Ab initio molecular dynamics (AIMD) simulations were performed to confirm dynamics stability of the selected materials, which is in supercells of 2√2 × 2√2 × 2 of unit cell. The entire MD simulation lasted 5 ps with the step of 1 fs. The temperature was controlled at 300K by using the Nosé-Hoover method 28,29 .
The adsorptions of H2O/O2 on the (001) surface of the HOIP structures were investigated and the H2O/O2 binding energy Eads was defined as: Eads = EHOIP-H2O/O2 -EHOIP -EH2O/O2, where EHOIP-H2O/O2, EHOIP and EH2O/O2 are the total energies of the H2O/O2-adsorbed HOIP structures, the HOIP structures and H2O/O2 respectively 30 . It was calculated in supercells with a vacuum space larger than 18 Å above the structure along the z-axis. Initially, one water molecule (oxygen molecule) was put at the top of the organic molecule on ABr-terminated surface.