Introduction

Perovskite materials have attracted much attention in many scientific fields for the composition diversity, easily available synthetic conditions and a variety of attractive properties1,2. For instance, hybrid organic–inorganic perovskite has widely applied in the fields of solar cells, light-emitting diodes, lasers, and photodetectors due to its longer charge diffusion lengths both for electrons and holes, higher carrier mobility and broad tunable bandgap (Eg)3,4. ABO3-type perovskite oxide has gradually become a research hotspot in modern industrial catalysis and thermoelectricity for the controllable structure, outstanding stability and low cost5,6. Inorganic double perovskite has aroused an interest in solar cells and light-emitting diodes because of adjustable photoelectric properties7,8. The trends of published papers searched on the website ‘web of science’ from 1961 to December 2020 are shown in Fig. 1. The number of papers under the keyword of perovskite shows an alarming increase. Especially after 2013, since the perovskite solar cell was proposed, the related publications has increased exponentially, indicating that perovskite materials have always been a hotspot for scientists.

Fig. 1: Number of published papers.
figure 1

a On keyword of ‘perovskite’ (from 1961 to December 2020). b On key words of ‘machine learning and material’ and ‘machine learning and perovskite’ (from 2002 to December 2020).

The traditional way to develop materials is usually based on trial and error, continuous synthesis and characterization keep trying until the properties of virtual materials meet the target. The method requires a long-time study on a limited quantity of materials and complicated experimental procedures, which can be a time-consuming and expensive endeavor. Under this limitation, important scientific progress often comes from the researchers’ experience and intuition or even was discovered by accident9,10. Besides, the discovery of high-performance materials needs a long cycle from experimental design to commercialization. With the development of synthesis and characterization techniques, the corresponding data become more and more complex. It is a great challenge to figure out the relationship between materials descriptors and properties by traditional experimental methods. To overcome this shortcoming, material simulated methods, including Density Functional Theory (DFT)11, Monte Carlo simulation12 and molecular dynamics13 are employed to explore the relationship between the structural, compositional, and technological descriptors and performance of materials at different scales. In particular, DFT could be used to obtain some key properties of the material without the need for experimental synthesis. However, most computational methods only aim at a specific system, leading to an unbearable amount of computation for complex systems. Some theoretical methods still cannot meet the requirements of quantitative description of material properties. Moreover, computational simulation methods require high computational costs and professional skills.

In recent years, artificial intelligence (AI), known as the ‘fourth paradigm of science’, has attracted worldwide attentions14. Since the 1980s, machine learning (ML) has been the core of AI for the power of reorganizing existing knowledge structures and mining implicit relationships. ML can extract valuable information from existing data, even failed experimental data15,16. For material science, ML has been becoming a powerful tool to assist design and screen various materials. A series of achievements about ML have been made in superconductor, photovoltaic materials and high entropy alloys17,18,19. As shown in Fig. 1, ML has also been applied widely in perovskite materials. Considering that ML has conducted a lot of researches in this field, reviewing their progress and providing an outlook for future work will be helpful in the development of perovskite materials.

In this review, we briefly discuss the successful application of ML in properties prediction and stability assessment of perovskite material. In section 2, the basic workflow of ML in material science is outlined. In section 3, we introduce the different types of perovskites and applications of ML in various properties of perovskite materials. In section 4, some of the current challenges and opportunities encountering in ML applications to perovskite design and discovery are briefly discussed. Our work is to provide practical guidance for accelerating the design of perovskite materials.

Workflow of machine learning

ML is an interdisciplinary subject that combines knowledge of computer science, statistics, mathematics and engineering to form an important branch of artificial intelligence20,21. The most common application of ML is to construct a statistical model used for data analysis and prediction. The main purpose of ML aims at evaluating or predicting the objects after training the model with historical data and specific conditions22. With the development of materials genome initiative (MGI), ML keeps playing an important role in materials science with the ability of mapping the relationships and trends through the available data without the physical mechanisms. In addition, the constructed ML model could also be applied again for reverse discovery of high-performance materials. To screen virtual materials with desired properties, ML could be allowed to develop quantitative structure-property relationships to predict the properties of virtual materials. The general workflow (Fig. 2) of ML in material science includes data preparation, feature engineering, model selection, model evaluation, and model application.

Fig. 2: The general workflow of ML in perovskite materials and related applications.
figure 2

Workflow of ML in perovskite materials includes data preparation, feature engineering, model selection, model evaluation and model application.

Data preparation

The dataset used for ML usually contains dependent and independent variables associated with the materials. Independent variables, also known as features or descriptors, refer to the representative information related to the structure and characteristics of materials, including the chemical composition, atomic or molecular parameters, structural parameters, as well as the technological conditions for synthesis process. The dependent variables refer to the target property of the materials affected by the independent variables, also known as the target variables23,24. the quantity and quality of data are key factors in the discovery of materials.

The amount of required samples depends on the ML model, but a general rule of thumb is that a reasonable ML model requires the number of data more than three times of descriptors at least, but some models such as neural networks and deep learning require large amounts of samples25. The quality of the data depends on the spatial coverage of the target properties and the uncertainties associated with the data. In general, data with a normal distribution is better for ML. Insufficient data of specific target or poor coverage of specific properties may not form an appropriate data distribution for ML. Also, data uncertainty, such as experimental error or calculation error, could affect the quality of data. The roughness of the modelling data directly determines the results of the constructed model. In general, the prediction error of the model is higher than the error of the training data. The methods adopted to reduce the roughness of the data include deletion of missing values, completion of experimental conditions, data normalization and other methods.

Data can be collected from available databases (Table 1) or published papers. The database contains many types of data, which could be generated from experiments, simulations and ML. A large amount of data could be collected from database, while the reproducibility of the data is uncertain, which might not ensure the quality of data. To guarantee the quantity and quality, data should be collected from the authoritative databases if the data are available. Using autonomous workflows to generate data could be a very convenient and fast way, but the quality of data obtained in this way may be inferior to the data from the database. If the database or autonomous workflows does not obtain the needed data, the dataset could also be generated through lab-scale calculations. Lab-scale calculations could be performed in many existing open sources and various software platforms, such as Materials Studio (MS), Vienna Ab initio Simulation Package (VASP), Car-Parrinello Molecular Dynamics (CPMD). Lab-scale calculations could generate a large amount of required data with good reproducibility to guarantee the quality of data. Generally, ML models constructed by the calculated data have relatively good evaluation metrics. However, the calculations of complex materials may take up too many calculation resources and take a long time.

Table 1 Publicly accessible databases of various materials.

The origin data collected from computational simulations or experimental measurements are often presented with incompleteness, noise, and inconsistency26. Thus, data preprocessing should be performed to ensure consistency and integrity from origin data. Specifically, individual data need to be reformatted into a single tabular form, imputed missing values, eliminated erroneous or incomparable data points, and normalized and rescaled the data. Data standardization can improve model accuracy and convergence speed. The results of several ML algorithms can vary with whether any standardization or scaled. It is worth noting that both feature variables and target variables can be normalized or scaled27.

Feature engineering

The properties of each material depend on a specific set of features, also called descriptors. Before model construction, it is crucial to identify the key features closely related to the target properties28. Features are generally derived from known properties of the constituent elements, such as atomic radius and electronegativity. The quantity of features should be less than that of dataset samples for effectively training and avoiding overfitting. Therefore, feature selection should reduce the dimension of input space as much as possible without losing important information. In particular, redundant and high self-correlation features should be removed to guarantee the efficiency and accuracy of models25,29. Reasonable material features should meet the following three conditions of perfect representation of material properties, sensitive to target properties, and easy to obtain30.

Model selection

ML algorithms could be briefly divided into two categories: supervised learning and unsupervised learning. Supervised learning is the process of using a set of samples with known labels to adjust the parameters of the models and achieve the required performance, which be further divided into regression and classification31. With the target property being a continuous value, the process is called regression. If the target is a discrete value, the process of searching the prediction function is called classification. Tables 2, 3 have summarized common ML algorithms in material design. Generally, the best model is obtained by comparing multiple algorithms. The criteria of algorithm selection are mainly based on the results of cross validation and independent test. The commonly used evaluation metrics include mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), determination coefficient (R2), correlation coefficient (R) for regression; confusion matrix, precision, recall, receiver operating characteristic curve (ROC), and area under ROC curve (AUC) for classification.

Table 2 Common ML algorithms in material design.
Table 3 Overview of application cases of ML in perovskite materials.

It is important for researchers to construct optimal models that map existing data without overfitting or underfitting23. After the model selection, it is generally necessary to optimize the internal hyper-parameters of the model algorithm to balance overfitting and underfitting32. Overfitting refers to that the model focuses too much on each individual data point in the training set, the unknown data cannot be well predicted. While underfitting means that the model is too simple to capture the general information of the training set, resulting in a large deviation. Models insensitive to hyper-parameters usually do not require repeated adjustments to achieve satisfying results33,34. For the hyper parametric sensitive model, the performance of the model will be closely related to the selection. It is necessary to adjust different hyper-parameters to achieve the robust ML model.

Model evaluation

The core of ML is to realize accurate prediction of unknown samples based on known information. There are inevitably some statistical errors in the calculation, which should rationally be checked and evaluated in the process model evaluation for the subsequent model application. There are three commonly used model evaluation methods: independent test, cross validation, and bootstrapping.

In general, the generalization error of the model can be evaluated by testing, but the goal of the model is the prediction of unknown samples. Therefore, a testing set is needed to test the generalization ability. The error obtained with the testing set can be taken as an approximation of the generalization error. The smaller error of independent test usually indicates the stronger generalization of the model available. It is worth noting that the independent testing set should be mutually exclusive to the training set35.

Cross validation (CV) could be used to evaluate the reliability of the ML models. In the CV, the input data is divided into k mutually exclusive subsets of the similar size, each subset is generated by ‘stratified samples’. Then, the union of k-1 subsets is used as the training set with the remaining one used as the testing set. After k times of training and testing, all test results are averaged to represent the final ML performance. The stability and fidelity of the evaluation results of the CV method depend to a large extent on the value of k. Hence, the CV method is usually called k-fold cross validation. In the k-fold CV, k is a specified number, the commonly used values are 5, 10, and 20. When k is equal to the sample number of input data, this method is called leave-one-out cross validation (LOOCV). LOOCV is not affected by random sample partitioning and the results are often considered to be more accurate. However, LOOCV may cost a long time and a lot of computational resources, which is not suitable for large dataset.

Bootstrap method is based on bootstrap sampling36. Given a dataset D containing m samples, the bootstrapping method would randomly copy a sample from D to the dataset D’ at a time until D’ contains m samples. In this process, some data may be sampled repeatedly while some data may never be sampled. Finally, D is designated as the training set and DD’ is used as the testing set. The number of training samples obtained by the bootstrapping method is equal to the original dataset. The bootstrapping method is effective under the condition of a small dataset. Nevertheless, the dataset generated by the bootstrap method may change the distribution of the original dataset and introduce the deviation.

Model application

The purpose of ML is to generalize the hidden patterns between descriptors and material properties of existing data samples. The properties could be accurately predicted with the constructed model. Therefore, the developed ML model can be applied to high-throughput screening. First, many virtual samples could be designed, and then the properties could be predicted with ML model. Finally, the materials with desired properties would be selected from the hypothetical samples for the experiments.

To further develop the application of ML model, it has also become one of the hotspots to develop the online prediction model for sharing. The network model enables more users to predict target properties. For example, Shi et al.37 developed the online server for predicting the specific surface area of ABO3 perovskites. Furmanchuk et al.38 developed an online application to predict the Seebeck coefficient of crystalline materials. The approach of developing models into online servers not only exposes sharing models, but also makes model applications easier and faster.

Applications of machine learning in perovskite materials

Perovskite, named after Russian geologist Perovski, originally referred to a specific compound, calcium titanate (CaTiO3). Now it is used to stand for a group of compounds with the same crystal structure as CaTiO339. The structural formula of perovskite material is usually represented as ABX3 or AA’BB’X6, where A and B are cations, the ionic radius of A is larger than that of B, and X usually means halogen ions or oxygen ions with small radius40. ABX3 perovskite is called simple perovskite, while AA’BB’X6 perovskite is known as double perovskite (DP). According to whether A-site cations are organic small molecules or metal ions, the simple perovskite could be classified into inorganic perovskite and hybrid organic–inorganic perovskite (HOIP)39,41. As shown in Fig. 3a, the ideal structure of simple perovskite generally presents a cubic structure. The eight vertex angles of the cube are occupied with inorganic cations or small organic groups A, the body center position is occupied with cations B, and the six face center positions are occupied with anions X42,43. The BX6 regular octahedron consists with six face-centered X anions and the body-centered B cations. Furthermore, the crystal structure of DP can be composed of the regular alternate arrangement of BX6 and B’X6 octahedrons (Fig. 3b). Normally, B and B’ are different transition metals, and A and A’ could be the same or different alkaline-earth or rare-earth metals44. Due to the flexibility of perovskite crystal structure, the ions at the Ca, Ti, or O positions of CaTiO3 can be replaced by elements or groups with similar radius, making the types of perovskite rich and diverse. The number of potential perovskites could reach tens of thousands. Taking the element doping into consideration, the potential number of perovskites could easily exceed 10741,45. Up to now, there are about 1000 perovskites that have been developed through experiments46. There is still a huge space for stable perovskites to be excavated. It would be a time-consuming and inefficient project to find stable and high-performance perovskites simply by experiments or DFT calculations. Based on many existing experimental and computational data, ML technology has gradually played an important role in the perovskites discovery.

Fig. 3: Structures of different perovskites.
figure 3

a Simple perovskite cubic crystal structure and b Double perovskite crystal structure.

This review mainly introduces the application of ML in the discovery and rational design of ABX3 inorganic perovskites, HOIPs, and DPs. In addition, two-dimensional layered perovskites are often classified as perovskites. However, the related works in layered perovskites are too seldom to discuss.

ABX3 inorganic perovskites

ABX3 inorganic perovskite is one of the most active materials. The diversity and flexibility make them a wide variety, and also lead to many different material properties, such as ferroelectricity and piezoelectricity47,48. In many applications, these properties are unmatched by other known materials, which makes inorganic perovskites greatly important in various fields, such as magnetic refrigeration49,50, solid oxide fuel cells51,52, and photocatalysis53,54.

Theoretically, most elements in the periodic table can replace the A or B of ABX3 to form perovskites (Fig. 4). However, not all compounds with ABX3 stoichiometry are perovskite structures. Therefore, finding an efficient way to determine whether a compound with the formula ABX3 exhibits a perovskite structure has been the first challenge in perovskite discovery and design. In many researches, the Goldschmidt tolerance factor (t)55 (Formula (1)) is usually used to judge the structure formability and phase stability of perovskite. However, the Goldschmidt tolerance factor is insufficient with an increasing variety of perovskites. Some researchers have proposed methods to determine the formability of perovskite structure. For instance, Sun et al.56 proposed a descriptor based on the tolerance factor and the octahedral factor, which accuracy reached 90%. Bartel et al.57 developed a tolerance factor (Formula (5)) that can be used to determine the formability of simple perovskite and double perovskite, defining when τ of the compounds less than 4.18 represents perovskite with 91% accuracy.

$$t = \frac{{r_A + r_X}}{{\sqrt 2 \left( {r_B + r_X} \right)}}$$
(1)
$$\mu = \frac{{r_B}}{{r_X}}$$
(2)
$$\eta = \frac{{V_A + V_B + 3 \cdot V_X}}{{a^3}}$$
(3)
$$\left( {\mu + t} \right)^\eta$$
(4)
$$\tau = \frac{{r_X}}{{r_B}} - n_A\left( {n_A - \frac{{\frac{{r_A}}{{r_B}}}}{{ln\left( {\frac{{r_A}}{{r_B}}} \right)}}} \right)$$
(5)

Where rA, rB, and rX are ionic radii of A, B, and X, respectively; μ is the octahedral factor; η is the atomic packing fraction; VA, VB, and VX are atomic volumes of A, B, and X, respectively, based on the rigid sphere model; a is the lattice constant of cubic cell; nA is the oxidation state of A.

Fig. 4: The elements of the periodic table to form perovskites.
figure 4

Elements that have the probability of form the perovskite structure in the periodic table.

Although these descriptors can well evaluate the formability and stability of perovskite with high accuracy, researchers still try to find the factors and patterns controlling the formability of perovskite structure through ML to develop a method that can fully judge the formability and stability of perovskite in a faster and more accurate way. The energy beyond the convex hull (Ehull) is a measure of the decomposition of the compound into a linear combination of the stable phases present on the phase diagram58. It is significant to evaluate the materials dynamic stability of. Normally, thermodynamically stable compounds have zero Ehull, while more positive values of Ehull indicate decreasing stability59. Ehull can be calculated by the DFT, but the huge computational costs limit the power of DFT in materials with a large chemical search space. In 2017, Schmidt et al.60 constructed a dataset containing 250,000 ABX3 compounds, from which about 20,000 ABX3 perovskite compounds were randomly extracted for model construction. An ML model with Ehull as the target variable was built to predict the stability of the compound. The ML model was used to predict the thermodynamic stability of the remaining approximately 230,000 virtual samples, in which there were 641 formally candidates with Ehull less than 5 meV/atom. Li et al.61 developed a ML model to predict the thermodynamic phase stability of perovskite oxides using a dataset of more than 1900 Ehull predicted by DFT. Two ML models were constructed respectively to classify and regress the Ehull, and then predict 15 perovskite candidate materials. Finally, four stable perovskites (La0.5Y0.5Co0.5Mn0.5O3、Y0.75Sr0.25VO3、CeReO3 and Dy0.75Nd0.25RuO3) were presented. In 2020, Liu et al.62 screened stable and metastable ABO3 perovskites using ML and the materials project based on the dataset of 397 ABO3 compounds (Fig. 5a). The ML classification model was applied to divide 891 ABO3 compounds into perovskite and non-perovskite compounds. The results showed that 331 compounds had perovskite structures, in which 174 had a formation probability of ≥85%. In addition, 37 thermodynamically stable ABO3 perovskites (0 meV/atom < Ehull < 36 meV/atom) and 13 metastable perovskites (36 meV/atom < Ehull < 70 meV/atom) were screened through the ML regression model for further synthesis and application. These researches have proven that the ML model could provide effective guidance to determine the stability of various perovskite oxides.

Fig. 5: Workflows of ML in ABO3 perovskites.
figure 5

a Workflow for predicting stability and metastability of ABO3 perovskite. Reproduced with permission from ref. 62. Copyright Elsevier 2020 b Workflow for the ABO3 cubic perovskite65.

In addition to Ehull, the formation energy of compounds could also be used to evaluate the formability and stability of perovskite. Li et al.63 proposed a transfer learning strategy to evaluate the stability of the ABX3 inorganic perovskites. First, an ML transfer learning model was constructed by taking the formation energies of 570 perovskites as the target variable and the physics-informed structural and elemental parameters of perovskites as descriptors. Then the transfer learning model was applied to predict the formation energies of 578 compounds with unknown target. With the combination of two datasets above, 1148 data were used to train a convolutional neural network model for high-throughput screening. Finally, 764 promising perovskite materials with the tolerance factor τ less than 4.8 were selected from 21316 assumed perovskites by the screening model, 98 of which have been validated to be stable by DFT calculation. In typical ML-based material discovery and large-scale screening of hypothetical perovskites, transfer learning is a recently developing ML method in dealing with the small data problems.

It is also an effective strategy to use the ML model constructed with experimental perovskite and non-perovskite to predict the formation probability of large quantities of unknown potential perovskites. In 2016, Pilania et al.64 demonstrated the powerful function and practicality of ML via SVM based classifier, which used elemental parameters to evaluate the formability of ABX3 halides in the perovskite crystal structure. After the exploration of vast descriptors, ionic radii, tolerance factor, and octahedral factor are identified as the most crucial related features for the model, indicating that steric and geometric packing effects have a great impact on the stability of these halides64. 40 ABX3 with perovskite-type crystal structures were proposed through predicting the perovskite formability of 455 ABX3 compounds with ML. Balachandran et al.65 developed two decision tree classifiers to acquire many potential perovskite materials and cubic perovskites, as shown in Fig. 5b. Two models with accuracy more than 90% were trained to predict unknown 625 compounds, in which 235 were perovskite and 20 were cubic perovskites. Besides, 87 promising perovskite candidates were selected for further experimental guide. Analyzing the results, potential perovskites may locate at (a) A and B atoms are a lanthanide or actinide elements, (b) A atom is an alkali metal, alkali earth metal or late transition metal atom, or (c) B atom is a p-block element65.

In 2019, Jain et al.66 constructed an ML classification model based on SVM with 189 ABX3 inorganic samples to predict the perovskite formability of 454 ABX3 compositions, among which the formation probability of 45 compounds is equal to or higher than 0.8. After comparing the thermodynamic stability information of perovskite in MP, AFLOW, and OQMD, 18 compounds were subject to carry out the DFT-based bulk structural optimizations and electronic structure predictions. According to the overall DFT results, two promising stable photovoltaic candidates, RbSnCl3 and RbSnBr3, were represented for further study. This work is an important step towards a basic understanding of the interfacial properties of perovskites, facilitating further breakthroughs in photovoltaic technology.

Recently, Park et al.67 proposed a method to identify the stability of perovskite. A series of ML models were developed for the target properties of the perovskite, namely, octahedral deformation parameters including the energy difference (ΔHc) between the relaxed and ideal cubic structures, quadratic elongation (λ), as well as octahedral angle variance (σ2) (Fig. 6a). The possibility of a known cation embedded in the perovskite was systematically analyzed. The influence of A-site cation on the phase stability of the perovskite was evaluated by measuring the degree of octahedral deformation when a given cation embedded in [BC6]4−67. This work shows that the combination of advanced electronic structure theory and ML analysis can provide an effective strategy that is superior to the conventional trial-and-error method in material design. More importantly, it provides a powerful guide for exploring a broad composition space of inorganic and mixed perovskites.

Fig. 6: Applications of ML in inorganic perovskites.
figure 6

a The parameters ΔH, λ, and σ2 were used to quantify the distortion out of the ideal cubic perovskites67. b the formation energy and Eg of predicted Li(Na)BX3 perovskite with DFT. The perovskites in the red box have the ideal formation energy and Eg values69. c Relationship between the total energy values for BaNbO2N supercells in the training set (rhombuses) and the test set (triangles) predicted by RR and the data calculated by DFT. Reproduced with permission from ref. 98. Copyright Elsevier 2019 d The predicted phase-transition energy difference ΔE versus DFT calculations. Reproduced with permission from ref. 100. Copyright John Wiley and Sons, Inc. 2019 e Comparison between test bandgap EHSE gand predicted bandgap EML g. Reproduced with permission from ref. 100. Copyright John Wiley and Sons, Inc. 2019 f 151 promising perovskites with different types of X-site compositions. Reproduced with permission from ref. 100. Copyright John Wiley and Sons, Inc. 2019.

Eg is a significant parameter in the applications of electrical conductivity, light-harvesting capability, photoelectric conversion, and other functions in perovskites, which is directly related with the properties of various photovoltaic devices68. The theoretical models of Eg could accelerate the discovery of perovskites, and help navigate the broad space of potential perovskite materials, and guide chemists to screen out candidates for experiments.

In 2018, Takahashi et al.69 used the RF to predict the Eg of ABX3 perovskite to determine whether the Eg values of the candidates meet the requirement of the applicable range of solar cells (1.7–3.0 eV). After model training with the Eg data of 15,000 perovskite materials, 9328 potential perovskite materials with Eg at the range of 1.7–3.0 eV were extracted from 414,736 candidates. Then the Eg values of the selected candidates based on Li and Na were calculated and evaluated with DFT, where 11 undiscovered Li (Na) based perovskite materials fell into the ideal Eg and formation energy ranges for solar cell applications (Fig. 6b). In addition to using the classification models to screen promising candidates with appropriate Eg, the ML regression models also have excellent predictive performance. Li et al.45 constructed a ML model with ABO3 perovskite formation energy (Ef) as the target. Then, the Ef predicted by the model was used as the instrumental variable to build a progressive learning model to predict the Eg of the perovskite materials. The results of the model indicated that the addition of predicted Ef as an instrumental descriptor can promote the prediction accuracy of Eg regression model (R2 = 0.855). This progressive learning strategy with instrumental descriptors provides an approach to widen the feature pool and reduce the computational effort instead of high-cost DFT calculations.

Even in the era of big data, limited samples are still the majority in material science. How to make full use of the limited samples for ML has also been a research potential in recent years. It is believed that the emergence of each method for small sample datasets would bring a bit of dawn to the development of ML in material science. Gladkikh et al.68 presented an ML technology suitable for small datasets-alternating conditional expectations (ACE). ACE has an advantage in that it shows the results in a graphic form, which can help for model interpretation. The graphic form of the ACE transformations can view the impact of each descriptor on the target property. Furthermore, ACE does not suffer from the curse of dimensionality due to it is estimated by univariate functions. They used ACE to study nonlinear mappings between Eg and descriptors of component elements and constructed a model to predict the Eg of the perovskites. The R2 and RMSE of the training set were 0.824 and 0.836 eV, respectively. Their study indicated that the Eg values of ABO3 perovskites mostly depend on the electronegativities, electron affinities, ionization energies, and atomic radii of the constituents.

The critical temperature at which ferroelectric materials convert from the ferroelectric to the paraelectric phase is called Curie temperature (Tc), also known as the Curie point70. Tc has been a key indicator in property measurement of ferroelectric materials. Most inorganic perovskites represented by ABO3 structure have excellent ferroelectric properties and become one of the most promising materials for electronic and magnetic components such as multilayer capacitors and sensors71,72. Tc has a considerable influence on many applications of perovskite materials in the magnetic recording, sensor, actuators, and refrigeration73,74. Therefore, it is quite meaningful to predict Tc of perovskite materials quickly and effectively before experiments.

In 2018, Zhai et al.75 developed a prediction model of Tc with physicochemical parameters based on ML. In the meanwhile, the potential perovskite material (La0.66Sr0.3Ba0.04MnO3) with high Tc of 390.35 °C were found from the virtual samples by the SVR model combined with the genetic algorithm search strategy. Similarly, Yang et al.76 used RR, SVM, ERT and other ML methods to train the Tc of lead-based perovskite ferroelectrics. The ML model integrating with the above three algorithms was used to predict the Tc of more than 200,000 kinds of lead-based perovskite materials outside the dataset. Then two lead-based perovskite solid solution ferroelectric materials were screened with high Tc of 481 °C and 466 °C were selected for further experiments. In addition, the integrated ML model was also applied to analyze the Tc prediction results of PbGa1/2Nb1/2-PbMn1/2Nb1/2O3-PbTiO3 system.

Neel temperature (TN) is the critical temperature at which antiferromagnetic material becomes paramagnetic77. It has been reported that TN is closely related to the applications of ABO3 perovskite in the fields of magnetic refrigeration, colossal magnetoresistance, etc78,79. Therefore, accurate and rapid prediction of TN is a very significant work in the design and discovery of perovskite oxides. Xiao et al.79 mapped the relationship between the main atomic parameters of Mn-based perovskite oxide and TN with SVM. It is worth noting that SVM is an algorithm especially appropriate for small sample datasets, which can build ML model with high generalization in a limited sample size. This work is helpful for the simple and rapid prediction of the TN of Mn-based perovskite.

Energy efficiency and sustainable development are the priority topics in modern society. However, refrigeration and air-conditioning consume a large amount of electric energy among various end-uses of energy in both commercial and residential areas80. Most refrigeration technologies rely on traditional conventional gas compression technologies, which have come under increasing criticism for their inefficiency and the use of air pollutant gases. The latest development of magnetic refrigeration technology based on the magnetocaloric effect of magnetic materials (especially near room temperature) has provided a promising alternative to vapor compression refrigeration81,82. In order to design a magnetic refrigerator with an operating temperature close to room temperature, much attention has been paid to the magnetocaloric material with a large maximum magnetic entropy change (MMEC) over a wide temperature range83,84. Zhang et al.85 established a GPR model to elucidate the statistical relationship between the MMEC and lattice parameters of magnetocaloric lanthanum manganite perovskites. The model demonstrated a high accuracy and stability with RMSE, MAE and correlation coefficients being 0.0121, 0.0054, and 99.997%, respectively. In addition, the model could be used as part of ML to get a better understanding of magnetic phase transformations and magnetocaloric effects in various types of doped magnetocaloric lanthanum manganite.

The dielectric breakdown strength refers to the highest electric field strength that a material can withstand without being destroyed under the action of an electric field, which is the key property to assess the performance of electrical and electronic devices86,87,88. Dielectric materials with high dielectric breakdown strength are necessary for high energy density electric energy storage applications in combination with continued miniaturization of electronic devices89,90. It is not only determined by the intrinsic factors of the material (chemical constituents, nature of the chemical bonding, crystal structure, etc.) but also affected by the extrinsic factors (defects, morphology, impurities, degradation, interfaces, etc.)91,92 Therefore, it is very challenging to accurately calculate the dielectric breakdown strength of complex materials entirely by DFT method and perform high-throughput screening from a large number of promising candidates. By contrast, ML may be a more potential approach to predict dielectric breakdown strength.

Kim et al.91 applied the ML technology to train and validate on a limited amount of accurate data from DFT calculations, then to predict the dielectric breakdown strength of hundreds of ABX3 compounds in a highly efficient manner. After making predictions on these compounds using the ML model, the dielectric breakdown strength of the most promising candidates was further validated by DFT calculations. The research results have shown that boron-containing perovskites may be extremely tolerant toward high electric fields. The prediction results of BSiO2F and SrBO2F showed a breakdown strength of almost 2 GV/m, which is worthy for further experimental studies. Gao et al.93 studied the dielectric permittivity of perovskites based on ML. They employed the GPR algorithm to obtain the relationship between the composition of perovskites and the dielectric permittivity to find the maximum dielectric permittivity in Ba (Ti1-x%Hfx%)O3 ceramic material. According to ML prediction, the optimal composition is found to be x = 11 with the highest dielectric permittivity εr = 4.5 × 104. The predicted materials are synthesized experimentally to further verify the accuracy of the model. This strategy combined with ML shows higher efficiency compared with the traditional experimental search.

ABO3-type perovskite oxides have also been considered as the potential materials for solid electrolytes in solid oxide fuel cells (SOFCs). Conductivity is an essential parameter to describe the ease of charge flow in a material. In addition to external factors like oxygen pressures and operating temperature, the conductivity of perovskite oxides is also affected by its composition and structure94,95,96. To discover or design perovskite oxides with high ionic conductivity, it is necessary to figure out the relationships between the molecular composition parameters and the oxygen ionic conductivity of the perovskite oxides. Liu et al97. explored the correlation between atomic parameters and ionic conductivity properties of 117 perovskite oxide data via partial least squares, backpropagation artificial neural network and SVR, in which model constructed by SVR processed the best generalization. It was found that and the ratio of O–O charge population to the O–O band length (P/L) and logarithm of oxide ionic conductivity (Lnσ) have a quadratic curving relationship. The value of P/L is one of the important quantum chemical parameters to predict the ion conductivity of perovskite oxides. Based on the calculation of P/L, a semi-empirical formula can be used to predict the oxide ion conductivity of the doped ABO3 perovskite.

Kaneko et al.98 proposed a regression model built by ML based on the data with DFT calculations to predict the stability of anion ordering in perovskite-type BaNbO2N supercells. DFT was used to calculate the total energies of 560 small BaNbO2N supercells with random anion ordering. Using the total energy of 420 BaNbO2N supercells as the training set, an ML model was established with prediction accuracy reaching 94% (Fig. 6c). The conclusion indicates that the most stable perovskite BaNbO2N supercells had each Nb atom coordinated with two N atoms, along with NbN chains in a cis conformation98. This work has suggested an approach for the property predictions of complex-compositions materials at a reasonable computational cost and provided guidance for the design of stable perovskite oxynitrides.

The specific surface area (SSA) of the photocatalyst plays a significant role in the photocatalytic reaction. Generally, the larger the SSA of the photocatalyst is, the more reaction sites and the better the photocatalytic performance are. ABO3 perovskite has been widely applied as the photocatalyst or photocatalytic active component in photocatalytic reactions. Shi Li et al.37 used GA and SVM algorithms to explore the relationship between the SSA of perovskites and the composition as well as experimental conditions. After virtual screening with the developed model, five visual perovskites with larger SSA and photocatalytic potential were proposed. In addition, the author has also developed the established model into an online forecasting application, making the model more available to researchers to predict the required large SSA perovskites. This method should be extended to ML-aided design of other properties and other materials.

Zheng et al.99 established a series of models by RF, RR and SVM with the electronegativity of atoms at A, B and the effective atomic radii of atoms at A, B, and X as descriptors for the predictions of four properties including density and formation energy, Eg and crystal volume. The results showed that RF method could effectively predict the density and Eg of perovskite materials; RR method could realize the prediction of density; SVM with linear kernel function method could achieve the prediction of formation energy. The research demonstrated that different ML algorithms have different sensitivity to the distribution of data samples. In the process of building ML models with different properties, different algorithms need to be evaluated and screened to optimize the evaluation function.

Lu et al.100 combined DFT calculation and ML technology to propose a multistep screening scheme for all-inorganic perovskite with stability, high spontaneous polarization, and proper Eg. The phase-transition energy difference was adopted as the target property to directly judge whether the compound can be exposed spontaneous polarization. As shown in Fig. 6d, e, the ML prediction accuracy of both energy difference and Eg regressions exceeds 90%, which is highly consistent with DFT calculations. After screening, 151 promising ferroelectric photovoltaic (FPV) perovskites were successfully extracted from 19,841 compositions (Fig. 6f). The accuracy of the ML predictions is further verified by DFT calculations, and 8 randomly selected FPV perovskites exhibited good thermal stability, appropriate Eg (1.01–1.62 eV), and considerable spontaneous polarization (7.10–32.78 µC cm−2). This scheme realized the ML for accelerating the material design of multi-property and the extension of materials database.

Hybrid organic–inorganic perovskite

HOIPs have become a major hotspot in the field of optoelectronics in recent years due to its easy synthesis, low cost and excellent optoelectronic properties, such as tunable optical Eg high optical absorption coefficient, high carrier mobility, and long load of diffusion length101,102. It has been widely applied in fields of solar cells103,104, light-emitting diodes105,106, and photodetectors107,108, and its performance is comparable to traditional materials. In addition, the development of HOIPs is still in continuous improvement and breakthroughs.

In 2009, Kojima et al.109 used perovskite-type organic–inorganic hybrid materials to prepare thin film solar cells and obtained a 3.8% power conversion efficiency (PCE). Since then, perovskite-type solar cells (PSCs) have attracted many interests of researchers for the huge development potential and the title of new hope in the field of photovoltaic110. The highest certified PCE of PSCs to date has reached 25.2%, according to the National Renewable Energy Laboratory111,112. It is reported that Pb is the key factor in the high performance of PSCs due to the strong antibonding coupling between the 6 s lone pairs of Pb and the 5p states of I, resulting in a small effectively masses and a direct Eg with a p-p transition113. However, Pb based halide perovskites are easily degraded spontaneously under exposure to moisture, air, light, heat, and other environments, resulting in the degradation product of carcinogenic PbI2114,115. These obvious shortcomings have hindered the industrial application of HOIPs solar cells, prompting researchers to seek high-performance perovskite materials with better chemical stability and environmentally friendly composition. ML method may accelerate the discovery of such materials.

The Eg of HOIPs is an important parameter for evaluating high-efficiency photovoltaic perovskite materials. Electrons in the valence band could be excited to the conduction band only under the condition of enough energy. Therefore, a comprehensive understanding of Eg and its relationship with HOIPs composition and structure would be very necessary before looking for photovoltaic materials with high light absorption coefficient. Lu et al.116 developed a target-driven method based on ML and DFT calculations to discover stable Pb-free HOIPs. Taking the Eg values of 212 HOIPs calculated by DFT as the training set, an ML model was built based on the GBR algorithm to predict the Eg values of 5158 unexplored possible HOIPs. After further screening, six orthorhombic lead-free HOIPs with proper Eg for solar cells and room temperature thermal stability were selected. And two of them have direct Eg in the visible region, excellent thermal stability, and excellent environmental stability. As shown in Fig. 7a, the maximum error of Eg obtained by ML prediction and DFT calculation is less than 0.1 eV, which shows that ML has a huge advantage in Eg prediction, and its accuracy is comparable to DFT calculation. The workload of DFT calculation has greatly reduced with the assist of ML, which is very important for large-scale screening of materials. Besides, HOIPs with small Eg (less than 0.9 eV) can be used in infrared sensors, and large Eg of HOIPs (larger than 3 eV) may serve as good insulating materials. Therefore, ML not only accelerates the prediction of Eg in photovoltaic materials but also in other related fields.

Fig. 7: Applications of ML in HOIPs.
figure 7

a A comparison of ML-predicted with DFT-calculated data of six HOIPs116. b The Eg energy changes with Cs (mol %) in CsxMA0.85-xDMA0.15PbI3 and the cubic structure begins to recover when x = 0.02118. c Ternary diagram denoting the different-stoichiometry crystal structures118. d The correlation of the ML Eg data and the experimental Eg data. Reproduced with permission from ref. 121. Copyright John Wiley and Sons, Inc. 2019 e Blackline: the maximum PCE predicted by ML corresponding to each Eg (1.2–1.3 eV); redline: the PCE of Shockley–Queisser limit. Reproduced with permission from ref. 121. Copyright John Wiley and Sons, Inc. 2019 f 4D-plot of PCE with respect to Eg, ΔH, and ΔL, indicating the highest PCE values with the bandgap in the range of 1.2–1.3 eV. Reproduced with permission from ref. 121. Copyright John Wiley and Sons, Inc. 2019.

In 2020, Saidi et al.117 used DFT to calculate the Eg and structural parameters of 862 HOIPs for modelling. Then a hierarchical convolutional neural network (CNN) was used to construct an ML model to predict the Eg of HOIPs. The results show that the lattice constant and the octahedral till angle play the key role in the prediction of the Eg. When these two features are removed from the dataset, the RMSE increases from 0.07 to 0.16 eV. In addition, applying hierarchical CNN to alleviate problems related to the imbalanced target values is also the key to success. In material design, small samples are a common problem, which are usually unevenly distributed. And this well-designed hierarchical ML approach is expected to be used in the design of other materials with uneven data distribution.

In addition to Eg, stability is also a key parameter affecting the overall performance of PSCs. Recently, Ali et al.118 constructed a dataset of A-site cation-doped HOIPs containing 852 data with the target of the energy difference (ΔHC) between the of cubic structure and the fully relaxed structure and 12 descriptors. These descriptors include the period and group numbers, the effective radius and the number of lone pairs to describe the A-site cations, the ionization energy and the electron affinity of the inorganic elements in B- and X-sites in combination with the tolerate factor and the octahedral factor118. The deep learning method was employed to train the model to predict the cubic phase stability, which was further applied to accelerate the search and discovery of HOIPs with stable cubic phase from the enormous material search space. A series of mixed-cation perovskites were synthesized under the guidance of the model, namely CsxMA0.85-xDMA0.15PbI3, where x = 0, 0.02, 0.05, 0.1, 0.2, 0.255, 0.595, and 0.765118. The experimental characterization results indicate that the cubic structure began to be restored when x equals 2 mol% Cs in CsxMA0.85-xDMA0.15PbI3 (Fig. 7b). This work also showed that the cubic structure could be recovered through converting the severely unstable double-cation perovskite (MA0.85DMA0.15PbI3) in the cubic structure at room temperature into a triple-cation compound by the incorporation of Cs cation (Fig. 7c)118. The work shows that the ML perovskite structure stability prediction model has greatly sped up the experimental process of cubic perovskites and reduced experimental costs.

There are many factors that affect the applications of HOIPs in photovoltaics. Li et al.119 using the ML approach and non-equilibrium Green’s function together with DFT to explore the electronic transport properties of MAPbI3. The band structure of MAPbI3 calculated with DFT indicated that the ferroelectric and antiferroelectric dipole configurations have very little effect on the Eg119. They tested the tunnel junctions composed of MAPbI3 and 48 different metal electrodes with the same fixed lattice constant as MAPbI3 and found that the electron transmission coefficient of Mg electrodes is the highest, and the conductivity of the Pt electrodes is the least119. In addition, as the perovskite unit cell number increases, the electron transmission coefficients usually exponentially decrease. The ML algorithms were employed to explore the correlations of the transport properties of MAPbI3 with different metal electrodes and tunnel barrier lengths119. This work could quickly and effectively predict the electron transmission coefficients of MAPbI3 under different metal electrodes and different tunnel barrier lengths, thereby stimulating more experimental and theoretical interests in other tunnel junction systems and electron transport problems with the “DFT+ML” strategy119.

PCE is a momentous indicator to evaluate the performance of solar cells. It would be strategically significant to study the PCE of reported PSCs with ML. In 2019, Odabaşı et al.120 collected 1921 samples of HOIPs solar cell devices to propose an effective strategy to improve the PCE of PSCs. RF algorithm was used to build the ML model for predicting the PCE of PSCs. The RMSE for training set and testing set were 1.70 and 3.29 for regular cells, 1.51 and 2.91 for the inverted cells, respectively. In addition, the factors were explored with association rules to provide theoretical guidance for the design of PSCs with high PCE. The results revealed that the factors like mixed-cation perovskites, dimethylformamide and dimethyl sulfoxide as solvents, chlorobenzene as the antisolvent were crucial to obtain the PSCs with PCE higher than 18.0%. Li et al.121 established a ML model for predicting Eg of perovskite materials with the material composition as descriptors. Taking the perovskite Eg, the energy difference (ΔH) between the HOMO of the hole transport layers and the HOMO of the perovskite material, and the energy difference (ΔL) between the LUMO of the perovskite material and the LUMO of the electron transport layers as features, a series of ML models were established to predict the open-circuit voltage (Voc), short-circuit current density (Jsc) and fill factor (FF) of PSCs. The performances of PSCs and the physical principles behind getting high-performance PSCs devices were fully studied based on the models. Moreover, perovskite materials were synthesized experimentally to verify the model. As shown in Fig. 7d, the Eg of the synthesized perovskite materials was highly consistent with the result predicted by ML, which strongly proved the reliability of the ML model prediction. In addition, the PCE tendency of the PSCs predicted by the model was also consistent with that by the theory of the Shockley–Queisser limit (Fig. 7e). The relationship between Eg, ΔL, ΔH, and PCE was further analyzed to derive a strategy for developing high-performance PSCs with different Eg (Fig. 7f). These findings indicate that ML has been very promising in terms of properties prediction and a deeper understanding of the physical phenomena associated with PSCs.

Double perovskite

In order to solve the instability and toxicity of HOIPs and the wide b Eg of ABO3 perovskite, the researchers replaced A-site or B-site cations of perovskite with two cations, forming a type of stable perovskite called double perovskites (DPs)122,123. Theoretically, DPs could achieve both the excellent performance of HOIPs and the stability of ABO3 inorganic perovskite, but the properties of DPs reported so far is not ideal. For example, Cs2AgBiBr6, which has been more popular in the DPs research direction recently, showed only 2.79% PCE in PSCs application, and the hydrogen production rate in photocatalytic water splitting was 48.9188 μmol h−1 g−1124,125. The average oxygen production rate and hydrogen production rate of Sr2CoWO6 in the application of photocatalytic water splitting were 188 μmol h−1 g−1 and 30 μmolэh−1 g−1, respectively126. These properties are still relatively limited compared with more mature perovskite materials. Therefore, exploring high-performance DPs materials still has huge research and development prospects.

In 2016, Pilania et al.127 proposed a robust ML model based on elemental descriptors, which effectively predicted the electronic Eg values of AA’BB’O6 double perovskite. The statistical learning model of KRR was used to train and test the dataset consisting of the Eg values of ~1300 double perovskites calculated with Gritsenko, van Leeuwen, van Lenthe, and Baerends potential and further optimized for solids (GLLB-SC) functional. The most important chemical pattern derived from the adopted learning framework is that the Eg is mainly controlled with the LUMO energy of the A-site 1 of the B-site. The R2 of cross validation of the best model reached 0.993 and the RMSE was 0.132 eV; the R2 of the test set was 0.947 and the RMSE was 0.36 eV. The results of the test set proved the strong generalization ability of the ML model and its high consistence with the DFT calculation results. In addition, this ML technique can be applied to any materials in a restricted chemical space with a given crystal structure to obtain the accurate prediction of Eg. In 2018, Xu et al.128 developed a procedure to identify the perovskites formability of all ABX3 and AA′BB′X6 compounds stored in the Materials Projects database. This program could identify the perovskite-forming properties of ABX3 and A2BB’X6 compounds with the crystal structure stored in the material project database. A variety of ML algorithms are employed to comprehensively analyze the correlation between atomic number, ionic radius, electronegativity, tolerance factor, and octahedral factor and perovskite formation to provide an intuitive view of these data. The prediction accuracy of best ML model reached more than 90%, which was used to identify suspicious data about the perovskite formation of A2BB’O6 compounds. Excluding those suspicious data, ML could achieve a prediction accuracy of up to 96.3%. In addition, the program also identified 11 ABO3 compounds, which showed different formative properties compared with previous publications. This work has largely enriched the perovskite formability and corrected the possible errors in the previous data of the ABO3 compounds.

In 2019, Agiorgousis et al.113 used ML to explore chalcogenide DPs to identify photovoltaic absorbers that can replace CH3NH3PbI3. After considering the thermodynamic stability, kinetic stability, and optical absorption, five promising perovskite photovoltaic absorbers (Ba2AlNbS6, Ba2GaNbS6, Ca2GaNbS6, Sr2InNbS6, and Ba2SnHfS6) were screened from more than 450 possible chalcogenide DPs candidates. Li et al.129 proposed a strategy with the combination of ML and DFT to engineer stable halide DPs. By choosing 283 DFT-calculated perovskite decomposition energy (ΔHD) as the training set, the ML mapping between the stability of the perovskite and the compositional ionic radii was established. The ML model was applied to predict the ΔHD of 14190 possible A2B(I)B(III)X6 type halide DPs, in which 2275 were stable (ΔHD > 0) and 11915 were unstable (ΔHD < 0). The ML method combined with DFT calculation could not only provide guidance for the experimental engineering of stable perovskites, but also offer enlightenment for the design and discovery of other materials without redundant experimental engineering and complex calculation simulation process.

Magnetism is the significant property of materials in many different applications. In 2019, Halder et al.130 used a combination method of computational tools to predict virtual magnetic DPs: an ML technique for the screen of stable candidate DPs, an evolutionary algorithm for the determination of crystal structure, and DFT calculations for characterization of electronic and magnetic properties. ML technique was applied to screen the most likely B/B’ combination to predict a stable perovskite structure. Among the 412 screened candidates of A2BB’O6 composition with 3d, 4d or 5d transition metals at B and B′ sites, 33 compounds were found to form stable DP structures, 25 of which were further considered for characterization of their structure and properties. Twenty-one DPs with different magnetic and electronic properties are predicted, ranging from ferromagnetic half metals to ferromagnetic, from antiferromagnetic insulators to ferromagnetic metals, and then to a rare example of antiferromagnetic metals. This ML study is expected to help the discovery of magnetic DPs.

It is very challenging to solve the model overfitting caused by data scarcity. In 2020, Li et al.131 developed an adaptive learning strategy to find high-performance AA’B2O6 cubic perovskites for catalyzing the oxygen evolution reaction (OER). Through mapping the correlations between a large amount of available informatics and the adsorption energies (i.e., *O and *OH), the probabilistic Gaussian processes quickly estimated the adsorption energies of reaction intermediates and the corresponding uncertainties of a rich material space. This adaptive learning strategy gradually improves the robustness of the model by verifying promising samples, albeit with large uncertainties. After iteratively validating/refining the candidates with theoretical overpotentials <0.5 V, an excellent ML model with RMSE less than 0.5 eV was attained. The model rapidly predicted nearly 4000 AA’B2O6 compounds and proposed nine stable cubic perovskite candidates with the optimal OER performance (OER overpotential is about 0.5 V, tolerance factor > 0.9): KRbCo2O6, BaSrCo2O6, KBaCo2O6, KCaCo2O6, BaPbTi2O6, BaRbTi2O6, BaSnTi2O6, BaTnTi2O6, RbEuTi2O6. Furthermore, they also revealed the potential relationship between the electronic structure descriptors and the OER activity of the perovskites, indicating that the orbital electronic structure characteristics of the B-site ion might be latent factors governing the OER activity. This work indicated that adaptive learning is a cost-effective strategy that can reduce the uncertainty of model predictions in high-dimensional feature spaces with the least computational cost.

Conclusions and outlook

This paper has briefly summarized the basic process of the ML method in material discovery and design and reviewed part of applications of ML in the large-scale screening and rational design of perovskite materials. The applications of ML in perovskites can be divided into the following four categories. The first type is using ML to explore the better evaluation indexes to describe the stability of perovskite materials. The second type aims to perform the high-throughput screen with the constructed ML model and many virtual samples to screen out the potential perovskite candidates with better properties for experimental guidance. The third type is to deeply dig out the relationship between the descriptors and perovskite properties to get a better understanding of the properties. The last type is the combination of ML and DFT calculation to deal with the problem of limited data. From this review, we have realized that ML has great potential and advantages in discovering materials and revealing the relationship between structural, compositional, and technological descriptors and performance based on known material information. In spite of some successful researches, the applications of ML in material research are still in its infancy and a lot of work needs to be further deepened in the future. Here, we propose some possible directions for ML in the field of perovskite materials:

(1) The combination of ML models and experiments/simulations: The ML development in the field of perovskites is still in its infancy. ML has focused on only a small part of the many excellent material properties of perovskite. Therefore, ML should be used to predict more material properties of perovskites and optimize the synthesis process of perovskites. Besides, ML could be effectively combined with DFT, molecular dynamics, Monte Carlo, and other theoretical simulation methods to accelerate the screening of large-scale perovskites and other materials. More importantly, experiments are the basis for the synthesis and characterization of materials. Therefore, it is necessary to strengthen the combination of ML and experiment to shorten experiment time, reduce experiment cost, and improve experiment efficiency. There are many recent studies on the combination of ML and perovskite experiments132,133,134,135. Sun et al.132 used deep neural network methods to build ML models based on experimental X-ray diffraction data to assist in structural analysis. The data synthesized by the experiment could serve back to ML again to increase the amount of data and then improve the generalization ability of the ML model. For example, the perovskites with required specific surface area (SSA) values could be discovered by integrating ML with experiments. First, a ML model could be constructed with collected data to predict the perovskite SSA. The potential candidates would be screened out after visual screening for experiments. Then the experimental data could be added back to the dataset for model reconstruction. The loop would keep performing until the perovskites with targeted SSA are obtained.

(2) The establishment and sharing of perovskite databases: ML is the data-driven method that strongly depends on the quantity and quality of data. Compared with speech recognition, image processing, and other fields with millions of data, the amount of data in material science is extremely limited. The ML method is more prone to overfitting with limited data, leading to the reduction of the generalization ability of the ML model126. Although there has existed a database containing a large amount of materials data, more data in the published papers has not yet been entered into the database. It is necessary to establish a more comprehensive, more standard, and more general perovskite information database to speed up the realization of data sharing and reduce the barriers to data access. In the meanwhile, researchers could also obtain more theoretical data through high-throughput calculations, as well as develop methods for intelligently reading literature, access and obtain a large number of related experimental and theoretical data from publications and enter these data into the database.

(3) Development of ML algorithms for small samples: Many powerful ML algorithms have been developed to be successfully applied in various fields. However, these algorithms usually have their own limitations, such as not suitable for small sample data, difficult to adjust parameters, etc. Therefore, developing faster, more accurate, advanced, and intelligent learning algorithms to deal with the challenge of insufficient data would be very indispensable, especially when most data in publications about ML in perovskite materials belong to small samples. A common method to deal with small samples is meta-learning, that is, learning knowledge within or across a specific field136,137. The development of new technologies such as neural Turing machines138 and imitation learning139 could make it possible. It has recently been reported that the Bayesian program learning framework can reach the level of human experience through one-shot learning under limited data conditions140. This may have a huge boost in materials science with scarce and expensive data. It would greatly improve the applicability of the ML method in perovskites and improve the efficiency and generalization of the model.

(4) ML computation platform: The current ML work is more about using programming languages to call various ML algorithms for modeling. For many non-computer researchers, it would very inconvenient and difficult due to the lack of basic programming knowledge. Even if some ML platforms and toolkits have been developed, problems of higher fees, fewer algorithms, simple functions, and inaccessibility are particularly prominent. Therefore, it is urgent to develop computing platforms with free access, complete algorithms, powerful functions, and smart computing. In addition, associating computing platforms with various material databases is also an in-depth direction.

(5) Descriptor interpretation and construction: The predictions or decisions made by ML are mainly based on classical probability theory and mathematical statistics. The physical and chemical meanings of the model still need further research and explanation. Therefore, discovering physical descriptors and making the black box model of statistical ML interpretable is a promising direction for data-driven perovskites. It would not only help experimenters to quickly design and screen visual materials with desired targets, but also enable them to understand the underlying physical laws behind the characteristics for further perovskite design. Alternatively, the accurate and interpretable descriptors could be created with existing descriptors, domain knowledge and ML algorithms.

In summary, with the continuous improvement of high-tech requirements for materials and the rapid development of computer technology and computational methods, ML will be more widely applied in other materials. It is believed that ML will become an indispensable auxiliary tool for experiments and computations in the field of materials science in the future.