Active learning for the power factor prediction in diamond-like thermoelectric materials

The Materials Genome Initiative requires the crossing of material calculations, machine learning, and experiments to accelerate the material development process. In recent years, data-based methods have been applied to the thermoelectric field, mostly on the transport properties. In this work, we combined data-driven machine learning and first-principles automated calculations into an active learning loop, in order to predict the p-type power factors (PFs) of diamond-like pnictides and chalcogenides. Our active learning loop contains two procedures (1) based on a high-throughput theoretical database, machine learning methods are employed to select potential candidates and (2) computational verification is applied to these candidates about their transport properties. The verification data will be added into the database to improve the extrapolation abilities of the machine learning models. Different strategies of selecting candidates have been tested, finally the Gradient Boosting Regression model of Query by Committee strategy has the highest extrapolation accuracy (the Pearson R = 0.95 on untrained systems). Based on the prediction from the machine learning models, binary pnictides, vacancy, and small atom-containing chalcogenides are predicted to have large PFs. The bonding analysis reveals that the alterations of anionic bonding networks due to small atoms are beneficial to the PFs in these compounds.

INTRODUCTION Thermoelectric (TE) materials have aroused widespread interest owing to their potential applications in waste heat harvesting and refrigeration [1][2][3] . The conversion efficiency of TE materials is evaluated by the dimensionless TE figure-of-merit ZT, defined as ZT ¼ S 2 σT κLþκe , where S, σ, κ L , κ e , and T, respectively, stand for the Seebeck coefficient, electrical conductivity, lattice thermal conductivity, electronic thermal conductivity, and the absolute temperature. Because of the intercorrelation between the transport parameters, the improvement of ZT values is challenging [4][5][6] .
As computational materials science is emerging, the highthroughput (HTP) calculation methods have been introduced to the TE material field. In 2014, Carrete et al. scanned~79,000 half-Heusler structures and finally recommended 3 semiconductors with low lattice thermal conductivities 7 . Chen et al. screened 25,000 semiconductors out of 48,000 inorganic compounds and performed the calculations of their electrical transport properties 8,9 . In 2018, Xi et al. applied HTP ab initio calculation to 161 ptype chalcogenides and experimentally verified the recommended TE compound Cd 2 Cu 3 In 3 Te 8 with ZT >1.0 10 . Li et al. studied both p-type pnictides and chalcogenides in the atomic ratio 1:1:2, and pnictides showed exceptionally high power factors (PFs) 11 .
Although HTP theoretical and experimental means bring a revolutionary leap in predicting properties of energy materials, their scales are limited by the high cost. Meanwhile, data-driven machine learning (ML) methods have attracted a lot of attention because it can efficiently search the huge space with extremely low cost. Recently, ML has been widely used in the development and design of TE materials. In 2017, Zhan et al. trained the ML model based on the collected experimental thermal boundary resistance data and achieved better prediction accuracy than the commonly used acoustic mismatch model 12 13 . In 2019, an ML model for predicting the κ L was proposed based on the experimentally measured κ L s of~100 inorganic materials 14 . In the same year, Tshitoyan et al. employed the text mining method on the material literature and sought potential TE materials by their similarities with the word "thermoelectric" 15 .
In most of the ML works, the train-test splitting scores or crossvalidation results are usually adopted to evaluate the accuracy of the ML models 16 . However, the high scores on the testing set do not necessarily represent superior extrapolation ability. On the other hand, the model extrapolation plays a decisive role in seeking potential materials. Although some algorithms can improve the model extrapolation ability in some degree 17 , the poor extrapolation performance is fundamentally inevitable due to the lack of information outside the data set. Thus iterative data verification that provides external information to ML models is a promising method to improve the model extrapolation. To build reliable models with as less validation samples as possible, active learning, a verification-by-learning framework, is suitable 18 .
In this work, active learning is used in the TE field to accurately predict the p-type PFs. Our active learning loop contains both the ML module and density functional theory (DFT) verification. As long as the extrapolation accuracy of the model is not high enough, the DFT verification will continue to provide reliable data to the ML module. We adopt three strategies of selecting validation candidates, including Top, Random, and Query by Committee (QBC) 19 . The model with the highest extrapolation accuracy comes from the QBC strategy. Finally, the bonding analysis on the screened high PF compounds is conducted to reveal the physical reasons for the good TE performance.

Data source
The diamond-like materials investigated include four types of compositions ( Fig. 1b), ABX 2 , AX, A 2 BCX 4 , and A 2 BX 4 , where the X site is a chalcogen or phosphorus group element. The atoms on A, B, and C sites are ordered by the valence of the elements among the IB, IIB, IIIA, and IVA groups (orange-marked in Fig. 1a). The original 158 entries of chalcogenides and pnictides are quoted from our previous HTP works and referred in the DFT database later 10,11 . By exhausting all the possible combinations among the aforementioned cations and anions, we construct a search space of diamond-like compounds with 482 entries (158 DFT calculated and 324 uncalculated) 10 . The target properties, maximum p-type PFs, are calculated with the constant electron-phonon coupling approximation (with the uniform deformation potential 4 eV and Young's modulus 100 GPa) at 700 K under optimized carrier concentrations in theory, similar to our previous works 10,11 . PF obtained by this method purely reflects the influence of electronic structure on group velocities and electronic relaxation times.
Active learning workflow A classic active learning strategy, Bayesian optimization, has been used many times to find materials with breakthrough properties. These works prove the effectiveness of active learning 20,21 . However, models in Bayesian optimization are limited to the probabilistic regression ones, excluding many other ML methods that also have outstanding performance, such as Support Vector Regression (SVR) 22 . In this work, the active learning strategies with unlimited model types are adopted to integrate active learning with more effective ML algorithms. Figure 2 shows our active learning loop with the key ingredients, i.e., the search space including DFT database, the ML module (including the models and strategies for candidate selections), and the DFT verification module. In order to be available for both calculated and uncalculated materials with diamond-like structures, the descriptors are all element-related, such as valence electron number, atomic weight, electronegativity, Mendeleev number, etc., with~60 descriptors per atom. The reason for not taking structure-related descriptors into account is that their generation for the uncalculated materials require the DFT structural relaxation, which is costly if applied to the whole search space.
Based on the DFT database, the models to predict the unexplored materials are built by ML algorithms. Then the candidate selection strategies are carried out according to the model results. There are three strategies, including one several-model strategy and two single-model strategies. The several-model strategy means that the selection of candidates requires the prediction results from multiple different models. In this work, QBC is a several-model strategy in which 15 candidates with large ambiguity are selected. The ambiguity is measured by the variance of five ML models, respectively, using different algorithms, SVR 22 , Gradient Boosting Regression (GBR) 23 , Random Forest Regression (RFR) 24 , Adaptive Boosting Regression 25 , and Kernel Ridge Regression (KRR) 26 . The other two single-model strategies are, respectively, Top and Random. Top strategy chooses the 15 candidates with high predicted PFs, and Random strategy just randomly recommends the candidates.
In each round, the recommended 15 candidates are verified by the DFT calculations. Based on the package TransOpt of the electrical transport calculation method and the HTP workflow, the entire verification process can be automatically proceeded 10,27 . Since the validation set is independent of model-learned data, the score of the validation set can be regarded as the measure of the model extrapolation accuracy. If the extrapolation is not satisfactory, the already verified samples will be added to the DFT database and the whole loop updated. The active learning loop is terminated when the extrapolated Pearson R is >0.9 or the number of iterations reaches the set maximum 10.
Active learning results From Fig. 3a, the root mean square error (RMSE) curves of all strategies with RFR algorithm show a generally decreasing trend with the number of iterations. Notably, the results of the first round in the active learning loop are equivalent to the performance of supervised learning for the untrained data (RMSẼ 20 μW cm −1 K −2 , Pearson R −0.11). The poor performance of the first round implies that the introduction of active learning is essential for ML methods to improve the prediction power on unknown data. The performance of Random and QBC strategies is similar; the falling RMSE curve and the rising Pearson R curve show that the accuracies are gradually improved with the number of iterations. Although there is a small range of RMSE fluctuations (~3 μW cm −1 K −2 ) in both Random and QBC strategies, it is reasonable because of the sample difference in each iteration. However, the Pearson R curve of the Top strategy does not maintain an upward trend after the first round, indicating that the extrapolation ability of the Top strategy does not improve with iterations. Nevertheless, the RMSE curve has a slight downward trend. It is possibly caused by lowered absolute values of PFs due to the nature that the Top strategy selects candidates with PFs from high to low.
Because the PFs cover a large range of absolute values (10-100 μW cm −1 K −2 ), RMSE cannot fully describe the accuracy of models. Therefore, we introduce a measure for the relative error, i.e., the mean absolute percentage error (MAPE). The formula is expressed as MAPE ¼ P n i¼0 PFDFTÀPFpre PFDFT 100 n %, where n represents the number of samples in each iteration. As shown in Supplementary Fig. 3 with both RMSE and MAPE, there is no downward trend in the MAPE curve after the first round of the Top strategy. The overall trends of RMSE and MAPE curves are similar. After the sixth generation, the values of MAPE for QBC and Random strategies basically fluctuate between 10 and 15%, while the MAPE values for Top strategy float between 20 and 30%.
As shown in Fig. 3b, all the ML models in the QBC strategy eventually converged to high accuracies indicated by low RMSE (~4 μW cm −1 K −2 ) and high Pearson R (>0.9, Supplementary Fig. 1). These models have been improved tremendously after ten round iterations, especially for the KRR model. From the results of the first round, the RMSE of KRR model reaches the maximum 40 μW cm −1 K −2 . Figure 3c, d show the data deviations of predicted and DFT PFs in the first and tenth round, respectively. From Fig. 3c, the sample points of KRR are the farthest from the line with a slope of 1, implying that KRR model performs the worst. Some PF values of the predictions of KRR model are even unreasonably negative. On the other hand, after ten rounds of iterations, the points of all algorithms, including KRR, are obviously close to the line with a Fig. 2 The workflow of active learning loop. The loop contains three ingredients, Search Space including DFT database, machine learning module and DFT verification. Machine learning model is built based on the DFT database to recommend candidates, and then the candidates are verified with DFT calculations. If the extrapolation accuracy does not meet the convergence criterion, the verification data will be merged into the DFT database, and the whole loop will proceed. slope of 1 (Fig. 3d). The dramatic improvement demonstrates that the DFT verification provides sufficient external data to enhance their extrapolation capabilities.
The efficiency of the selection strategy can be considered from two aspects, the divergence and information. Compared with the other two strategies, the candidates of the Top strategy are localized in high PF area in each iteration (low divergence). The accuracy of the model still increases in the first round because the data in the high PF area is sparse (high information). After the first round, the provided PFs contain less information due to the decreasing of the data sparsity. The low divergence of the Top strategy sometimes reduces the extrapolation ability. On the other hand, the divergence of the QBC strategy is comparable to the Random strategy in the PF prediction. Based on the fact that the Random and QBC perform comparably (Fig. 3a), thus in the case of PF prediction, the data divergence plays a vital role.

Material analysis
The last-round GBR algorithm in the QBC strategy, which performs the best (Pearson R 0.95), was used to predict the p-type PFs of the whole search space. The compounds in top 20% PFs are shown in Fig. 4a. The overall TE performance depends not only on PFs but also on many other factors. Here we choose two other parameters for further screening potential high-performance TE materials, including "band gap," relating to electrical properties, and "average atomic weight," relating to lattice thermal conductivity. The band gap criterion is 0.7 ± 0.4 eV, considering the uncertainties of the band gap in DFT calculations and the optimal band gaps for TE applications (10k B T op , where T op is the operating temperature) 28 . In addition, the compounds with average atomic weight >80 might have low lattice thermal conductivities and therefore be screened out. Figure 4a shows the results under the two criteria with the highlighted box. The compounds with a relatively large PF are marked with triangles and their chemical formulas are labeled, and they are HgB 2 Te 4 , ZnSiSb 2 , AuBSe 2 , Zn 2 GeTe 4 , and Zn 2 SnTe 4 . Combining with the calculations of electronic and lattice thermal conductivities 11 , the DFT predicted ZT max s at 700 K of HgB 2 Te 4 , ZnSiSb 2 , AuBSe 2 , Zn 2 GeTe 4 , and Zn 2 SnTe 4 are, respectively, 1.19, 0.97, 1.26, 1.30, and 1.41. ZT max represents the maximum value of DFT-calculated ZT when the carrier concentration is fully optimized within the range of 5 × 10 19 -1 × 10 21 cm −3 .
In order to explore the underlying mechanisms for high PFs, all the compounds with top 20% PFs and the corresponding optimal carrier concentrations are plotted in Fig. 4b (pnictides) and Fig. 4c (chalcogenides). Three major phenomena relating to the PFs can be concluded: (1) among all the studied diamond-like materials, the PFs of pnictides are generally larger; (2) among chalcogenides, the PFs of the compounds in IIB 1 :IIIA 2 :VIA 4 atomic ratio are relatively large; (3) the PFs of IIB 1 :IIIA 2 :VIA 4 chalcogenides with smaller atomic radius elements such as Si or B are relatively large.
Pnictides own extremely high PFs, mainly due to the low valence band effective masses, and therefore high group velocities and low scattering phase space in relaxation times 11 . For quantitative comparison, we calculated the effective masses and group velocities of the pnictide GaAs and chalcogenide ZnSe. The effective mass of the valence band maximum (VBM) in GaAs (2.12 m e , m e is the mass of a free electron) is smaller than that of Observing chalcogenide in compounds with the top 20% p-type PFs (Fig. 4b), we found that a large percentage of compounds are in IIB 1 :IIIA 2 :VIA 4 atomic ratio. This conclusion is consistent with our previous work 10 . From the crystal structures, this series of compounds can be seen as vacancy-containing chalcogenides (VCCs). In order to further explain why the PFs of VCCs are relatively high, two compounds with similar atomic masses but in different chemical formulas, ZnGa 2 Te 4 and CuGaTe 2 , were investigated ( Supplementary Fig. 2). We introduce the energy integral of the negative density of energy (-DOE) at the VBM to quantify the degree of the destabilizing contribution, which is  Table 1). Although the relaxation time and group velocity are slightly decreased, the electrical conductivity increased significantly due to the large enhancement in DOS and carrier concentration.
In addition to vacancies, the lattice distortion caused by small atoms might further increase the PFs. For example, both Zn 2 SnTe 4 and Zn 2 SiTe 4 are in the vacancy-containing structure, but the PF of Zn 2 SiTe 4 is~6 μW cm −1 K −2 higher than Zn 2 SnTe 4 . From the view of the structure, small silicon atoms cause the short Si-Te bonds, thereby shortening the distance between neighboring Te atoms (Fig. 5a). The anti-bonding interactions are raised between the originally non-interacting Te-Te in Zn 2 SiTe 4 (Fig. 5c). Comparing with the band structure of Zn 2 SnTe 4 (Fig. 5b), the anti-bonding interaction of Te-Te leads to the increase of band energy at X point, causing a better band convergence with the VBM at Γ point.

DISCUSSION
The scores of the train-test splitting in supervised learning models are generally good; however, the accuracy of extrapolation could be very poor. In most material problems, the reason for the inaccurate extrapolation results from ML models lies in the lack of samples. Therefore, a method of guiding material exploration is needed, which aims at providing reasonable estimate of the material property in the whole search space by supplying a small scale of samples. Hence, active learning, a framework for updating ML models through external verification, is implemented to improve the extrapolation accuracy, exemplified by the TE PFs for chalcogenides and pnictides with diamond-like structures. Several candidate selection strategies in active learning are tested. Finally, the extrapolation accuracy of the GBR model in QBC strategy is the highest (Pearson R 0.95), ensuring the reliability of extrapolation. Hence, this model is applied to predict the full search space to seek high PF materials. Materials with the top 20% PFs are analyzed by band structures and bonding conditions. It is found that the diamond-like materials with three special structures are more likely to have higher PFs: (1) binary pnictides, (2) IIB 1 :IIIA 2 : VIA 4 compounds with VCC structure, and (3) materials containing elements with small atomic radii. This work demonstrates the ability of active learning on accurately proposing potential materials based on small sample set. Fig. 4 The distribution of the top 20% ML-predicted PFs. a Band gaps and average atomic weights for compounds with the top 20% MLpredicted PFs. b Pnictides with top 20% of PF and PF > 60 μW cm −1 K −2 is marked. 11P and 112P means the pnictides in the form of AX and ABX 2 . c Chalcogenides with top 20% of PF and PF > 45 μW cm −1 K −2 is marked. 11S, 124S, 112S, and 1124S, respectively, means the pnictides in the form of AX, AB 2 X 4 , ABX 2 , and A 2 BCX 4 .

DFT computational methods
DFT calculations are carried out using projector augmented wave method as implemented in the Vienna ab initio Simulation Package 30,31 . Perdew-Burke-Ernzerhof-type generalized gradient approximation (GGA) is applied as exchange-correlation functional 32 . Self-consistent calculation is performed with an energy convergence criterion of 10 −4 and 520 eV plane-wave energy cutoff. The strongly constrained and appropriately normed meta-GGA potential is adopted 33 . The Monkhorst-Pack uniform k-point sampling was used with k = 180/L (L represents the lattice parameter) for electrical transport properties 34 . Chemical-bonding information was obtained using the band-resolved projected crystal orbital Hamilton populations as implemented in the Local Orbital Basis Suite Towards Electronic-Structure Reconstruction package [35][36][37][38][39] .
In order to get the ZT value, the electrical properties, including the Seebeck coefficient, electrical conductivity, and the electronic thermal conductivity are calculated by Boltzmann transport theory. The lattice thermal conductivity is obtained by the Slack model, which has proved to be suitable for diamond-like compounds 11,40,41 .

DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.