Introduction

Thermoelectric (TE) materials have aroused widespread interest owing to their potential applications in waste heat harvesting and refrigeration1,2,3. The conversion efficiency of TE materials is evaluated by the dimensionless TE figure-of-merit ZT, defined as \({\mathrm{ZT}} = \frac{{S^2\sigma T}}{{\kappa _{\mathrm{L}} + \kappa _{\mathrm{e}}}}\), where S, σ, κL, κe, and T, respectively, stand for the Seebeck coefficient, electrical conductivity, lattice thermal conductivity, electronic thermal conductivity, and the absolute temperature. Because of the intercorrelation between the transport parameters, the improvement of ZT values is challenging4,5,6.

As computational materials science is emerging, the high-throughput (HTP) calculation methods have been introduced to the TE material field. In 2014, Carrete et al. scanned ~79,000 half-Heusler structures and finally recommended 3 semiconductors with low lattice thermal conductivities7. Chen et al. screened 25,000 semiconductors out of 48,000 inorganic compounds and performed the calculations of their electrical transport properties8,9. In 2018, Xi et al. applied HTP ab initio calculation to 161 p-type chalcogenides and experimentally verified the recommended TE compound Cd2Cu3In3Te8 with ZT >1.010. Li et al. studied both p-type pnictides and chalcogenides in the atomic ratio 1:1:2, and pnictides showed exceptionally high power factors (PFs)11.

Although HTP theoretical and experimental means bring a revolutionary leap in predicting properties of energy materials, their scales are limited by the high cost. Meanwhile, data-driven machine learning (ML) methods have attracted a lot of attention because it can efficiently search the huge space with extremely low cost. Recently, ML has been widely used in the development and design of TE materials. In 2017, Zhan et al. trained the ML model based on the collected experimental thermal boundary resistance data and achieved better prediction accuracy than the commonly used acoustic mismatch model12. In 2018, Miller et al. viewed diamond-like semiconductors from the perspective of carrier concentration range with ML method and quantified their dopabilities by linear regression13. In 2019, an ML model for predicting the κL was proposed based on the experimentally measured κLs of ~100 inorganic materials14. In the same year, Tshitoyan et al. employed the text mining method on the material literature and sought potential TE materials by their similarities with the word “thermoelectric”15.

In most of the ML works, the train–test splitting scores or cross-validation results are usually adopted to evaluate the accuracy of the ML models16. However, the high scores on the testing set do not necessarily represent superior extrapolation ability. On the other hand, the model extrapolation plays a decisive role in seeking potential materials. Although some algorithms can improve the model extrapolation ability in some degree17, the poor extrapolation performance is fundamentally inevitable due to the lack of information outside the data set. Thus iterative data verification that provides external information to ML models is a promising method to improve the model extrapolation. To build reliable models with as less validation samples as possible, active learning, a verification-by-learning framework, is suitable18.

In this work, active learning is used in the TE field to accurately predict the p-type PFs. Our active learning loop contains both the ML module and density functional theory (DFT) verification. As long as the extrapolation accuracy of the model is not high enough, the DFT verification will continue to provide reliable data to the ML module. We adopt three strategies of selecting validation candidates, including Top, Random, and Query by Committee (QBC)19. The model with the highest extrapolation accuracy comes from the QBC strategy. Finally, the bonding analysis on the screened high PF compounds is conducted to reveal the physical reasons for the good TE performance.

Results

Data source

The diamond-like materials investigated include four types of compositions (Fig. 1b), ABX2, AX, A2BCX4, and A2BX4, where the X site is a chalcogen or phosphorus group element. The atoms on A, B, and C sites are ordered by the valence of the elements among the IB, IIB, IIIA, and IVA groups (orange-marked in Fig. 1a). The original 158 entries of chalcogenides and pnictides are quoted from our previous HTP works and referred in the DFT database later10,11. By exhausting all the possible combinations among the aforementioned cations and anions, we construct a search space of diamond-like compounds with 482 entries (158 DFT calculated and 324 uncalculated)10. The target properties, maximum p-type PFs, are calculated with the constant electron–phonon coupling approximation (with the uniform deformation potential 4 eV and Young’s modulus 100 GPa) at 700 K under optimized carrier concentrations in theory, similar to our previous works10,11. PF obtained by this method purely reflects the influence of electronic structure on group velocities and electronic relaxation times.

Fig. 1: The search space and schematic diagram of the crystal structure.
figure 1

a Element replacement range; orange indicates the cation range, green is the anion range. b Schematic diagram of four types of diamond-like materials.

Active learning workflow

A classic active learning strategy, Bayesian optimization, has been used many times to find materials with breakthrough properties. These works prove the effectiveness of active learning20,21. However, models in Bayesian optimization are limited to the probabilistic regression ones, excluding many other ML methods that also have outstanding performance, such as Support Vector Regression (SVR)22. In this work, the active learning strategies with unlimited model types are adopted to integrate active learning with more effective ML algorithms.

Figure 2 shows our active learning loop with the key ingredients, i.e., the search space including DFT database, the ML module (including the models and strategies for candidate selections), and the DFT verification module. In order to be available for both calculated and uncalculated materials with diamond-like structures, the descriptors are all element-related, such as valence electron number, atomic weight, electronegativity, Mendeleev number, etc., with ~60 descriptors per atom. The reason for not taking structure-related descriptors into account is that their generation for the uncalculated materials require the DFT structural relaxation, which is costly if applied to the whole search space.

Fig. 2: The workflow of active learning loop.
figure 2

The loop contains three ingredients, Search Space including DFT database, machine learning module and DFT verification. Machine learning model is built based on the DFT database to recommend candidates, and then the candidates are verified with DFT calculations. If the extrapolation accuracy does not meet the convergence criterion, the verification data will be merged into the DFT database, and the whole loop will proceed.

Based on the DFT database, the models to predict the unexplored materials are built by ML algorithms. Then the candidate selection strategies are carried out according to the model results. There are three strategies, including one several-model strategy and two single-model strategies. The several-model strategy means that the selection of candidates requires the prediction results from multiple different models. In this work, QBC is a several-model strategy in which 15 candidates with large ambiguity are selected. The ambiguity is measured by the variance of five ML models, respectively, using different algorithms, SVR22, Gradient Boosting Regression (GBR)23, Random Forest Regression (RFR)24, Adaptive Boosting Regression25, and Kernel Ridge Regression (KRR)26. The other two single-model strategies are, respectively, Top and Random. Top strategy chooses the 15 candidates with high predicted PFs, and Random strategy just randomly recommends the candidates.

In each round, the recommended 15 candidates are verified by the DFT calculations. Based on the package TransOpt of the electrical transport calculation method and the HTP workflow, the entire verification process can be automatically proceeded10,27. Since the validation set is independent of model-learned data, the score of the validation set can be regarded as the measure of the model extrapolation accuracy. If the extrapolation is not satisfactory, the already verified samples will be added to the DFT database and the whole loop updated. The active learning loop is terminated when the extrapolated Pearson R is >0.9 or the number of iterations reaches the set maximum 10.

Active learning results

From Fig. 3a, the root mean square error (RMSE) curves of all strategies with RFR algorithm show a generally decreasing trend with the number of iterations. Notably, the results of the first round in the active learning loop are equivalent to the performance of supervised learning for the untrained data (RMSE ~20 μW cm−1 K−2, Pearson R −0.11). The poor performance of the first round implies that the introduction of active learning is essential for ML methods to improve the prediction power on unknown data. The performance of Random and QBC strategies is similar; the falling RMSE curve and the rising Pearson R curve show that the accuracies are gradually improved with the number of iterations. Although there is a small range of RMSE fluctuations (~ 3 μW cm−1 K−2) in both Random and QBC strategies, it is reasonable because of the sample difference in each iteration. However, the Pearson R curve of the Top strategy does not maintain an upward trend after the first round, indicating that the extrapolation ability of the Top strategy does not improve with iterations. Nevertheless, the RMSE curve has a slight downward trend. It is possibly caused by lowered absolute values of PFs due to the nature that the Top strategy selects candidates with PFs from high to low.

Fig. 3: Extrapolation accuracy results of different strategies.
figure 3

a Comparison of Pearson R and RMSE of RFR algorithms in, respectively, Top, Random, and QBC strategies. b Changes in RMSE of five algorithms in QBC strategy with the number of iterations. c The comparison of the first-generation validation set between the ML prediction and the DFT calculation with QBC strategy. The shaded part indicates the largest root mean square error (RMSE) range of the five algorithms. d The comparison of the last-generation (tenth) validation set between the ML prediction and the DFT calculation with QBC strategy.

Because the PFs cover a large range of absolute values (10–100 μW cm−1 K−2), RMSE cannot fully describe the accuracy of models. Therefore, we introduce a measure for the relative error, i.e., the mean absolute percentage error (MAPE). The formula is expressed as \({\mathrm{MAPE}} = \mathop {\sum}\limits_{i = 0}^n {\left| {\frac{{{\mathrm{PF}}_{{\mathrm{DFT}}} - {\mathrm{PF}}_{{\mathrm{pre}}}}}{{{\mathrm{PF}}_{{\mathrm{DFT}}}}}} \right| \times \frac{{100}}{n}} \%\), where n represents the number of samples in each iteration. As shown in Supplementary Fig. 3 with both RMSE and MAPE, there is no downward trend in the MAPE curve after the first round of the Top strategy. The overall trends of RMSE and MAPE curves are similar. After the sixth generation, the values of MAPE for QBC and Random strategies basically fluctuate between 10 and 15%, while the MAPE values for Top strategy float between 20 and 30%.

As shown in Fig. 3b, all the ML models in the QBC strategy eventually converged to high accuracies indicated by low RMSE (~4 μW cm−1 K−2) and high Pearson R (>0.9, Supplementary Fig. 1). These models have been improved tremendously after ten round iterations, especially for the KRR model. From the results of the first round, the RMSE of KRR model reaches the maximum 40 μW cm−1 K−2. Figure 3c, d show the data deviations of predicted and DFT PFs in the first and tenth round, respectively. From Fig. 3c, the sample points of KRR are the farthest from the line with a slope of 1, implying that KRR model performs the worst. Some PF values of the predictions of KRR model are even unreasonably negative. On the other hand, after ten rounds of iterations, the points of all algorithms, including KRR, are obviously close to the line with a slope of 1 (Fig. 3d). The dramatic improvement demonstrates that the DFT verification provides sufficient external data to enhance their extrapolation capabilities.

The efficiency of the selection strategy can be considered from two aspects, the divergence and information. Compared with the other two strategies, the candidates of the Top strategy are localized in high PF area in each iteration (low divergence). The accuracy of the model still increases in the first round because the data in the high PF area is sparse (high information). After the first round, the provided PFs contain less information due to the decreasing of the data sparsity. The low divergence of the Top strategy sometimes reduces the extrapolation ability. On the other hand, the divergence of the QBC strategy is comparable to the Random strategy in the PF prediction. Based on the fact that the Random and QBC perform comparably (Fig. 3a), thus in the case of PF prediction, the data divergence plays a vital role.

Material analysis

The last-round GBR algorithm in the QBC strategy, which performs the best (Pearson R 0.95), was used to predict the p-type PFs of the whole search space. The compounds in top 20% PFs are shown in Fig. 4a. The overall TE performance depends not only on PFs but also on many other factors. Here we choose two other parameters for further screening potential high-performance TE materials, including “band gap,” relating to electrical properties, and “average atomic weight,” relating to lattice thermal conductivity. The band gap criterion is 0.7 ± 0.4 eV, considering the uncertainties of the band gap in DFT calculations and the optimal band gaps for TE applications (10kBTop, where Top is the operating temperature)28. In addition, the compounds with average atomic weight >80 might have low lattice thermal conductivities and therefore be screened out. Figure 4a shows the results under the two criteria with the highlighted box. The compounds with a relatively large PF are marked with triangles and their chemical formulas are labeled, and they are HgB2Te4, ZnSiSb2, AuBSe2, Zn2GeTe4, and Zn2SnTe4. Combining with the calculations of electronic and lattice thermal conductivities11, the DFT predicted ZTmaxs at 700 K of HgB2Te4, ZnSiSb2, AuBSe2, Zn2GeTe4, and Zn2SnTe4 are, respectively, 1.19, 0.97, 1.26, 1.30, and 1.41. ZTmax represents the maximum value of DFT-calculated ZT when the carrier concentration is fully optimized within the range of 5 × 1019–1 × 1021 cm−3.

Fig. 4: The distribution of the top 20% ML-predicted PFs.
figure 4

a Band gaps and average atomic weights for compounds with the top 20% ML-predicted PFs. b Pnictides with top 20% of PF and PF > 60 μW cm−1 K−2 is marked. 11P and 112P means the pnictides in the form of AX and ABX2. c Chalcogenides with top 20% of PF and PF > 45 μW cm−1 K−2 is marked. 11S, 124S, 112S, and 1124S, respectively, means the pnictides in the form of AX, AB2X4, ABX2, and A2BCX4.

In order to explore the underlying mechanisms for high PFs, all the compounds with top 20% PFs and the corresponding optimal carrier concentrations are plotted in Fig. 4b (pnictides) and Fig. 4c (chalcogenides). Three major phenomena relating to the PFs can be concluded: (1) among all the studied diamond-like materials, the PFs of pnictides are generally larger; (2) among chalcogenides, the PFs of the compounds in IIB1:IIIA2:VIA4 atomic ratio are relatively large; (3) the PFs of IIB1:IIIA2:VIA4 chalcogenides with smaller atomic radius elements such as Si or B are relatively large.

Pnictides own extremely high PFs, mainly due to the low valence band effective masses, and therefore high group velocities and low scattering phase space in relaxation times11. For quantitative comparison, we calculated the effective masses and group velocities of the pnictide GaAs and chalcogenide ZnSe. The effective mass of the valence band maximum (VBM) in GaAs (2.12 me, me is the mass of a free electron) is smaller than that of ZnSe (2.62 me), and the electron group velocity of GaAs (2.93 × 105 m s−1) is higher than that of ZnSe (2.13 × 105 m s−1). Meanwhile, the relaxation time of GaAs (4.83×10−14 s) is increased compared to that of ZnSe (1.50 × 10−14 s).

Observing chalcogenide in compounds with the top 20% p-type PFs (Fig. 4b), we found that a large percentage of compounds are in IIB1:IIIA2:VIA4 atomic ratio. This conclusion is consistent with our previous work10. From the crystal structures, this series of compounds can be seen as vacancy-containing chalcogenides (VCCs). In order to further explain why the PFs of VCCs are relatively high, two compounds with similar atomic masses but in different chemical formulas, ZnGa2Te4 and CuGaTe2, were investigated (Supplementary Fig. 2). We introduce the energy integral of the negative density of energy (-DOE) at the VBM to quantify the degree of the destabilizing contribution, which is written as \(E_{{\rm{band}}} = \mathop {\int}\limits_{E_{f} - 2}^{E_f} { - {\rm{DOE}}\left( E \right){\rm{d}}E}\)29. The Eband of ZnGa2Te4 and CuGaTe2 are, respectively, −19.67 and −31.89 eV. A smaller Eband means that the anti-bonding interaction at VBM is weaker, resulting in a flat band structure and high density of states (DOS) at the Fermi levels (Supplementary Table 1). Although the relaxation time and group velocity are slightly decreased, the electrical conductivity increased significantly due to the large enhancement in DOS and carrier concentration.

In addition to vacancies, the lattice distortion caused by small atoms might further increase the PFs. For example, both Zn2SnTe4 and Zn2SiTe4 are in the vacancy-containing structure, but the PF of Zn2SiTe4 is ~6 μW cm−1 K−2 higher than Zn2SnTe4. From the view of the structure, small silicon atoms cause the short Si-Te bonds, thereby shortening the distance between neighboring Te atoms (Fig. 5a). The anti-bonding interactions are raised between the originally non-interacting Te-Te in Zn2SiTe4 (Fig. 5c). Comparing with the band structure of Zn2SnTe4 (Fig. 5b), the anti-bonding interaction of Te-Te leads to the increase of band energy at X point, causing a better band convergence with the VBM at Γ point.

Fig. 5: Schematic diagram of the bond length and band structure of Zn2SnTe4 and Zn2SiTe4.
figure 5

a Schematic diagram of lattice distortion. b Band structure of compound Zn2SnTe4. c The band-resolved projected crystal orbital Hamilton populations (COHPs) for Te-Te offsite of Zn2SiTe4.

Discussion

The scores of the train-test splitting in supervised learning models are generally good; however, the accuracy of extrapolation could be very poor. In most material problems, the reason for the inaccurate extrapolation results from ML models lies in the lack of samples. Therefore, a method of guiding material exploration is needed, which aims at providing reasonable estimate of the material property in the whole search space by supplying a small scale of samples. Hence, active learning, a framework for updating ML models through external verification, is implemented to improve the extrapolation accuracy, exemplified by the TE PFs for chalcogenides and pnictides with diamond-like structures. Several candidate selection strategies in active learning are tested. Finally, the extrapolation accuracy of the GBR model in QBC strategy is the highest (Pearson R 0.95), ensuring the reliability of extrapolation. Hence, this model is applied to predict the full search space to seek high PF materials. Materials with the top 20% PFs are analyzed by band structures and bonding conditions. It is found that the diamond-like materials with three special structures are more likely to have higher PFs: (1) binary pnictides, (2) IIB1:IIIA2:VIA4 compounds with VCC structure, and (3) materials containing elements with small atomic radii. This work demonstrates the ability of active learning on accurately proposing potential materials based on small sample set.

Methods

DFT computational methods

DFT calculations are carried out using projector augmented wave method as implemented in the Vienna ab initio Simulation Package30,31. Perdew–Burke–Ernzerhof-type generalized gradient approximation (GGA) is applied as exchange–correlation functional32. Self-consistent calculation is performed with an energy convergence criterion of 10−4 and 520 eV plane-wave energy cutoff. The strongly constrained and appropriately normed meta-GGA potential is adopted33. The Monkhorst–Pack uniform k-point sampling was used with k = 180/L (L represents the lattice parameter) for electrical transport properties34. Chemical-bonding information was obtained using the band-resolved projected crystal orbital Hamilton populations as implemented in the Local Orbital Basis Suite Towards Electronic-Structure Reconstruction package35,36,37,38,39.

In order to get the ZT value, the electrical properties, including the Seebeck coefficient, electrical conductivity, and the electronic thermal conductivity are calculated by Boltzmann transport theory. The lattice thermal conductivity is obtained by the Slack model, which has proved to be suitable for diamond-like compounds11,40,41.