High-throughput prediction of the carrier relaxation time via data-driven descriptor

It has been demonstrated that many promising thermoelectric materials, such as tetradymite compounds are also three-dimensional topological insulators. In both cases, a fundamental question is the evaluation of carrier relaxation time, which is usually a rough task due to the complicated scattering mechanisms. Previous works using the simple deformation potential theory or considering complete electron-phonon coupling are, however, restricted to small systems. By adopting a data-driven method named SISSO (Sure Independence Screening and Sparsifying Operator) with the training data obtained via deformation potential theory, we propose an efficient and physically interpretable descriptor to evaluate the relaxation time, using tetradymites as prototypical examples. Without any input from first-principles calculations, the descriptor contains only several elemental properties of the constituent atoms, and could be utilized to quickly and reliably predict the carrier relaxation time of a substantial number of tetradymites with arbitrary stoichiometry.


INTRODUCTION
Substantial advances have been made in screening or discovering functional materials via high-throughput computational method, which involves physics, statistics, computer science, and artificial intelligence 1 . Over the last decade, such an approach has been extensively applied in materials science, such as identifying potential photovoltaic absorbers 2 , predicting the stability of compounds 3 , and screening topological insulators (TIs) [4][5][6] . Among various applications, the search for good thermoelectric (TE) materials 7-9 is a particularly intriguing one due to the increasingly serious energy crisis and environmental pollution. Presently, there have been a few works focusing on the high-throughput prediction of compounds with low thermal conductivity [10][11][12] , which is essential for high TE conversion efficiency. Another critical transport coefficient closely related to the TE performance is the so-called power factor 7 , where a troublesome issue is suitable treatment of the relaxation time owing to the complex scattering mechanisms involved.
Generally speaking, the electron-acoustic phonon interaction is the dominant scattering mechanism in the operational temperature range of TE materials 13 . Within the effective mass approximation, Bardeen and Shockley 14 derived an analytical method named deformation potential (DP) theory for the relaxation time around the band edges. Furthermore, the energy dependent k-resolved relaxation time can be obtained by a complete electron-phonon coupling (EPC) calculation 15,16 based on the Wannier interpolation techniques 17 . However, both approaches involve quite complex first-principles calculations and huge computational efforts are needed. To date, it is still a challenging task to evaluate the relaxation time, which even becomes prohibitive for system with large unit cell.
In this work, a high-throughput investigation combined with the compressed sensing approach 18,19 is employed to quickly predict the carrier relaxation time of the tetradymite compounds.
Using first-principles calculations, we first compute the electronic band structures of a small set of tetradymites with integer stoichiometry. For simplicity, we use the DP theory to calculate the corresponding relaxation times (τ cal ) considering the dominant electron-acoustic phonon scattering. It should be mentioned that many compounds in the tetradymite family were demonstrated to be TIs 20 , where the electronic band edges exhibit distinct energy dispersions and electron-phonon transition probabilities compared with those of normal insulators (NIs) 21,22 . We thus divide the calculated τ cal into two training sets according to their different topological natures. For each set, we then adopt the compressed sensing method to construct a physically intuitive descriptor for the prediction of the relaxation time (τ pre ). We shall see that both NIs and TIs exhibit good quantitative agreement between the τ cal and τ pre , with Pearson correlation coefficient higher than 90%. Importantly, generalizing the proposed descriptors allows us to rapidly predict the relaxation time of a vast number of tetradymites with fractional stoichiometry, and selective crosschecks find good agreement between our predictions and explicit first-principles calculations.

Reliable training data
The candidate tetradymite compounds investigated in this work are constructed by combining group-VA elements (As, Sb, and Bi) with group-VIA elements (S, Se, and Te), which respectively occupy the A/B and C/D/E sites in the rhombohedral primitive cell shown in Fig. 1a. The hexagonal unit cell is plotted in Fig. 1b, where the covalently bonded quintuple layers (QLs) are held together via van der Waals (vdW) interactions. Among a total of 243 obtained tetradymites with integer stoichiometry, 85 NIs and 67 TIs were predicted to be mechanically stable 20 . By performing accurate first-principles calculations, the carrier relaxation time of the 152 tetradymites can be obtained. Compared with those of NIs, there are obvious band inversion around the Fermi level of TIs, which can lead to distinct effective mass and electron-phonon scattering phase space 23 . It is thus necessary to separate the calculated relaxation times into NIs and TIs parts, which are respectively summarized in Supplementary Tables 1 and 2. As nearly all the tetradymite compounds exhibit better TE performance with hole doping [24][25][26] , we focus on the p-type relaxation time at room temperature in the present work.
Data-driven descriptors with strong predictive power By performing SISSO training (see details in the "Methods") for the τ cal of 85 NIs, an optimized descriptor is obtained for predicting the relaxation time: Here m is the atomic mass and χ denotes the Pauling electronegativity of the constituent elements. The subscripts A/ B/C/D/E and "ave" indicate the occupied site and the average value, respectively. For the case of TIs, the best descriptor is given by: where the extra feature r indicates the radius of the p-orbital and the subscript "mse" refers to the mean square error. Figure 2a, b show the quantitative comparison of the τ pre and τ cal for the 85 NIs and 67 TIs, respectively. It is clear that the relaxation times predicated by the optimized descriptors agree very well with those obtained from first-principles calculations. The Pearson correlation coefficient is as high as 94% (92%) for the NIs (TIs), corroborating the strong predictive power of the obtained descriptors. Besides, it is quite necessary to compare the calculated and predicted carrier relaxation time with the experimental results. However, very few works reported the direct measurement of relaxation time, which can be alternatively derived by fitting the experimental transport properties such as the electrical conductivity or carrier mobility. In Supplementary  Fig. 1, the τ cal and τ pre of five tetradymites in the training data are compared with those obtained experimentally (τ exp ) [27][28][29][30] . We see that τ cal and τ pre exhibit good quantitative agreement, but are somewhat larger than τ exp . This is reasonable since τ cal in the training data are calculated around the band edges, where the carrier concentrations are relatively lower (10 17 cm −3 -10 18 cm −3 ). In contrast, the five τ exp obtained from limited experimental works correspond to higher carrier concentrations (10 18 cm −3 -10 20 cm −3 ) with optimized TE performance, which lead to enhanced scattering and thus smaller relaxation time. Besides, the impurity and grain boundary scattering in the experiments could also reduce the relaxation time. On the other hand, we find that τ cal and τ pre exhibit almost the same behavior with τ exp for the five investigated tetradymites, which further confirms the reliability of our training data and the strong predictive power of the proposed descriptors. Since it is a delicate combination of the elemental properties of the constituent atoms, the descriptor allows us to pinpoint the determinant factors governing the value of relaxation time, as further demonstrated below.
With an intensive analysis of Eq. (1), it is found that the first term (mass term) dominate the predicted relaxation time of NIs. In particular, we see from Fig. 3 that the average atomic mass (m ave ) exhibits the strongest correlation with the predicted relaxation time. To have a better understanding, the mass term is rewritten as: It is obvious that the relaxation time of NIs increases with m ave . As evidenced by our additional first-principles calculations, we find that the tetradymites containing light elements usually exhibit relatively flat band structures around the Fermi level, which correspond to bigger effective mass (m*) and thus lower relaxation time (τ cal ∝ m* −3/2 ). Indeed, we find that the hole effective mass of the 85 NIs decrease obviously with increasing average atomic mass, as shown in Supplementary Fig. 2a. Such a fact can be understood by a simple tight-binding analysis, where the effective mass is believed to be inversely related to the lattice constant (or bond length). As heavier mass of the constituent Fig. 1 The crystal structure of tetradymite compounds. a Primitive cell. b Unit cell. Note that the A/B and C/D/E sites can be occupied by the As/Sb/Bi and S/Se/Te atoms, respectively. atom usually corresponds to larger orbital radius and thus longer bond length, the inverse relation between m ave and m* can be qualitatively convinced, which strongly confirms the delicate physical insight of the descriptor. In comparison, the second term of Eq. (1) is less important, which can be rewritten as: According to previous work 20 , the χ C and χ D of NIs are usually higher than χ ave while the χ E is relatively lower. For the NIs with different elements in the C and D sites, we thus find lower relaxation time in systems with larger χ C and χ D . The physical origin is that larger Pauling electronegativity of the anions usually have smaller atomic radii, which leads to smaller lattice constants. According to a simple tight-binding analysis, this would result in larger effective mass, and consequently give rise to lower relaxation time.
For the case of TIs, we see from Fig. 3 that the relaxation time is mostly related to the r mse . Owing to the band inversions induced by the SOC, the valance band edges of the TIs are mainly contributed by the p-orbitals of both the cations and anions 31 , which greatly affect the energy dispersions and the hole effective mass. As r mse is less than 1, we find from the first term of Eq. (2) that lower relaxation time is obtained in the topological tetradymites with larger r mse . This is consistent with the fact that the effective mass (m*) of the 67 TIs in the training data is roughly inversely related to the |In r mse |, as shown in Supplementary Fig.  2b. This effect becomes pronounced when separately comparing the results of Bi 2 Se 2 S with other TIs, where the former one exhibits the highest r mse corresponding to the largest effective mass (2.3 m e ) and the lowest relaxation time (6.8 fs). Such an observation is consistent with the established understanding that bigger r mse means larger discrepancy between the p-orbitals of the anions and cations, causing weaker orbital hybridization and thus narrower band width near the Fermi level. Consequently, larger effective mass and lower relaxation time can be expected. On the other hand, we find that the effect of the second term of Eq. (2) is somewhat hidden due to the ultrasmall value of r 3 mse χ ave À χ E j j (less than 10 −4 ). For instance, we see that the m ave in the second term is positively related to the predicted relaxation time. However, the corresponding Pearson correlation coefficient is very small as shown in Fig. 3. Of course, the second term cannot be ignored, since the χ mse is also strongly related to the predicted relaxation time as shown in Fig. 3. In principle, the TIs with larger χ mse tend to be polar covalent or even ionic compounds. Hence, the valence electrons are largely localized which lead to relatively flat bands with bigger effective mass and thus lower relaxation time. We see from the second term of Eq. (2) that the τ pre is also positively related to the m ave . Compared with that of the NIs, the descriptor of the TIs also correctly captures the essential physical factor (m*) governing the relaxation time, whereas the delicate mechanism is quite different suggesting the necessity in classifying the investigated tetradymites as mentioned above.
Predictions beyond the training data Beyond the above-mentioned 152 tetradymites with integer stoichiometry, there are more doped systems with fractional stoichiometry used for favorable TE applications [24][25][26] . In practice, doping at various concentrations is a typical approach to enhance the TE performance by achieving the band convergence 32 or reducing the lattice thermal conductivity 33 . Unfortunately, it is quite difficult or even impossible to obtain the carrier relaxation time of these systems using first-principles calculations since prohibitively large supercells are needed to model the structures. Such a problem can be effectively solved by utilizing the optimized descriptors given by Eqs. (1) and (2). In general, the doped tetradymites can be represented by a nominal formula of As x Sb y Bi 2ÀxÀy S x 0 Se y 0 Te 3Àx 0 Ày 0 (0 x; y; x þ y 2 and 0 x 0 ; y 0 ; x 0 þ y 0 3). According to the virtual crystal approximation 34 , the site-specific properties of the randomly doped systems should be weighted averages for first-principles modeling. For any tetradymites, the site-specific atomic mass can be defined as: Here the coefficients (0 x i ; y i 1; 1 i 5) refer to the fractional occupancy of each site, which define the stoichiometry via x = x 1 + x 2, y = y 1 + y 2, x′ = x 3 + x 4 + x 5, and y 0 ¼ y 3 þ y 4 þ y 5 . With the site-specific Pauling electronegativity and the sitespecific radius of the p-orbital given by the same way, our datadriven descriptors can thus be used to rapidly predict the relaxation time of any tetradymites with arbitrary stoichiometry. As an example, we list the descriptor predicted relaxation time of 13 NIs and 13 TIs with fractional stoichiometry in Supplementary  Table 3. Note that all of them have medium supercells and thus first-principles calculations are still reasonably tractable. We see that τ pre shows good agreement with first-principles results for both NIs and TIs. Indeed, as evidenced from Supplementary Fig.  3a, b, the Pearson correlation coefficient between the τ pre and τ cal of the 13 NIs (TIs) are 96% (93%). Moreover, we compare in Supplementary Fig. 4 the τ pre and τ exp for five additional tetradymites with fractional stoichiometry 27,29,35 , where they exhibit the same behaviors as discussed above for the systems with integer stoichiometry. Once again, such an observation demonstrates that the optimized descriptors are robust and reliable for evaluating the relaxation time of the tetradymites with either integer or fractional stoichiometry. It should be emphasized that we focus on the relaxation time around the band edge of tetradymites and the carrier concentration is usually in the range of 10 17 cm −3 -10 18 cm −3 , which is lower than those found at optimized TE performance as mentioned above. Since the relaxation time usually decreases with increasing carrier concentration, the τ pre is found to be smaller than the τ exp , which is similar to those found for systems with integer stoichiometry. Such an observation also indicates that our descriptors may not be suitable for predicting the relaxation time of heavily doped systems.
By varying each stoichiometry (x i and y i ) at an interval of 1/6, we can obtain over sixteen million different tetradymites and quickly predict their carrier relaxation time using the generalized datadriven descriptors. Figure 4a, b show the distributions of the room temperature τ pre for over eight million NIs and eight million TIs, respectively. It is found that most NIs have the relaxation time in the range of 60-70 fs, whereas the corresponding region for the TIs is 30~40 fs. Remarkably, we see from Fig. 4a that more than 2 × 10 5 NIs exhibit large τ pre exceeding 120 fs, which are very desirable for achieving promising TE performance 23 . On the other hand, Liang et al. 36 theoretically suggested that the decreased relaxation time of the bulk state can lead to enhanced TE performance for both p-and n-type systems in the TI films of tetradymites. Interestingly, we see from

DISCUSSION
In summary, by employing the data-driven SISSO approach, we report two simple descriptors for predicting the carrier relaxation time of NIs and TIs in the tetradymite family, where the training sets are obtained from accurate first-principles calculations. Unlike those from the DP theory or EPC calculations which is forbidden for systems with large unit cell, our descriptors enable an effective prediction for the relaxation time of the tetradymite compounds with either integer or fractional stoichiometry at negligible computational cost. The robust predictive power originates from the fact that the descriptors could perfectly capture the essential physical mechanism governing the value of relaxation time. Additionally, such physically meaningful descriptors also clarify how to efficiently tune the value of the relaxation time so that maximum TE performance can be realized in both NIs and TIs of the tetradymite family. It should be also emphasized that the obtained descriptors contain only the intrinsic properties of the constituent elements, which can be essentially adapted to fast predict the carrier relaxation time of many other TE materials such as the half-Heusler 38 and rock-salt-type 39 compounds.

Accurate first-principles calculations
The electronic band structures of these 152 tetradymites were calculated within the framework of density functional theory 40,41 using the Vienna abinitio simulation package 42 . The Perdew-Burke-Ernzerhof functional with generalized gradient approximation 43 was adopted, and the spin-orbit coupling was explicitly included in the calculations. The van der Waals interaction between adjacent quintuple layers was taken into account by means of the optB86b-vdW functional 44 . We adopted a plane-wave cutoff energy of 550 eV and a 13 × 13 × 13 Monkhorst-Pack k-mesh in the Brillouin zone. The energy convergence threshold was 10 −6 eV and the relaxed structure was determined until the magnitude of the force acting on each atom is less than 0.01 eV Å −1 . By considering the dominant acoustic phonon scattering, we used the DP theory 14 to evaluate the relaxation time, which can be expressed as Here C is the elastic modulus, h is the reduced Planck constant, k B is the Boltzmann constant, T is the absolute temperature, and E is the DP constant represents the shift of band edge per unit strain. In addition to the case of optimized structure, four reference calculations with lattice deformation of ±1% and ±2% were performed to evaluate C and E. Such a method has been successfully used to predict the relaxation time of many known materials 45-47 . SISSO training based on the compressed sensing technique Using the calculated τ cal of both NIs and TIs as two independent training sets, we could obtain the simple and physically meaningful descriptors based on the SISSO method 48,49 . In our case, three elemental properties of the constituent atoms (the atomic mass, the radius of the p-orbital, and the Pauling electronegativity, see the details in Table 1) as well as their average value and mean square error were considered as input features. It should be mentioned that the band edges of tetradymites are dominated by the p-orbitals of both the cations and anions, which is one of the crucial factors in determining the relaxation time. The intrinsically linear relationship observables in the compressed-sensing formalism was made nonlinear by equipping the features with nonlinear operators H fI; þ; À; ; =; exp; log; À j j; ffi p ; À1 ; 2 ; 3 g. At each iteration, H operates on all available combinations, and more than 10 10 features were constructed up to a complexity cutoff of 3. The sure independence screening (SIS) scores each feature with a correlation magnitude and keeps only the top ranked. The  subset extracted by the SIS algorithm was set to 80,000. After the reduction, the sparsifying operators (SO) was used to pinpoint the optimal descriptor. In contrast to previous machine learning methods 50,51 , SISSO can identify descriptors and capture the underlying mechanisms of physical properties, which is highly desirable for the accelerative discovery of functional materials. Meanwhile, SISSO shows superior advantages compared with other established method which suffers with large and highly correlated features spaces. In particular, SISSO is built to work even when only relatively small training sets are available. As a step forward with respect to previous algorithms, SISSO is an effective tool for high-throughput investigation in materials science.

DATA AVAILABILITY
All relevant data are included in the paper and/or its Supplementary Information files.