Accelerating materials-space exploration for thermal insulators by mapping materials properties via artificial intelligence

Purcell, Thomas A. R.; Scheffler, Matthias; Ghiringhelli, Luca M.; Carbogno, Christian

doi:10.1038/s41524-023-01063-y

Download PDF

Article
Open access
Published: 24 June 2023

Accelerating materials-space exploration for thermal insulators by mapping materials properties via artificial intelligence

npj Computational Materials volume 9, Article number: 112 (2023) Cite this article

3436 Accesses
5 Citations
57 Altmetric
Metrics details

Subjects

Abstract

Reliable artificial-intelligence models have the potential to accelerate the discovery of materials with optimal properties for various applications, including superconductivity, catalysis, and thermoelectricity. Advancements in this field are often hindered by the scarcity and quality of available data and the significant effort required to acquire new data. For such applications, reliable surrogate models that help guide materials space exploration using easily accessible materials properties are urgently needed. Here, we present a general, data-driven framework that provides quantitative predictions as well as qualitative rules for steering data creation for all datasets via a combination of symbolic regression and sensitivity analysis. We demonstrate the power of the framework by generating an accurate analytic model for the lattice thermal conductivity using only 75 experimentally measured values. By extracting the most influential material properties from this model, we are then able to hierarchically screen 732 materials and find 80 ultra-insulating materials.

Synthesis of goldene comprising single-atom layer gold

Article Open access 16 April 2024

Giant nanomechanical energy storage capacity in twisted single-walled carbon nanotube ropes

Article Open access 16 April 2024

Scaling deep learning for materials discovery

Article Open access 29 November 2023

Introduction

Artificial-intelligence (AI) techniques have the potential to significantly accelerate the search for novel, functional materials, especially for applications where different physical mechanisms compete with each other non-linearly, e.g., quantum materials¹, and where the cost of characterizing the materials makes a large-scale search intractable, e.g., thermoelectrics². Due to this inherent complexity, only limited amounts of data are currently available for such applications, which in turn severely limits the applicability and reliability of AI techniques³. Using thermal transport as an example, we propose a route to overcome this hurdle by presenting an AI framework that is applicable to scarce datasets and that provides heuristics able to steer further data creation into regions of interest in materials space.

Heat transport, as measured by the temperature-dependent thermal conductivity, κ_L, is a ubiquitous property of materials and plays a vital role for numerous scientific and industrial applications including energy conversion⁴, catalysis⁵, thermal management⁶, and combustion⁷. Finding new crystalline materials with either an exceptionally low or high thermal conductivity is a prerequisite for improving these and other technologies or making them commercially viable at all. Accordingly, finding new thermal insulators and understanding where in materials space to search for such compounds is an important open challenge in this field. From a theory perspective, thermal transport depends on a complex interplay of different mechanisms, especially in thermal insulators, for which strongly anharmonic, higher-order effects can be at play⁸. Despite significant progress in the computational assessment of κ_L in solids^9,10, these ab initio approaches are too costly for a large-scale exploration of material space. For this reason, computational high-throughput approaches have so far covered only a small subset of materials^11,12,13. Experimentally, an even smaller number of materials have had their thermal conductivities measured, and <150 thermal insulators identified^14,15.

Recently, increased research efforts have been devoted to leveraging AI frameworks to extend our knowledge in this field. In particular, various regression techniques have been proven to successfully interpolate between the existing data and approximate κ_L using only simpler properties^11,14,16,17; however, using these techniques to extrapolate into new areas of materials space is a known challenge. More importantly, the explainbility of these models is limited by their inherent complexity. Physically motivated, semi-empirical models, e.g., the Slack model¹⁸, perform slightly better in this regard because they encapsulate information about the actuating mechanism. Recent efforts have used AI to extend the capabilities of these models^2,16,19,20 to increase their accuracy in estimating κ_L. However, the applicability of such models is still limited by the physical assumptions entering the original expressions^2,19. A general model that removes these assumptions and achieves the quantitative accuracy of AI approaches, while retaining the qualitative interpretability of analytical models, is however, still lacking.

In this work, we tackle this challenge by using a symbolic regression technique to quantitatively learn κ_L, using easily calculated materials properties. While symbolic regression methods are typically more expensive to train than other kernel based methods, such as Kernel-Ridge Regression (KRR) and Gaussian Process Regression (GPR), their prediction errors are typically equivalent to other methods and their natural feature reduction and resulting analytical expressions make them a useful method for explainable AI, as further illustrated below²¹. Furthermore, the added cost of training does not affect the evaluation time of the given models, meaning the extra time only has to be spent at the beginning. The inherent uncertainty estimate in methods like GPR, allows for a prediction of where the resulting models are expected to perform worse; however, we also propose a method to get an ensemble uncertainty estimate for symbolic regression that can be applied more generally to these types of models. We further exploit the feature reduction of SISSO and expand upon its interpretability by using a global sensitivity analysis method to distill out the key material properties that are most important for modelling κ_L and to find the conditions necessary for obtaining an ultra-low thermal conductivity. From here, we use this analysis to learn the conditions needed to screen materials in each step of a hierarchical, high-throughput workflow to discover new thermal insulators. Using this workflow we can then establish qualitative design principles that lend themselves to general application across material space and use them to find 80 materials with an ultra-low κ_L.

Results

Symbolic regression models for thermal conductivity

We use the sure-independence screening and sparsifying operator (SISSO) method as implemented in the SISSO++ code²². This method has been used to successfully describe multiple applications including the stability of materials²³, catalysis²⁴, and glass transition temperatures²⁵. To find the best low-dimensional models for a specific target property, in our case the room temperature, lattice thermal conductivity, ${\kappa }_{{{{\rm{L}}}}}\left({{300\,{\rm{K}}}}\right)$, SISSO first builds an exhaustive set of analytical, non-linear functions, i.e., trillions of candidate descriptors, from a set of mathematical operators and primary features, the set of user-provided properties that will be used to model the target property. Here we are focusing on room temperature data only because that is what is the most abundant in the literature and relevant for potential applications; however, some temperature dependence will be inherently included via the temperature dependence of our anharmonicity factor σ^A. For this application the primary features are both the structural and dynamical properties for seventy-five materials with experimentally measured ${\kappa }_{{{{\rm{L}}}}}\left({{300\,{\rm{K}}}}\right)$^{17,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43} (see Section IV D and Supplementary Note 1 for more details). By using the experimentally measured values for κ_L we avoid the issues related to the inconsistent reliability of different approaches to calculating κ_L for different material classes^44,45, and hopefully create a universal model for it. For many of the materials of interest here the standard Boltzmann Transport approach will be unreliable^44,45, but the fully anharmonic ab initio Green-Kubo approach is unnecessarily expensive to use for all materials⁴⁵. Combining theoretical and experimental data in this way allows one to avoid both the cost or unreliability of calculating, κ_L and the challenges of experimentally synthesizing and characterizing candidate materials. As long as all samples are consistent across each feature, AI and ML based models will adapt the computational features to the experimental target.

Figure 1b illustrates the main goal of the work: to learn which primary features are important for modeling κ_L and what thresholds of those indicate where thermal insulators are present. As a result the figure also represents the workflow used to calculate κ_L and generate the primary features for the model. All of the data generated in this workflow will be calculated using ab initio methods, with each step representing an increasing cost of calculation, as shown in Fig. 1a. The total cost of calculating these primary features is several orders of magnitude smaller than explicitly calculating κ_L, either with the Boltzmann Transport Equation or aiGK. While using only compositional and structural features would further reduce the cost of generating them, it comes at the expense of decreasing the reliability and explainability of the models. A goal of this work is to learn the screening conditions needed to remove materials at each step of the workflow in Fig. 1b and only perform the intensive κ_L calculations on the most promising materials. Because of this, we feel that using the features generated from this workflow is the most logical set to use. Importantly, as described in Section IV D we use a consistent and accurate formalism for calculating all features in this workflow, and therefore expect a quantitative agreement between these features and their experimental counterparts. Even if this framework were restricted to explore only high-symmetry materials, the overall cost of the calculations in a supercell would be reduced by a factor of one hundred as shown by the non-green bars in Fig. 1a. In the more general case we would be able to screen closer to 1000 more materials using this procedure over the brute-force workflows of calculating κ_L for all materials. With the learned conditions one could then create a prescreening procedure by learning models for each of the relevant structural or harmonic properties using only compositional inputs, and use those to estimate κ_L⁴⁶; however, that is outside of the scope of this work.

**Fig. 1: The motivation for the work is reducing the number of calculations needed to approximate the thermal conductivity of a material.**

In practice, we model the $\log \left({\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\right)$ instead of ${\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)$ itself to better handle the wide range of possible thermal conductivities. The parity plot in Fig. 2(a) illustrates the performance of the identified SISSO model when the entire dataset is used (see Section IV A for more details). The resulting expression is characterized by d₁ and d₂

$$\begin{array}{ll}\log \left({\kappa }^{{{{\rm{SISSO}}}}}\left({{{300\,\rm{K}}}}\right)\right)\,=\,{a}_{0}+{a}_{1}{d}_{1}+{a}_{2}{d}_{2}\\ \qquad \qquad \qquad\quad \,\,{d}_{1}\,=\,\frac{{\left({m}_{{{{\rm{avg}}}}}+200.3{{{\rm{Da}}}}\right)}^{2}}{\sqrt{\mu }{\left({V}_{{{{\rm{m}}}}}+218.9{\mathring{\rm A} }^{3}\right)}^{3}{\Theta }_{{{{\rm{D,\infty }}}}}{\sigma }^{{{{\rm{A}}}}}}\\ \qquad \qquad \qquad\quad \,\,{d}_{2}\,=\,{\sigma }^{{{{\rm{A}}}}}\frac{{V}_{{{{\rm{m}}}}}\rho }{{m}_{{{{\rm{avg}}}}}}+{{{{\rm{e}}}}}^{\frac{-{\omega }_{\Gamma ,\max }}{27.11{{{\rm{THz}}}}}}+{{{{\rm{e}}}}}^{{\sigma }^{{{{\rm{A}}}}}}\end{array}$$

(1)

where a₀ = 6.327, a₁ = − 8.219 × 10⁴, and a₂ = − 1.704 are constants found by least-square regression and all variables are defined in Table 1. We find that this model has a training root-mean squared error (RMSE) of 0.14, with an R² of 0.98 for $\log \left({\kappa }^{{{{\rm{SISSO}}}}}\left({{300\,{\rm{K}}}}\right)\right)$. To better understand how these error terms translate to ${\kappa }_{{{{\rm{L}}}}}\left({{300\,{\rm{K}}}}\right)$, we also use the average factor difference (AFD)

$${{{\rm{AFD}}}}=1{0}^{x}$$

(2a)

$$x=\frac{1}{n}\mathop{\sum }\limits_{i}^{n}\left\vert \log \left({\kappa }_{{{{\rm{L}}}}}\right)-\log \left({\kappa }_{{{{\rm{L}}}}}^{{{{\rm{pred}}}}}\right)\right\vert ,$$

(2b)

where n is the number of training samples. Here, we find an AFD of 1.30 that is on par if not smaller than models previously found by other methods (e.g., 1.36 ± 0.03 for a Gaussian Process Regression model¹⁷ and 1.48 for a semi-empirical Debye-Callaway Model²). However, differences in the training sets and cross-validation scheme prevent a fair comparison of these studies for the prediction error. To see a complete representation of the training error for all models refer to Supplementary Note 2.

**Fig. 2: Error evaluation for the presented models.**

Table 1 List of the primary features used in this calculation.

Full size table

To get a better estimate of the prediction error, we use a nested cross-validation scheme further defined in Section IV E. As expected, the prediction error is slightly higher than the training error with an RMSE of 0.22 ± 0.02 and an AFD of 1.45 ± 0.03. As shown in Fig. 2(b), these errors are comparable to those of a KRR and GPR model trained on the same data, following the procedures listed in Sections IV B and IV C, respectively. We chose to retrain the models using the same dataset and cross-validation splits in order to single out the effect of the methodology itself, and not changes in the data set and splits. These results show that the performance of SISSO and more traditional regression methods are similar, but the advantage of the symbolic regression models is that only seven of the primary features are selected. Another advantage of the nested cross-validation scheme is that it creates an ensemble of independent models, which can also be used to approximate the uncertainty of the predictions. These results substantiates that our symbolic regression approach performs as well as interpolative methods and it outperforms the Slack model, which was originally developed for elemental cubic solids¹⁸. Interestingly, offering the features of the Slack model to SISSO does not improve the results, and even some primary features previously thought to be decisive, e.g., the Grüneisen parameter, γ. are not even selected by SISSO (see Supplementary Note 5).

A key advantage of using symbolic regression techniques over interpolative methods such as KRR and GPR is that the resulting models not only yield reliable quantitative predictions, but also allows for a qualitative inspection of the underlying mechanisms. To get a better understanding of how the thermal conductivity changes across materials space we map the model in Fig. 2c. From this map we can see that the thermal conductivity of a material is mostly controlled by d₂ with d₁ providing only a minor correction. While these observed trends are already helpful, the complex non-linearities in both d₁ and d₂ impedes the generation of qualitative design rules. Furthermore, some primary features such as V_m and σ^A enter both d₁ and d₂, with contrasting trends, e.g., σ^A lowers d₁ but increases d₂. To accelerate the exploration of materials space, one must first be able to disentangle the contradicting contributions of the involved primary features.

Extracting physical understanding by identifying the most physically relevant features via sensitivity analysis

The difficulties in interpreting the “plain” SISSO descriptors described above can be overcome by performing a sensitivity analysis or a feature importance study to identify the most relevant primary features that build d₁ and d₂. For this purpose, we employ both the Sobol indices, i.e., the main effect index S_i and the total effect index ${S}_{i}^{{{{\rm{T}}}}}$⁴⁷, and the Shapley Additive Explanations (SHAP)⁴⁸ metric for the model predictions. To calculate the Sobol indices we use an algorithm that includes correlative effects first described by Kucherenko et al.⁴⁹, and later implemented in UQLAB^50,51. The main advantage of this approach is its ability to include correlative effects between the inputs, which if ignored can largely bias or even falsify the sensitivity analysis results⁵². Qualitatively, S_i quantifies how much the variance of $\log \left({\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\right)$ correlates with the variance of a primary feature, ${\hat{x}}_{i}$, and ${S}_{i}^{{{{\rm{T}}}}}$ quantifies how much the variance of $\log \left({\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\right)$ correlates with ${\hat{x}}_{i}$ including all interactions between ${\hat{x}}_{i}$ and the other primary features. For example, Sobol indices of 0.0 indicate that $\log \left({\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\right)$ is fully independent of ${\hat{x}}_{i}$, whereas a value of 1.0 indicates that $\log \left({\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\right)$ can be completely represented by changes in ${\hat{x}}_{i}$⁵¹. Moreover, ${S}_{i}^{{{{\rm{T}}}}} < {S}_{i}$ implies that correlative effects are significant, with an ${S}_{i}^{{{{\rm{T}}}}}=0$ indicating that a primary feature is perfectly correlated to the other inputs⁵¹.

The SHAP values constitute a local measure of how each feature influences a given prediction in the data set. This metric is based on the Shapley values used in game theory for assigning payouts to players in a game based on their contribution toward the total reward⁴⁸. In the context of machine learning models each input to the model represents the players and the difference between individual predictions from the global mean prediction of a dataset represents the payouts⁵³. The SHAP values then perfectly distribute the difference from the mean prediction to each feature for each sample, with negative values indicating that the feature is responsible for reducing the prediction from the mean and a positive value is responsible for increasing it.⁵³. A similar metric is the Local Interpretable Model-agnostic Explanations (LIME) values⁵⁴. LIME first defines a local neighborhood for each data point, and then uses a similar algorithm to SHAP to compare each prediction against their corresponding local area. Because of the computational complexity of calculating SHAP values makes their exact calculation intractable with a large number of features, these values can be approximated by the Kernel SHAP method⁴⁸. Originally the Kernel SHAP method assumed feature independence⁴⁸, but was recently advanced to include feature dependence via sampling over a multivariate distribution represented by a set of marginal distributions and a Gaussian Copula⁵³. However, there are some cases for small data sets with highly correlated features where the SHAP values are qualitatively different from the true Shapley values⁵⁵.

Figure 3 compares the different sensitivity metrics including and excluding feature dependence. To get the global values of the SHAP and LIME indexes we take the mean absolute value for each feature across all 75 materials, but other metrics have been proposed in the literature and it is not clear which one is best^56,57,58. However the local information contained in metrics such as SHAP and LIME is an advantage they have over global metrics such as the Sobol indexes as it allows for the identification of regions in the material space that do not follow the global trends. Comparing the plots in Fig. 3a and b illustrates the importance of not treating the input primary features as independent, as all four sensitivity analysis metrics are qualitatively wrong under that assumption. This is likely a result of sampling over physically unreachable parts of the feature space, e.g., areas with a high density, low mass, and high molar volume, and suggests that caution should be used when applying these techniques to highly correlated datasets. The impact of this is demonstrated in Supplementary Fig. 3, where we explicitly simplify the model to remove some of the dependencies. All three indexes that include correlative effects show that σ^A, V_m, Θ_D,∞, and ω_Γ,max predominately control the variance of ${\kappa }^{{{{\rm{SISSO}}}}}\left(300\,{{{\rm{K}}}}\right)$. The main difference between S_i and the kernel SHAP metrics is the relative importance of Θ_D,∞ and ω_Γ,max when compared against V_m and σ^A. The difference between these results could be from the the Sobol indexes globally sampling the region of Θ_D,∞ > 1300 K instead of relying on the two materials in that regime or S_i over-estimating its importance because the higher correlation between Θ_D,∞ and the other inputs. In fact, the low values of ${S}_{i}^{{{{\rm{T}}}}}$ also imply that there are significant correlative effects in place between these inputs, and no single feature can be singled out as primarily responsible for changes in ${\kappa }^{{{{\rm{SISSO}}}}}\left({{{\rm{300\,K}}}}\right)$. For instance, the similarity between the importance of ω_Γ,max and Θ_D,∞ is because they are strongly correlated to each other, and only one of them needs to be considered (see the Supplementary Fig. 2). The importance of these features is further substantiated in Fig. 2b, where we compare the performance of the models calculated using the full dataset and one that only includes σ^A, V_m, and Θ_D,∞. For all tested models, we see only a slight deterioration in performance with a predictive AFD of 1.87, 1.77, and 1.77 for the SISSO, KRR, and GPR models, respectively, compared to 1.45 for the models trained with all features. This result highlights that the trends and the underlying mechanisms describing the dependence of ${\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)$ in materials space are fully captured by those features alone.

**Fig. 3: The feature importance metrics for the models.**

Even more importantly, our model captures the interplay between these features across materials, as demonstrated in the maps in Fig. 4. These maps showcase the strong correlation between ${\kappa }^{{{{\rm{SISSO}}}}}\left({{{\rm{300\,K}}}}\right)$ and σ^A, V_m, and Θ_D,∞, and that materials with high anharmonicity, low-energy vibrational modes, and a large molar volume will be good thermal insulators. Figure 4 shows the expected value of ${\kappa }^{{{{\rm{SISSO}}}}}\left({{{\rm{300\,K}}}}\right)$, ${E}_{\hat{{{{\mathcal{X}}}}}}\left(\left.{\kappa }^{{{{\rm{SISSO}}}}}\left({{{\rm{300\,K}}}}\right)\right\vert \hat{{{{\mathcal{X}}}}}\right)$, for different sets of input features, $\hat{{{{\mathcal{X}}}}}$, shown on the axes of each plot. We then overlay the maps with the actual values of each input for all materials in the training set to evaluate the trends across different groups of materials. Figure 4c confirms that σ^A is already a good indicator for finding thermal insulators, with most of the materials having ${\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)$ within one standard deviation of the expected value. For the more harmonic materials with σ^A < 0.2, the vanishing degree of anharmonicity is, alone, not always sufficient for quantitative predictions. In this limit, a combination of σ^A and V_m can produce correct predictions for the otherwise underestimated white triangles with a σ^A < 0.2, as seen in Fig. 4a. In order to fully describe the low thermal conductivity of the remaining highlighted materials both Θ_D,∞ and V_m are needed as can be seen in Fig. 4a, b, d and e. Generally, this reflects that the three properties σ^A, Θ_D,∞, and V_m are the target properties to optimize to obtain ultra-low thermal conductivities.

**Fig. 4: The expected value of ${\kappa }^{{{{\rm{SISSO}}}}}\left({{{\rm{300\,K}}}}\right)$ relative to select primary features.**

These results can also be rationalized within our current understanding of thermal transport and showcase which physical mechanisms determine κ_L in material space. Qualitatively, it is well known that good thermal conductors typically exhibit a high degree of symmetry with a smaller number of atoms, e.g., diamond and silicon, whereas thermal insulators, e.g., glass-like materials, are often characterized by an absence of crystal symmetries and larger primitive cells. In our case, this trend is quantitatively captured via V_m, which reflects that larger unit cells have smaller thermal conductivities. Furthermore, it is well known that phonon group velocities determine how fast energy is transported through the crystal in the harmonic picture⁵⁹, and that it is limited by scattering events arising due to anharmonicity. In our model, these processes are captured by Θ_D,∞, which describes the degree of dispersion in the phonon band structure, and the anharmonicity measure, σ^A respectively. In this context, it is important to note that, in spite of the fact that these qualitative mechanisms were long known, there had hitherto been no agreement on which material property would quantitatively capture these mechanisms best across material space. For instance, both the γ, the lattice thermal expansion coefficient, and now σ^A, have been used to describe the anharmonicity of a material. However, when both γ and σ^A are included as primary features, only σ^A is chosen (see Supplementary Note 5 for more details). This result indicates that the σ^A measure is the more sensitive choice for modeling the strength of anharmonic effects. While γ also depends on anharmonic effects, they are also influenced by the bulk modulus, the density, and the specific heat of a material.

Validating the predictions with ab initio Green-Kubo calculations

To confirm that the discovered models produce physically meaningful predictions, we validate the estimated thermal conductivity of four materials using the ab initio Green-Kubo method (aiGK)^10,45. This approach has recently been demonstrated to be highly accurate when compared to experiments⁴⁵, using similar DFT settings for what is done in this work. In particular aiGK is highly accurate in the low thermal conductivity regime that we are studying here. For details of how we calculate κ_L see the methodology in Section IV J. For this purpose, we chose BrBaCl, LiScS₂, CaF₂, and GaLiO₂, since these materials represent a broad region of the relevant feature space that also test the boundary regions of the heuristics found by the sensitivity analysis and mapping, as demonstrated by the yellow stars in Fig. 4. Figure 5 shows the convergence of the thermal conductivity of the selected materials, as calculated from three aiMD trajectories. All of the calculated thermal conductivities fall within the 95% confidence interval of the model, with the predictions for both CaF₂ and ClBaBr being especially accurate. The better performance of the model for these materials is expected, as they are more similar to the training data than the hexagonal Caswellsilverite like materials. In addition, quantum nuclear effects play a more important role in LiScS₂ and GaLiO₂ than CaF₂ and ClBaBr, which can also explain why those predictions are worse than CaF₂ and ClBaBr. Overall these results demonstrate the predictive power of the discussed model.

**Fig. 5: Validation of the predictions of the model.**

Discovering improved thermal insulators

Using the information gained from the sensitivity analysis and statistical maps of the model, we are now able to design a hierarchical and efficient high-throughput screening protocol split into three stages: structure optimization, harmonic model generation, and anharmonicity quantification. We demonstrate this procedure by identifying possible thermal insulators within a set of 732 materials, within those compounds available in the materials project⁶⁰ that feature the same crystallographic prototypes^61,62 as the ones used for training. Once the geometry is optimized we remove all materials with V_m < 35.5 Å(60 materials) and all (almost) metallic materials (GGA bandgap < 0.2 eV), and are left with 302 candidate compounds. We then generate the converged harmonic model for the remaining materials and screen out all materials with Θ_D,∞ > 547 K or have an unreliable harmonic model, e.g., materials with imaginary harmonic modes, leaving 148 candidates. Finally we evaluate the anharmonicity, σ^A, for the remaining materials (see Section IV D) and exclude all materials with σ^A < 0.206, and obtain 110 candidate thermal insulators. To avoid unnecessary calculations, we first estimate σ^A via ${\sigma }_{{{{\rm{OS}}}}}^{{{{\rm{A}}}}}$ and then refine it via aiMD when ${\sigma }_{{{{\rm{OS}}}}}^{{{{\rm{A}}}}} > 0.4$⁸. For these candidate materials, we evaluate ${\kappa }^{{{{\rm{SISSO}}}}}\left({{{\rm{300\,K}}}}\right)$ using Eq. (1). Of the 110 materials that passed all checks, 96 are predicted to have have a ${\kappa }^{{{{\rm{SISSO}}}}}\left({{{\rm{300\,K}}}}\right)$ below 10 Wm⁻¹K⁻¹, illustrating the success of this method.

Finally, let us emphasize that the proposed strategy is not limited to the discovery of thermal insulators, but can be equally used to find, e.g., good thermal conductors. This is demonstrated in Fig. 6, in which we predict the thermal conductivity of all non-metallic and stable materials using the SISSO and KRR models. Generally, both the SISSO and KRR models agree with each other with only 28 of the 227 materials having a disagreement larger than a factor of two and one (LiHF₂) with a disagreement larger than a factor of 5, further illustrating the reliability of these predictions. We expect that the large deviation for LiHF₂ is a result of the large σ^A value for that material (0.54), which is significantly larger than the maximum in the training data. We can see from the outset histograms of both the SISSO and KRR models that the hierarchical procedure successfully finds the good thermal insulators, with only 26 of the 122 materials with a ${\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\le 10$ Wm⁻¹K⁻¹ and 10 of the 80 materials with a ${\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\le 5$ Wm⁻¹K⁻¹ not passing all tests. Of these eight only the thermal insulating behavior of CuLiF₂ and Sr₂HN can not be described by the values of the other two tests that passed. Conversely, materials that do not pass the test show high conductivities. When one of the tests fail the average estimated value of $\log \left({\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\right)$ increases to 1.38 ± 0.490 (24.0 Wm⁻¹K⁻¹), with a range of 0.95 Wm⁻¹K⁻¹ to 741.3 Wm⁻¹K⁻¹. In particular, screening the materials by their molar volumes alone is a good marker for finding strong thermal conductors as all of the 15 materials with ${\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\ge 100$ Wm⁻¹K⁻¹ have a V_m ≤ 45 Å³.

**Fig. 6: A scatter plot of the prediction of both the SISSO and KRR generated models for an additional 227 materials from the same classes as the training set.**

Discussion

We have developed an AI framework to facilitate and accelerate material space exploration, and demonstrate its capabilities for the urgent problem of finding thermal insulators. By combining symbolic regression and sensitivity analysis, we are able to obtain accurate predictions for a given property using relatively easy to calculate materials properties, while retaining strong physical interpretability. Most importantly, this analysis enables us to create hierarchical, high-throughput frameworks, which we used to screen over a set of more than 700 materials and find a group of ~100 possible thermal insulators. Notably, almost all of the good thermal conductors in the set of candidate materials are discarded within the first iteration of the screening, in which we only discriminate by molar volume, i.e., with an absolutely negligible computational cost compared to full calculations of κ_L. Accordingly, we expect this approach to be extremely useful in a wide range of materials problems beyond thermal transport, especially whenever (i) few reliable data are available, (ii) additional data are hard to produce, and/or (iii) multiple physical mechanisms compete non-trivially, limiting the reliability of simplified models.

Although the proposed approach is already reliable for small dataset sizes, it obviously becomes more so when applied to larger ones. Here, the identified heuristics can substantially help steer data creation toward more interesting parts of material space. Along these lines, it is possible to iteratively refine both the SISSO model and the rules from the sensitivity analysis during material space exploration while the dataset grows. Furthermore, one can also apply the proposed procedure to the most influential primary features in a recursive fashion, learning new expressions for the computationally expensive features, e.g., σ^A, using simpler properties. In turn, this will further accelerate material discovery, but also allow for gaining further physical insights. Most importantly, this method is not limited to just the thermal conductivity of a material, and can be applied to any target property. Further extending this framework to include information about where the underlying electronic structure calculations are expected to fail, also provides a means of accelerating materials discovery more generally⁶³.

Methods

SISSO

We use SISSO to discover analytical expressions for ${\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)$⁶⁴. SISSO finds low-dimensional, analytic expressions for a target property, P, by first generating an exhaustive set of candidate features, $\hat{{{\Phi }}}$, for a given set of primary features, ${\hat{{{\Phi }}}}_{0}$, and operators ${\hat{{{{\mathcal{H}}}}}}_{m}$, and then performing an ℓ₀-regularization over a subset of those features to find the n-dimensional subset of features, whose linear combination results in the most descriptive model. $\hat{{{\Phi }}}$ is recursively built in rungs, ${\hat{{{{\mathcal{F}}}}}}_{r}$, from ${\hat{{{\Phi }}}}_{0}$ and ${\hat{{{{\mathcal{H}}}}}}_{m}$, by applying all elements, ${\hat{{{{\rm{h}}}}}}^{m}$, of ${\hat{{{{\mathcal{H}}}}}}^{m}$ on all elements ${\hat{f}}_{i}$ and ${\hat{f}}_{j}$ of ${\hat{{{{\mathcal{F}}}}}}_{r-1}$

$${\hat{{{{\mathcal{F}}}}}}_{r}\equiv {\hat{{{\mbox{h}}}}}^{m}\left[{\hat{f}}_{i},{\hat{f}}_{j}\right],\forall \ {\hat{{{\mbox{h}}}}}^{m}\in {\hat{{{{\mathcal{H}}}}}}^{m}\ \,{{\mbox{and}}}\,\ \forall \ {\hat{f}}_{i},{\hat{f}}_{j}\in {\hat{{{{\mathcal{F}}}}}}_{r-1}.$$

${\hat{{{\Phi }}}}_{r}$ is then the union of ${\hat{{{\Phi }}}}_{r-1}$ and ${\hat{{{{\mathcal{F}}}}}}_{r}$. Once $\hat{{{\Phi }}}$ is generated, the n_SIS features most correlated to P are stored in ${\hat{{{{\mathcal{S}}}}}}_{1}$, and the best one-dimensional models are trivially extracted from the top elements of ${\hat{{{{\mathcal{S}}}}}}_{1}$. Then the n_SIS features most correlated to any of the residuals, ${{{{\boldsymbol{\Delta }}}}}_{1}^{i}$, of the ${n}_{{{{\rm{res}}}}}$ best one-dimensional descriptors are stored in ${\hat{{{{\mathcal{S}}}}}}_{2}$. We define this projection as

$$s=\max \left({s}_{0},{s}_{1},...,{s}_{i},...,{s{}_{{{{\rm{n}}}}}}_{{{{\rm{res}}}}}\right)$$

(3)

$${s}_{i}={R}^{2}\left(\hat{\phi },{{{{\boldsymbol{\Delta }}}}}_{1}^{i}\right),$$

(4)

where $\hat{\phi }\in \hat{{{\Phi }}}$, and R is the Pearson correlation function. We call this approach the multiple residual approach, which was first introduced by the authors⁶⁵ and later fully described in Ref. ⁶⁶. From here, the best two-dimensional models are found by performing an ℓ₀-regularized optimization over ${\hat{{{{\mathcal{S}}}}}}_{1}\cup {\hat{{{{\mathcal{S}}}}}}_{2}$⁶⁷. This process is iteratively repeated until the best n-dimensional descriptor is found⁶⁴.

For this application ${\hat{{{{\mathcal{H}}}}}}_{m}$ contains: A + B, A − B, A*B, $\frac{A}{B}$, $\left\vert A-B\right\vert$, $\left\vert A\right\vert$, ${\left(A\right)}^{-1}$, ${\left(A\right)}^{2}$, ${\left(A\right)}^{3}$, $\sqrt{A}$, $\root 3 \of {A}$, $\exp \left(A\right)$, $\exp \left(-1.0* A\right)$, and $\ln \left(A\right)$. In addition to ensure the units of the primary features do not affect the final results, we additionally include the following operators: ${\left(A+\beta \right)}^{-1}$, ${\left(A+\beta \right)}^{2}$, ${\left(A+\beta \right)}^{3}$, $\sqrt{\alpha A+\beta }$, $\root 3 \of {A+\beta }$, $\exp \left(\alpha A\right)$, $\exp \left(-1.0* \alpha A\right)$, and $\ln \left(\alpha A+\beta \right)$, where α and β are scaling and bias constants used to adjust the input data on the fly. We find the optimal α and β terms using non-linear optimization for each of these operators^22,66,68. To ensure that the parameterization does not result in mathematically invalid equations for data points outside of the training set, the range of each candidate feature is derived from the range of the primary features, and the upper and lower bounds for the features are set appropriately. When generating new expressions these ranges are then used as a domain for the operations, and any expression that would lead to invalid results are excluded⁶⁶. The range of the primary features are set to be physically relevant for the systems we are studying and are listed in Table 1. Hereafter, we call the use of these operators parametric SISSO. For more information please refer to⁶⁶.

All hyperparameters were set following the cross-validation procedures described in Section IV E.

Kernel-Ridge regression

To generate the kernel-ridge regression models we used the utilities provided by scikit-learn⁶⁹, using a radial basis function kernel with optimized regularization term and kernel length scale. The hyperparameters were selected using with a 141 by 141 point logarithmic grid search with possible parameters ranging from 10⁻⁷ to 10⁰. Before performing the analysis each input feature, x_i is standardized

$${{{\bf{{x}}}_{{{{\rm{i}}}}}^{{{{\rm{Stand}}}}}}}=\frac{{{{{\bf{x}}}}}_{{{{\rm{i}}}}}-{\mu }_{{{{\rm{i}}}}}}{{\sigma }_{{{{\rm{i}}}}}}$$

(5)

where ${{{\bf{{x}}}_{{{{\rm{i}}}}}^{{{{\rm{stand}}}}}}}$ is the standardized input feature, μ_i is the mean of the input feature for the training data, and σ_i is the standard deviation of the input feature for the training data.

Gaussian process regression

To generate the Gaussian Process Regression Models we used the utilities provided by scikit-learn⁶⁹, using a radial basis function kernel with an optimized regularization term and kernel length scale. The hyperparameters were selected using with a 141 by 141 point logarithmic grid search with possible parameters ranging from 10⁻⁷ to 10⁰. Before performing the analysis each input feature, x_i is standardized

$${{{\bf{{x}}}_{{{{\rm{i}}}}}^{{{{\rm{Stand}}}}}}}=\frac{{{{{\bf{x}}}}}_{{{{\rm{i}}}}}-{\mu }_{{{{\rm{i}}}}}}{{\sigma }_{{{{\rm{i}}}}}}$$

(6)

where ${{{\bf{{x}}}_{{{{\rm{i}}}}}^{{{{\rm{stand}}}}}}}$ is the standardized input feature, μ_i is the mean of the input feature for the training data, and σ_i is the standard deviation of the input feature for the training data. All uncertainty values were taken from the results of the GPR predictions, and in the case of the nested cross-validation the uncertainty was propagated using

$${\kappa }_{{{{\rm{GPR}}}}}^{{{{\rm{pred}}}}}=\frac{1}{3}\mathop{\sum }\limits_{i=1}^{3}{\kappa }_{{{{\rm{GPR}}}},i}^{{{{\rm{pred}}}}}$$

(7)

$${\sigma }_{{{{\rm{GPR}}}}}^{{{{\rm{pred}}}}}=\frac{1}{3}\sqrt{\mathop{\sum }\limits_{i=1}^{3}{\left({\sigma }_{{{{\rm{GPR}}}},i}^{{{{\rm{pred}}}}}\right)}^{2}},$$

(8)

where ${\kappa }_{{{{\rm{GPR}}}},i}^{{{{\rm{pred}}}}}$ and ${\kappa }_{{{{\rm{GPR}}}},i}^{{{{\rm{pred}}}}}$ are the respective prediction and uncertainty of the i^th GPR model for a given data point and ${\kappa }_{{{{\rm{GPR}}}}}^{{{{\rm{pred}}}}}$ and ${\kappa }_{{{{\rm{GPR}}}}}^{{{{\rm{pred}}}}}$ are the respective mean prediction and uncertainty for a prediction.

Creating the dataset

In this study we focus on only room-temperature data for κ_L, since values for other temperatures are even scarcer. However, we note that an explicit temperature dependence can be straightforwardly included using multi-task SISSO⁷⁰, and it is at least partially included via, the anharmonicity factor, σ^A⁸ (see below for more details). For ${\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)$, we have compiled a list of seventy-five materials from the literature (see Supplementary Table 1 for complete list with references), whose thermal conductivity has been experimentally measured. This list was curated from an initial set of over 100 materials, from which we removed all samples that are either thermodynamically unstable or are electrical conductors. This list of materials covers a diverse set of fourteen different binary and ternary crystal structure prototypes^61,62,71.

With respect to the primary features, ${\hat{{{\Phi }}}}_{0}$, compound specific properties are provided for each material. All primary features can be roughly categorized in two classes: Structural parameters that describe the equilibrium structure and dynamical parameters that characterize the nuclear motion. For the latter case, both harmonic and anharmonic properties have been taken into account. As shown in Supplementary Note 5, additional features, such as the parameters entering the Slack model, i.e., γ, Θ_a, and V_a, can be included. However, these features do not benefit the model and when included only V_a, and not γ or Θ_a are selected. For a complete list of all primary features, and their definitions refer to Table 1.

The structural parameters relate to either the mass of the atoms (μ, ${m}_{\min }$, ${m}_{\max }$, m_avg), the lattice parameters of the primitive cell (V_m, ${L}_{\min }$, ${L}_{\max }$, L_avg), the density of the materials (ρ), or the number of atoms in the primitive cell (n_at). For all systems a generalization of the reduced mass, μ, is used so it can be extended to non-binary systems,

$$\mu ={\left(\mathop{\sum }\limits_{i}^{{n}_{{{{\rm{emp}}}}}}\frac{1}{{m}_{i}}\right)}^{-1},$$

(9)

where n_emp is the number of atoms in the empirical formula and m_i is the mass of atom, i. Similarly, the molar volume, V_m, is calculated by

$${V}_{{{{\rm{m}}}}}=\frac{{V}_{{{{\rm{prim}}}}}}{Z},$$

(10)

where ${V}_{{{{\rm{prim}}}}}$ is the volume of the primitive cell and $Z=\frac{{n}_{{{{\rm{at}}}}}}{{n}_{{{{\rm{emp}}}}}}$. Finally, ρ is calculated by dividing the total mass of the empirical cell by V_m

$$\rho =\mathop{\sum }\limits_{i}^{{n}_{{{{\rm{emp}}}}}}\frac{{m}_{i}}{{V}_{{{{\rm{m}}}}}}.$$

(11)

All of the harmonic properties used in these models are calculated from a converged harmonic model generated using phonopy⁷². For each material, the phonon density of states of successively larger supercells are compared using a Tanimoto similarity measure

$$S=\frac{{g}_{{{{\rm{p,L}}}}}\left(\omega \right)\cdot {g}_{{{{\rm{p,S}}}}}\left(\omega \right)}{\parallel {g}_{{{{\rm{p,L}}}}}\left(\omega \right){\parallel }^{2}+\parallel {g}_{{{{\rm{p,S}}}}}\left(\omega \right){\parallel }^{2}-{g}_{{{{\rm{p,L}}}}}\left(\omega \right)\cdot {g}_{{{{\rm{p,S}}}}}\left(\omega \right)},$$

(12)

where S is the similarity score, ${g}_{{{{\rm{p,L}}}}}\left(\omega \right)$ is the phonon density of states of the larger supercell, ${g}_{{{{\rm{p,S}}}}}\left(\omega \right)$ is the phonon density of states of the smaller supercell, $A\left(\omega \right)\cdot B\left(\omega \right)=\int\nolimits_{0}^{\infty }A\left(\omega \right)B\left(\omega \right)d\omega$, and $\parallel A\left(\omega \right){\parallel }^{2}=\int\nolimits_{0}^{\infty }{A}^{2}\left(\omega \right)d\omega$. If S > 0.80, then the harmonic model is considered converged. From here C_V is calculated from phonopy as a weighted sum over the mode dependent heat capacities. Both approximations to the Debye temperature are calculated from the moments of the phonon density of states

$$\langle {\varepsilon }^{n}\rangle =\frac{\int\,d\varepsilon \,{g}_{{{{\rm{p}}}}}(\varepsilon )\,{\varepsilon }^{n}}{\int\,d\varepsilon {g}_{{{{\rm{p}}}}}\left(\varepsilon \right)}$$

(13)

$${\Theta }_{{{{\rm{P}}}}}=\frac{1}{{k}_{B}}\langle \varepsilon \rangle$$

(14)

$${\Theta }_{{{{\rm{D,\infty }}}}}=\frac{1}{{k}_{B}}\sqrt{\frac{5}{3}\langle {\varepsilon }^{2}\rangle },$$

(15)

where ${g}_{p}\left(\varepsilon \right)$ is the phonon density of states at energy ε⁷³. Finally v_s is approximated from the Debye frequency, ω_D, by²⁰

$${v}_{{{{\rm{s}}}}}={\left(\frac{{V}_{{{{\rm{a}}}}}}{6{\pi }^{2}}\right)}^{1/3}{\omega }_{{{{\rm{D}}}}},$$

(16)

where ω_D is approximated as

$${\omega }_{{{{\rm{D}}}}}=\root 3 \of {\frac{9{n}_{{{{\rm{at}}}}}}{a}}$$

(17)

and a is found by fitting ${g}_{p}\left(\omega \right)$ in the range $\left[0,\frac{{\omega }_{\Gamma ,max}}{8}\right]$ to

$${g}_{p,D}\left(\omega \right)=a{\omega }^{2}.$$

(18)

To measure the anharmonicity of the materials we use σ^A as defined in⁸

$${\sigma }^{{{{\rm{A}}}}}(T)=\sqrt{\frac{\mathop{\sum}\limits_{I,\alpha }{\left\langle {\left({F}_{I,\alpha }-{F}_{I,\alpha }^{{{{\rm{ha}}}}}\right)}^{2}\right\rangle }_{(T)}}{\mathop{\sum}\limits_{I,\alpha }{\left\langle {F}_{I,\alpha }^{2}\right\rangle }_{(T)}}}\,,$$

(19)

in which 〈⋅〉_(T) denotes the thermodynamic average at a temperature T, F_I,α is the α component of the force calculated from density functional theory (DFT) acting on atom I, and ${F}_{I,\alpha }^{{{{\rm{ha}}}}}$ is the same force approximated by the harmonic model⁸. First we calculate ${\sigma }_{{{{\rm{OS}}}}}^{{{{\rm{A}}}}}$, which uses an approximation to the thermodynamic ensemble average using the one-shot method proposed by Zacharias and Giustino⁷⁴. In the one-shot approach the atomic positions are offset from their equilibrium positions by a vector ΔR,

$$\Delta {R}_{I}^{\alpha }=\frac{1}{\sqrt{{M}_{I}}}\mathop{\sum}\limits_{s}{\zeta }_{s}\left\langle {A}_{s}\right\rangle {e}_{sI}^{\alpha },$$

(20)

where I is the atom number, α is the component, e_s are the harmonic eigenvectors, $\left\langle {A}_{s}\right\rangle =\sqrt{2{k}_{B}T}/{\omega }_{s}$ is the mean mode amplitude in the classical limit⁷⁵, and ζ_s = (−1)^s−1⁷⁴. These displacements correspond to the turning-points of the oscillation estimated from the harmonic force constants, and is a good approximation to σ^A in the harmonic limit. Because of this, if ${\sigma }_{{{{\rm{OS}}}}}^{{{{\rm{A}}}}} < 0.2$ we accept that value as the true σ^A. Otherwise we calculate σ^A using aiMD in the canonical ensemble at 300 K for 10 ps, using the Langevin thermostat. When performing the high-throughput screening the threshold for when to use aiMD is increased to 0.4 because that is the point where ${\sigma }_{{{{\rm{OS}}}}}^{{{{\rm{A}}}}}$ becomes qualitatively unreliable⁸.

All electronic structure calculations are done using FHI-aims⁷⁶. All geometries are optimized with symmetry-preserving, parametric constraints until all forces are converged to a numerical precision better than 10⁻³ eV/Å⁷⁷. The constraints are generated using the AFlow XtalFinder Tool⁷¹. All calculations use the PBEsol functional to calculate the exchange-correlation energy and an SCF convergence criteria of 10⁻⁶ eV/Å and 5 × 10⁻⁴ eV/Å for the density and forces, respectively. Relativistic effects are included in terms of the scalar atomic ZORA approach and all other settings are taken to be the default values in FHI-aims. For all calculations we use the light basis sets and numerical settings in FHI-aims. These settings were shown to ensure a convergence in lattice constants of ± 0.1 Å and a relative accuracy in phonon frequencies of 3%⁸.

All primary features are calculated using the workflows defined in FHI-vibes⁷⁸.

Error evaluation

To estimate the prediction error for all models we perform a nested cross-validation, where the data are initially separated into different training and test sets using a ten-fold split. Two hyperparameters (maximum dimension and parameterization depth) are then optimized using a fivefold cross-validation on each of the training sets, and the overall performance of the model is evaluated on the corresponding test set. The size of the SIS subspace, number of residuals, and rung were all set to 2000, 10, and 3, respectively, because they did not have a large impact on the final results. We then repeat the procedure three times and average over each iteration to get a reliable estimate of the prediction error for each sample⁷⁹.

Calculating the inputs to the Slack model

The individual components for the Slack model were the same as the ones used for the main models, with the exception of γ, V_a and Θ_a. For Θ_a, we first calculate the Debye temperature, Θ_D

$${\Theta }_{{{{\rm{D}}}}}=\frac{\hslash {\omega }_{{{{\rm{D}}}}}}{{k}_{B}}$$

(21)

where ω_D is the same Debye frequency used for calculating v_s (see Section IV D), k_B is the Boltzmann constant, and ℏ is Planck’s constant. From here we calculate Θ_a using

$${\Theta }_{{{{\rm{a}}}}}=\frac{{\Theta }_{{{{\rm{D}}}}}}{\root 3 \of {{n}_{{{{\rm{at}}}}}}}.$$

(22)

We use the phonopy definition of Θ_D instead of Θ_D,∞ because it is better aligned to the original definition of Θ_a. However, it is not used in the SISSO training because the initial fitting procedure to find ω_D does not produce a unique value for Θ_D and it is already partially included via v_s. To calculate the thermodynamic Grüneisen parameter we use the utilities provided by phonopy⁷². The atomic volume was calculated by taking the volume of the primitive cell and dividing it by the total number of atoms.

Calculating the Sobol indexes

Formally, the Sobol indices are defined as

$${S}_{i}=\frac{{{{{\rm{Var}}}}}_{{\hat{x}}_{i}}\left({E}_{{\widetilde{{{{\mathcal{X}}}}}}_{i}}\left(\log \left({\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\right)| {\hat{x}}_{i}\right)\right)}{{{{\rm{Var}}}}\left(\log \left({\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\right)\right)}$$

(23)

$${S}_{i}^{{{{\rm{T}}}}}=1-\frac{{{{{\rm{Var}}}}}_{{\widetilde{{{{\mathcal{X}}}}}}_{i}}\left({E}_{{\hat{x}}_{i}}\left(\log \left({\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\right)| {\widetilde{{{{\mathcal{X}}}}}}_{i}\right)\right)}{{{{\rm{Var}}}}\left(\log \left({\kappa }_{{{{\rm{L}}}}}\left({{{\rm{300\,K}}}}\right)\right)\right)}$$

(24)

where ${\hat{x}}_{i}\in \hat{{{{\mathcal{X}}}}}$ is one of the inputs to the model, ${{{{\rm{Var}}}}}_{a}\left(B\right)$ is the variance of B with respect to a, ${E}_{a}\left(B\right)$ is the mean of B after sampling over a, and ${\widetilde{{{{\mathcal{X}}}}}}_{i}$ is the set of all variables excluding ${\hat{x}}_{i}$.

Normally, it is assumed that all elements of $\hat{{{{\mathcal{X}}}}}$ are independent of each other, and this assumption is preserved when calculating S_i and ${S}_{i}^{T}$ in Fig. 3b. As a result of this, the variance of $\log \left({\kappa }^{{{{\rm{SISSO}}}}}\left({{{\rm{300\,K}}}}\right)\right)$ and the required expectation values would be calculated from sampling over an n_v-dimensional hypercube covering the full input range, ignoring the correlation between the input variables. However, in order to properly model the correlative effects between elements of $\hat{{{{\mathcal{X}}}}}$, Kucherenko et al. modify this sampling approach^49,51. The first step of the updated algorithm is to fit the input data to a set of marginal univariate distributions coupled together via a copula^49,51. The algorithm then samples over an n_v-dimensional unit-hypercube and transforms these samples into the correct variable space using a transform defined by the fitted distributions and copulas (see Supplementary Note 3 for more details). It was later demonstrated that when using the approach proposed by Kucherenko and coworkers to calculate the Sobol indices, S_i includes effects from the dependence of ${\hat{x}}_{i}$ on those in ${\widetilde{{{{\mathcal{X}}}}}}_{i}$, while ${S}_{i}^{{{{\rm{T}}}}}$ is independent of these effects⁸⁰. We use this updated algorithm to calculate S_i and ${S}_{i}^{{{{\rm{T}}}}}$ in Fig. 3a. In both cases we use the implementation in UQLab⁵⁰ to calculate S_i and ${S}_{i}^{T}$.

Calculating the SHAP indexes

The SHAP values are calculated by treating the features as independent variables using the original method proposed by Lundberg and Lee⁴⁸, as implemented in the python package SHAP, and as dependent variables using shapr by Aas, et al.⁵³. The SHAP values are an extension of the Shapley values from cooperative game theory, that distributes the contribution, $v\left({{{\mathcal{S}}}}\right)$, of each player or subset of players, ${{{\mathcal{S}}}}\subseteq {{{\mathcal{M}}}}=\left\{1,\cdot ,M\right\}$, where ${{{\mathcal{M}}}}$ is the set of all players^48,53. The Shapley value, ${\phi }_{j}\left(v\right)={\phi }_{j}$, can then be calculated by taking a weighted mean over the contribution function differences for all ${{{\mathcal{S}}}}$ not containing the player, j,

$${\phi }_{j}=\mathop{\sum}\limits_{{{{\mathcal{S}}}}\subseteq {{{\mathcal{M}}}}\setminus \left\{j\right\}}\frac{\left\vert {{{\mathcal{S}}}}\right\vert !\left(M-\left\vert {{{\mathcal{S}}}}\right\vert -1\right)!}{M!}\left(v\left({{{\mathcal{S}}}}\cup \left\{j\right\}\right)-v\left({{{\mathcal{S}}}}\right)\right),j=1,\cdots \,,M,$$

(25)

where $\left\vert {{{\mathcal{S}}}}\right\vert$ is the number of members in ${{{\mathcal{S}}}}$⁵³. For a machine learning problem with a training set ${\left\{{y}^{i},{{{{\boldsymbol{x}}}}}^{i}\right\}}_{i = 1,\cdots ,{n}_{{{{\rm{train}}}}}}$, where yⁱ is the property value and xⁱ are the target property value and input feature values for the i^th data point in the training set with n_train data points^48,53, we can explain the prediction of the model, $f\left({{{{\boldsymbol{x}}}}}^{* }\right)$ for a particular point, x^*, with

$$f\left({{{{\boldsymbol{x}}}}}^{* }\right)={\phi }_{0}+\mathop{\sum }\limits_{j=1}^{M}{\phi }_{j}^{* },$$

(26)

where ϕ₀ is the mean prediction and ${\phi }_{j}^{* }$ is the Shapley value for the j^th feature for a prediction x = x^*. Essentially the Shapley value for the model describes the difference between a prediction, ${y}^{* }=f\left({{{{\boldsymbol{x}}}}}^{* }\right)$, and the mean of all predictions^48,53. The contribution function is then defined as

$$v\left({{{\mathcal{S}}}}\right)=E\left[\left.f\left({{{\boldsymbol{x}}}}\right)\right\vert {{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}={{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}^{* }\right],$$

(27)

which is the expectation value of the model conditional on ${{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}={{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}^{* }$^48,53. The expectation value can be calculated as

$$\begin{array}{ll}E\left[\left.f\left({{{\boldsymbol{x}}}}\right)\right\vert {{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}={{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}^{* }\right]&=E\left[\left.f\left({{{{\boldsymbol{x}}}}}_{\widetilde{{{{\mathcal{S}}}}}},{{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}\right)\right\vert {{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}={{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}^{* }\right]\\ &=\int\,f\left({{{{\boldsymbol{x}}}}}_{\widetilde{{{{\mathcal{S}}}}}},{{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}\right)p\left(\left.{{{{\boldsymbol{x}}}}}_{\widetilde{{{{\mathcal{S}}}}}}\right\vert {{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}={{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}* \right)d{{{{\boldsymbol{x}}}}}_{\widetilde{{{{\mathcal{S}}}}}},\end{array}$$

(28)

where ${{{{\boldsymbol{x}}}}}_{\widetilde{{{{\mathcal{S}}}}}}$ is the subset of all features not included in ${{{\mathcal{S}}}}$ and $p\left(\left.{{{{\boldsymbol{x}}}}}_{\widetilde{{{{\mathcal{S}}}}}}\right\vert {{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}={{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}* \right)$ is the conditional probability distribution of ${{{{\boldsymbol{x}}}}}_{\widetilde{{{{\mathcal{S}}}}}}$ given ${{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}={{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}*$^48,53. In the case where the features are treated independently, $p\left(\left.{{{{\boldsymbol{x}}}}}_{\widetilde{{{{\mathcal{S}}}}}}\right\vert {{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}={{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}* \right)$ is replaced by $p\left({{{{\boldsymbol{x}}}}}_{\widetilde{{{{\mathcal{S}}}}}}\right)$ and $v\left({{{\mathcal{S}}}}\right)$ can be approximated by Monte Carlo integration

$$v\left({{{\mathcal{S}}}}\right)=\frac{1}{K}\mathop{\sum }\limits_{k=1}^{K}f\left({{{{\boldsymbol{x}}}}}_{\widetilde{{{{\mathcal{S}}}}}}^{k},{{{{\boldsymbol{x}}}}}_{{{{\mathcal{S}}}}}^{* }\right),$$

(29)

where ${{{{\boldsymbol{x}}}}}_{\widetilde{{{{\mathcal{S}}}}}}^{k}$ are samples from the training data, and K is the number of samples taken^48,53. To include feature dependence the marginal distributions of the training data are converted into a Gaussian copula and that is used to generate samples for the Monte Carlo integration⁵³.

Because the number of subsets that need to be explored grows as 2^M for the number of features, calculating the exact Shapley values for a large number of inputs becomes intractable. To remove this constraint the problem can be approximated as the optimal solution of a weighted least squares problem, which can be described as Kernel SHAP, which is described in refs. ^48,53.

Calculating the LIME indexes

For the LIME values we use the LIME package in python⁵⁴. The values were calculated using the standard tabular explainer using all features in the model and the mean absolute value of each prediction for each feature was used to asses the global feature importance. The methodology assumes the features are independent and for algorithmic details see ref. ⁵⁴

Calculating the thermal conductivity

To calculate κ_L, we use the ab initio Green-Kubo (aiGK) method^10,81. The aiGK method calculates the αβ component of the thermal conductivity tensor, κ^αβ, of a material for a given volume V, pressure p, and temperature T with

$${\kappa }^{\alpha \beta }\left(T,p\right)=\frac{V}{{k}_{B}{T}^{2}}\mathop{\lim }\limits_{\tau \to \infty }\int\nolimits_{0}^{\tau }{\langle G{\left[{{{\bf{J}}}}\right]}^{\alpha \beta }\left({\tau }^{{\prime} }\right)\rangle }_{\left(T,p\right)}d{\tau }^{{\prime} }$$

(30)

where k_B is Boltzmann’s constant, ${\langle \cdot \rangle }_{\left(T,p\right)}$ denotes an ensemble average, ${{{\bf{J}}}}\left(t\right)$ is the heat flux, and $G\left[{{{\bf{J}}}}\right]$ is the time-(auto)correlation functions

$$G{\left[{{{\bf{J}}}}\right]}^{\alpha \beta }=\mathop{\lim }\limits_{{t}_{0}\to \infty }\frac{1}{{t}_{0}}\int\nolimits_{0}^{{t}_{0}-\tau }{J}^{\alpha }\left(t\right){J}^{\beta }\left(t+\tau \right)dt.$$

(31)

The heat flux of each material is calculated from aiMD trajectories using the following definition

$${{{\bf{J}}}}\left(t\right)=\mathop{\sum}\limits_{I}{{{{\boldsymbol{\sigma }}}}}_{I}{\dot{{{{\bf{R}}}}}}_{I},$$

(32)

where R_I is the position of the i^th-atom and σ_I is the contribution of the i^th atom to the stress tensor, σ = ∑_Iσ_I¹⁰. From here κ_L is calculated as

$${\kappa }_{{{{\rm{L}}}}}=\frac{1}{3}{{{\rm{Tr}}}}\left[{{{\boldsymbol{\kappa }}}}\right]$$

(33)

All calculations were done using both FHI-vibes⁷⁸ and FHI-aims with the same settings as the previous calculations⁸ (see Section IV D for more details). The molecular dynamics calculations were done using a 5 fs time step in the NVE ensemble, with the initial structures taken from a 10 ps NVT trajectory. Three MD calculations were done for each material and the κ_L was taken to be the average of all three runs.

Data availability

All raw electronic structure data can be found on the NOMAD archive (https://doi.org/10.17172/NOMAD/2022.04.27-1)⁸². All processed data and figure creation scripts can be found on figshare (https://doi.org/10.6084/m9.figshare.22068749.v4)⁸³. A reproduction notebook can be found on the NOMAD AI Toolkit (https://nomad-lab.eu/aitutorials/kappa-sisso).

Code availability

SISSO++²² and FHI-vibes⁷⁸ were used to generate all data and analysis in the paper and are freely available online in the cited publications. All electronic structure calculations were done using FHI-aims⁷⁶, which is freely available for use for academic use (with a voluntary donation) (https://fhi-aims.org/get-the-code-menu/get-the-code). The Sobol indexes are calculated with UQLab^50,51 (https://www.uqlab.com/download) and the kernel SHAP values were found with shapr⁵³ (https://github.com/NorskRegnesentral/shapr) which are open source. The python SHAP library⁴⁸ was also used for the independent SHAP values, and is open source (https://github.com/slundberg/shap).

References

Stanev, V., Choudhary, K., Kusne, A. G., Paglione, J. & Takeuchi, I. Artificial intelligence for search and discovery of quantum materials. Commun. Mater. 2, 105 (2021).
Article Google Scholar
Miller, S. A. et al. Capturing Anharmonicity in a Lattice Thermal Conductivity Model for High-Throughput Predictions. Chem. Mater. 29, 2494 (2017).
Article CAS Google Scholar
Gomes, C. P., Selman, B. & Gregoire, J. M. Artificial intelligence for materials discovery. MRS Bull. 44, 538 (2019).
Article Google Scholar
Zhang, Q., Uchaker, E., Candelaria, S. L. & Cao, G. Nanomaterials for energy conversion and storage. Chem. Soc. Rev. 42, 3127 (2013).
Article CAS Google Scholar
Christian Enger, B., Lødeng, R. & Holmen, A. A review of catalytic partial oxidation of methane to synthesis gas with emphasis on reaction mechanisms over transition metal catalysts. Appl. Catal. A Gen. 346, 1 (2008).
Article Google Scholar
Wu, W. et al. Preparation and thermal conductivity enhancement of composite phase change materials for electronic thermal management. Energy Convers. Manag. 101, 278 (2015).
Article CAS Google Scholar
Pollock, T. M. Alloy design for aircraft engines. Nat. Mater. 15, 809 (2016).
Article CAS Google Scholar
Knoop, F., Purcell, T. A. R., Scheffler, M. & Carbogno, C. Anharmonicity measure for materials. Phys. Rev. Mater. 4, 083809 (2020).
Article CAS Google Scholar
Broido, D. A., Malorny, M., Birner, G., Mingo, N. & Stewart, D. A. Intrinsic lattice thermal conductivity of semiconductors from first principles. Appl. Phys. Lett. 91, 231922 (2007).
Article Google Scholar
Carbogno, C., Ramprasad, R. & Scheffler, M. Ab Initio Green-Kubo Approach for the Thermal Conductivity of Solids. Phys. Rev. Lett. 118, 175901 (2017).
Article Google Scholar
Carrete, J., Li, W., Mingo, N., Wang, S. & Curtarolo, S. Finding Unprecedentedly Low-Thermal-Conductivity Half-Heusler Semiconductors via High-Throughput Materials Modeling. Phys. Rev. X 4, 11019 (2014).
CAS Google Scholar
Seko, A. et al. Prediction of Low-Thermal-Conductivity Compounds with First-Principles Anharmonic Lattice-Dynamics Calculations and Bayesian Optimization. Phys. Rev. Lett. 115, 205901 (2015).
Article Google Scholar
Xia, Y. Revisiting lattice thermal transport in PbTe: The crucial role of quartic anharmonicity. Appl. Phys. Lett. 113, 073901 (2018).
Article Google Scholar
Zhu, T. et al. Charting lattice thermal conductivity for inorganic crystals and discovering rare earth chalcogenides for thermoelectrics. Energy Environ. Sci. 14, 3559 (2021).
Article CAS Google Scholar
Springer Materials. http://materials.springer.com.
Zhang, Y. & Ling, C. A strategy to apply machine learning to small datasets in materials science. npj Comput. Mater. 4, 25 (2018).
Article CAS Google Scholar
Chen, L., Tran, H., Batra, R., Kim, C. & Ramprasad, R. Machine learning models for the lattice thermal conductivity prediction of inorganic materials. Comput. Mater. Sci. 170, 109155 (2019).
Article CAS Google Scholar
Slack, G.A. The Thermal Conductivity of Nonmetallic Crystals. In Solid State Phys. - Adv. Res. Appl., vol. 34, 1–71 (Academic Press, 1979).
Yan, J. et al. Material descriptors for predicting thermoelectric performance. Energy Environ. Sci. 8, 983 (2015).
Article Google Scholar
Toberer, E. S., Zevalkink, A. & Snyder, G. J. Phonon engineering through crystal chemistry. J. Mater. Chem. 21, 15843 (2011).
Article CAS Google Scholar
Wang, Y., Wagner, N. & Rondinelli, J. M. Symbolic regression in materials science. MRS Commun. 9, 793–805 (2019).
Article Google Scholar
Purcell, T. A. R., Scheffler, M., Carbogno, C. & Ghiringhelli, L. M. SISSO++: A C++ Implementation of the Sure-Independence Screening and Sparsifying Operator Approach. J. Open Source Softw. 7, 3960 (2022).
Article Google Scholar
Schleder, G. R., Acosta, C. M. & Fazzio, A. Exploring Two-Dimensional Materials Thermodynamic Stability via Machine Learning. ACS Appl. Mater. Interfac. 12, 20149 (2020).
Article CAS Google Scholar
Han, Z.-K. et al. Single-atom alloy catalysts designed by first-principles calculations and artificial intelligence. Nat. Commun. 12, 1833 (2021).
Article CAS Google Scholar
Pilania, G., Iverson, C. N., Lookman, T. & Marrone, B. L. Machine-Learning-Based Predictive Modeling of Glass Transition Temperatures: A Case of Polyhydroxyalkanoate Homopolymers and Copolymers. J. Chem. Inf. Model. 59, 5013 (2019).
Article CAS Google Scholar
Morelli, D. T. & Slack, G. A. High Lattice Thermal Conductivity Solids. In High Therm. Conduct. Mater., 37–68 (Springer, New York, NY, New York, 2006).
Slack, G. A. Thermal Conductivity of MgO, Al₂O₃, MgAl₂O₄, and Fe₃O₄ Crystals from 3^∘ to 300^∘K. Phys. Rev. 126, 427–441 (1962).
Article CAS Google Scholar
Martin, J. Thermal conductivity of Mg₂Si, Mg₂Ge and Mg₂Sn. J. Phys. Chem. Solids 33, 1139–1148 (1972).
Article CAS Google Scholar
Takahashi, T. & Kikuchi, T. Porosity dependence on thermal diffusivity and thermal conductivity of lithium oxide Li₂O from 200 to 900^∘C. J. Nucl. Mater. 91, 93–102 (1980).
Article CAS Google Scholar
Turkes, P., Pluntke, C. & Helbig, R. Thermal conductivity of SnO₂ single crystals. J. Phys. C. Solid State Phys. 13, 4941–4951 (1980).
Article Google Scholar
Gerlich, D. & Andersson, P. Temperature and pressure effects on the thermal conductivity and heat capacity of CsCl, CsBr and CsI. J. Phys. C. Solid State Phys. 15, 5211 (1982).
Article CAS Google Scholar
Williams, R. K., Graves, R. S. & McElroy, D. L. Thermal Conductivity of Cr2O₃ in the Vicinity of the Neel Transition. J. Am. Ceram. Soc. 67, C–151 (2006).
Article Google Scholar
Valeri-Gil, M. & Rincón, C. Thermal conductivity of ternary chalcopyrite compounds. Mater. Lett. 17, 59 (1993).
Article CAS Google Scholar
Morelli, D. T. et al. Low-temperature transport properties of p -type CoSb₃. Phys. Rev. B 51, 9622–9628 (1995).
Article CAS Google Scholar
Hohl, H. et al. Efficient dopants for ZrNiSn-based thermoelectric materials. J. Phys. Condens. Matter 11, 1697–1709 (1999).
Article CAS Google Scholar
Young, D. P., Khalifah, P., Cava, R. J. & Ramirez, A. P. Thermoelectric properties of pure and doped FeMSb (M=V,Nb). J. Appl. Phys. 87, 317–321 (2000).
Article CAS Google Scholar
Li, J.-G., Ikegami, T. & Mori, T. Fabrication of transparent Sc2O₃ ceramics with powders thermally pyrolyzed from sulfate. J. Mater. Res. 18, 1816–1822 (2003).
Article CAS Google Scholar
Kawaharada, Y., Kurosaki, K., Muta, H., Uno, M. & Yamanaka, S. High temperature thermoelectric properties of CoTiSb half-Heusler compounds. J. Alloy. Compd. 384, 308–311 (2004).
Article CAS Google Scholar
Víllora, E. G., Shimamura, K., Yoshikawa, Y., Ujiie, T. & Aoki, K. Electrical conductivity and carrier concentration control in β-Ga₂O₃ by Si doping. Appl. Phys. Lett. 92, 202120 (2008).
Article Google Scholar
Toher, C. et al. High-throughput computational screening of thermal conductivity, Debye temperature, and Grüneisen parameter using a quasiharmonic Debye model. Phys. Rev. B 90, 174107 (2014).
Article Google Scholar
Lu, Y. et al. Fabrication of thermoelectric CuAlO₂ and performance enhancement by high density. J. Alloy. Compd. 650, 558 (2015).
Article CAS Google Scholar
Huang, W. et al. Investigation of thermodynamics properties of chalcopyrite compound CdGeAs₂. J. Cryst. Growth 443, 8 (2016).
Article CAS Google Scholar
Pantian, S., Sakdanuphab, R. & Sakulkalavek, A. Enhancing the electrical conductivity and thermoelectric figure of merit of the p-type delafossite CuAlO₂ by Ag₂O addition. Curr. Appl. Phys. 17, 1264 (2017).
Article Google Scholar
Xia, Y. et al. High-Throughput Study of Lattice Thermal Conductivity in Binary Rocksalt and Zinc Blende Compounds including Higher-Order Anharmonicity. Phys. Rev. X 10, 041029 (2020).
CAS Google Scholar
Knoop, F., Purcell, T. A. R., Scheffler, M. & Carbogno, C. Anharmonicity in Thermal Insulators - An Analysis from First Principles. Phys. Rev. Lett. 130, 236301 (2023).
Article Google Scholar
Foppa, L., Purcell, T. A., Levchenko, S. V., Scheffler, M. & Ghringhelli, L. M. Hierarchical Symbolic Regression for Identifying Key Physical Parameters Correlated with Bulk Properties of Perovskites. Phys. Rev. Lett. 129, 55301 (2022).
Article CAS Google Scholar
Sobol’, I. M. Sensitivity estimates for nonlinear mathematical models. Math. Model. Comput. Exp. 1, 407–414 (1993).
Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In (eds Guyon, I. et al.) Advances in Neural Information Processing Systems 30, 4765–4774 (Curran Associates, Inc., 2017).
Kucherenko, S., Tarantola, S. & Annoni, P. Estimation of global sensitivity indices for models with dependent variables. Comput. Phys. Commun. 183, 937 (2012).
Article CAS Google Scholar
Marelli, S. & Sudret, B.UQLab: A Framework for Uncertainty Quantification in Matlab, 2554–2563 (American Society of Civil Engineers, Reston, VA, 2014).
Wiederkehr, P.Global Sensitivity Analysis with Dependent Inputs. Ph.D. thesis, (ETH Zurich, 2018).
Razavi, S. et al. The Future of Sensitivity Analysis: An essential discipline for systems modeling and policy support. Environ. Model. Softw. 137, 104954 (2021).
Article Google Scholar
Aas, K., Jullum, M. & Løland, A. Explaining individual predictions when features are dependent: More accurate approximations to shapley values. Artif. Intell. 298, 103502 (2021).
Article Google Scholar
Ribeiro, M. T., Singh, S. & Guestrin, C. “why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13–17, 2016, 1135–1144 (2016).
Roder, J., Maguire, L., Georgantas, R. & Roder, H. Explaining multivariate molecular diagnostic tests via shapley values. BMC Med. Inform. Decis. Mak. 21, 1–18 (2021).
Article Google Scholar
Lee, Y. G., Oh, J. Y., Kim, D. & Kim, G. Shap value-based feature importance analysis for short-term load forecasting. J. Electr. Eng. Technol. 18, 579–588 (2022).
Article Google Scholar
Ittner, J., Bolikowski, L., Hemker, K. & Kennedy, R. Feature synergy, redundancy, and independence in global model explanations using shap vector decomposition. Preprint at https://arxiv.org/abs/2107.12436v1 (2021).
Nohara, Y., Matsumoto, K., Soejima, H. & Nakashima, N. Explanation of machine learning models using improved shapley additive explanation. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB ’19, 546 (Association for Computing Machinery, New York, NY, USA, 2019).
Peierls, R. E. Quantum theory of solids (Oxford University Press, 1955).
Jain, A. et al. Commentary: The materials project: A materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).
Article Google Scholar
Mehl, M. J. et al. The AFLOW Library of Crystallographic Prototypes: Part 1. Comput. Mater. Sci. 136, S1 (2017).
Article CAS Google Scholar
Hicks, D. et al. The AFLOW Library of Crystallographic Prototypes: Part 2. Comput. Mater. Sci. 161, S1 (2019).
Article CAS Google Scholar
Duan, C., Liu, F., Nandy, A. & Kulik, H. J. Putting Density Functional Theory to the Test in Machine-Learning-Accelerated Materials Discovery. J. Phys. Chem. Lett. 12, 4628 (2021).
Article CAS Google Scholar
Ouyang, R., Curtarolo, S., Ahmetcik, E., Scheffler, M. & Ghiringhelli, L. M. SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Phys. Rev. Mater. 2, 83802 (2018).
Article CAS Google Scholar
Foppa, L. et al. Materials genes of heterogeneous catalysis from clean experiments and artificial intelligence. MRS Bull. 46, 1016 (2021).
Article CAS Google Scholar
Purcell, T.A., Scheffler, M. & Ghiringhelli, L.M. Recent advances in the sisso method and their implementation in the sisso++ code.Preprint at https://arxiv.org/abs/2305.01242 (2023).
Ghiringhelli, L. M. et al. Learning physical descriptors for materials science by compressed sensing. N. J. Phys. 19, 023017 (2017).
Article Google Scholar
Johnson, S. G. The NLopt nonlinear-optimization package. http://github.com/stevengj/nlopt.
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825 (2011).
Google Scholar
Ouyang, R., Ahmetcik, E., Carbogno, C., Scheffler, M. & Ghiringhelli, L. M. Simultaneous learning of several materials properties from incomplete databases with multi-task SISSO. J. Phys. Mater. 2, 24002 (2019).
Article Google Scholar
Hicks, D. et al. AFLOW-XtalFinder: a reliable choice to identify crystalline prototypes. npj Comput. Mater. 7, 30 (2021).
Article CAS Google Scholar
Togo, A. & Tanaka, I. First principles phonon calculations in materials science. Scr. Mater. 108, 1 (2015).
Article CAS Google Scholar
Pässler, R. Basic moments of phonon density of states spectra and characteristic phonon temperatures of group IV, III-V, and II-VI materials. J. Appl. Phys. 101, 093513 (2007).
Article Google Scholar
Zacharias, M. & Giustino, F. One-shot calculation of temperature-dependent optical spectra and phonon-induced band-gap renormalization. Phys. Rev. B 94, 75125 (2016).
Article Google Scholar
Dove, M. Introduction to lattice dynamics (Cambridge University Press, 1993).
Blum, V. et al. Ab initio molecular simulations with numeric atom-centered orbitals. Comput. Phys. Commun. 180, 2175 (2009).
Article CAS Google Scholar
Lenz, M.-O. et al. Parametrically constrained geometry relaxations for high-throughput materials science. npj Comput. Mater. 5, 123 (2019).
Article Google Scholar
Knoop, F., Purcell, T. A. R., Scheffler, M. & Carbogno, C. FHI-vibes: Ab Initio Vibrational Simulations. J. Open Source Softw. 5, 2671 (2020).
Article Google Scholar
Krstajic, D., Buturovic, L. J., Leahy, D. E. & Thomas, S. Cross-validation pitfalls when selecting and assessing regression and classification models. J. Cheminform. 6, 10 (2014).
Article Google Scholar
Mara, T. A., Tarantola, S. & Annoni, P. Non-parametric methods for global sensitivity analysis of model output with dependent inputs. Environ. Model. Softw. 72, 173 (2015).
Article Google Scholar
Ravichandran, N. K. & Broido, D. Unified first-principles theory of thermal properties of insulators. Phys. Rev. B 98, 085205 (2018).
Article CAS Google Scholar
Purcell, T. A., Scheffler, M., Ghiringhelli, L. M. & Carbogno, C. Thermal Conductivity Screening Data https://doi.org/10.17172/NOMAD/2022.04.27-1 (2022).
Purcell, T. A., Scheffler, M., Ghiringhelli, L. M. & Carbogno, C. Accelerating Materials-Space Exploration for Thermal Insulators by Mapping Materials Properties via Artificial Intelligence: Figures https://doi.org/10.6084/m9.figshare.22068749.v4 (2023).

Download references

Acknowledgements

T.A.R.P. thanks Florian Knoop for valuable discussions and providing scripts for the ab initio Green-Kubo analysis. This work was funded by the NOMAD Center of Excellence (European Union’s Horizon 2020 research and innovation program, grant agreement N${}^{\underline{{{{\rm{o}}}}}}$ 951786), the ERC Advanced Grant TEC1p (European Research Council, grant agreement N${}^{\underline{{{{\rm{o}}}}}}$ 740233), and the project FAIRmat (FAIR Data Infrastructure for Condensed-Matter Physics and the Chemical Physics of Solids, German Research Foundation, project N${}^{\underline{{{{\rm{o}}}}}}$ 460197019). T.A.R.P. would like to thank the Alexander von Humboldt (AvH) Foundation for their support through the AvH Postdoctoral Fellowship Program. This research used resources of the Max Planck Computing and Data Facility and the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

The NOMAD Laboratory at Fritz-Haber-Institut der Max-Planck-Gesellschaft and IRIS-Adlershof of the Humboldt-Universität zu Berlin, Faradayweg 4-6, D-14195, Berlin, Germany
Thomas A. R. Purcell, Matthias Scheffler, Luca M. Ghiringhelli & Christian Carbogno
Physics Department and IRIS Adlershof Humboldt Universität zu Berlin, Berlin, Germany
Matthias Scheffler & Luca M. Ghiringhelli

Authors

Thomas A. R. Purcell
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Scheffler
View author publications
You can also search for this author in PubMed Google Scholar
Luca M. Ghiringhelli
View author publications
You can also search for this author in PubMed Google Scholar
Christian Carbogno
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.A.R.P. implemented all methods and performed all calculations. T.A.R.P. and C.C. ideated the workflow. M.S., L.M.G., and C.C. supervised the project. All authors analyzed the data and wrote the paper.

Corresponding authors

Correspondence to Thomas A. R. Purcell, Luca M. Ghiringhelli or Christian Carbogno.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Purcell, T.A.R., Scheffler, M., Ghiringhelli, L.M. et al. Accelerating materials-space exploration for thermal insulators by mapping materials properties via artificial intelligence. npj Comput Mater 9, 112 (2023). https://doi.org/10.1038/s41524-023-01063-y

Download citation

Received: 08 June 2022
Accepted: 09 June 2023
Published: 24 June 2023
DOI: https://doi.org/10.1038/s41524-023-01063-y

This article is cited by

Application of Artificial Intelligence in Aerospace Engineering and Its Future Directions: A Systematic Quantitative Literature Review
- Kamal Hassan
- Amit Kumar Thakur
- Rajesh Singh
Archives of Computational Methods in Engineering (2024)
Using orbital sensitivity analysis to pinpoint the role of orbital interactions in thermoelectric power factor
- Wenhao Zhang
- Jean-François Halet
- Takao Mori
npj Computational Materials (2023)