Introduction

Polymers are extensively used in industry and daily life, owing to various advantages of chemical inertness, mechanical flexibility, and lightweight1. As the organic electronics are becoming smaller while the power density keeps increasing, the thermal management and heat dissipation capability have attracted significant attention2,3. However, conventional polymers are thermal insulators with reported thermal conductivity (TC) in the range from 0.1 to 0.5 W m−1 K−1, preventing the development of organic electronics4. Polymers with high TC are urgently demanded in organic energy storage and electronic devices to accommodate revolutionary innovations in organic electronics and optoelectronics5. The polymer morphology and topology were found to be closely related to TC6. Increasing the crystallite orientation and crystallinity can significantly reduce the phonon scattering and enhance the TC along the chain directions, which has been demonstrated by both experiments7,8,9,10,11,12 and theoretical simulations13,14,15. A recent study has fabricated polyethylene (PE) films by disentanglement and alignment of amorphous chains with a metal-like TC of 62 W m−1 K−1, over two orders of magnitude greater than that of classical amorphous polymers7. Moreover, molecular dynamics (MD) simulations have suggested that individual crystalline PE chains have a very high or even divergent TC16. These findings provide opportunities for solving the heat dissipation problem of polymer devices.

Intra-chain atomic interactions are usually much stronger than inter-chain interactions in polymers, and enhancing the intra-chain thermal transport of polymers is essential to improve their TC. Experimental techniques such as micro-mechanical stretching7,8, electrostatic spinning9,10, and nanoscale templating11,12 are effective in improving the crystallinity of polymers and obtaining more consistent chain orientation, resulting in an increase in the TC of amorphous polymers by 1–3 orders of magnitude. Strategies such as applying mechanical strains17,18, parallel-linking of chains19, and modulation of dihedral energy20 in MD simulations suggested that ordered chains and large radius of gyration (Rg) are favorable for high TC of polymers. According to Debye’s theory, the TC of the polymer \(k\) can be expressed by the phonon group velocity \({v}_{g}\), mean free path \(l\) and volumetric heat capacity \({C}_{v}\), i.e., \(k={v}_{g}{C}_{v}l\). In general, the \({v}_{g}\) and \({C}_{v}\) of polymers mainly determined by the characteristics of the repeating units and the strength of the backbone bonding in the individual chain21. Thus, it is possible to realize the high TC polymer by adjusting the repeating unit of polymer chains, which also facilitates understanding the heat transport mechanisms along the chain for polymers with different hierarchical structures.

Despite the fact that the chain structure of polymers exhibits great influence on the thermal characteristic, the polymer library is quite large, with as many as 108 monomeric organic molecules known to exist in chemical space22. Current research on the TC of polymers is still an Edisonian process, guided by intuition or experience in a trial-and-error approach that is time-consuming and expensive23. Most of the studies are conducted on simple structures such as PE5,7,16, which makes it difficult to grasp the general rule of the factors affecting the TC of polymers and to discover polymer molecular structures with high TC in huge chemical space.

The field of polymer informatics24, associated with the development of artificial intelligence and machine learning (ML) methods, attempts to utilize the data-driven centric method for physical property regulation or device development of organic materials to resolve the conflict between structural freedom and efficiency/cost in the traditional trial-and-error approach. The research on polymer informatics has attracted extensive attention and succeeded in recent years25,26,27, involving the prediction of organic optical28,29,30, electrical31,32,33, and thermal properties34,35,36,37,38. Particularly, several efforts emerged in the search or design of structures with high TC as related to crystalline polymers36, amorphous polymers37,38, and copolymers39. Most of these studies have employed graph descriptors37 or polymer chemistry fragment statistics36,38,39 to describe monomer structures in informatics algorithms, also called fingerprints or representations. The graph descriptors generated rely on molecular/monomer graph information, formulated by knowledge domain feature engineering40 or by attempting to form general descriptors41. Moreover, descriptors such as molecular access systems (MACCS)42 are obtained through statistics of different chemical fragments, and are closely related to molecular graphs. Subsequently, they are collectively referred to as graph descriptors. Fingerprints are required for the unique, complete, minimal representation of each candidate, and successful fingerprints are a challenging task43. Besides, polymers are composed of many repeating units, which are more complex than organic small molecules and require accurate capture of information on monomer connection sites32. Graph descriptors have long been applied and validated in the development of drug-like small molecules44, and the availability of open-source toolkits such as RDKit45 and Mol2vec40 has facilitated their accessibility, which is also one reason that graph descriptors are popular in polymer informatics. However, the graph descriptor is in the form of a string of numeric vectors. The completeness of the molecular structure determines the coupling association between the digits. Hence, the relationship between molecular monomers and material properties is difficult to grasp.

Exploring the ensemble of physically independent descriptors for the representation of molecular structures is important in qualitative structure-property relationship modeling and enables more intuitive guidelines for molecular structure evaluation46. Feature engineering for the collection and reduction of physical descriptors are critical steps in determining effective capabilities in polymer informatics. The development of automatic, universal and efficient tools for the calculation of descriptors of organic molecules is of interest to researchers, which translates the chemical information encoded in the symbolic representation of molecules into useful numbers or some standardized experimental results47. Several open-source and commercial software47,48,49 are available to calculate various types of molecular descriptors such as carbon atomic number, molecular weight, and Extended Topochemical Atom50, which have been successfully applied in organic chemistry synthesis51, molecular antibacterial activity prediction52, and so on. In addition, the parameter conditions in experiments or simulations affect the molecular properties. For instance, the force-field-inspired descriptors such as types of bond, angle, and dihedral have been validated for the prediction of the specific heat of polymers, even if the datasets are from experiments35. The dimensionality reduction of polymer features is another concern, as some descriptors may have little relevance to the target property, and a low-dimensional descriptor space is much easier to build up for the ML model53. Feature extraction and selection are the dominant approaches to reduce the dimensionality of features. Feature extraction creates subsets from the original data space, such as principal component analysis (PCA), where the specific meaning of the new features obtained is difficult to understand54. Feature selection retains the physical meaning of individual descriptors, while filters based on correlation evaluation have dependencies on mathematical models, like the Pearson and Spearman coefficients that consider the linear and monotonic relationships of the data, respectively55. Further, the filter methods do not involve ML models, which may lead to the inapplicability of the gained features. The wrapper-based feature selection techniques combine ML models to eliminate redundant features, including recursive feature elimination (RFE), sequential feature selection (SFS), and exhaustive feature selection56. Testing different subsets of descriptors for informatics algorithms is the crucial feature of the wrapper approaches, and the key is the strategy of combining different descriptors. Typical RFE seeks to improve model performance by continuously reducing the low-impact features from the remaining features in iteratively constructed ML models, which refer to the ranking of feature weights assigned by models such as random forests54. Thus, the RFE relies on the feature weight evaluation mechanism of the ML models.

Herein, focusing on the challenges of polymer monomer representation and feature selection, we propose an ML interpretable framework integrated with high-throughput MD simulations for the discovery of polymer structures with high TC, as illustrated in Fig. 1. It consists of four components: 1) polymer library construction; 2) MD simulation for the TC of polymers; 3) monomer feature representation and hierarchical down-selection; 4) ML models construction for TC prediction. The training data were collected from literature57,58, and candidates from the databases of PoLyInfo59 and PI1M60 were applied for the virtual screening of high TC structures. All polymer monomers were identified by the SMILES (simplified molecular input line entry system) strings and formed one-dimensional polymer chains by replication. The TC of training datasets was calculated by MD simulations with the second generation of the general AMBER force field—GAFF261. Inspired by drug-like molecular representation and molecular force fields, we obtained 320 physical descriptors by Mordred software47 calculation and force field parameter file extraction, and retained 20 optimized descriptors by hierarchical down-selection. We then trained random forest (RF), extreme gradient boosting (XGBoost) tree-based models, and multilayer perceptron (MLP) neural network models separately to establish the relationship between the optimized descriptors and the TC of these benchmark polymer datasets. Further, we analyzed the feature importance of each optimized descriptor and extracted the chemical heuristic for high TC polymers design through SHAP analysis62. Using the trained ML models, 107 promising polymers with TC greater than 20.00 W m−1 K−1 were identified, which are served for symbolic regression to derive mathematical formulas for expressing the TC of promising polymers. Last, we discussed the thermal transport mechanisms of polymer chains and analyzed the intra-chain thermal transport linkages of polymers with different hierarchical structures. Overall, the proposed approach is beneficial for theoretical or experimental investigations of high TC polymers.

Fig. 1: Schematics of high-throughput screening of polymers with high TC via interpretable machine learning.
figure 1

which is implemented in four components: 1) polymer library construction, 2) MD simulation for the TC of polymers; 3) monomer feature representation and hierarchical down-selection; 4) ML model construction for TC prediction.

Results

Distribution of polymer datasets in chemical space

Polymer data from literature57,58 were utilized as the benchmark database for training ML models, as well as PoLyInfo59 and PI1M60 databases were used for the virtual screening of polymer structures with high TC. The polymers are classified into 19 classes such as polyolefins, polyethers, and polyamides according to different elements and chemical functional groups63. To validate the distribution of the selected 1735 benchmark data over the other two datasets, their chemical structures were visualized in 2D space by the uniform manifold approximation and projection64, where the chemical structure of each monomer was transformed into the Morgan fingerprint41 of a 1024 vector with a radius of two atoms. It is observed that the polymer structures in the selected benchmark dataset (Fig. 2a) are well covered by the chemical space distribution of those in the PoLyInfo (Fig. 2b) and PI1M (Fig. 2c) databases. Note that the PI1M dataset was generated by a generative model of a recurrent neural network trained with data from PoLyInfo, which fills the sparse region of the chemical space of the PoLyInfo dataset, but the distribution is consistent60. Thus, the ML models trained with the selected data are well able to learn the chemical features of all candidates and can be effectively adopted for the virtual screening of polymer structures with high TC. In addition, we counted the distribution of polymer TC in the benchmark dataset in Supplementary Fig. 1, which has a wide range and most of the polymers have TC less than 10 W m−1 K−1, and only a few polymers have TC greater than 30 W m−1 K−1 (Insert in Supplementary Fig. 1a). The unbalanced data distribution makes the discovery of high TC polymer structures a difficulty. To better improve the ML models generalization across the entire TC range, our learning problem was framed in logarithmic scale, i.e., log2TC, as the target property for ML models65.

Fig. 2: Visualization of polymer data distribution in a 2D space by UMAP.
figure 2

a, b and c correspond to the selected, PoLyInfo, and PI1M datasets, respectively.

Polymer descriptors hierarchical down-selection and ML Models Training

Polymer descriptors are hierarchically down-selected in three stages: removing features with low variance, primary filtering referred to different correlation coefficients, and final selection assisted with the ML model (shown in Supplementary Note 2). The collected initial monomer physical descriptors are composed of 286 Mordred-based and 34 MD-inspired descriptors. The descriptors of MD-inspired and Mordred-based descriptors are listed in Supplementary Note 3. The removal of low variance descriptors is intended to eliminate descriptors with variance less than a specific threshold, whose contribution to the target property of all polymer data (log2TC in this work) is considered to be nearly consistent. After the variance threshold was set to 0.10, the 264 descriptors were reserved for the next stage. We established the weight assignment mechanism based on the different correlation coefficients for further primary filtering of the descriptors, due to the various attentions of their mathematical models. The Pearson, Spearman, and Distance coefficients are used to evaluate linear, monotonic, and non-linear relationships between data, respectively, while the maximum information coefficient (MIC) reflects the association of two variables through information entropy, whether linear or nonlinear. The reliability of MIC depends on the data sample size and the value is reliable only with large datasets. The four metric coefficients of Pearson, Spearman, Distance, and MIC were incorporated and each was assigned a weighting factor of 0.25, and the thresholds were set to 0.05, 0.05, 0.153, and 0.132, respectively. The 53 descriptors with a cumulative weight value of 1 were retained through VAM. Random sequential feature selection (RFSF) combined with the RF model was then developed for optimized descriptors determination. Considering all possible combinations of descriptors for ML model training is time-consuming and expensive, so traditional SFS usually leads to sub-optimal solutions, where the recommended ensemble of optimized descriptors is not unique, and is influenced by the input order of the descriptors66. Here, we disrupted the order of the input descriptors before each run, then combined them with 100 RF model training cycles, and acquired the final optimized descriptors based on a statistical approach. The threshold value depends on the occurrence times of the descriptors in 100 RF model training runs, and descriptors with a frequency larger than the threshold value were retained. We measured the performance of RF models trained by different descriptor ensembles with thresholds ranging from 0.39 to 0.32 in Supplementary Fig. 3, separately. By balancing the mean-square error (MSE) of ML model and the number of descriptors, 20 optimized descriptors were finally selected with a threshold of 0.34. The results of the optimized descriptors based on VAM and RSFS are shown in Fig. 3a, and their detailed descriptions are listed in Supplementary Note 4. Moreover, Fig. 3e exhibits the Pearson correlation matrices of the correlations among optimized descriptors (Other metrics, see Supplementary Fig. 2). It is found that most descriptors are positively correlated with each other and negatively correlated with TC. Only three descriptors are positive for TC, two of which are MD-inspired descriptors. For example, the descriptor MW_ratio reflects the ratio of the molecular weight of the mainchain to the molecular weight of the monomer, with values between 0 and 1. The MW_ratio of 1 indicates that the polymer is without side chains, which reduces the loss of heat flux along the chain and makes it possible to get large TC.

Fig. 3: Polymer descriptors down-selection and ML models training.
figure 3

a Optimized descriptors acquired by down-selection with four coefficients - Pearson, Spearman, Distance, and MIC coefficients - and RF model. b Accuracy of RF model based on optimized descriptors, where training R2 is 0.875 and test R2 is 0.844. c Mean-square error (MSE) of ML models at different down-selection processes, including initial (Init.), mathematical correlation (Cor.) coefficients screening, and RF model optimization (Opt.) stages. And, an additional PCA approach was applied to compare. d MSE of ML models with different polymer representation approaches. The violin plot represents the distribution of values, individual subsamples are shown in gray, and the mean and standard of MSE in black. e Pearson correlation matrices showing correlations among optimized descriptors and TC. The inset is the statistics of the Pearson coefficients distribution.

Figure 3b shows the results of the RF model trained with the optimized descriptors, with training and test R2 of 0.87 and 0.84, respectively. To verify the extensibility of the optimized descriptors, XGBoost and MLP models were deployed for training (see Supplementary Fig. 4). The accuracy R2 of the training and test sets for XGBoost is 0.95 and 0.87, and that for MLP are 0.81 and 0.88, respectively, which is comparable or even better than the RF model. Therefore, these three models are utilized in the subsequent discussion.

The prediction accuracy of ML models at different down-selection stages is illustrated in Fig. 3c (training and test data set prediction in Supplementary Fig. 5). The extra PCA with more than 95% variance was performed to compare with RFSF technology. According to the relationship between the number of principal components and the cumulative variance in Supplementary Fig. 6, at least 19 components are required to exceed 95% variance. It is close to the number of sets of optimized descriptors. As seen in Fig. 3c, the tree-based models of the RF and XGBoost maintain relatively low MSE and high accuracy R2 (See Supplementary Fig. 8) even with large descriptor dimensions because of their strong ability to prevent overfitting of the data. Moreover, the feature down-selection process is usually accompanied by the loss of information, which results in a decrease of model accuracy. However, the feature down-selection process also reduces the redundancy between data which suppresses the overfitting and improves the accuracy of the MLP model. Overall, the accuracy of all three models trained with the optimized descriptors from RFSF is higher than that of the models trained with the PCA-derived descriptors, which demonstrates the effectiveness of our approach.

The ML models with different graph descriptors were applied for comparison in Fig. 3d (training and test data set prediction in Supplementary Fig. 7). The Mol2vec40 is an unsupervised ML approach to learn vector representations of molecular substructures, which requires a benchmark dataset for molecular structure training. Here, the pre-trained polymer embedding model was from elsewhere60, which was created using the PoLyInfo and PI1M datasets. The MACCS42 descriptor is the structural key-based descriptor with 166-bit keyset. The Morgan and Morgan count (cMorgan)41 descriptors are the extended connectivity fingerprints that capture molecular features relevant to molecular activity. The results in Fig. 3d and Supplementary Fig.8 reflect the superiority of ML models trained with the optimized descriptors, no matter the models of RF, XGBoost, and MLP. The down-selection processes of physical descriptors examine individual/combined descriptors in relation to TC, while the graph descriptors aim to represent molecular/monomeric information as completely as possible. Whilst the elements or groups in the molecular graph have been indicated to correlate with the TC of polymer chains36, it is more intuitive and effective to predict the log2TC of polymer chains using the associated physical descriptors. But not absolute, which is also related to the parameters such as chain stiffness67. We also evaluated ML models with a hybrid descriptor set composed of the optimized descriptors and one of the graph descriptors in Supplementary Fig. 9a and 9b. The performance of ML models trained with hybrid descriptors shows only a small improvement or even is comparable to that of trained with optimized descriptors, which reflects the fact that the optimized descriptors cover relatively complete information about the polymer structures. Furthermore, we applied the optimized descriptors or graph descriptors to train the directed message passing neural network (DMPNN) models in Chemprop68, as shown in Supplementary Fig. 9c, d. Although the limited amount of data in the available benchmark dataset makes it difficult to output a high-performance DMPNN model, the performance of the optimized descriptors is the best compared to other descriptors. This illustrates the potential of optimized descriptors for applications in diverse and complex ML models.

Physical insights from an interpretable ML model

Figure 4 summarizes the effect of the features using SHAP, for the RF model trained on optimized descriptors. The SHAP approach attempts to address the unexplainable black-box challenge of ML algorithms by calculating the marginal contribution of features to the model output62. Hence, the features of each polymer structure in training data sets are assigned the SHAP values separately. As shown in Fig. 4a, the importance ranking of the optimized descriptors was referenced to the average SHAP value. Among the top 8 optimized descriptors, the number of MD-inspired and Mordred-based descriptors is equal, which reflects that the construction of the RF model is a joint contribution of these two types of descriptors. The distribution of SHAP values for each descriptor is displayed in Fig. 4b, and the depth of shade of data points in the beeswarm plot represents the magnitude of TC of polymer structures in the training set. The distribution of SHAP values for the top-ranked features is relatively wide, and is monotonic about the feature values overall (Supplementary Fig. 10).

Fig. 4: Analysis of feature importance using SHAP on RF model trained by optimized descriptors.
figure 4

a Average SHAP values for 20 optimized descriptors. b Represent the SHAP values of each descriptor related to training data set polymers in a beeswarm diagram. c, d SHAP values for the Cross-section and Kd_average of the training data set polymers as a function of descriptor value. The cross-section is the effective cross-sectional area of the polymer chain, and the Kd_average is the average value of force constants of the dihedral angle from the GAFF2 force field.

Here, we highlight the two MD-inspired descriptors of cross-sectional and Kd_average. The most important descriptor of cross-sectional indicates the effective cross-sectional area of the polymer chain, which is intuitive in relation to the TC. From Fig. 4c, the SHAP value for cross-sectional decreases monotonically with the descriptor. In 1-D polymer chains, the effective cross-sectional area relies on factors such as the complexity of the side chains and the chain orientation. Polymers with small cross-sectional areas facilitate the construction of centralized phonon transport channels along the backbone, and reduce the heat flux dissipated through the side chains. Thus, the TC is negatively related to the cross-sectional area, and polymers with high TC usually have a small cross-sectional area (Supplementary Fig. 11a). Moreover, the polymer chain structure is absent of disorder compared to the amorphous structure, maintaining the symmetry of the crystal and reducing phonon scattering. However, the polymer chains may rotate and become disordered due to temperature and other effects, resulting in a rapid decrease in TC69. The close correlation between the dihedral energy constant and polymer chain stiffness has been demonstrated, and the dihedral angle force constant Kd has been artificially increased in MD simulations to maintain PE chain stiffness and increase TC20,69. The Kd_average is the average of all types of dihedral force constants from GAFF2 force field for polymer chain, which is roughly proportional to the corresponding SHAP value in Fig. 4d. Especially for polymer structures with great kd_average (>4 kcal mol−1) usually have large SHAP values and TC (Supplementary Fig. 11b). Notably, the TC of polymer chains is influenced by multiple parameters and it is difficult to have the individual descriptor to determine its value. One example is that crystalline polynorbornene has been proven to be weakly sensitive to chain stiffness, even if increasing the dihedral angular force constant term in MD simulations69. This confirms the significance of our proposed ML framework for predicting the log2TC of polymers.

Discovery of high TC polymers

The reliability of the optimized descriptors has been exhibited by the performance of the various trained ML models. Next, we applied these ML models to predict the log2TC of polymer structures in the PoLyInfo and PI1M databases, in order to virtually screen promising polymers with high TC. The predicted polymer TC versus cross-sectional area from the ensemble of optimized descriptors combined with RF, XGBoost, and MLP are visualized in Fig. 5a–c, respectively. Where stars indicate polyethylene with log2TC of 3.91, 4.66, and 5.30 predicted by RF, XGBoost, and MLP, respectively, and that calculated by MD simulation is 5.28. The dependence of TC on the cross-sectional area is evident here, as almost all of the predicted high log2TC polymers have small cross-sectional areas. Moreover, since PI1M has the same chemical distribution space as PoLyInfo and fills the sparse area, which covers most of the log2TC range of PoLyInfo and enriches the polymer structures in the high log2TC region.

Fig. 5: Prediction of high TC polymers in PoLyInfo and PI1M databases using constructed ML models.
figure 5

a, b and c based on RF, XGBoost, and MLP models, respectively. d Synthetic accessibility (SA) score versus calculated log2TC of screened high TC polymers (TC > 20.00 W m−1 K−1). The star indicates PE, and the TC in this work is 38.98 W m−1 K−1.

Comparing the results from different ML models, the tree-based models of RF and XGBoost predict the log2TC of polymers in a narrower space than that of the MLP. Though the excellent performance of the tree-based models in preventing overfitting, the extrapolation of the models is usually inadequate and the predictions are still limited to the range of log2TC of the polymer structures in the training set. In contrast, the neural network model of MLP usually has better extrapolation capability, and is superior in finding small data such as high log2TC polymer structures, despite the relatively low training accuracy R2 of the model. This finding is similar to a previous study of predicting the permeability of gas separation membranes using ML23. As well, previous work has revealed the length dependence of the TC of polymer chains. Within a certain length range, the diverging thermal conductivity k and chain length L can be fitted by kLβ, where β indicates the relatively dominant phonon transport mechanism70. Here, we considered polymer chains with TC greater than 20.00 W m−1 K−1 with an effective length of 50 nm as the outstanding polymers with high TC. Then, a balanced strategy to integrate the performance of three ML models was devised to recommend promising polymer structures for the calculation of TC by MD simulations. We identified the polymer structures in the PoLyInfo dataset with RF, XGBoost, and MLP predictions of log2TC up to 3.51, 3.50, and 4.33, and only the polymer structures with no less than 2 occurrences were picked for MD simulations. As a result, 24 polymer structures with high TC were discovered and verified. Similarly, we implemented this method to identify 84 high TC polymer structures in the PI1M database. After de-duplication, totally 107 high TC polymer structures were found in this work, and the Synthetic Accessibility (SA) scores were calculated as shown in Fig. 5d. The specific polymer structures can be seen in Supplementary Note 8. From Supplementary Fig. 12, we can see that most of the high TC polymers are simple linear or contain aromatic rings in the mainchain, which have small repeating unit lengths and no side chains. The SA score was initially utilized to estimate the synthetic accessibility of drug-like molecules based on molecular complexity and fragment contributions71, and was subsequently adopted for polymers37,38. The SA score values ranged from 1 to 10, and synthesis is more difficult as the value increases. To take into account the effect of monomer linkages, polymer molecules with a polymerization degree of 6 were calculated for the SA score. Among them, 28 polymer structures with SA no more than 3.00, including polyethylene, polytetrafluoroethylene and poly(p-phenylene), and etc. Although it is currently difficult to fabricate each of these structures, we believe that more polymers like PE chains will be prepared for exploring the limits TC of polymers by combining advanced processes such as micromechanical stretching, electrostatic spinning, and nanoscale templating preparation in the near future5,7,16.

Symbolic regression for TC prediction of promising polymers

Since the TC of polymer chains is influenced by complex multi-parameters, it is difficult to predict trends in TC values for different polymers from any single descriptor. Symbolic regression (SR) attempts to accelerate the discovery of materials with superior properties by relating available descriptors through mathematical formulas to construct new combinatorial features72. SR does not require massive datasets, as long as a high consistency and accuracy73. The 107 promising polymer structures (TC > 20.00 W m−1 K−1) with optimized descriptors were utilized for SR, where the ratio of training to test set was 3:1. The mathematical formula was acquired and selected using an efficient stepwise strategy with SR based on genetic programming (GPSR) as implemented in the gplearn code74. The hyperparameters setup and the detailed formula determination process can be found in Supplementary Note 9. Pearson coefficients are first applied to filter optimized descriptors and create sub-descriptors, and an updated ensemble of 22 descriptors was obtained. The frequency of occurrence of optimized descriptors in 158 mathematical formulas (PC values \(\ge\)0.85 and complexity \(\le\)10) is displayed in Fig. 6a, and the first eight descriptors were finally retained. It is worth emphasizing that the MD-inspired descriptors of cross-sectional area (cross-sectional) and dihedral force constants (Kd_average) appeared in each of the formulas. In Fig. 6b, we calculated the Pearson coefficients of the new set of descriptors with the TC, the results suggest these descriptors are closely associated with the TC. Subsequently, we reset the grid search hyperparameters in gplearn and used R2 as the evaluation metric. Only formulas with high R2 and low complexity (length of formula) are considered suitable for the prediction the log2TC of polymer structures75. Thus, 9073 mathematical formulas with complexity within 30 and R2 over 0.6, which are characterized by complexity and accuracy R2 via density plot in Fig. 6c. The four points of c, d, e, and f at Pareto front were identified by Latin hypercube sampling approach76,77, and their corresponding formulas are expressed in Supplementary Table 8. The complexities of the four formulas are in the range of 20–30, and the fitting accuracies are all greater than 0.70. Moreover, the training accuracy is mostly positive to complexity. For example, the formula represented by point c with a complexity of 20 has a relatively low accuracy R2 among the four points, but the fitting results are consistent with the MD calculated log2TC, as demonstrated in Fig. 6d. Meanwhile, all four identified formulas include the descriptors of the Cross-sectional, Kd_average and Nd_average, which verified that the TC of polymer chain is strongly correlated with the parameters such as cross-sectional area and dihedral stiffness. These formulas are meaningful in the initial rapid screening of high TC polymer chain structures.

Fig. 6: GPSR for TC prediction of promising polymers.
figure 6

a Frequency of occurrence of optimized descriptors in 158 mathematical formulas (PC values \(\ge\) 0.85 and complexity \(\le\) 10). b Pearson correlation matrices showing correlations among 22 descriptors and TC, where the descriptors d1–d8 correspond to descriptors 1 to 8 in (a). c Pareto front of accuracy R2 vs. complexity of 9073 mathematical formulas shown via density plot. d MD calculated log2TC vs. fitting results of the formula (point c) with a complexity of 20 and training accuracy R2 of 0.71.

Thermal transport mechanism of promising individual polymer chains

Taking into account factors such as TC and SA score, eight polymer structures (see Fig. 7a) were chosen for the analysis of phonon dispersion relations. Currently, polymer structures like [*]C=C[*] and [*]N=N[*] are challenging to be synthesized experimentally, but are contributing to our understanding of polymer thermal transport mechanisms. All of these polymer molecules are π-conjugated structures except for the PE and the Polytetrafluoroethylene (PTFE), which are simple linear structures. In π-conjugated polymer molecules, the overlap of p-orbitals has enhanced restraint in inhibiting chain rotation and forming the rigid backbone15. Figure 7b illustrates the phonon dispersion relations, which were obtained by phonon spectral energy density (Phonon-SED) analysis78, The detailed description of the Phonon-SED approach can be found in the Method part. Since the acoustic modes are dominated by the thermal transport of heat carriers in polymer chains, phonon modes with frequencies below 25 THz are demonstrated. Moreover, the phonon group velocity \({v}_{g}\) is approximated as the average of the slopes of all acoustic branches15,67. The volumetric heat capacity \({C}_{v}\) of each structure was evaluated from corresponding amorphous polymers, we constructed an amorphous system containing about 10,000 atoms according to the repeating units, respectively, and calculated the value of \({C}_{v}\) after the equilibrium simulation at 300 K, more details can be found in Supplementary Note 10. So far, the phonon mean free path was derived from \(l=k/{v}_{g}{C}_{v}\). Since the MD simulation did not reach equilibrium for the model of [*]N=N[*], we failed to obtain its \({C}_{v}\) and \(l\). The approximations of the above calculations allow the results to be rough, but it do help us to understand the underlying thermal conductivity mechanisms of these promising polymer structures by comparing the relative trends of the relevant parameters, as listed in Table 1.

Fig. 7: Structure and phonon dispersion relations for the eight promising polymers.
figure 7

a Polymer chain structures. b Phonon dispersion relations. The q is the wavevector, the \(\omega\) is phonon frequency and the average phonon group velocity of one branch is estimated as the slope of the origin to the maximum frequency point as shown in the red dashed line in the PHTC001 structure.

Table 1 Thermal properties for eight promising polymers.

The volumetric heat capacity of the eight polymer structures varies from 2.70 to 3.74 J cm−3 K−1, which is not critical to the high TC of polymer chains21. As for the phonon group velocity, the six π-conjugated polymers have large values (more than 5900 m/s) due to overlapped p-orbital and delocalized electrons. Additionally, the small atomic mass enables a large phonon group velocity. The PTFE has a smaller phonon group velocity than that of PE due to the relatively larger mass of fluorine atoms compared to hydrogen atoms. The phonon mean free path provides valuable insights into phonon transport in the polymer chains. Overall, simple linear polymer chains easy to have long phonon mean free paths, especially for linear π-conjugated polymers such as [*]C=C[*]. These structures have large chain stiffness and few atoms except for the backbone, thereby having weak phonon-disorder scattering.

Thermal transport linkages between the various hierarchical polymer structures

To explore the thermal transport linkages between the different hierarchical structures of polymer chains and amorphous polymers, we selected 58 structures from 107 promising high TC polymer chains and calculated the TC of the corresponding amorphous polymers (ATC) using reverse non-equilibrium molecular dynamics (NEMD) simulations79, as listed in Supplementary Table 4 and shown in Fig. 8a. Here, ATC was specifically defined as the TC of the amorphous polymer to distinguish it from that of polymer chains. Amorphous polymers normally have much lower TC than polymer chains due to their internal disordered chain entanglement, and thus polymers with ATC greater than 0.40 W m−1 K−1 can be considered to have outstanding thermal conductivity37,80. Among the amorphous polymers simulated in this work, half of which have an ATC greater than 0.40 W m−1 K−1, while the equivalent percentage is only 2.3% in the reference (ref. 63). In Fig. 8b, the radius of gyration (\({R}_{g}\)) of amorphous polymers has a close positive correlation with ATC, and this work broadens the upper limit of \({R}_{g}\) that in ref. 63. Since polymer chains with high TC are associated with strong atomic interactions and large chain stiffness, their corresponding amorphous structure is also conducive to maintaining large rigid chain segments.

Fig. 8: Thermal conductivity of amorphous polymers (ATC).
figure 8

a ATC of 58 structures randomly selected from data of 107 promising polymers (this work), and the reference (Ref.) data calculated by Hayashi et al. 63, contains 1051 polymers, using the same simulation parameters as set in this work. b Radius of gyration (\({R}_{g}\)) versus ATC for polymers of this work and Ref., where the diamond markers indicate the six typical amorphous polymers in (c), including Poly(p-phenylene) (PPP), Poly(p-phenylenevinylene) (PPPV), Polyacetylene (PA), Poly[(E)-1-fluoroethene-1,2-diyl] (PEFD), Polyethylene (PE) and Polytetrafluoroethylene (PTFE), where the black balls indicated the carbon atoms, the golden balls indicated the fluorine atoms, the red balls indicated the connection positions, and the hydrogen atoms were hidden. d Contributions of convection and different types of interactions to the ATC of six polymers. The ATC of each amorphous polymer was quantified into six components of the contribution of bond, angle, dihedral (dihed), convection (conv), nonbonded (non), and improper.

According to different values of \({R}_{g}\), we selected six structures in Fig. 8c, Poly(p-phenylene), Poly(p-phenylenevinylene), Polyacetylene (PA), Poly[(E)-1-fluoroethene-1,2-diyl] (PEFD), PE and PTFE, to understand the thermal transport mechanism via energy flux decomposition analysis63. The ATC of each amorphous polymer was quantified into six components of bond, angle, dihedral, convection, nonbonded and improper, where the nonbonded contribution contains pairwise and K-space contributions. From Fig. 8d, the intra-chain interactions of bonds, angles, and dihedrals dominate the ATC of amorphous polymers. Especially for π-conjugated polymers, the direct contribution of the dihedral term to the ATC is obvious. By comparing PA/PEFD pairs or PE/PTFE pairs, the system containing atoms with a large mass such as fluorine may inhibit the propagation of phonons and reduce the ATC. For a unified comprehension of the mechanism of the dihedral term on the contribution to the TC of different hierarchical structures, we investigated the role of chain orientation and chain rotation of polymers in Supplementary Note 10. Our results reveal that polymers with low dihedral energy are prone to poorly consistent chain orientation (Supplementary Fig. 16) and severe chain rotation (Supplementary Fig. 17), which are undesirable for heat flux transport in the intended direction. Furthermore, the TC of strained amorphous polymer or polymer chains is more sensitive to the reduction of dihedral energy rather than strain-free amorphous polymer, because it has a large original orientational order parameter.

Discussion

In summary, we have developed an interpretable ML framework for exploring high thermal conductivity polymer chains via high-throughput MD simulations. Inspired by the drug-like small molecule representation and the molecular force field, we reduced the initially calculated/collected 320 physical descriptors to 20 optimized descriptors by hierarchical down-selection. The constructed ML models are capable of effectively reflecting the relationship between optimized descriptors and property, and exhibit high accuracy in TC prediction. All the models of RF, XGBoost and MLP achieved the R2 of more than 0.80, which is superior to that of represented by conventional graph descriptors. Moreover, the promotion or inhibition of TC by optimized descriptors like cross-sectional area and dihedral stiffness was captured by RF model using SHAP analysis.

Using the trained ML models, we discovered 107 promising polymers with TC greater than 20.00 W m−1 K−1, and 29 of which have SA scores of no more than 3.00. These polymer structures have been validated through high-fidelity MD simulations. Further, we used SR with optimized descriptors to fit the TC of promising polymers, and the derived mathematical formulas enable a preliminary fast screening of high TC polymers without relying on ML models, which is friendly for experimental studies. In closing, we calculated phonon dispersion relations for eight typical polymer structures via phonon spectral energy density analysis to reveal the underlying TC mechanisms. Notably, most of these structures are π-conjugated polymers, whose overlapping p-orbitals enable easy maintenance of strong chain stiffness and large group velocities.

Currently, the pure individual polymer chains are still not accessible by experiments. Remarkably, there have been many efforts to fabricate polymer nanofibers with consistently oriented chains by techniques such as micromechanical stretching, electrostatic spinning, and nanoscale templating, but this has requirements on the inherent properties of the polymers. Although mechanical tensile disentanglement of amorphous polymers may be realized by adjusting conditions such as strain rate and temperature, it is demanding on the mechanical properties of the polymers80. Many π-conjugated polymers are not suitable for the stretching process due to their incomparable elastic modulus with PE, whereas electrostatic spinning and nanoscale templating technologies are probably applicable81,82. The conjugated polymer can be dissolved in matching solvents and then prepared to form consistently oriented nanofibers by electrostatic induction or by nanoscale template confinement such as anodic aluminum oxide templates. Moreover, the thermal properties of hierarchical structures of polymers are closely related. We calculated 58 amorphous polymers whose repeating unit was randomly extracted from the set of 107 promising polymers, and half of them have a high ATC of 0.40 W m−1 K−1. Analyzed by \({R}_{g}\) data, strong interatomic interactions are also beneficial for obtaining large rigid chain segments in the amorphous system, achieving significant intra-chain thermal transport and high ATC. The proposed approach may assist in the research of high-performance polymers that are not limited to TC, and aid in understanding the linkage between the properties of different hierarchical structures.

Methods

Polymer modeling and cross-sectional area calculation

Polymer modeling is a monomer-to-chain process, implemented in the STK tool, with input parameters of monomer SMILES and degree of polymerization83. The length of the polymer chains was set uniformly to 50 nm, and the degree of polymerization was obtained by dividing the chain length by the monomer length and rounding up to an integer. Starting from the polymer SMILES, a molecular chain with polymerization degree 2 was generated by RDKit and optimized using the MMFF force field84. Then, the monomer length was determined by measuring the distance between equivalent atoms in two repeating units in the heat transport direction. Following the modeling, a Python pipeline of PYSIMM realized the assignment of GAFF2 force field parameters and the generation of MD simulation input structure data files85.

The cross-sectional area is one of the important parameters for thermal conductivity analysis. In molecular dynamics simulations, the calculation of the cross-sectional area is difficult for systems that do not occupy the entire simulation box. The cross-sectional area was estimated by the ratio of the van der Waals volume to the length of the monomer13. The Van der Waals volume of the monomer was calculated by the sum of atomic and bond contributions, and has been successfully tested and applied in previous drug compounds86.

Calculation of TC by MD simulations

The TC of polymer chains was obtained by NEMD simulations performed in a Large-scale Atomic/Molecular Massively Parallel Simulator85. The implementation of NEMD simulations is similar to the steady-state measurement experiments for TC, in which a 1-D steady-state heat transfer is generated by adding the heat source and sink at both ends of the sample87. NEMD simulations have been extensively applied in the calculation of TC of low-dimensional systems, as it has been proven to have the ability to identify non-Fourier heat conduction phenomena induced by nanoconfinement70,88,89,90. As for polymers, Liu et al. 70 demonstrated that competition between ballistic phonon transport and diffusive phonon transport in single polymer chains leads to a diverging length-dependent thermal conductivity through MEND simulations. Shrestha et al. 91 experimentally examined the temperature dependence of the TC of polymer nanofibers under ultra-high stretch, and indicated that the TC at high-temperature end matched well with the results from NEMD calculations90.

In terms of the NEMD method for TC calculation of polymer chains, the heat energy exchange was achieved by an enhanced version of the heat exchange algorithm, which rescales and shifts the velocities of particles inside reservoirs to impose a constant heat flux92. The polymer chains were placed in a box of 540 × 60 × 60 (x × y × z) Å box, where the dimension in the y and z directions was set to 60 Å to avoid interaction with the neighboring polymer chains. Before TC calculation, the polymer chain structures were relaxed to reach a stable conformation. Then, the polymer chain was divided into 50 slabs in the x direction, and the fixed regions at two ends of the chain were set as a heat-insulating walls. In the NEMD simulation, the system was run under NVT (constant number of atoms, volume, and temperature) and NVE (constant number of atoms, volume, and energy) ensembles for 1 ns at 300 K sequentially to release chain stress36,93. After that, the heat was added/extracted to the heat source/sink regions (20 Å of each region) at the end of the polymer chain in a regular rate to create a constant heat flux. The applied heat varies for different polymer chain structures and ranges from 0.01 eV/ps to 0.08 eV/ps. At last, the temperature profile was averaged over the last 2–3 ns and used for TC calculation, solved by \(k=-J({d}_{T}/{d}_{x})\), where \(J\) is heat flux, \({d}_{T}/{d}_{x}\) is the temperature gradient. In addition, the ATC of amorphous polymers and the \({C}_{V}\) calculations were implemented by an automated pipeline in the Radonpy toolkit (Supplementary Fig. 15)63.

Descriptors calculation and ML models construction

The ideal polymer descriptors are required to minimize and completely represent polymer information, and are one of the key factors in determining the prediction accuracy of ML algorithms. The physical descriptors for this work were sourced from both Mordred software calculations and GAFF2 force field parameters extraction. The Mordred software was initially developed for small molecule characteristics in cheminformatics, which can calculate more than 1800 descriptors47. However, since we consider two connecting sites of polymer monomers, only 286 valid descriptors were obtained. Therefore, as a complement, we additionally extracted parameters from each polymer force field file as the descriptor. For graph descriptors, MACCS, Morgan, and cMorgan fingerprints were calculated in the RDKit package45. The Mol2vec fingerprints were embedded via Mol2vec40. We referred to the polymer representation model trained using PoLyInfo and PI1M databases for generating Mol2vec fingerprints60.

The ML models of RF, XGBoost, and MLP were implemented by using Scikit-learn94. Hyperparametric optimization for RF, XGBoost, and MLP was operated with the Bayesian Optimization package95 which is a global optimization tool to achieve good prediction accuracy R2. The Gaussian regression process and acquisition function with ten random pairs of parameters were selected for initial training, and the ideal parameters for each ML model were determined after 100 optimization iterations52.

To explain the association of optimized descriptors with TC, we used the SHAP toolkit with RF model to evaluate the feature importance62. The SHAP analysis is based on a game-theoretic approach that associates the optimal credit allocation with the local explanations of the model, which considers the model performance by neglecting each feature and provides the direction of each descriptor effect52.

Mathematical formulas for TC fitted by symbolic regression (SR)

The mathematical formulae were acquired and selected using an efficient stepwise strategy with GPSR as implemented in gplearn74. The 107 polymer structures with TC greater than 20.00 W m−1 K−1 were randomly divided into 3:1 as training and test sets, respectively. At first, Pearson coefficients were used as evaluation metrics of training fitness to filter optimized descriptors and generate sub-descriptors, and a new dataset containing 22 descriptors was generated. Further, the grid search strategy with the hyperparameters and metric R2 as listed in Table 2 was applied to determine the mathematical formulas. We ultimately discussed four formulas at the Pareto front that were identified by Latin hypercube sampling approach77. More information about SR can be found in the Supplementary Note 9.

Table 2 Setup of hyperparameters in gplearn toolkit for GPSR.

Analysis of phonon dispersion relations by phonon spectral energy density (Phonon-SED)

To understand the TC mechanism of polymers, MD simulations coupled with Phonon-SED approach78 were employed to calculate the dispersion relations of polymers. The polymer chain with a length of 100 Å was constructed as an input of SMILES and placed into a box with the cross section of 60 × 60 Å. After energy minimization, the system was run under the NVT (constant number of atoms, volume, and temperature) ensemble for 0.25 ns at 2 K sequentially to release chain stress. Subsequently, the system was run under the NVE (constant number of atoms, volume, and energy) ensemble for 2 million steps with the timestep of 0.25 fs. During this period, the velocity and position of each atom in the polymer backbone were recorded with intervals of 20 steps. The Phonon-SED converted the time domain information of atomic velocities and positions into wave vectors versus angular frequencies via two-dimensional Fourier transform, expressed as

$$\varPhi \left(q,\omega \right)=\frac{1}{4{\rm{\pi }}{\tau }_{0}{N}_{T}}\mathop{\sum }\limits_{\alpha }^{\left\{x,y,z\right\}}\mathop{\sum }\limits_{b}^{B}{m}_{b}{\left|{\int }_{0}^{{\tau }_{0}}\mathop{\sum }\limits_{n}^{{N}_{T}}{\dot{u}}_{\alpha }\left(n,{b;t}\right)\times {e}^{{iq}\cdot r\left(n,0{;t}\right)-i\omega t}{dt}\right|}^{2}$$
(1)

Where \(q\) is the wavevector, \(\omega\) is the frequency, \({\tau }_{0}\) is the simulation time, \({m}_{b}\) is the mass of atom b, \(\alpha\) is the cartesian direction, \({N}_{T}\) is the number of the unit cell in the polymer chain, \({\dot{u}}_{\alpha }\left(n,b{\rm{;}}t\right)\) is the velocity of atom b in the unit cell n at time t in the \(\alpha\) direction, and \(r\left(n,0{\rm{;}}t\right)\) is the equilibrium position of unit cell n.