Exploring high thermal conductivity polymers via interpretable machine learning with physical descriptors

The efficient and economical exploitation of polymers with high thermal conductivity is essential to solve the issue of heat dissipation in organic devices. Currently, the experimental preparation of functional thermal conductivity polymers remains a trial and error process due to the multi-degrees of freedom during the synthesis and characterization process. In this work, we have proposed a high-throughput screening framework for polymer chains with high thermal conductivity via interpretable machine learning and physical-feature engineering. The polymer thermal conductivity datasets for training were first collected by molecular dynamics simulation. Inspired by the drug-like small molecule representation and molecular force field, 320 polymer monomer descriptors were calculated and the 20 optimized descriptors with physical meaning were extracted by hierarchical down-selection. All the machine learning models achieve a prediction accuracy R2 greater than 0.80, which is superior to that of represented by traditional graph descriptors. Further, the cross-sectional area and dihedral stiffness descriptors were identified for positive/negative contribution to thermal conductivity, and 107 promising polymer structures with thermal conductivity greater than 20.00 W/mK were obtained. Mathematical formulas for predicting the polymer thermal conductivity were also constructed by using symbolic regression. The high thermal conductivity polymer structures are mostly {\pi}-conjugated, whose overlapping p-orbitals enable easily to maintain strong chain stiffness and large group velocities. The proposed data-driven framework should facilitate the theoretical and experimental design of polymers with desirable properties.


INTRODUCTION
Polymers are extensively used in industry and daily life, owing to various advantages of chemical inertness, mechanical flexibility and light weight 1 .As the organic electronics are becoming smaller while the power density keeps increasing, the thermal management and heat dissipation capability have attracted significant attention 2 .However, conventional polymers are thermal insulators with reported thermal conductivity in the range from 0.1 to 0.5 W/mK, preventing the development of organic electronics 3 .Polymers with high thermal conductivity are urgently demanded in organic energy storage and electronic devices to accommodate revolutionary innovations in organic electronics and optoelectronics 4 .The polymer morphology and topology were found to be closely related to thermal conductivity 5 .Increasing the crystallite orientation and crystallinity can significantly reduce the phonon scattering and enhance the thermal conductivity along the chain directions, which has been demonstrated by both experiments [6][7][8] and theoretical simulations [9][10][11][12] .A recent study has fabricated polyethylene (PE) films by disentanglement and alignment of amorphous chains with a metal-like thermal conductivity of 62 W/mK, over two orders of magnitude greater than that of classical amorphous polymers 6 .Moreover, molecular dynamics simulations have suggested that individual crystalline PE chains have a very high or even divergent thermal conductivity 11 .These findings provide opportunities for solving the heat dissipation problem of polymer devices.
Despite the fact that chain alignment, crystallinity, polymer fibers or even single-chain polymer structures exhibit great influence on the thermal characteristic [13][14][15] , the polymer library is quite large, with as many as 10 8 monomeric organic molecules known to exist in chemical space 16 .Current research on the thermal conductivity of polymers is still an Edisonian process, guided by intuition or experience in a trial-and-error approach that is time-consuming and expensive 17 .Moreover, most of the studies are conducted on simple structures such as PE 4,6,11 , which makes it difficult to grasp the general rule of the factors affecting the thermal conductivity of polymers and to discover polymer molecular structures with high thermal conductivity in huge chemical space.
The field of polymer informatics 18 , associated with the development of artificial intelligence and machine learning (ML) methods, attempts to utilize the data-driven centric method for physical property regulation or device development of organic materials to resolve the conflict between structural freedom and efficiency/cost in the traditional trial-and-error approach.The research on polymer informatics has attracted extensive attention and succeeded in recent years [19][20][21] , involving the prediction of organic optical [22][23][24] , electrical [25][26][27] and thermal properties [28][29][30][31][32] .Particularly, several efforts emerged in the search or design of structures with high TC as related to crystalline polymers 30 , amorphous polymers 31,32 and copolymers 33 .Most of these studies have employed graph descriptors 31 or polymer chemistry fragment statistics 30,32,33 to describe monomer structures in informatics algorithms, also called fingerprints or representations.The graph descriptors generated rely on molecular/monomer graph information, formulated by knowledge domain feature engineering 34 or by attempting to form general descriptors 35 .
Moreover, descriptors such as molecular access systems (MACCS) 36 are obtained through statistics of different chemical fragments, and are closely related to molecular graphs.Subsequently, they are collectively referred to as graph descriptors.The fingerprint is required for the unique, complete, minimal representation of each candidate, and the successful fingerprint is a challenging task 37 .Besides, polymers are composed of many repeating units, which are more complex than organic small molecules and require accurate capture of information on monomer connection sites 26 .The graph descriptions have long been applied and validated in the development of drug-like small molecules 38 , and the availability of open-source toolkits such as RDKit 39 and Mol2vec 34 has facilitated their accessibility, which is also one reason that graph descriptions are popular in polymer informatics.However, the graph descriptor is in the form of a string of numeric vectors.The completeness of the molecular structure determines the coupling association between the digits.Hence, the relationship between molecular monomers and material properties is difficult to grasp.
Exploring the ensemble of physically independent descriptors for the representation of molecular structures is important in qualitative structure-property relationship (QSPR) modeling and enables more intuitive guidelines for molecular structure evaluation 40 .Feature engineering for the collection and reduction of physical descriptors are critical steps in determining effective capabilities in polymer informatics.The development of automatic, universal and efficient tools for the calculation of descriptors of organic molecules is of interest to researchers, which translates the chemical information encoded in the symbolic representation of molecules into useful numbers or some standardized experimental results 41 .Several open-source and commercial software [41][42][43] are available to calculate various types of molecular descriptors such as carbon atomic number, molecular weight and Extended Topochemical Atom (ETA) 44 , which have been successfully applied in organic chemistry synthesis 45 , molecular antibacterial activity prediction 46 and so on.In addition, the parameter conditions in experiments or simulations affect the molecular properties.For instance, the force-field-inspired descriptors such as types of bond, angle and dihedral have been validated for the prediction of the specific heat of polymers, even if the datasets are from experiments 29 .The number reduction of polymer features is another concern, as some descriptors may have little relevance to the target property, and a low-dimensional descriptor space is much easier to build up for the ML model 47 .Feature extraction and selection are the dominant approaches to reduce the dimensionality of features.Feature extraction creates subsets from the original data space, such as principal component analysis (PCA), where the specific meaning of the new features obtained is difficult to understand 48 .Feature selection retains the physical meaning of individual descriptors, while filters based on correlation evaluation have dependencies on mathematical models, like the Pearson and Spearman coefficients that consider the linear and monotonic relationships of the data, respectively 49 .Further, the filter methods do not involve ML models, which may lead to the inapplicability of the gained features.The wrapper-based feature selection techniques combine ML models to eliminate redundant features, including recursive feature elimination (RFE), sequential feature selection (SFS) and exhaustive feature selection (EFS) 50 .Testing different subsets of descriptors for informatics algorithms is the crucial feature of the wrapper approaches, and the key is the strategy of combining different descriptors.Typical RFE seeks to improve model performance by continuously reducing the low impact features from the remaining features in iteratively constructed ML models, which refer to the ranking of feature weights assigned by models such as random forests 48 .Thus, the RFE relies on the feature weight evaluation mechanism of the ML models.
Herein, focusing on the challenges of polymer monomer representation and feature selection, we proposed an ML interpretable framework integrated with high-throughput MD simulations for the discovery of polymer structures with high TC, as illustrated in Fig. 1.It consists of four components: 1) polymer library construction, 2) MD simulation for the TC of polymers; 3) monomer feature representation and hierarchical down-selection; 4) ML models construction for TC prediction.The training data was collected from the literatures 51,52 , and candidates from the databases of PolyInfo 53 and PI1M 54 were applied for the virtual screening of high TC structures.All polymer monomers were identified by the SMILES (simplified molecular input line entry system) strings and formed onedimensional polymer chains by replication.The TC of training datasets was calculated by MD simulations with the second generation of the general AMBER force field -GAFF2 55 .Inspired by druglike molecular representation and molecular force fields, we obtained 320 physical descriptors by mordred software 41 calculation and force field parameter file extraction, and retained 20 optimized descriptors by hierarchical down-selection.We then trained random forest (RF), extreme gradient boosting (XGBoost) tree-based models, and multilayer perceptron (MLP) neural network model separately to establish the relationship between the optimized descriptors and the TC of these benchmark polymer datasets.Further, we analyzed the feature importance of each optimized descriptor and extracted the chemical heuristic for high TC polymers design through SHAP analysis 56 .Using the trained ML models, 107 promising polymers with TC greater than 20.00 W/mK were identified, which are served for symbolic regression to derive mathematical formulas for expressing the TC of promising polymers.
Last, we discussed the TC mechanisms of eight typical polymers.Overall, the proposed approach is beneficial for theoretical or experimental investigations of high TC polymers.

Distribution of polymer datasets in chemical space
Polymer data from literatures 51,52 were utilized as the benchmark database for training machine learning models, as well as PolyInfo 53 and PI1M 54 databases were used for the virtual screening of polymer structures with high TC.The polymers are classified into 19 classes such as polyolefins, polyethers and polyamides according to different elements and chemical functional groups 57 .To validate the distribution of the selected 1735 benchmark data over the other two datasets, their chemical structures were visualized in 2D space by the uniform manifold approximation and projection (UAMP) 58 , where the chemical structure of each monomer was transformed into the Morgan fingerprint 35 of a 1024 vector with a radius of 2 atoms.It is observed that the polymer structures in the selected benchmark dataset (Fig. 2a) are well covered by the chemical space distribution of those in the PolyInfo (Fig. 2b) and PI1M (Fig. 2c) databases.Note that the PI1M dataset was generated by a generative model of a recurrent neural network trained with data from PolyInfo, which fills the sparse region of the chemical space of the PolyInfo dataset, but the distribution is consistent 54 .Thus, the ML models trained with the selected data are well able to learn the chemical features of all candidates and can be effectively adopted for the virtual screening of polymer structures with high TC.

Polymer descriptors hierarchical down-selection and ML Models Training
Polymer descriptors are hierarchically down-selected in three stages: removing features with low variance, primary filtering referred to different correlation coefficients, and final selection assisted with the ML model (shown in Supplementary Section S1).The collected initial monomer physical descriptors are composed of 286 Mordred-based and 34 MD-inspired descriptors.The descriptors of MD-inspired and Mordred-based descriptors are listed in Supplementary Section S2.The removal of low variance descriptors is intended to eliminate descriptors with variance less than a specific threshold, whose contribution to the target property of all polymer data (TC in this work) is considered to be nearly consistent.After the variance threshold was set to 0.01, the 264 descriptors were reserved for the next stage.We established the weight assignment mechanism (WAM) based on the different correlation coefficients for further primary filtering of the descriptors, due to the various attentions of their mathematical models.The Pearson, Spearman and Distance coefficients are used to evaluate linear, monotonic and non-linear relationships between data respectively, while the maximum information coefficient (MIC) reflects the association of two variables through information entropy, whether linear or nonlinear.The reliability of MIC depends on the data sample size and the value is reliable only with large datasets.The four metric coefficients of Pearson, Spearman, Distance and MIC were incorporated and each was assigned a weighting factor of 0.25, and the thresholds were set to 0.05, 0.05, 0.153 and 0.132, respectively.The 53 descriptors with a cumulative weight value of 1 were retained through VAM.
Random sequential feature selection (RFSF) combined with the RF model was then developed for optimized descriptors determination.Considering all possible combinations of descriptors for ML model training is time-consuming and expensive, so traditional SFS usually leads to sub-optimal solutions, where the recommended ensemble of optimized descriptors is not unique, and is influenced by the input order of the descriptors 59 .Here, we disrupted the order of the input descriptors, combined them with 100 RF model training cycles, and acquired the final optimized descriptors based on a statistical approach.The threshold was set to 0.34, that is to maintain descriptors that occur more than 34 times in 100 RF model training runs.The results of the optimized descriptors based on VAM and RSFS are shown in Fig. 3a and their detailed descriptions are listed in Supplementary Section S3.Moreover, Fig. 3e exhibits the Pearson correlation matrices of the correlations among optimized descriptors (Other metrics see in Fig. S1).It is found that most descriptors are positive correlated with each other and negative correlated with TC.Only three descriptors are positive for TC, two of which are MD-inspired descriptors.For example, the descriptor MW_ratio reflects the ratio of the molecular weight of the mainchain to the molecular weight of the monomer, with values between 0 and 1.The MW_ratio of 1 indicates that the polymer is without side chains, which reduces the loss of heat flux along the chain and makes it possible to get large TC.
Figure 3b shows the results of the RF model trained with the optimized descriptors, with training and test R 2 of 0.87 and 0.84 respectively.To verify the extensibility of the optimized descriptors, XGBoost and MLP models were deployed for training (see Fig. S2).The accuracy of the training and test sets for XGBoost is 0.95 and 0.87, and that for MLP are 0.81 and 0.88 respectively, which is comparable or even better than the RF model.Therefore, these three models are utilized in the subsequent discussion.
The prediction accuracy of ML models at different down-selection stages illustrated in Figure 3c (training and test data set prediction in Fig. S3).The extra PCA more than 95% variance was performed to compare with RFSF technology.According to the relationship between the number of principal components and the cumulative variance in Fig. S4, at least 19 components are required to exceed 95% variance.This is close to the number of sets of optimized descriptors.As seen in Fig. 3c, the tree-based models of the RF and XGBoost maintain relatively high accuracy even with large descriptor dimensions because of their strong ability to prevent overfitting of the data.Moreover, the feature down-selection process is usually accompanied by the loss of information, which results in the decrease of model accuracy.However, the feature down-selection process also reduces the redundancy between data which suppresses the overfitting and improves the accuracy of the MLP model.Overall, the accuracy of all three models trained with the optimized descriptors from RFSF is higher than that of the models trained with the PCA-derived descriptors, which demonstrates the effectiveness of our approach.
The ML models with different graph descriptors were applied for comparison in Fig. 3d (training and test data set prediction in Fig. S5).The Mol2vec 34 is an unsupervised ML approach to learn vector representations of molecular substructures, which requires a benchmark dataset for molecular structure training.Here, the pre-trained polymer embedding model was from elsewhere 54 , which was created using the PolyInfo and PI1M datasets.The MACCS 36 descriptor is the structural key-based descriptor with 166-bit keyset.The Morgan and Morgan count (cMorgan) 35 descriptors are the extended connectivity fingerprints that capture molecular features relevant to molecular activity.The results in Fig. 3d reflect the superiority of ML models trained with the optimized descriptor, no matter the models of RF, XGBoost and MLP.The down-selection processes of physical descriptors examine individual/combined descriptors in relation to TC, while the graph descriptors aim to represent molecular/monomeric information as completely as possible.Whilst the elements or groups in the molecular graph have been indicated to correlate with the TC of polymer chains 30 , it is more intuitive and effective to predict the TC of polymer chains using the associated physical descriptors.But not absolute, which is also related to the parameters such as chain stiffness 60 .

Physical insights from interpretable ML model
Figure 4 summarizes the effect of the features using SHAP, for the RF model trained on optimized descriptors.The SHAP approach attempts to address the unexplainable black-box challenge of ML algorithms by calculating the marginal contribution of features to the model output 56 .Hence, the features of each polymer structure in training data sets are assigned the SHAP values separately.As shown in Fig. 4a, the importance ranking of the optimized descriptors was referenced to the average SHAP value.Among the top 8 optimized descriptors, the number of MD-inspired and Mordred-based descriptors is equal, which reflects the fact that the construction of the RF model is a joint contribution of these two types of descriptors.The distribution of SHAP values for each descriptor is displayed in Fig. 4b, and the depth of shade of datapoints in the beeswarm plot represents the magnitude of TC of polymer structures in the training set.The distribution of SHAP values for the top-ranked features is relatively wide, and is monotonic about the feature values overall (Fig. S6).
Here, we highlight the two MD-inspired descriptors of cross-sectional and Kd_average.The most important descriptor of cross-sectional indicates the effective cross-sectional area of polymer chain, which is intuitive in relation to the TC.From the Fig. 4c, the SHAP value for cross-sectional decreases monotonically with the descriptor.According to the Fourier's law, the heat flow rate along 1-D polymer chain can be expressed as    ⁄ = −     ⁄ , where Q is the heat flow,   is the time interval,  is the thermal conductivity, A is the cross-sectional area, and   and   are the temperature difference and distance between the hot and cold ends, respectively 61 .Therefore, the TC is negatively related to the cross-sectional area, and polymers with high TC usually have a small cross-sectional area (Fig. S7a).Moreover, the polymer chain structure is absent of disorder compared to the amorphous structure, maintaining the symmetry of the crystal and reducing phonon scattering.However, the polymer chains may rotate and become disordered due to temperature and other effects, resulting in a rapid decrease in TC 62 .The closely correlation between dihedral angle energy constant and polymer chain stiffness has been demonstrated, and the dihedral angle force constant Kd has been artificially increased in MD simulations to maintain PE chain stiffness and increase thermal conductivity 62,63 .The Kd_average is the average of all types of dihedral force constants from GAFF2 force field for polymer chain, which is roughly proportional to the corresponding SHAP value in Fig. 4d.Especially for polymer structures with great kd_average (>4 kcal/mol) usually have large SHAP values and TC (Fig S7 b).Notably, the TC of polymer chains is influenced by multiple parameters and it is difficult to have the individual descriptor to determine its value.One example is that crystalline polynorbornene has been proved to be weakly sensitive with the chain stiffness, even if increasing the dihedral angular force constant term in MD simulations 62 .This confirms the significance of our proposed ML framework for predicting the TC of polymers.

Discovery of high TC polymers
The optimized descriptors were validated reliability in combination with different ML models for predicting the TC of polymer chains.Next, we applied these ML models to predict the TC of polymer structures in the PolyInfo and PI1M databases, in order to virtually screen promising polymers with high TC.The predicted polymer TC versus cross-sectional area from the ensemble of optimized descriptors combined with RF, XGBoost and MLP are visualized in Fig. 5a-c, respectively.Where stars indicate polyethylene with log2TC of 3.91, 4.66, and 5.30 predicted by RF, XGBoost, and MLP respectively, and that calculated by MD simulation is 5.28.The dependence of TC on the cross-sectional area is evident here, as almost all of the predicted high TC polymers have small cross-sectional areas.
Moreover, since PI1M has the same chemical distribution space as PolyInfo and fills the sparse area, which covers most of the TC range of PolyInfo and enriches the polymer structures in the high TC region.
Comparing the results from different ML models, the tree-based models of RF and XGBoost predict the TC of polymers in a narrower space than that of the MLP.Though the excellent performance of the tree-based models in preventing overfitting, the extrapolation of the models is usually inadequate and the predictions are still limited to the range of TC of the polymer structures in the training set.In contrast, neural network model of MLP usually has better extrapolation capability, and is superior in finding small data such as high TC polymer structures, despite the relatively low training accuracy R 2 of the model.This finding is similar to previous study of predicting the permeability of gas separation membranes using ML 17 .As well, previous work has revealed the length dependence of the thermal conductivity of polymer chains.Within a certain length range, the diverging thermal conductivity k and chain length L can be fitted by k ∼ L β , where β indicates the relatively dominant phonon transport mechanism 13 .Here, we considered polymer chains with TC greater than 20.00 W/mK with an effective length of 50 nm as the outstanding polymers with high TC.Then, balanced strategy to integrate the three ML models performance were devised to recommend promising polymer structures for calculation of TC by MD simulations.We identified the polymer structures in the PolyInfo dataset with RF, XGBoost and MLP predictions of log2TC up to 3.51, 3.50 and 4.33, and only the polymer structures with no less than 2 occurrences were picked for MD simulations.As a result, 24 polymer structures with high TC were discovered and verified.Similarly, we implemented this method to identify 84 high TC polymer structures in the PI1M database.After de-duplication, totally 107 high TC polymer structures were found in this work, and the Synthetic accessibility (SA) scores were calculated as shown in Fig. 5d.The specific polymer structures can be seen in Supplementary Section S8.From Figure S8, we can see that most of the high TC polymers are simple linear or contain aromatic rings in the mainchain, which have small repeating unit lengths and no side chains.The SA score was initially utilized to estimate the synthetic accessibility of drug-like molecules based on molecular complexity and fragment contributions 64 , and was subsequently adopted for polymers 31,32 .The SA score values ranged from 1 to 10, and synthesis is more difficult as the value increases.To take into account the effect of monomer linkages, polymer molecules with a polymerization degree of 6 were calculated for the SA score.Among them, 28 polymer structures with SA no more than 3.00, including polyethylene, polytetrafluoroethylene and poly(p-phenylene), and etc.Although it is currently difficult to fabricate each of these structures, we believe that more polymers like PE chain will be prepared for exploring the limits TC of polymers by combining advanced processes such as micromechanical stretching, electrostatic spinning and nanotemplate preparation in the near future 4,6,11 .

Symbolic regression for prediction of promising polymers
Since the TC of polymer chains is influenced by complex multi-parameters, it is difficult to predict frequencies below 25 THz are demonstrated.Moreover, the phonon group velocity is approximated as the average of the slopes of all acoustic branches 12,60 .In the one-dimensional polymer chain systems, the TC is analyzed as  =     , where   is the phonon group velocity,   is the volumetric heat capacity and  is the phonon mean free path.Due to the limitation of classical MD simulations, the volumetric heat capacity can be expressed as   = 3    ⁄ , where   is Boltzmann constant, N is the number of atoms and V is the volume.Thus, phonon mean free path can be calculated by the ratio of the TC to the multiplication of the phonon group velocity and the volumetric heat capacity.The approximations of the above calculations allow the results to be rough, but it do help us to understand the underlying thermal conductivity mechanisms of these promising polymer structures by comparing the relative trends of the relevant parameters, as listed in Table 1.
The volumetric heat capacity of the eight polymer structures varies from 3.28 to 6.13, which is not critical to the high TC of polymer chains.As for the phonon group velocity, the six π-conjugated polymers have large values (more than 5900 m/s) due to overlapped p-orbital and delocalized electrons.
Additionally, the small atomic mass enables a large phonon group velocity.The PTFE has smaller phonon group velocity than that of PE due to the relatively larger mass of fluorine atoms compared to hydrogen atoms.The phonon mean free path provides valuable insights into phonon transport in the

CONCLUSIONS
We have developed an interpretable ML framework for exploring high thermal conductivity polymer chains via high-throughput MD simulations.Inspired by the drug-like small molecule representation and the molecular force field, we reduced the initially calculated/collected 320 physical descriptors to 20 optimized descriptors by hierarchical down-selection.The constructed ML models are capable of effectively reflecting the relationship between optimized descriptors and property, and exhibit high accuracy in TC prediction.All the models of RF, XGBoost and MLP achieved the R 2 of more than 0.80, which is superior to that of represented by conventional graph descriptors.Moreover, the promotion or inhibition of TC by optimized descriptors like cross-sectional area and dihedral stiffness was captured by RF model using SHAP analysis.
Using the trained ML models, we discovered 107 promising polymers with TC greater than 20.00 W/mK, and 29 of which have SA scores no more than 3.00.These polymer structures have been validated through high-fidelity MD simulations.Further, we used SR with optimized descriptors to fit the TC of promising polymers, and the derived mathematical formulas enable a preliminary fast screening of high TC polymers without relying on ML models, which is friendly for experimental studies.In closing, we calculated phonon dispersion relations for eight typical polymer structures via phonon spectral energy density analysis to reveal the underlying TC mechanisms.Notably, most of these structures are π-conjugated polymers, whose overlapping p-orbitals enable easily to maintain strong chain stiffness and large group velocities.Our approach may assist in the research of high-performance polymers that are not limited to TC.

Polymer modeling and cross-sectional area calculation
Polymer modeling is a monomer to chain process, implemented in the STK tool, with input parameters of monomer SMILES and degree of polymerization 73 .The length of the polymer chains was set uniformly to 50 nm, and the degree of polymerization was obtained by dividing the chain length by the monomer length and rounding up to an integer.Starting from the polymer SMILES, a molecular chain with polymerization degree 2 was generated by RDKit and optimized using the MMFF force field 74 .Then, the monomer length was determined by measuring the distance between equivalent atoms in two repeating units in the heat transport direction.Following the modeling, a Python pipeline of PYSIMM realized the assignment of GAFF2 force field parameters and the generation of MD simulation input structure data files 75 .
The cross-sectional area is one of the important parameters for thermal conductivity analysis.In molecular dynamics simulations, the calculation of the cross-sectional area is difficult for systems that do not occupy the entire simulation box.The cross-sectional area was estimated by the ratio of the van der Waals volume to the length of the monomer 9 .The Van der Waals volume of the monomer was calculated by the sum of atomic and bond contributions, and has been successfully tested and applied in previous drug compounds 76 .

Calculation of TC by MD simulations
The TC of Polymer chains were obtained by non-equilibrium molecular dynamics simulations (NEMD) performed in a Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) 75 .
In terms of the NEMD method, the heat energy exchange was achieved by an enhanced version of the heat exchange algorithm, which rescales and shifts the velocities of particles inside reservoirs to impose a constant heat flux 77 .The polymer chains were placed in a box of 540×60×60 (x×y×z) Å box, where the dimension in the y and z directions was set to 60 Å to avoid interaction with the neighboring polymer chains.Before TC calculation, the polymer chain structures were relaxed to reach a stable conformation.
Then, the polymer chain was divided into 50 slabs in the x direction, and the fixed regions at two ends of the chain were set as a heat-insulating walls.In the NEMD simulation, the system was run under NVT (constant number of atoms, volume, and temperature) and NVE (constant number of atoms, volume, and energy) ensembles for 1 ns at 300 K sequentially to release chain stress 30,78 .After that, the heat was added/extracted to the heat source/sink regions (20 Å of each region) at the end of the polymer chain in a regular rate to create a constant heat flux.The applied heat varies for different polymer chain structures and ranged from 0.01 eV/ps to 0.08 eV/ps.At last, the temperature profile was averaged over the last 2~3 ns and used for TC calculation based on Fourier's law  = −(    ⁄ ), where  is heat flux,     ⁄ is the temperature gradient.

Descriptors calculation and ML models construction
The ideal polymer descriptors are required to minimize and completely represent polymer information, and is one of the key factors in determining the prediction accuracy of ML algorithms.The physical descriptors for this work were sourced from both Mordred software calculations and GAFF2 force field parameters extraction.The Mordred software was initially developed for small molecule characteristics in cheminformatics, which can calculate more than 1800 descriptors 41 .However, since we consider two linkages of polymer monomers, only 286 valid descriptors were obtained.Therefore, as a complement, we additionally extracted parameters from each polymer force field file as the descriptor.For graph descriptors, MACCS, Morgan and cMorgan fingerprints were calculated in the RDKit package 39 .The Mol2vec fingerprints were embedded via Mol2vec 34 .We referred the polymer representation model trained using PoLyInfo and PI1M databases 54 .
The ML models of RF, XGBoost and MLP were implemented by using Scikit-learn 79 .
Hyperparametric optimization for RF, XGBoost and MLP was operated with the Bayesian Optimization package 80 which is a global optimization tool to achieve good prediction accuracy R 2 .The Gaussian regression process and acquisition function with 10 randomly pairs of parameters were selected for initial training, and the ideal parameters for each ML model were determined after 100 optimization iterations 46 .
To explain the association of optimized descriptors with TC, we used the SHAP toolkit with RF model to evaluate the feature importance 56 .The SHAP analysis is based on a game-theoretic approach that associates the optimal credit allocation with the local explanations of the model, which considers the model performance by neglecting each feature and provides direction of each descriptor effect 46 .

Mathematical formulas for TC fitted by symbolic regression (SR)
The mathematical formulae were acquired and selected using an efficient stepwise strategy with SR based on genetic programming (GPSR) as implemented in gplearn 68 .The 107 polymer structures with TC greater than 20.00 W/mK were randomly divided into 3:1 as training and test sets respectively.
At first, Pearson coefficients were used as evaluation metrics of training fitness to filter optimized descriptors and generate sub-descriptors, and a new dataset containing 22 descriptors was generated.
Further, the grid search strategy with the hyperparameters and metric R 2 as listed in Table 2 was applied to determine the mathematical formulas.We ultimately discussed four formulas at Pareto front were identified by Latin hypercube sampling approach 70,71 .More information about SR can be found in the Supplementary Section S9.

Analysis of phonon dispersion relations by phonon spectral energy density (Phonon-SED)
To understand the TC mechanism of polymers, MD simulations coupled with Phonon-SED approach 72 were employed to calculate the dispersion relations of polymers.The polymer chain with a length of 100 Å was constructed as an input of SMILES and placed into a box with the cross section of 60×60 Å.After energy minimization, the system was run under the NVT (constant number of atoms, volume, and temperature) ensemble for 0.25 ns at 2 K sequentially to release chain stress.Subsequently, Where  is the wavevector,  is the frequency,  0 is the simulation time,   is the mass of atom b,  is the cartesian direction,   is the number of the unit cell in the polymer chain, ̇(, ; ) is the velocity of atom b in the unit cell n at time t in the  direction, and (, 0; ) is the equilibrium position of unit cell n.The MD-inspired descriptors include force field related (FF-related) descriptors and thermal conductivity related (TC-related) descriptors, listed in Table S1.The ratio of the molecular weight of the main chain to that of the monomer Vdw Van der Waals volume of the monomer Cross-sectional Equivalent cross-sectional area of polymer chain The Mordred-based descriptors were calculated by Mordred software 1 , the 286 Mordred-based descriptors in this work are listed in Table S2, and the detailed meaning could be found at https://mordred-descriptor.github.io/documentation/master/descriptors.html(accessed Aug 10, 2022 ).S3.     S3.In this work, a total of 107 polymer structures with thermal conductivity greater than 20 W/mK were identified and validated by molecular dynamics simulations, as listed in  Polymer properties can be found in Table S4 using the PID.

Fig. 1 .
Fig. 1.Schematics of high-throughput screening of polymers with high TC via interpretable machine learning.

Fig. 2 .
Fig. 2. Visualization of polymer data distribution in a 2D space by UMAP.(a), (b) and (c) are corresponding to the selected, PolyInfo and PI1M datasets, respectively.

Fig. 3 .
Fig. 3. Polymer descriptors down-selection and ML models training.(a) Optimized descriptors acquired by down-selection with four coefficients -Pearson, Spearman, Distance, and MIC coefficients -and RF model.(b) Accuracy of RF model based on optimized descriptors, where training R 2 is 0.875 and test R 2 is 0.844.(c) Accuracy of ML models at different down-selection processes, including initial (Init.),mathematical correlation (Cor.)coefficients screening and RF model optimization (Opt.)stages.And, an additional PCA approach was applied to compare.(d) Accuracy of ML models with different polymer representation approaches.The violin plot represents the distribution of values, individual subsamples are shown in gray, and the mean and standard of R 2 in black.(e) Pearson correlation matrices showing correlations among optimized descriptors and TC.The inset is the statistics of the Pearson coefficients distribution.

Fig. 4 .
Fig. 4. Analysis of feature importance using SHAP on RF model trained by optimized descriptors.(a) Average SHAP values for 20 optimized descriptors.(b) Represent the SHAP values of each descriptor related to training data set polymers in a beeswarm diagram.(c) and (d) SHAP values for the Crosssection and Kd_average of the training data set polymers as a function of descriptor value.The Crosssection is the effective cross-sectional area of polymer chain, and the Kd_average is the average value of force constants of the dihedral angle from GAFF2 force field.

Fig. 5 .
Fig. 5. Prediction of high TC polymers in PolyInfo and PI1M databases using constructed ML models.(a), (b) and (c) based on RF, XGBoost and MLP models, respectively.(d) Synthetic accessibility (SA) score versus calculated log2K of screened high TC polymers (TC > 20.00 W/mK).The star indicates PE, and the TC in this work is 38.98 W/mK.

Fig. 6 .
Fig.6c.The four points of c, d, e and f at Pareto front were identified by Latin hypercube sampling approach70,71  , and their corresponding formulas are expressed in TableS8.The complexities of the four formulas are in the range of 20 to 30, and the fitting accuracies are all greater than 0.70.Moreover, the training accuracy is mostly positive to complexity.For example, the formula represented by point c with complexity of 20 has a relatively low accuracy R 2 among the four points, but the fitting results are consistent with the MD labeled log2K, as demonstrated in Fig.6d.Meanwhile, all four identified formulas include the descriptors of the Cross-sectional, Kd_average and Nd_average, which verified that the TC of polymer chain is strongly correlated with the parameters such as cross-sectional area and dihedral stiffness.These formulas are meaningful in the initial rapid screening of high TC polymer chain structures.
polymer chains.Overall, simple linear polymer chains are easily to have long phonon mean free paths, especially for linear π-conjugated polymers of [*]C=C[*] and [*]N=N[*].These structures have large chain stiffness and few atoms except for the backbone, thereby having weak phonon-disorder scattering.

Fig. 7 .
Fig. 7. Structure and phonon dispersion relations for the eight promising polymers.(a) Polymer chain structures.(b) Phonon dispersion relations.The q is wavevector, the  is phonon frequency and the average phonon group velocity of one branch is estimated as the slope of the origin to the maximum frequency point as shown in the red dashed line in the PHTC001 structure.
the system was run under the NVE (constant number of atoms, volume, and energy) ensemble for 2 million steps with the timestep of 0.25 fs.During this period, the velocity and position of each atom in the polymer backbone were recorded with intervals of 20 steps.The Phonon-SED converted the time domain information of atomic velocities and positions into wave vectors versus angular frequencies via two-dimensional Fourier transform, expressed as (, ) , ; ) ×  ⋅(,0;)− | 2(1)

FF
FigureS2shows the RF, XGBoost and MLP models trained with the optimized descriptors.The test set is consisted of 100 randomly selected polymer structures from the benchmark dataset.

Fig. S2 .
Fig. S2.Construction of different ML models with optimized descriptors

Fig. S5 .
Fig. S5.Accuracy of ML models with various graph descriptors

Fig. S9 .
Fig. S9.Filtering optimized descriptors and creating sub-descriptors in gplearn through Pearson coefficient (PC) values.(a) Mathematical formula complexity versus PC values.(b) Statistics of formulas with Pearson coefficient not less than 0.85 by complexity.(c) and (d) Frequency of occurrence of optimized descriptors and sub-descriptors in 158 mathematical formulas (PC values >=0.85 and complexity <=10).

Table 1 .
Volumetric heat capacity, phonon group velocity, phonon mean free path and phonon thermal conductivity for the eight promising polymers

Table 2 .
Setup of hyperparameters in gplearn for GPSR

Table S1 .
Descriptions of all MD-inspired descriptors

Descriptions of optimized descriptors for polymer thermal conductivity (TC) prediction
Polymer optimized descriptors were obtained by downselection, listed in Table

Table S3 .
Descriptions of all MD-inspired descriptors

Table 4
, and the relevant monomer structure can be viewed in Figure S8 by polymer ID (PID).The TC obtained from RF, XGBoost and MLP model predictions and MD simulations are log2TC in W/mK.The effective cross-sectional area of the polymer is given in Å 2 .

Table S4 .
MD-validated polymer structures with thermal conductivity greater than 20 W/mK

Table S6 .
New descriptors ensemble for GPSR