Materials informatics (MI) is a growing interdisciplinary field of materials science, attracting significant attention in recent years. MI utilizes machine learning to model, predict, and optimize the properties of new materials1,2. Naturally, the most essential resource in MI is data. Hence, significant efforts have been made to develop open databases for inorganic materials and light-weight organic molecules, such as the Materials Project3 (~140,000 inorganic compounds), the Automatic-Flow4 (AFLOW: ~3,000,000 inorganic compounds), the Open Quantum Materials Database5 (OQMD: ~1,000,000 inorganic compounds), and QM96 (~134,000 organic molecules). In particular, the huge databases of computational properties built using high-throughput first-principles calculations have brought remarkable progress in MI and their widespread use in science and technology. However, for polymeric materials, despite their industrial usefulness and unique characteristics, such as lightness, high tenacity, elasticity, and ease of processing, the development of open databases has considerably lagged behind other material systems7. This is due to the following reasons: (1) high costs of data production, (2) the difficulty in creating common data due to the diversity of polymeric materials in terms of structures and processing conditions, and (3) cultural barriers to avoiding information leakage to competitors2. In addition, the computational difficulty in performing high-throughput calculations and their high computational costs have hindered the development of computational property databases for polymeric materials.

PoLyInfo8 is the current largest database of polymer properties, built from manually collected literature data. Currently, it contains ~100 properties of more than 18,000 polymers. However, the overall data in PoLyInfo are rather sparse as there are few cases where more than one property is simultaneously recorded for one polymer. Polymer Genome9,10,11,12 is a database containing seven different electronic and optical properties of crystalline and single chain states of polymers from first-principles calculations and several experimental properties of amorphous polymers. The computational properties include the crystal bandgap (562 polymers), polymer chain bandgap (3881 polymers), static dielectric constant of polymer crystals (383 polymers), and refractive index of polymer crystals (383 polymers). A common feature of these databases is that they do not provide application programming interfaces (APIs) and therefore do not allow automatic batch downloading of the data. Therefore, the creation of data resources conducive to data-driven research is vital for advancing polymer informatics.

Large-scale data of computational properties have proven to be an essential resource for machine learning applications in MI. For example, such big data have been used as source data for transfer learning when dealing with limited data in materials research. Transfer learning represents a statistical methodology for reusing knowledge, data, or models acquired in one domain (source domain) to another (target domain)13,14. Suppose that directly establishing a machine learning model from scratch is difficult due to the lack of sufficient amount of experimental data, in such cases, a model is trained on a large amount of computational property data, and the pretrained model is fine-tuned using a small amount of experimental data, to build a highly accurate prediction model in the target domain. Successful examples of cross-domain transfer between computational and experimental data have been reported for various material systems15,16,17,18,19,20, including our previous work on the prediction and synthesis of thermally conductive amorphous polymers using neural networks transferred from computational properties in which only 28 samples were available in the target domain21.

For polymer properties, even computational data are rather scarce. Polymer Genome9,10,11,12 is the only existing database, which is constructed using first-principles electronic structure calculations of polymers in crystalline states. However, currently, the number of samples is small, and the calculation is limited to seven electrical and optical properties. For the all-atom classical molecular dynamics (MD) simulations, which are powerful techniques for computing the equilibrium and non-equilibrium properties of the condensed-phase systems of polymeric materials, there are only a few reported works that have constructed large datasets with high-throughput calculations22,23,24. Afzal et al. created a dataset of 315 polymers using high-throughput MD simulations; however, the target properties were limited to the glass transition temperature (Tg) and thermal expansion coefficient24. To build a computational polymer property database, a workflow of high-throughput MD simulations should be established, which is considered technically challenging. The entire workflow of an MD simulation comprises several sub-modules, such as the specification of an empirical potential, the initialization of polymer chains, equilibrium and nonequilibrium MD simulations, and the calculation of the properties from simulated molecular trajectories, which complicate the job control and error handling when fully automating the workflow. While this workflow can be partially streamlined using the pysimm Python package25, there is still no open-source software that facilitates the building of the entire workflow. In addition, various types of conditional parameters, such as the degree of polymerization, number of polymer chains, and annealing schedules, need to be determined appropriately. Furthermore, a unified platform is required to create various polymeric states such as amorphous structures, oriented structures, and polymer blends. It also requires vast computational resources. For example, an equilibrium MD simulation of a conventional amorphous polymer requires an average run time of more than 30–50 h based on our experiments conducted on a workstation with a dual CPU (Intel Xeon Gold 6148; 2.4 GHz) having 40 cores.

Herein, we present RadonPy (, which is an open-source Python library for fully automated calculation, for a comprehensive set of polymer properties, using all-atom classical MD simulations. For a given polymer repeating unit with its chemical structure, the entire process of the MD simulation can be performed fully automatically, including molecular modeling, equilibrium and nonequilibrium MD simulations, automatic determination of the completion of equilibration, scheduling of restarts in case of failure to converge, and property calculations in the post-process step. In this first release, the library comprises the calculation of 15 properties, such as the thermal conductivity, density, specific heat capacity, thermal expansion coefficient, and refractive index, in the amorphous state. In this study, we calculated 15 properties for more than 1000 unique amorphous polymers. These calculated properties were systematically validated with respect to experimental values obtained from PoLyInfo. In particular, the focus here is on the thermal conductivity of polymers, which will be an important performance metric for designing polymeric materials used as insulating resins, molding resins, adhesives, and coating agents for mobile devices, given the increase in heat generation brought on by miniaturization and performance improvement of mobile devices. During the high-throughput data production, we computationally identified eight amorphous polymers with extremely high thermal conductivities (>0.4 W ∙ m–1 ∙ K–1), including six polymers with unreported thermal conductivities. These polymers exhibited a high density of hydrogen bonding units or rigid, linear backbones. In addition, a decomposition analysis of the heat conduction, which is implemented in RadonPy, revealed the underlying mechanisms that yield such a high thermal conductivity: heat transfer via hydrogen bonds and dipole–dipole interactions between polymer chains having hydrogen bonding units or via covalent bonds of polymer backbones with high rigidity and linearity.

Results and discussion

Software overview

RadonPy is compatible with Python 3.7 to 3.9. RadonPy is designed to be used jointly with the chemoinformatics Python library RDkit26, with high compatibility between the input/output systems of each module in RadonPy and those of RDKit. The input parameter set for RadonPy comprises a simplified molecular input line entry system (SMILES)27 string with two asterisks representing the connecting points of a repeating unit, the polymerization degree, the number of polymer chains in a simulation cell, and temperature. Subsequently, the following processes are fully automated (Fig. 1): the conformation search for the repeating unit, calculation of the electronic properties, such as the atomic charge and dipole polarizability, based on the density functional theory (DFT), generation of initial configurations of polymer chains based on the self-avoiding random walk, assignment of the force field parameters, creation of a simulation cell such as an isotropic amorphous cell, MD simulation to equilibrate the system, determination of whether to reach equilibrium, execution of nonequilibrium MD simulation (NEMD) for thermal conductivity calculation, and calculation of various physical property values. RadonPy is mainly designed to run on a supercomputer; multiple polymers are calculated independently in parallel using many computation nodes in a supercomputer. The DFT and MD calculations were performed using the Psi428 and Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)29, respectively, through the RadonPy interface. Each step will be detailed in the Methods section (see also Supplementary Notes in the Supplementary Information for a “Getting started with RadonPy” guide).

Fig. 1: Flowchart of the automated MD calculation workflow for polymer properties using RadonPy.
figure 1

RadonPy can automate each process to perform all-atom classical molecular dynamics simulations. Multiple polymers are calculated independently in parallel using many computation nodes in a supercomputer.

With the current release, the following properties are calculated from the equilibrium calculations: density, radius of gyration (Rg), specific heat capacities at constant pressure (CP) and at constant volume (CV), isothermal/isentropic compressibility, isothermal/isentropic bulk modulus, volume expansion coefficient, linear expansion coefficient, self-diffusion coefficient, refractive index, static dielectric constant, and nematic order parameter. Thermal conductivity and thermal diffusivity are calculated from the NEMD.

RadonPy outputs and stores trajectory data, including atomic coordinates and velocities, and thermodynamic data in the text-based dump files of LAMMPS. Calculated physical properties are stored in CSV format. In addition, the final system state, including atomic coordinates and velocities, in the equilibration and the NEMD are saved as Python pickle files, allowing the final system state to be restored to restart further MD calculations.


The PoLyInfo database contains 15,335 homopolymers, which have only organic 10 element species, H, C, N, O, F, P, S, Cl, Br, and I. Among these, we selected 1138 unique homopolymers as the calculation target in this study, for which as many experimental properties as possible were recorded. The selected polymer set was composed of a wide variety of polymer backbones, such as polystyrenes, polyvinyl, polyacrylates, polyamides, polycarbonates, polyurethanes, and polyimides. The validation data of the density, thermal conductivity, refractive index, specific heat capacity CP, linear expansion coefficient, and volume expansion coefficient were collected from PoLyInfo. The data used were limited to homopolymers and those meeting the following conditions: their material type was labeled as one of neat resin, samples contained no additives, fillers, and dopants, the measured temperature was in the range of 273–323 K, the postforming state was amorphous or unidentified, and the topology of the polymers was linear or unidentified.

Distribution of calculated polymers in chemical space

The automated MD calculations were conducted for the 1138 homopolymers selected from the PoLyInfo database. Of the five independent calculations, the automatic calculations succeeded at least once for 1070 polymers, more than thrice for 1001 polymers, and in all the five cases for 759 polymers. The failed calculations were classified into four cases: the structural optimization of the DFT calculation did not converge, the MD simulation did not reach equilibrium, the system was partially oriented (nematic order parameter > 0.1), and the temperature gradient in the NEMD calculation did not become linear.

To investigate the distribution of the backbones of the 1070 calculated polymers over the 15,335 polymers in PoLyInfo, their chemical structures were visualized onto a 2D space using the uniform manifold approximation and projection (UMAP)30. The chemical structure of each polymer was transformed into a 2048 bit vector with an extended connectivity fingerprint with a radius of three atoms (ECFP6)31. To consider the repeating structure of polymers, the ECFP6 descriptor was constructed after generating the macrocyclic oligomer with 10-mer of the repeating unit. The UMAP with the Hamming distance was used to create the 2D representation of the 15,335 fingerprinted polymers, as shown in Fig. 2a, in which its subset corresponding to the 1070 polymers successfully calculated at least once is shown in Fig. 2b. The plot colors indicate the 21 classes of the polymer backbones according to the PoLyInfo database. The two distributions exhibited similar patterns in the UMAP plot, confirming no significant selection bias in the calculated polymers. In addition, the calculated polymers were selected to cover the 20 classes except for the class of others.

Fig. 2: UMAP plot visualizing the distribution of the polymer backbones.
figure 2

The UMAP plots show the distribution of (a) 15,335 homopolymers in PoLyInfo and (b) 1070 homopolymers calculated in this study. The 21 classes of the polymer backbones are color-coded according to the definition of PoLyInfo.

Validation of the calculated physical properties

To evaluate the performance of the automated MD pipeline, the calculated properties were systematically compared with the experimental values from PoLyInfo (Fig. 3). In the validation process, we used the 1001 polymers for which the automatic calculation was successfully completed at least thrice out of the five independent trials. We also examined the effect of the simulation box size on the calculated properties (see “Examination of box size effects” in the Supplementary Discussion in the Supplementary Information). For the validation test, 28 different polymers were chosen by taking structural variations into account (Supplementary Fig. 11). The number of polymer chains varied from 10 to 50 with the number of atoms in each of a polymer chain set to 1000‒2000. According to the experimental results, the box size had no significant effect on the calculated properties in these conditions (Supplementary Figs. 610).

Fig. 3: Comparison between the MD-calculated properties of various amorphous polymers (vertical axis) and their experimental values in PoLyInfo (horizontal axis).
figure 3

The six panels show parity plots of (a) density (N = 382), (b) thermal conductivity (N = 34), (c) refractive index (N = 107), (d) specific heat capacity (N = 66), (e) linear expansion coefficient (N = 165), and (f) volume expansion coefficient (N = 144). The error bar indicates the standard deviation of the calculated or measured properties within the same polymer. The dashed black line indicates the y = x line. The red line is the regression line fitted to the calculated and experimental values.

The calculated density well reproduced the experimental values (R2 = 0.890), albeit with a slight underestimation, as the slope of the fitted straight line in the parity plot was equal to 0.805 in Fig. 3a. The standard deviation (SD) of the calculated values in the five independent trials was low. The slight underestimation can be explained as follows: since the polymerization degree in the present MD simulations is smaller than that in the experimental conditions in PoLyInfo, the mobility of the simulated polymer chains becomes larger than that observed in the real systems, resulting in an overestimation of the free volume. In addition, unobservable partial crystallization behind experimental data could result in higher experimental density than in the amorphous states. Note also that, as reported in a previous study, the MD calculation of the density of the organic molecule liquids using the GAFF2 force field is often poorly performed in high-density regions32. On the other hand, in our calculation, such a discrepancy never occurred in the high-density regions. This is because of the use of the modified force field parameters developed by Träg and Zahn33 for fluorocarbon polymers (see “Assignment of force field parameters” in the Methods section).

The calculated thermal conductivities also showed good agreement with the experimental values in PoLyInfo (R2 = 0.490), as shown in Fig. 3b. However, the correlation was not so high. This could be because the experimental values of the thermal conductivity involve fluctuations due to differences in the measurement methods and temperature dependence. Moreover, there is a gap between the real and model systems, owing to the differences in various factors, including the degree of polymerization and its distribution, degree of orientation, crystallinity, impurities, and polymer chain entanglement. In addition, as the level of the thermal conductivity increases, the fluctuation in the calculated values within the independent trials increases significantly (Supplementary Fig. 1). Thus, for polymers with potentially high thermal conductivity, the number of independent trials of MD calculations should be increased to improve the accuracy.

The calculated refractive index well reproduced the PoLyInfo dataset (R2 = 0.809) with a trivial underestimation where the slope of the fitted straight line in the parity plot was equal to 0.839 (Fig. 3c). The slight underestimation would arise from the reported underestimation of the density because the refractive index is defined to be the increasing function of the density. The variation in the MD simulations was quite small. It can be concluded that a sufficiently high prediction accuracy was obtained for the refractive index.

Figure 3d shows the correlation of CP between the calculated and experimental values (R2 = 0.602). The calculated CP showed an evident overestimation as the fitted slope in the parity plot was 1.430. This observation is inevitable in the classical MD because classical MD calculations do not include quantum effects: the vibrational energy in a classical harmonic oscillator is significantly higher than that in a quantum harmonic oscillator at the same frequency. The quantum-corrected-to-classical CP ratio decreases monotonically with increasing frequency of vibration. Thus, the ratio of the CP in PoLyInfo (CPPoLyInfo) to the MD-calculated value (CPMD) should decrease with the increasing mean of the bond-stretching and -bending force constants. The theoretical consideration and experimental observations are described in the Supplementary Discussion and Supplementary Fig. 12 in the Supplementary Information, respectively. Fortunately, since the observed correlation is relatively clear, it would not be difficult to correct the systematic bias by applying e.g., transfer learning or multi-fidelity learning.

The linear and volume expansion coefficients showed weak correlations (R2 = 0.178 and 0.217) between the calculated and experimental values (Fig. 3e, f). The variations in the linear and volume expansion coefficients within the same polymer were significant both experimentally and computationally. Previous studies also reported that MD-calculated values of the volume expansion coefficients in molecular organic liquids are highly variable and less reproducible with respect to the experimental values32. Possibly, the timescale and simulation cell size of the MD simulations should be sufficiently large for an accurate simulation.

Data distribution

The marginal distributions of the six properties for the calculated 1070 unique amorphous polymers are presented in the diagonal panels of Fig. 4, and their statistics are summarized in Table 1. The calculated thermal conductivities were distributed between 0.082 and 0.619 W ∙ m–1 ∙ K–1, with their mean being 0.240 W ∙ m–1 ∙ K–1. The thermal conductivity of the unoriented polymers in the amorphous states is known to be typically less than 0.3 W ∙ m–1 ∙ K–1. On the other hand, few of the calculated polymers exhibited exceedingly high thermal conductivities. However, as mentioned above, in the high-thermal-conductivity regions, the fluctuations in the MD-calculated properties became significant. Thus, we narrowed down to eight highly reliable polymers, as shown in Fig. 5, with small variation in the repeated calculations (SD < 0.05 W ∙ m–1 ∙ K–1). For polyethylene (PI1) and poly(vinyl alcohol) (PI241), the experimental thermal conductivities were recorded in the PoLyInfo. The calculated thermal conductivity of PI1 was 0.456 W ∙ m–1 ∙ K–1, which is consistent with the reported values (0.39–0.53 W ∙ m–1 ∙ K–1) of the polyethylene neat resin in PoLyInfo. On the other hand, the calculated thermal conductivity of PI241 was 0.439 W ∙ m–1 ∙ K–1, which is overestimated compared to the reported value (0.31 W ∙ m–1 ∙ K–1) of the poly(vinyl alcohol) neat resin in PoLyInfo. The experimental thermal conductivities of the other six polymers were unrecorded in the PoLyInfo. Apart from polyethylene (PI1), the structural features of these polymers fall into three types: (1) polymers with a high density of hydrogen bonding units (PI241 and PI305), (2) aromatic polyamides with rigid, linear backbones (PI626), and aromatic polyimides (PI687, PI711, PI715, and PI1093).

Fig. 4: Joint distribution of the six properties calculated from the automatic MD simulation, including the thermal conductivity (W∙m–1∙K–1), density (g∙cm−3), specific heat capacity CP (J∙kg−1∙K−1), bulk modulus (Pa), linear expansion coefficient (K−1), refractive index, and radius of gyration Rg (scaled).
figure 4

The diagonal panels represent the histograms of the individual property values. In the upper off-diagonal panels, a scatter plot of each pair of properties is shown with its Pareto front set displayed as large dots that indicates higher and lower bounds of specific heat capacity and thermal conductivity, refractive index and thermal conductivity, and thermal conductivity and refractive index. The lower off-diagonal panels represent the kernel density estimation of the bivariate joint distributions, which is displayed with contours.

Table 1 Summary statistics of the calculated properties, including the mean, standard deviation (SD), minimum, and maximum.
Fig. 5: Repeating units of identified polymers exhibiting a high thermal conductivity in amorphous states.
figure 5

The compound identifier corresponds to the polymer ID in the calculated dataset.

In addition, the calculated values of the density and refractive index sufficiently correlated with the experimental values; therefore, we investigated the distributions of these properties from a quantitative viewpoint. The calculated density values were distributed between 0.742 and 1.914 g∙cm–3 with their mean being 1.133 g∙cm−3. Twelve amorphous polymers were identified as having a high-density state: >1.75 g∙cm−3. These polymers were found to contain rich halogen atoms (Supplementary Fig. 2). The calculated values of the refractive index ranged from 1.274 to 1.839 with their mean equal to 1.550. Nine polymers were identified as high-refractive-index polymers in amorphous states, with their refractive index being greater than 1.75. These polymers had large π-conjugated backbones (Supplementary Fig. 3), indicating that the calculated high refractive index originated from the high polarizability of the large π-conjugation.

An observation of the joint distribution of the multiple properties, as shown in the off-diagonal panels of Fig. 4, provides hypothetical insights into the hidden dependency of the multiple properties, and the existence and location of the Pareto frontiers with the chemical features of the constituent polymers. The observed Pareto frontier of the specific heat capacity and thermal conductivity suggests the difficultly of achieving both high specific heat capacity and low thermal conductivity in amorphous polymers. Polymers distributed around the Pareto frontier included mainly polystyrenes, polyacrylates, and hydrocarbon polymers. On the other hand, no Pareto frontier was observed in the region of higher thermal conductivity. The joint distribution of the thermal conductivity and refractive index shows that there are still unexplored regions of amorphous polymers reaching lower thermal conductivity with higher refractive index and higher thermal conductivity with a lower refractive index. The thermal conductivity was approximately proportional to the scaled Rg. The scaled Rg was defined as Rg scaled by 1/M0.6 to remove molecular weight (M) dependency based on the following scaling rule34.

$${{{\mathrm{Rg}}}} \propto M^{0.6}$$

Another study confirmed computationally that thermal conductivity is positively correlated with Rg for amorphous polyethylene35. Our study demonstrated that this dependency holds for a wide variety of amorphous polymers. The specific heat capacity was inversely proportional to the density. This observation can be explained by the Dulong–Petit law36. The specific heat capacity is inversely proportional to the mean atomic weights (Supplementary Fig. 4) because the heat capacity of a mole is typically almost a constant in materials. On the other hand, the density is proportional to the mean atomic weights (Supplementary Fig. 5). In the joint distribution of the density and refractive index, their correlation was unclear. According to the Lorentz–Lorenz equation (Eq. 17 in the Methods section), the refractive index is described as a function of the density and polarizability. The observed distribution implies that, for polymers in amorphous states, the polarizability is dominant in determining the refractive index.

Decomposition analysis of thermal conductivity

The decomposition analysis was performed to understand the mechanism of the eight polymers (Fig. 5) that exhibited a high thermal conductivity (see “NEMD simulation for thermal conductivity calculation” in the Methods section). As shown in Fig. 6, for each calculated thermal conductivity, the decomposition analysis quantified the contribution of the six components corresponding to convection, bond, angle, dihedral, improper, and nonbonded, where the nonbonded contribution represents the sum of the pairwise and K-space contributions described in Eq. 4 in Methods section. Since the contribution of the improper term was negligible, it is shown as a dihedral term in Fig. 6. Notably, the AMBER-type force field describes the dihedral potential as the sum of the dihedral term and nonbonded 1–4 interactions; thus, a part of the nonbonded contribution is essentially attributed to the dihedral contribution37.

Fig. 6: Contributions of convection and different types of interactions to the calculated high thermal conductivities of the eight polymers.
figure 6

The colors in the bar chart mean convection (red), bond stretching (purple), bond angle bending (violet), dihedral (blue), and nonbonded (green) terms.

The calculated thermal conductivity of PI1 (polyethylene) was 0.456 W ∙ m–1 ∙ K–1. The high thermal conductivity of PI1 was due to the significant contribution of bond bending.

The high thermal conductivities of PI241 (polyvinyl alcohol) and PI305 (poly(vinylene) carbonate) were largely due to the contributions of nonbonded interactions. The polymer chains of PI241 and PI305 contain highly condensed hydroxyl groups. This indicates that a high density of hydrogen bonding units provides large intermolecular interactions via the creation of hydrogen bonds and dipole–dipole interactions, resulting in a significant contribution of nonbonded interactions. Thus, the thermal conductivities of PI241 and PI305 are enhanced by the heat transfer via hydrogen bonds and dipole–dipole interactions.

In the aromatic polyamide PI626 (poly-p-phenyleneterephthalamide a.k.a. Kevlar), the bond, angle, dihedral, and nonbonded interactions showed moderately large contributions. The results can be explained as follows: the backbone of PI626 is relatively rigid, resulting in a significant contribution to the thermal conductivity through covalent bonds, and PI626 can create the interaction of hydrogen bonds and dipole–dipole interactions with its amide groups, resulting in moderately high contributions through nonbonded interactions.

Thermally conductive behaviors in the aromatic polyimides PI687, PI711, PI715, and PI1093 were largely due to the contributions of bond stretching. The PI687 had a significantly large contribution of bond stretching. Aromatic polyimides have rigid backbones, particularly PI687, which has high rigidity and linearity. The results show that the rigid and linear characteristics of a polymer backbone can help enhance the thermal conductivity through the contribution of bond stretching. The PI1093 is an aromatic polyimide containing an amide group. The contribution of nonbonded interactions of PI1093 was the largest in the four identified aromatic polyimides. This suggests that polymers containing hydrogen bonding units and having rigid and linear backbones can help further increase the thermal conductivity in amorphous states.

Figure 7 shows the joint distribution of the total thermal conductivity with each quantified contribution. The correlations with the total thermal conductivity can be clearly observed in the bond, angle, dihedral, and nonbonded terms. On the other hand, the convection term did not correlate significantly with the thermal conductivity. In summary, thermally conductive amorphous polymers can be designed, in principle, by increasing the contributions of the bond, angle, dihedral, and nonbonded terms.

Fig. 7: Distribution of the thermal conductivity (W∙m−1∙K−1) and categorization in terms of its contributions from convection, bond, angle, dihedral, and nonbonded terms.
figure 7

Diagonal panels represent the histograms of individual quantities. In the upper off-diagonal panels, the scatter plots of the six quantities are displayed. The lower off-diagonal panels represent their kernel density estimation, which is displayed with contours.

Transfer learning from MD values to experimental values

As described above, several properties showed significant discrepancies, including systematic bias, between the experimental and the MD-calculated values. The dependence of the MD simulations on initial conditions resulted in large fluctuations in the calculated properties, especially in the linear expansion and volume expansion coefficients. The experimental values of these two properties also fluctuated considerably, making them insufficiently reliable as a validation set. We believe that the application of machine learning can contribute to the reduction of these biases and variances. Hereafter, we demonstrate an example of calibrating the discrepancy of MD simulations by using transfer learning.

The target properties to be predicted were the specific heat capacity, linear expansion coefficient, and volume expansion coefficient. As discussed previously, the specific heat capacity exhibited a large bias between the experimental and MD-calculated values, which would originate from the presence or absence of quantum effects, and the latter two had significantly large variations even within the same polymer in both experimental and calculated properties. For each property, the source task of the transfer learning was to predict the MD-calculated properties, and the target task was to predict the experimental properties in PoLyInfo. A predictive model defines a mapping from a fingerprinted chemical structure of a given polymer repeating unit to the experimental or MD-calculated property. The workflow of the shotgun transfer learning18 is outlined as follows (see the Supplementary Methods in the Supplementary Information for more details):

  1. 1.

    All samples in the MD properties dataset with their polymers included in the PoLyInfo experimental dataset were removed to obtain the dataset for the source task.

  2. 2.

    Using the source dataset, we trained 100 neural networks with randomly generated network structures.

  3. 3.

    We randomly selected 80%, 10%, and 10% of the experimental dataset for the training, validation, and test datasets, respectively, for the target task.

  4. 4.

    Each pretrained neural network was fine-tuned using the training set of the target task, in order to obtain a transferred calibration model.

  5. 5.

    The root mean squared error (RMSE) of each transferred model with respect to the validation set was calculated, and the prediction performance on the test set was examined using the model exhibiting the best transferability that achieved the smallest validation RMSE.

As shown in Fig. 8, for all the three properties, the transferred models showed significant improvements in predicting the experimental data, compared to the direct predictions from the MD calculations. The systematic bias in the specific heat capacity almost disappeared. Interestingly, for the linear expansion coefficient and volume expansion coefficient, the transferred model not only corrected for the systematic bias of the MD properties, but also significantly reduced the variability of the experimental values. The improvement in MAE for these property predictions reached 69% and 87%, respectively (Table 2), compared to that in the MD-based predictions.

Fig. 8: Comparison of the predictive performance of MD simulation and transfer learning.
figure 8

The parity plots of (a) MD simulation and (b) transfer learning for experimental values of specific heat capacity (CP), linear expansion coefficient, and volume expansion coefficient. The experimental and predicted values are shown on the horizontal and vertical axes, color-coded by red (blue: fits to the training data for transfer learning).

Table 2 Comparison of prediction performance on the PoLyInfo experimental dataset between the MD simulations and the machine learning model based on transfer learning (TL).

It is inevitable that any dataset mass-produced from a fully automated MD simulation will be subject to various kinds of biases and variances, because there are no calculation conditions universally applicable to a wide variety of polymer systems. The results shown here imply that machine learning techniques have the great potential to bridge the gap between real systems and inherently incomplete computational models.

Summary and outlook

We presented RadonPy, which is the first open-source Python library to fully automate polymer property calculations using all-atom classical MD simulations. The high-throughput calculation using RadonPy was successfully performed for more than 1000 unique amorphous polymers with a wide variety of thermophysical properties, such as the thermal conductivity, refractive index, density, and specific heat capacity CP. For systems other than amorphous homopolymers, such as copolymers, blend polymers, and uniaxially oriented systems, as well as for other properties, automated calculation capabilities have already been implemented; however, no calculation protocols based on experimental data have been established. In RadonPy, automatic calculation protocols for various polymer properties can be implemented as an add-on feature. We will continue to promote the development of RadonPy.

In this study, the agreement between a total of six properties obtained from the high-throughput MD calculation and experimental values was comprehensively verified. As a result, the refractive index, density, and thermal conductivity successfully reproduced the experimental values quantitatively. The calculated values of the specific heat capacity were also highly correlated with the experimental values, although the classical MD calculation had a systematic bias due to its inability to represent quantum effects. For the linear and volume expansion coefficients, the correlation between the calculated and experimental values was weak due to large variations and uncertainties in both the calculations and experiments. There has been no previous work on a comprehensive validation of high-throughput MD simulations of polymer properties on such a scale. More rigorous comprehensive validation with experimental values, including other properties not discussed in this study, should be conducted to determine appropriate calculation conditions and protocols.

This study also revealed various issues related to the creation of a polymer properties database using high-throughput MD calculations. Properties, such as glass transition temperature, dielectric loss tangent, and cohesive energy density, are expected to be predictable under the same or nearly the same setting as the current calculation conditions. On the other hand, mechanical and viscoelastic properties, which are largely affected by polymer chain entanglement, would be difficult to predict with the current settings for molecular weight, timescale, generation of initial structure, high-order structure, etc. It is necessary to determine appropriate conditions for automated calculations according to individual properties. It is also important to produce temperature-dependent and molecular weight-dependent physical property profiles. We will then be faced with the problem of lacking a comprehensive set of experimental data necessary to determine appropriate calculation conditions. In this study, PoLyInfo was employed as the benchmark dataset, but as indicated by the observed large fluctuations of linear expansion coefficients and volume expansion coefficients, the quality and reliability of the current data are far from satisfactory. Data cleansing, which involves an enormous amount of work to trace back to the original paper for each record, will need to be performed. Alternatively, we will eventually be faced with the need to construct an experimental dataset for benchmarking, acquired in a controlled environment.

In addition, several issues related to data storage need to be further considered when building a large database. Currently, RadonPy stores and outputs all intermediate trajectory data, including atomic coordinates and velocities, in LAMMPS dump files. However, in the future database development, intermediate trajectory files may be discarded, except for the final states and the last several nanoseconds in the equilibration and the NEMD because of the enormous data size (~20 GB per polymer on average). The issue of data storage when building a large database is unavoidable. In addition, the issue of data formatting will be an obstacle. In the first stage, we focused only on linear polymers, so their representation could be handled with the SMILES notation. However, in the future, block or alternating copolymers and branched polymers will also be included in the automated calculation pipeline. Then, it will be necessary to introduce an advanced notation for polymers such as BigSMILES38.

Compared with other material systems, polymer research has lagged in terms of constructing open databases available for data-driven research. The primary objective in the development of RadonPy was to use it to create a systematically designed polymer property database. In the early days of MI in inorganic chemistry, the development of an open database was strategically promoted. In particular, huge computational property databases constructed using high-throughput first-principles calculations drove the evolution and widespread applications of MI. Large-scale computational property data have historically proven to be an important resource in MI, and RadonPy was designed for the rapid production of large amounts of polymer property data using highly parallel computers such as supercomputers. In this study, more than 1000 unique amorphous polymers were computed in ~2 months mainly using the supercomputer, Fugaku. In the future, our growing data will significantly facilitate the evolution of polymer informatics, just like the first-principles computational database for inorganic crystals.


Conformation search of a repeating unit

For a given SMILES string of a polymer repeating unit, 3D atomic coordinates of up to 1000 different molecular conformations were generated using the ETKDG version 2 method39,40,41 implemented in the Python library RDKit26. The SMILES string has two asterisk symbols for representing two attachment points of the repeating unit. These symbols were capped with hydrogen atoms. The potential energy of each conformation of a repeating unit was evaluated using the molecular mechanics calculation with the general Amber force field version 2 (GAFF2)37,42 after the geometry optimization. Subsequently, the optimized conformers were clustered by performing the Butina clustering43 based on the torsion fingerprint deviation44. The most stable four conformations were further optimized by performing DFT calculations with the ωB97M-D3BJ functional45,46 combined with the 6–31 G(d,p)47,48 basis set. The most stable conformation was determined based on the DFT total energies.

Calculation of electronic property of a repeating unit

The atomic charges of a repeating unit were calculated using the restrained electrostatic potential (RESP) charge model49 with a single-point calculation of the Hartree–Fock method50 combined with the 6–31 G(d) basis set on the optimized geometry of the most stable conformation. The total energy, the highest occupied molecular orbital (HOMO) energy level, the lowest unoccupied molecular orbital (LUMO) level, and the dipole moment were calculated with the single-point calculation using the ωB97M-D3BJ functional combined with the 6–311 G(d,p) basis set45,46,51,52,53 for H, C, N, O, F, P, S, Cl, and Br atoms and with the LanL2DZ basis set54 for I atom. In addition, the dipole polarizability tensor was obtained by applying the finite field method under an electric field of 1.0 × 10−4 a.u. using the ωB97M-D3BJ functional combined with the 6–311 + G(2d,p) basis set45,46,51,52,53,55,56 for H, C, N, O, F, P, S, and Cl atoms, with the 6–311 G(d,p) for Br atom, and with the LanL2DZ basis set for I atom. The reason for using the 6–311 + G(2d,p) basis set is that a basis set, including double polarization and diffuse functions, is required for appropriate polarizability calculations57. The isotropic dipole polarizability was defined as the mean of the diagonal values of the dipole polarizability tensor.

Generation of polymer chains

A polymer chain was constructed by connecting a repeating unit with the self-avoiding random walk algorithm. To prevent unintended chiral inversions and cis/trans conversions due to a large strain structure in the polymer chain growth, the bond between the head and capped atoms in a growing polymer chain and the bond of the tail and capped atoms in the next repeating unit were arranged to be coaxial and anti-parallel, the two capped atoms were deleted, and a new bond between the head and tail atoms was created. The length of the new bond was 1.5 Å, and the dihedral angle around the new bond was randomized in the range of −180° to +180° during the self-avoiding step. Charge neutrality was ensured by summing the charges of capped H atoms to the atoms to which they are bonded. In this study, polymer chains were created to include ~1000 atoms; thus, the degree of polymerization varies across polymers. By taking the number of atoms at the same level for different polymers, the molecular weights were controlled to be almost the same. Thus, all calculated properties were obtained under conditions where the molecular weights were set to be approximately the same. In addition, we investigated the sensitivity of the calculated properties to the change in the number of atoms. As a result, we confirmed that the number of atoms in the simulation cell has a trivial effect on the calculated properties, which is detailed in “Validation of the calculated physical properties” in the Results and Discussion section and in the Supplementary Discussion in the Supplementary Information. The tacticity of a polymer chain could also be controlled in this process using RadonPy. In this study, all the polymers were generated as atactic polymers.

Assignment of force field parameters

The GAFF2 force field is expressed as follows37:

$$\begin{array}{l}E_{{{{\mathrm{MM}}}}} = \mathop {\sum}\limits_{{{{\mathrm{bonds}}}}} {K_{{{\mathrm{b}}}}\left( {r - r_0} \right)^2} + \mathop {\sum}\limits_{{{{\mathrm{angles}}}}} {K_{{{\mathrm{a}}}}\left( {\theta - \theta _0} \right)^2} + \mathop {\sum}\limits_{{{{\mathrm{dihedrals}}}}} {K_{{{\mathrm{d}}}}\left[ {1 + \cos \left( {n_{{{\mathrm{d}}}}\varphi - \delta } \right)} \right]} \\ \quad \quad \quad + \mathop {\sum}\limits_{{{{\mathrm{impropers}}}}} {K_{{{\mathrm{i}}}}\left( {\chi - \chi _0} \right)^2} + \mathop {\sum}\limits_{{{{\mathrm{i}}}},{{{\mathrm{j}}}}} {\frac{{q_{{{\mathrm{i}}}}q_{{{\mathrm{j}}}}}}{{4\pi \varepsilon _0r_{{{{\mathrm{ij}}}}}}}} + \mathop {\sum}\limits_{{{{\mathrm{i}}}},{{{\mathrm{j}}}}} {4\varepsilon _{{{{\mathrm{ij}}}}}\left[ {\left( {\frac{{\sigma _{{{{\mathrm{ij}}}}}}}{{r_{{{{\mathrm{ij}}}}}}}} \right)^{12} - \left( {\frac{{\sigma _{{{{\mathrm{ij}}}}}}}{{r_{{{{\mathrm{ij}}}}}}}} \right)^6} \right]} \end{array}$$

where r, θ, φ, χ, and rij are the bond length, bond angle, dihedral angle, improper angle, and distance between atoms i and j, respectively; Kb, Ka, Kd, and Ki denote the force constants of the bond, bond angle, dihedral angle, and improper angle, respectively; r0, θ0, and χ0 are the equilibration structural parameters of the bond, bond angle, and improper angle, respectively; nd is the multiplicity, and δ is the phase angle for the torsional angle parameters; and qi and qj are the atomic charges of atoms i and j, and ε0 is the dielectric constant of vacuum; εij and σij are the Lennard–Jones parameters determining the depth of the energy potential and equilibrium distance, respectively. Compared to those by GAFF, GAFF2 has improved the parameter values of Kb, r0, Ka, and θ0 to reproduce molecular geometries, vibrational spectra, and potential energy surfaces from higher level quantum mechanics calculations and improved the non-bonded parameters to better reproduce ab initio interaction energies and experimental neat liquid properties42. The parameter set was suitable for thermal conductivity calculations because the reproducibility of the vibrational properties was considered. The modified parameters for fluorocarbon developed by Träg and Zahn33 were used for fluorocarbon polymers. The GAFF2 parameters were automatically assigned to each polymer chain in RadonPy. If the pre-defined parameter set lacked the bond angle parameters of Ka and θ0 for a certain atom group, these parameter values were empirically estimated in the same manner as GAFF2.

Generation of a simulation cell

A simulation cell containing amorphous polymers was constructed by randomly arranging and rotating 10 polymer chains such that they did not overlap with each other, resulting in an amorphous cell having ~10,000 atoms. Initially, the density of the amorphous cell was set to 0.05 g∙cm–3 and was then increased by conducting a packing simulation as described below.

Packing simulation

The initial structure of the generated amorphous cell had a very low density. A packing simulation was performed to increase the density of the amorphous polymers to an appropriate value for subsequent calculations. A 1 ns NVT simulation with a Nosé−Hoover thermostat was performed while the temperature was increased from 300 K to 700 K; in the next 1 ns NVT simulation, the calculation cell was isotropically reduced to a density of 0.8 g∙cm−3 at 700 K. In this packing simulation, to prevent the self-aggregation of a polymer chain by intramolecular interactions leading to a globule-like structure, the Coulomb interaction was turned off, and the cutoff of the Lennard–Jones potential was set to 3.0 Å. Under this condition, the polymer chains remain random coil structures and could not pass through each other. Thus, the polymer chains were entangled in the final structure of the packing simulation. The time step was set to 1 fs, the periodic boundary condition (PBC) was applied, and all the bonds and angles, including those of the hydrogen atoms, were constrained by the SHAKE algorithm58 in this packing simulation.

Equilibration simulation

The amorphous polymers after the packing simulation were equilibrated by the 21-steps compression/decompression equilibration protocol59 proposed by Larsen and co-workers. In this protocol, a temperature rise to 600 K and a drop to 300 K were repeated for ~1.5 ns while the system was compressed to 50,000 atm and then decompressed to 1 atm by combining the NVT and NpT simulations with a Nosé−Hoover thermostat and a barostat. After the 21-steps equilibration, NpT simulations were run for more than 5 ns at 300 K and 1 atm until equilibrium was achieved. The achievement of the equilibrium was checked each 5 ns after the 21-steps equilibration. In this study, the equilibrium state was defined as being reached when the following conditions were met: the relative standard deviations (RSD) of the total, kinetic, bonding, bond angle, dihedral, van der Waals (vdW), and long-range coulomb energy fluctuations were less than 0.05, 0.05, 0.1, 0.1, 0.2, 0.2, and 0.1%, respectively, and the RSDs of the density and radius of gyration fluctuations were less than 0.1 and 1%, respectively. In this study, calculations that did not achieve equilibrium after 50 ns of equilibration calculations were treated as failures. The time step was set to 1 fs, the PBC was applied, and the SHAKE constraint58 was applied to all the bonds and angles, including those of the hydrogen atoms in this equilibration simulation. The twin-range cutoff method60 was used for nonbonded interactions with a short cutoff of 8 Å and a long cutoff of 12 Å. The long-range Coulomb interaction was treated using the particle-particle particle-mesh (PPPM) method61. When the nematic order parameter decreased below 0.1, it was judged that the amorphous structure was appropriately generated; otherwise, it was treated as a failed calculation and removed from the data.

NEMD simulation for thermal conductivity calculation

To calculate the thermal conductivity, we performed the reverse NEMD simulation62 proposed by Müller-Plathe. The simulation box of the reverse NEMD was constructed by triplication of an equilibrated amorphous cell in the x-axis direction under the PBC. The reverse NEMD simulation involved dividing the simulation box into N slabs along the direction of the heat flux, which was generated in the system with temperature gradients induced by exchanging the velocity between the coldest atom in slab N/2 and the hottest atom in slab 0, as shown in Fig. 9. As a result, slab N/2 becomes the hottest in the cell, and the temperature gradually decreases towards slab 0 and slab N because of using the PBC. To prevent the occurrence of temperature shifts due to cell replication, the preheating step with NVT ensemble was run for 2 ps at 300 K. Subsequently, the reverse NEMD with NVE ensemble was run for 1 ns. The number of slabs was set to 20, and the frequency of velocity swapping was set to 200 fs. The time step was set to 0.2 fs, and the SHAKE constraint was not applied in the reverse NEMD simulation. The twin-range cutoff method was used for nonbonded interactions with a short cutoff of 8 Å and a long cutoff of 12 Å. The long-range Coulomb interaction was treated using the PPPM method. As a validation of the adequacy of the reverse NEMD calculation, RadonPy confirmed a linearity in the temperature gradient. A calculation result with a poor linearity in the temperature gradient (R2 < 0.95) was treated as a failure and removed from the data.

Fig. 9: Schematic representation of the simulation box for reverse nonequilibrium molecular dynamics.
figure 9

The red and blue slabs are the hottest and coldest region, respectively, in the simulation box.

A thermal conductivity decomposition analysis was performed for 100 ps. For the Irving-Kirkwood equation63 modified by Torii and co-workers64, the energy flux can be expressed as follows:

$$J_{{{\mathrm{u}}}} = \frac{1}{V}\left\{ {\mathop {\sum}\limits_{{{{\mathrm{i}}}} \in V} {e_{{{\mathrm{i}}}}v_{{{{\mathrm{i}}}},{{{\mathrm{u}}}}}} + \mathop {\sum}\limits_{{{{\mathrm{i}}}} \in V} {\left( {{{{\bf{S}}}}_{{{\mathrm{i}}}}{{{\bf{v}}}}_{{{\mathrm{i}}}}} \right)_{{{\mathrm{u}}}}} } \right\}$$

where Ju is the energy flux along the direction of unit vectors u, V is the volume, ei is the per-atom potential and kinetic energy, vi,u is the velocity of the atom, Si is the per-atom stress tensor, and i is the index of atoms. The first and second terms represent the contribution to the energy flux via convection and interatomic interactions, respectively. The second term can be further divided into each component of the interactions. The component (a, b) of the stress tensor can be written as65,66,67

$${{{\bf{S}}}}_{{{{\mathrm{ab}}}}} = \mathop {\sum}\limits_{n = 1}^{N_{{{\mathrm{p}}}}} {{{{\bf{r}}}}_{{{{\mathrm{i}}}}0,{{{\mathrm{a}}}}}{{{\bf{F}}}}_{{{{\mathrm{i}}}},{{{\mathrm{b}}}}}} + \mathop {\sum}\limits_{n = 1}^{N_{{{\mathrm{b}}}}} {{{{\bf{r}}}}_{{{{\mathrm{i}}}}0,{{{\mathrm{a}}}}}{{{\bf{F}}}}_{{{{\mathrm{i}}}},{{{\mathrm{b}}}}}} + \mathop {\sum}\limits_{n = 1}^{N_{{{\mathrm{a}}}}} {{{{\bf{r}}}}_{{{{\mathrm{i}}}}0,{{{\mathrm{a}}}}}{{{\bf{F}}}}_{{{{\mathrm{i}}}},{{{\mathrm{b}}}}}} + \mathop {\sum}\limits_{n = 1}^{N_{{{\mathrm{d}}}}} {{{{\bf{r}}}}_{{{{\mathrm{i}}}}0,{{{\mathrm{a}}}}}{{{\bf{F}}}}_{{{{\mathrm{i}}}},{{{\mathrm{b}}}}}} + \mathop {\sum}\limits_{n = 1}^{N_{{{\mathrm{i}}}}} {{{{\bf{r}}}}_{{{{\mathrm{i}}}}0,{{{\mathrm{a}}}}}{{{\bf{F}}}}_{{{{\mathrm{i}}}},{{{\mathrm{b}}}}} + {{{\mathrm{Kspace}}}}({{{\bf{r}}}}_{{{{\mathrm{i}}}},{{{\mathrm{a}}}}},{{{\bf{F}}}}_{{{{\mathrm{i}}}},{{{\mathrm{b}}}}})}$$

The first to fifth terms denote the pairwise, bond, angle, dihedral, and improper contributions, respectively, where Fi denotes the force acting on atom i due to the interaction, ri0 denotes the relative position of atom i to the geometric center of its interacting atoms, and Np, Nb, Na, Nd, and Ni are the numbers of non-bonding atom pairs, bonds, bond angles, dihedral angles, and improper angles, respectively. The sixth term is the K-space contribution from the long-range Coulombic interactions. The partial thermal conductivity λpartial is given by

$$\lambda _{{{{\mathrm{partial}}}}} = \frac{{J_{{{{\mathrm{partial}}}}}}}{{J_{{{{\mathrm{total}}}}}}}\lambda _{{{{\mathrm{total}}}}}$$

where Jtotal is the total heat flux calculated by Eq. 3, Jpartial is the partial heat flux subdivided by each term of Eqs. 3 and 4, and λtotal is the total thermal conductivity calculated by the reverse NEMD.

Calculation of physical properties

The density in a NpT simulation was computed using the mass m and volume V of the system as follows:

$$\rho = \frac{m}{{\left\langle V \right\rangle }}$$

where the angular brackets \(\left\langle \cdot \right\rangle\) represent time averaging.

The radius of gyration Rg was calculated using the following equation:

$${{{\mathrm{Rg}}}} = \sqrt {\frac{1}{N}\mathop {\sum}\limits_{{{{\mathrm{k}}}} = 1}^N {\left( {{{{\bf{r}}}}_{{{\mathrm{k}}}} - {{{\bf{r}}}}_{{{{\mathrm{mean}}}}}} \right)^2} }$$

where rk is the position of a repeating unit k, and rmean denotes the mean position of the repeating units in a polymer chain.

The specific heat capacity at constant pressure Cp was calculated from the fluctuations in the enthalpy H68:

$$C_{{{\mathrm{P}}}} = \frac{{\left\langle {{{{\mathrm{\delta }}}}H^2} \right\rangle }}{{k_{{{\mathrm{B}}}}T^2m}}$$

where kB is the Boltzmann constant, and T is the temperature. The enthalpy was calculated using the constant pressure of 1 atm because the calculated pressure value in the NpT simulations has a significant fluctuation, leading to inaccurate CP calculation.

The isothermal compressibility βT and isothermal bulk modulus KT were calculated from the fluctuations in the volume V68:

$$\beta _{{{\mathrm{T}}}} = \frac{{\left\langle {\delta V^2} \right\rangle }}{{k_{{{\mathrm{B}}}}T\left\langle V \right\rangle }}$$
$$K_{{{\mathrm{T}}}} = \frac{1}{{\beta _{{{\mathrm{T}}}}}}$$

The volume expansion coefficient αP was calculated from the covariance of the volume V and enthalpy H68:

$$\alpha _{{{\mathrm{P}}}} = \frac{{\left\langle {\delta V\delta H} \right\rangle }}{{k_{{{\mathrm{B}}}}T^2\left\langle V \right\rangle }}$$

Here, the enthalpy was calculated at a constant pressure of 1 atm. The linear expansion coefficient αP,l in the isotropic systems was calculated using the following equation68:

$$\alpha _{{{{\mathrm{P}}}},{{{\mathrm{l}}}}} = \frac{{\alpha _{{{\mathrm{P}}}}}}{3}$$

The specific heat capacity at constant volume CV was calculated from the following equation, associated with CP, αP, and βT68:

$$C_{{{\mathrm{V}}}} = C_{{{\mathrm{P}}}} - \frac{{\alpha _{{{\mathrm{P}}}}^2T\left\langle V \right\rangle }}{{\beta _{{{\mathrm{T}}}}m}}$$

The isentropic compressibility βS and isentropic bulk modulus KS were calculated using the following equations:

$$\beta _{{{\mathrm{S}}}} = \beta _{{{\mathrm{T}}}}\frac{{C_{{{\mathrm{V}}}}}}{{C_{{{\mathrm{P}}}}}}$$
$$K_{{{\mathrm{S}}}} = \frac{1}{{\beta _{{{\mathrm{S}}}}}}$$

The self-diffusion coefficient was calculated using the Einstein equation68:

$$D = \mathop {{\lim }}\limits_{t \to \infty } \frac{1}{{6t}}\left\langle {\left| {{{{\bf{r}}}}\left( {t + t_0} \right) - {{{\bf{r}}}}\left( {t_0} \right)} \right|^2} \right\rangle$$

where t is the time, and r denotes the atomic position at the time.

The refractive index n was obtained from the Lorentz–Lorenz equation:

$$\frac{{n^2 - 1}}{{n^2 + 2}} = \frac{{4{{{\mathrm{\pi }}}}}}{3}\frac{\rho }{M}\alpha _{{{{\mathrm{polar}}}}}$$

where αpolar is the isotropic dipole polarizability of a repeating unit computed from the DFT calculation, and M is the molecular weight of a repeating unit.

The static dielectric constant ε(0) was calculated using the equation69:

$$\varepsilon \left( 0 \right) = \frac{{\left\langle {{{{\bf{\mu }}}}^2} \right\rangle - \left\langle {{{\bf{\mu }}}} \right\rangle ^2}}{{3\varepsilon _0k_{{{\mathrm{B}}}}T\left\langle V \right\rangle }} + \varepsilon _{{{{\mathrm{el}}}}}$$

where μ is the dipole moment of the system, ε0 is the dielectric constant of vacuum, and εel is the contribution of the electronic polarization in the dielectric constant, which is evaluated as the square of the refractive index n2.

The nematic order parameter was calculated as the highest eigenvalue of the second rank ordering tensor68 Qαβ, following equation:

$${{{\bf{Q}}}}_{{{{\mathrm{\alpha \beta }}}}} = \frac{1}{N}\mathop {\sum}\limits_{{{{\mathrm{i}}}} = 1}^N {\frac{1}{2}\left( {3{{{\bf{u}}}}_{{{{\mathrm{i\alpha }}}}}{{{\bf{u}}}}_{{{{\mathrm{i\beta }}}}} - {{{\bf{\delta }}}}_{{{{\mathrm{\alpha \beta }}}}}} \right)}$$

where u and u (α, β = x, y, or z) are the unit vectors of the molecular axis of a repeating unit i, δαβ is the Kronecker delta, and N is the number of repeating units. The molecular axis of each repeating unit is defined as the long axis found from the inertia tensor. The nematic order parameter takes on a value between 0 for an isotropic structure and 1 for a completely ordered structure.

The thermal conductivity λ was calculated according to Fourier’s law:

$$\lambda = \frac{{J_Q}}{{(\partial T/\partial x)}} = \frac{{\Delta E}}{{2A\Delta t(\partial T/\partial x)}}$$

where JQ is the heat flux, and ∂T/∂x is the temperature gradient of the NEMD simulation. The heat flux JQ can be calculated from the exchanged energy obtained using the Müller-Plathe algorithm ΔE, the cross-sectional area in the heat flux direction A, and the simulation time Δt. The thermal diffusivity κ was obtained from the calculated thermal conductivity λ, density ρ, and heat capacity CP:

$$\kappa = \frac{\lambda }{{\rho C_{{{\mathrm{P}}}}}}$$