Development of scalable and generalizable machine learned force field for polymers

Understanding and predicting the properties of polymers is vital to developing tailored polymer molecules for desired applications. Classical force fields may fail to capture key properties, for example, the transport properties of certain polymer systems such as polyethylene glycol. As a solution, we present an alternative potential energy surface, a charge recursive neural network (QRNN) model trained on DFT calculations made on smaller atomic clusters that generalizes well to oligomers comprising larger atomic clusters or longer chains. We demonstrate the validity of the polymer QRNN workflow by modeling the oligomers of ethylene glycol. We apply two rounds of active learning (addition of new training clusters based on current model performance) and implement a novel model training approach that uses partial charges from a semi-empirical method. Our developed QRNN model for polymers produces stable molecular dynamics (MD) simulation trajectory and captures the dynamics of polymer chains as indicated by the striking agreement with experimental values. Our model allows working on much larger systems than allowed by DFT simulations, at the same time providing a more accurate force field than classical force fields which provides a promising avenue for large-scale molecular simulations of polymeric systems.

In this document, we elucidate a few of the methods and the results that have been truncated for brevity in the main text.

S1. Methods
In this particular section, we discuss the quantitative specifics and details of the various methods employed in generating the training dataset for the QRNN model training and inference.

S1.1. Sample preparation
The ten-step relaxation carried out on the initialized configurations is given by: • Brownian Dynamics on the initialized configuration for 100 ps at 10 K with a timestep of 1 fs.
• NVT simulation at 300 K for 50 ps with a timestep of 1 fs.
• NVT simulation at 70 K for 200 ps with a timestep of 1 fs.
• NPT simulation with an external pressure of 1 atm at 300K for 50 ps with a timestep 1 fs.
• NPT simulation with an external pressure of 1 atm at 300K for 200 ps with a timestep of 2 fs.
• NPT simulation with an external pressure of 100 atm at 300K for 10 ns with a timestep 2 fs.
• NPT simulation with an external pressure of 1 atm at 300K for 10 ns with a timestep of 2 fs.
• The cell size is averaged over the last 2 ns of the previous stage for the volume of the NVT simulation in the next stage.
• NVT simulation at 300 K for 5 ns with a timestep of 2 fs.
The relaxed configuration at the end of this is used as the starting configuration for QRNN MD.

S1.2. Dataset preparation for QRNN training
The sampled conformational clusters also include density variations in addition to the bond, angle, and dihedral deformations.On the other hand, decomposed samples only include bond breakage (i.e.extreme bond sampling).The breakdown of the regularly identified clusters and the subsequent number of conformational and decomposed clusters for the different N-molecule clusters for the monomer, dimer, and trimer of ethylene glycol are given in Table S1.

S1.3. Active learning
We generated additional 800,000 clusters using the split between the different Nmolecule clusters of the monomer, dimer, and trimer of ethylene glycol, as shown in Table S2.For the second round of active learning, we carry out the same analysis only from the extracted molecular clusters from our QRNN simulations, as shown in Table S3.We do not carry out any conformational sampling at this stage.

S2. Results
The model performance is in good agreement with the reference results with R 2 ∼ 1, as shown in Fig. S1, with an RMSE of 1.84 × 10 −3 eV/atom for energy, RMSE of 7.80 × 10 −3 for charges, and RMSE of 3.08 × 10 −1 eV/Å for forces .The collated experimental values of the properties we are interested in can be found in Table S7.
Molecule ρ (g/cm 3 ) c p (J/g-K) D (µm 2 /s) In addition, we also look at the figures of predicted density, specific heat capacity, and self-diffusivity along with how well they correlate with the experimentally obtained values.A high correlation even with an inaccurate prediction can prove fruitful since the experimental values can be obtained by applying a linear correction factor to the MD-predicted properties.We see that the first three models of QRNN -M, QRNN -M,D, and QRNN -M,D,T do have their benefits in the prediction of self-diffusivity, however, they are not well correlated with the experimental values of the properties that we are exploring.However, the models generated after the first round of active learning, show not just more accurate predictions in the experimental properties but they also show a better correlation with the experimental values.In total, we carry out two rounds of active learning with combinations of either training a whole new model or transfer learning from the previously learned model.Here we present the results from the most promising set of models trained from the two rounds of active learning.

S2.1. First round of active learning
In this section, we look at the performance of two models trained after adding the new samples from the active learning analysis.We compare both these models against QRNN -M,D,T which is one of the initially trained models.The two models we compare are: (i) QRNN -AL-TL which is transfer learns on QRNN -M,D,T and uses the composite dataset of the old samples and the new active learning ones; (ii) QRNN -AL-TL-xTB which uses transfer learns in a similar way as QRNN -AL-TL but instead of using the dipole moments as a feature for learning, we use the extended tight binding (xTB) charges.These newer models are not just qualitatively closer to predicting the experimental values but they also allow us to come up with a linear correction relation to be able to estimate the experimental density, diffusivity or specific heat capacity directly from the QRNN MD predictions.
The dimer scanning results from the first round of active learning which are not shown in the manuscript are given below.

S2.2. Second round of active learning
As established from the first round of learning, training on xTB charges can turn out to be beneficial.As a result, we carry out a second round active-learning and add these new samples to our training dataset.We then train a number of models that either use the dipole moments or xTB charges along with the forces and energy as the learning features.The best model QRNN -AL2.0-TL-xTB is compared against QRNN -M,D,T and QRNN -AL-TL from the previous rounds to show the superior performance.

Figure S1 :
Figure S1: Parity plot showing the model prediction of energies, charges, and forces against the reference values for 1486 72-atom configurations.

Figure S2 :
Figure S2: Calculated force to separate a pair of tetramer molecules from a given separation distance.The force is calculated by taking the gradient of the energy curves shown in Fig. 1 of the main text.

Figure
Figure S3: (a) Predicted density against the experimental density and the (b) correlation between them.

Figure
Figure S4: (a) Predicted diffusivity against the experimental diffusivity and the (b) correlation between them.

Figure
Figure S5: (a) Predicted specific heat capacity against the experimental specific heat capacity and the (b) correlation between them.

Figure
Figure S6: (a) Predicted density against the experimental density and the (b) correlation between them.

Figure
Figure S7: (a) Predicted diffusivity against the experimental diffusivity and the (b) correlation between them.

Figure
Figure S8: (a) Predicted specific heat capacity against the experimental specific heat capacity and the (b) correlation between them.

Figure S9 :
Figure S9: Separation energy between two molecules as a function of separation distance compared between the DFT calculation, OPLS4 force field, and the machine-learned force field for the (a) monomers, (b) dimers, (c) trimers and (d) tetramers of ethylene glycol.

Figure
Figure S10: (a) Predicted density against the experimental density and the (b) correlation between them.

Figure
Figure S11: (a) Predicted diffusivity against the experimental diffusivity and the (b) correlation between them.

Figure
Figure S12: (a) Predicted specific heat capacity against the experimental specific heat capacity and the (b) correlation between them.

Table S1 :
Number of original clusters extracted from OPLS4 MD, conformational sampled clusters, and decomposed sampled clusters for ethylene glycol, diethylene glycol, and triethylene glycol.

Table S2 : Number of clusters extracted
from QRNN MD and corresponding conformational sampled clusters for ethylene glycol, diethylene glycol, and triethylene glycol that are used for filtering for the first round of active learning.

Table S4 :
Comparison of the experimental density of ethylene glycol oligomers with the classically obtained density from MD simulations using an OPLS4 force field and with the machine-learned force field simulations.

Table S5 :
Comparison of the experimental self-diffusivity of ethylene glycol oligomers with the classically obtained density from MD simulations using an OPLS4 force field and with the machine-learned force field simulations.

Table S6 :
Comparison of the experimental specific heat capacity of ethylene glycol oligomers with the classically obtained density from MD simulations using an OPLS4 force field and with the machinelearned force field simulations.