Exploring the frontiers of condensed-phase chemistry with a general reactive machine learning potential

Atomistic simulation has a broad range of applications from drug design to materials discovery. Machine learning interatomic potentials (MLIPs) have become an efficient alternative to computationally expensive ab initio simulations. For this reason, chemistry and materials science would greatly benefit from a general reactive MLIP, that is, an MLIP that is applicable to a broad range of reactive chemistry without the need for refitting. Here we develop a general reactive MLIP (ANI-1xnr) through automated sampling of condensed-phase reactions. ANI-1xnr is then applied to study five distinct systems: carbon solid-phase nucleation, graphene ring formation from acetylene, biofuel additives, combustion of methane and the spontaneous formation of glycine from early earth small molecules. In all studies, ANI-1xnr closely matches experiment (when available) and/or previous studies using traditional model chemistry methods. As such, ANI-1xnr proves to be a highly general reactive MLIP for C, H, N and O elements in the condensed phase, enabling high-throughput in silico reactive chemistry experimentation.


Crystal
Ignition delay time (IDT) for biofuel simulations based on each of the major products CO, CO 2 , and H 2 O.To remove anomalies when only a single molecule is produced significantly prior to "true" ignition, we define IDT as the earliest time that at least five molecules of a given product are produced.The manuscript uses the average IDT value between CO, CO   Recall that each ensemble member is evaluated against a different held-out dataset.Therefore, although ensemble member 8 has a higher energy RMSE (with an MAE that is similar to other ensemble members), this appears to be the result of some large outliers, potentially from some poor DFT results (see Figure 7).Root-mean-squared-error (RMSE) and mean-absolute-error (MAE) are reported for each individual ensemble member.Recall that each ensemble member is evaluated against a different held-out test dataset.Therefore, although ensemble member 8 has a higher energy RMSE (with an MAE that is similar to other ensemble members), this appears to be the result of some large outliers, potentially from some poor DFT calculations.19/20 Model α ( o ) β ( o ) γ (

Figure 1 .
Figure 1.Ignition delay time (IDT) for biofuel simulations based on each of the major products CO, CO 2 , and H 2 O.To remove anomalies when only a single molecule is produced significantly prior to "true" ignition, we define IDT as the earliest time that at least five molecules of a given product are produced.The manuscript uses the average IDT value between CO, CO 2 , and H 2 O. (a) biofuel+O 2 system (b) biofuel with ethanol+O 2 (c) biofuel with 2-butanol+O 2 (d) biofuel with MTBE+O 2 .

Figure 5 .Figure 6 .
Figure 5. Energy correlation plot.Root-mean-squared-error (RMSE) and mean-absolute-error (MAE) are reported as the average of eight ensemble models against a held-out test dataset.

Figure 7 .
Figure 7. Energy correlation plots for all ensemble members (members 1-8 from top-left to bottom-right).Root-mean-squared-error (RMSE) and mean-absolute-error (MAE) are reported for each individual ensemble member.Recall that each ensemble member is evaluated against a different held-out test dataset.Therefore, although ensemble member 8 has a higher energy RMSE (with an MAE that is similar to other ensemble members), this appears to be the result of some large outliers, potentially from some poor DFT calculations.

Figure 9 .Figure 10 .
Figure 9. Validation of conservation of energy for ANI-1xnr.All results correspond to NVE simulations for the CH 4 +O 2 system.Panel a) compares the energy drift over a 10 ps simulation with different time-steps.This figure justifies the choice of a 0.1 fs time-step for this system.Panel b) presents the energy drift for a 100 ps simulation with a 0.1 fs time-step.The average energy drift with this time-step is approximately -7.6 × 10 −8 eV/ps/atom.Panel c) demonstrates that the average energy fluctuations scale approximately linearly on a log-log plot with respect to time-step size.Compare to Figure 5 in de Oca Zapiain et al. 2

Figure 11 .Figure 12 .
Figure 11.Histogram of the system composition of all systems in the training dataset, colored by element.

Figure 13 .
Figure 13.Distribution of the molecule size (i.e., number of heavy atoms) in the ANI-1xnr training dataset.

Table 2 .
The ANI neural networks used in this work were implemented in the NeuroChem C++/CUDA software package.A batch size of 32 was used while training the ANI-1xnr model.A weight of 1.0 was used on both the energy and force loss term.Learning rate annealing was used during training, starting at a learning rate of 0.001 and converging at a learning rate of 0.00001.The ADAM update algorithm is used during training1.The network architecture is provided in Table2.The symmetry function parameters are provided in Table3.ANI-1xnr neural network architecture 0.500000,0.646875,0.793750,0.940625,1.087500,1.234375,1.381250,1.528125,1.675000,1.821875,1.968750,2.115625,2.262500,2.409375,2.556250,2.703125,2.850000,2.996875,3.143750,3.290625,3.437500,3.584375,3.731250,3.

Table 4 .
Model performance against held-out test dataset.Root-mean-squared-error (RMSE) and mean-absolute-error (MAE) are reported as the average of eight ensemble models with the corresponding standard deviation.Energy errors are reported both as unnormalized and per-atom normalized values.