Exploring the frontiers of condensed-phase chemistry with a general reactive machine learning potential

Zhang, Shuhao; Makoś, Małgorzata Z.; Jadrich, Ryan B.; Kraka, Elfi; Barros, Kipton; Nebgen, Benjamin T.; Tretiak, Sergei; Isayev, Olexandr; Lubbers, Nicholas; Messerly, Richard A.; Smith, Justin S.

doi:10.1038/s41557-023-01427-3

Download PDF

Article
Open access
Published: 07 March 2024

Exploring the frontiers of condensed-phase chemistry with a general reactive machine learning potential

Nature Chemistry (2024)Cite this article

9652 Accesses
153 Altmetric
Metrics details

Subjects

Abstract

Atomistic simulation has a broad range of applications from drug design to materials discovery. Machine learning interatomic potentials (MLIPs) have become an efficient alternative to computationally expensive ab initio simulations. For this reason, chemistry and materials science would greatly benefit from a general reactive MLIP, that is, an MLIP that is applicable to a broad range of reactive chemistry without the need for refitting. Here we develop a general reactive MLIP (ANI-1xnr) through automated sampling of condensed-phase reactions. ANI-1xnr is then applied to study five distinct systems: carbon solid-phase nucleation, graphene ring formation from acetylene, biofuel additives, combustion of methane and the spontaneous formation of glycine from early earth small molecules. In all studies, ANI-1xnr closely matches experiment (when available) and/or previous studies using traditional model chemistry methods. As such, ANI-1xnr proves to be a highly general reactive MLIP for C, H, N and O elements in the condensed phase, enabling high-throughput in silico reactive chemistry experimentation.

Machine learning in chemical reaction space

Article Open access 30 October 2020

AdsorbML: a leap in efficiency for adsorption energy calculations using generalizable machine learning potentials

Article Open access 22 September 2023

Transition1x - a dataset for building generalizable reactive machine learning potentials

Article Open access 24 December 2022

Main

Over the past several decades, atomic-scale simulation has become an invaluable computational tool for providing microscopic explanations of experimentally observed phenomena. Many scientifically crucial chemical and materials properties can be evaluated through molecular dynamics (MD) simulation, wherein atomic motion is dictated by integrating the second law of Newtonian physics. The quantitative predictiveness of MD depends almost entirely on the accuracy of the underlying model potential energy surface (the potential) used to compute the forces acting on each atom. However, standard physics-based paradigms, such as classical force fields (FFs) and quantum mechanics (QM) methods, straddle a historical trade-off between computational cost, accuracy and generality, that is, being applicable to a broad range of systems without further specialization. This trade-off is especially pronounced in the context of modelling reactions, that is, the making and breaking of chemical bonds. Although computationally efficient, reactive FFs often need to be reparameterized to pre-determined reactions to be quantitatively accurate. By contrast, while QM methods are often quite reliable and generally applicable, their computational cost is prohibitive for many reactive MD studies. For this reason, a fast, accurate and general reactive potential is of paramount importance to many scientific applications, as it would fulfil the long-sought promise for predictive MD simulations that can provide reliable reaction rates, discover entire reaction networks and warn of dangerous conditions, all before entering the laboratory.

Recently, machine learning interatomic potentials (MLIPs) have been proposed to overcome the trade-off that has existed in physics-based computational models for many decades. MLIPs often achieve computational efficiency similar to classical FFs but with QM-level accuracy^{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}. Among the many different types of MLIPs that have been proposed, neural network (NN)-based MLIPs are especially capable of describing a broad range of chemical systems without additional specialization and, thereby, represent a top candidate for developing a truly general MLIP. For example, ANAKIN-ME (or ANI) is a NN-based MLIP that has been trained to large and chemically diverse datasets of organic molecules containing the elements C, H, N, O, S, F and Cl (refs. ^17,18). While previous ANI MLIPs proved to be extremely accurate for near-equilibrium conformations of organic molecules in vacuo, these potentials do not address the challenges of modelling condensed-phases (that is, periodic systems of liquids, supercritical fluids or solids) and reactive chemistry.

Several MLIPs have been developed for studying both condensed-phase and gas-phase (or in vacuo) reactive chemistry of a specific system^{19,20,21,22,23,24,25}. However, each of these studies required considerable domain and MLIP expertise and enormous computational resources to build a non-general reactive MLIP. For this reason, a highly general reactive MLIP would be transformational towards the usage and impact of MLIPs among non-experts. While recent endeavours have yielded groundbreaking results towards a general MLIP for approximately one third of the periodic table^26,27,28, these studies do not directly target reactive chemistry. Targeted, model-aware sampling strategies for dataset generation of three-dimensional atomic positions are especially essential for modelling rare events, such as chemical reactions^17,29.

Active learning (AL)³⁰ is a class of model-aware algorithms designed to automatically sample, select and label new data with the goal of efficiently generating a diverse and relevant dataset to train a more robust ML model. AL aims to ameliorate human bias through automating the decision-making process for adding new data to a training dataset. Recently, AL has been applied to develop numerous MLIPs trained to datasets of atomic positions labelled with energies and atomic forces from expensive QM calculations^{17,19,31,32,33,34,35,36}.

To develop a general reactive MLIP with AL, existing methodologies for selecting, labelling and training are relatively straightforward to apply (Methods). However, for sampling atomic positions, adequately exploring reactive chemical space in an automated fashion is extremely challenging³⁷ because it requires the exploration of chemical variance of molecular species in tandem with structural variance associated with non-equilibrium thermodynamic processes. The traditional approach of fitting reactive FFs^{38,39,40,41,42} to a limited dataset of pre-determined gas-phase reaction pathways based on chemical intuition is insufficient for developing a general reactive MLIP, as the resulting MLIP would be biased to only perform well on the assumed reaction network. Similarly, while recent work (performed simultaneous and independent to this study) presented an automated approach to sample transition states and minimum-energy-path structures for gas-phase (or in vacuo) reactions of C, H, N and O molecules⁴³, this sampling procedure is unlikely to result in an MLIP that is robust for condensed-phase high-temperature reactive MD simulations. By contrast, training an MLIP directly to condensed-phase QM reactive data would ensure that the potential is reliable for the density ranges typically used in reactive MD simulations.

Wang and co-workers developed an elegant approach for the MD-based exploration of reaction pathways in the condensed phase, using QM methods, referred to as the ab initio nanoreactor (NR)^44,45. The NR was designed to model high-velocity molecular collisions of small molecules by using a fictitious biasing force to promote chemical reactions and the formation of new molecules, thus automatically exploring reaction pathways between arbitrary reactants and products. The ab initio NR was successfully able to predict graphene ring formation from pure acetylene as well as reaction pathways to form glycine, one of the building blocks of life, from small early earth molecules.

Although Wang et al. clearly demonstrated the promise of the ab initio NR to discover reactive chemistry, QM-driven MD sampling is extremely computationally intensive for generating a large training dataset within AL. In this Article, inspired by the work of Wang et al., we design an MLIP-driven NR sampling procedure that targets arbitrary reactive chemical processes and compositions of C, H, N and O elements, including near pure elemental systems and mixtures. Combined with the ANI model architecture and applying AL at scale, we aim to produce a robust and general reactive MLIP. Figure 1 shows a summary of the NR–AL workflow and the specific applications investigated in this work with the final model, referred to as ANI-1xnr.

**Fig. 1: Summary of the nanoreactor active learning workflow and specific applications considered.**

To evaluate the reliability of ANI-1xnr in practical research scenarios, we conduct several condensed-phase reactive chemistry simulations inspired by other literature with the ANI-1xnr potential, namely carbon solid-phase nucleation^46,47,48, graphene ring formation from acetylene with varying O₂ concentrations^44,49, biodiesel ignition with different fuel additives⁵⁰, methane combustion²⁴ and the spontaneous formation of glycine from early earth molecules^44,51,52. Across this wide range of applications, we show that ANI-1xnr provides results that are consistent with chemical intuition, experimental data, QM calculations (density-functional theory (DFT), Hartree–Fock and density-functional tight-binding (DFTB)) and classical reactive MD simulations (reactive force field (ReaxFF) and an application-specific MLIP) all without retraining.

This study demonstrates the capability of automated chemical exploration workflows to build a general-purpose reactive potential, resulting in ANI-1xnr, a fast, accurate and general potential capable of simulating a wide range of real-world reactive systems containing C, H, N and O elements.

Results

Nanoreactor active learning

Before assessing the performance of the ANI-1xnr model on the different case studies, we evaluated both the diversity and the completeness of the ANI-1xnr dataset. Figure 2 provides a two-dimensional visualization of a high-dimensional dataset by clustering together similar local atomic environments for the elements H (Fig. 2a), C (Fig. 2b), N (Fig. 2c) and O (Fig. 2d). Figure 2a–d compare the ANI-1xnr dataset and a non-reactive, near-equilibrium, molecule in vacuo, AL dataset (ANI-1x). Clearly, the ANI-1xnr dataset not only effectively encompasses the entire ANI-1x dataset but it also extends substantially beyond the local atomic environment space covered by ANI-1x. More importantly, the ANI-1xnr dataset provides pathways between many of the clusters in the ANI-1x dataset. These pathways probably correspond to reactions in a low-dimensional representation. Furthermore, Fig. 2e provides select examples of the over 1,000 unique molecules (consisting of ten or fewer CNO atoms; Methods) that are identified in the ANI-1xnr training dataset. Since the NR sampling simulations were initialized with only small molecules (consisting of two or fewer CNO atoms; Methods), the NR–AL procedure automatically discovered hundreds, if not thousands, of reaction pathways leading to these distinct molecular structures.

Carbon solid-phase nucleation

Accurate simulation of amorphous carbon systems has long been one of the top interests among chemists and materials scientists, as some distinct materials (for example, graphene, diamond and carbon nanotubes) form from amorphous carbon under different conditions. Understanding this behaviour would assist in the development of functional materials by controlling the solid-phase nucleation process. Many reactive FFs have been employed to simulate amorphous carbon in MD^48,53,54. With the widespread use of ML methods, researchers recently developed application-specific MLIPs to investigate amorphous carbon systems^46,55. These application-specific MLIPs proved accurate at predicting pure carbon fragments and mechanical properties of the bulk system. Despite these achievements, MLIPs trained on application-specific datasets would have very poor generality to new chemistry as the model has only been fit to a limited number of structures and reactions. On the other hand, the NR–AL approach presented in this work does not sample any specific form of carbon explicitly. We rely on the NR sampler and AL algorithm to automatically select physically relevant and unbiased configurations of carbon atoms. To validate ANI-1xnr in carbon solid-phase nucleation simulations under different conditions, we perform simulations at high (3.52 g cc⁻¹), medium (2.25 g cc⁻¹) and low (0.50 g cc⁻¹) densities.

Figure 3 summarizes the product of each simulation. For each of the high- (Fig. 3a), medium- (Fig. 3b) and low-density (Fig. 3c) carbon simulations, ANI-1xnr produces the expected structure of carbon for the respective density^46,47,48. Specifically, for the system with the highest density (3.52 g cc⁻¹), diamond, graphene and hexagonal diamond phase coexist after 246 ps, where 70% of carbon atoms in the simulation box forms diamond cubic crystal structure. After another 2.3 ns, the high-density system contains 86% of carbon atoms in the diamond cubic crystal structure, with very few graphene and hexagonal diamond sites. In the medium-density (2.25 g cc⁻¹) system, 31% of atoms rapidly form graphene after 8.2 ps, and the system contains 83% graphene after another 2.3 ns. Graphene sheets tend to form a stacked and more ordered graphite-like structure, which is observed for the system slice in Fig. 3b). The low-density (0.5 g cc⁻¹) system forms carbon atom chains after 250 ps, with 11% of atoms forming graphene sheets. After another 3 ns, the system contains 88% of atoms formed in graphene sheets. However, the graphene sheets in this low-density case are more disordered and appear to form fullerene-like closed or partially closed meshes. Extended Data Table 1 provides an analysis of the ANI-1xnr crystal lattice constants for diamond and graphite.

**Fig. 3: Carbon solid-phase nucleation simulation results for ANI-1xnr.**

Effect of oxygen on graphene ring formation

Wang et al.⁴⁴ applied the original ab initio NR method to observe ring formation (that is, the early stages of graphene formation) from a pure acetylene (C₂H₂) system. Subsequently, Lei et al.⁴⁹ presented DFTB NR simulations of acetylene in the presence of different amounts of oxygen, where O₂/C₂H₂ = 0, 0.1, …, 1 is the ratio of added O₂ while the number of C₂H₂ molecules is fixed to 40. Graphene formation is the dominant process for pure C₂H₂, as the generation of free radicals enables the rapid growth of hydrocarbon rings. By contrast, the addition of O₂ to the system deters or, at high enough O₂/C₂H₂ ratios, completely eliminates ring formation⁴⁹. Similar to the work of Lei et al.⁴⁹, we perform reactive simulations with varying ratios of C₂H₂ and O₂.

Figure 4 shows the amount of three-, four-, five-, six- and seven-membered rings formed with respect to simulation time for eight different O₂/C₂H₂ ratios. Increasing the oxygen ratio decreases the number of rings formed, which is in good agreement with the simulations from Lei et al. and experimental literature⁵⁶. Furthermore, although the branching ratios (that is, the relative production of different ring sizes) are not completely converged for all systems, the branching ratios are clearly in qualitative agreement with Lei et al. Specifically, six-membered rings are the predominant product, followed by five-membered and seven-membered rings at noticeably lower, but nearly equal, branching ratios. However, in contrast with Lei et al., six-membered rings form even for an O₂/C₂H₂ ratio of 0.5. In comparison, the simulations of Lei et al. predict ring formation for O₂/C₂H₂ ratio up to 0.2, but negligible ring formation for an O₂/C₂H₂ ratio of 0.4. The ANI-1xnr results are in much closer agreement with experimental data, which report graphene formation for O₂/C₂H₂ ratios between 0.4 and 0.8 (no experimental measurements were reported outside of this range). A clear explanation for the improved agreement between ANI-1xnr and experiment is the longer simulation timescales and the larger system sizes achievable by ANI-1xnr compared with DFTB (Methods). Specifically, for an O₂/C₂H₂ ratio of 0.5, six-membered rings only begin to form after 1 ns with ANI-1xnr. Considering that the DFTB simulations of Lei et al. ran for only 0.5 ns, our results suggest that six-membered rings could form under higher oxygen ratio conditions using DFTB at longer timescales. Although it is possible that even longer MD simulations could result in ring formation at even higher O₂/C₂H₂ ratios, this case study demonstrate the value in the lower computational costs of ANI-1xnr compared with traditional methods, such as DFTB, to discover interesting phenomena that can only be observed during long timescale simulations. Further validation of the ANI-1xnr simulation results are provided in Extended Data Fig. 1.

**Fig. 4: Effect of oxygen on graphene ring formation simulation results for ANI-1xnr.**

Comparison of biofuel additives

To promote combustion processes of liquid fuel, fuel additives are utilized as detergents, oxygenates, emission depressors, corrosion inhibitors, dyes and to increase the octane number. Chen et al.⁵⁰ performed high-temperature high-pressure MD simulations with ReaxFF^40,42 to predict the mechanisms and kinetics of several fuel additives, including ethanol, 2-butanol and methyl tert-butyl ether (MTBE). According to their results, 2-butanol was the best fuel additive at enhancing ignition while MTBE demonstrated similar ignition enhancement to 2-butanol. By contrast, ethanol was the worst fuel additive, having a negligible effect on the O₂ consumption rate and ignition delay time (IDT) compared with the clean biofuel.

To validate the reliability of ANI-1xnr for simulating biodiesel and to investigate the reported ignition enhancement of fuel additives, we reproduced four systems simulated by Chen et al.⁵⁰, namely, clean biodiesel, biodiesel with ethanol as additive, biodiesel with 2-butanol as additive and biodiesel with MTBE as additive. Figure 5 shows that the main products (CO, CO₂ and H₂O) are produced in very similar quantities to the ReaxFF simulations of Chen et al. Despite a quantitative difference between ANI-1xnr and ReaxFF IDTs (Extended Data Table 2), the additive effect on ignition delay for ANI-1xnr agrees qualitatively with ReaxFF, namely, all three additives cause product formation to occur at earlier times compared with clean biodiesel. Furthermore, ANI-1xnr predicts that 2-butanol and MTBE both result in the enhancement of O₂ consumption, similar to ReaxFF (Extended Data Table 2). The primary qualitative discrepancy with ReaxFF is that ANI-1xnr predicts that ethanol also enhances O₂ consumption. However, experimental work demonstrates that ethanol can actually accelerate fuel ignition at relatively high pressures, in agreement with our simulation results⁵⁷. Extended Data Fig. 3 provides further justification for the ANI-1xnr ethanol simulation results.

**Fig. 5: Biofuel additive simulation results for ANI-1xnr.**

Methane combustion

Emerging research has shown the success of application-specific MLIPs on systems such as radical reactions in hydrocarbon combustion and well-known gas-phase mechanisms^58,59. Zeng et al.²⁴ trained an NN-based potential to a dataset of QM-calculated fragment clusters sampled from a ReaxFF simulation of the combustion process of a mixture of CH₄ and O₂. They showed that their application-specific MLIP could then simulate the combustion process of methane with a reasonable mechanism. Though our ANI-1xnr potential was trained for a more general purpose, we compare the performance of our MLIP with the application-specific MLIP of Zeng et al. for methane combustion under high temperatures and pressures. Specifically, we reproduce their MD simulation of methane combustion under the same conditions with the ANI-1xnr potential. Figure 6a shows that the ANI-1xnr potential produces very similar major products and species profiles to those of Zeng et al. However, by comparison with the CH₄ and O₂ consumption rates of Zeng et al., ANI-1xnr predicts an overall reaction rate that is approximately a factor of 40 times faster. Specifically, while their system required 0.5 ns of simulation time to consume half of the initial CH₄, our system required only 0.012 ns. Similar to the biofuel case, the difference in the overall reaction rate is probably due to the difference in the reference DFT reaction energy barriers (Methods). Extended Data Figs. 4 and 5 provide further explanation as to the potential cause of this discrepancy.

Due to the extreme simulation conditions, no experimental reference data are available for comparison. However, the similar trend for species concentration with respect to time in comparison with the work of Zeng et al. indicates that our general-purpose MLIP was able to learn the relevant physics and mechanisms as well as the application-specific MLIP of Zeng et al. Also, the CH₄ and O₂ consumption curves for the ANI-1xnr model are much closer to exponential decay, which is more physically reasonable than the near-linear decay plots of Zeng et al.

Miller experiment

In 1959, Stanley Miller designed a famous experiment to elucidate the origins of life on earth⁵¹. Miller applied an electric field to a gaseous system consisting of simple small-molecule species (for example, NH₃, CO, H₂O, H₂ and CH₄) and reported the formation of amino acids such as glycine (C₂H₅NO₂). This revolutionary experiment led to the formation of the field of prebiotic chemistry, which aims to discover the reaction networks that produce molecules that are essential for the formation of life. In this spirit, computational studies have attempted to imitate the reaction conditions of the Miller experiment to predict the key reaction pathways that lead to the formation of glycine. Recently, Saitta and Saija performed relatively short (≈40 ps) near-ambient temperature (400 K) condensed-phase (≈1 g cc⁻¹) DFT-MD simulations, wherein an electric field is applied directly to ‘spark’ chemical reactions⁵². As our MLIP does not contain the necessary electronic information to apply an electric field, we instead encourage reactions to occur on picosecond timescales by performing high-temperature high-density MD simulations, similar to the Miller NR simulation of Wang et al.⁴⁴ Due to the low computational cost of our MLIP, we are able to run our Miller experiment simulation considerably longer (≈4 ns) than the ab initio NR simulations of Wang et al. (≈1 ns) with the same system size of 228 atoms but with periodic boundary conditions. For this reason, we use a constant condensed-phase density (with corresponding pressures around 1 GPa) rather than applying an artificial piston to periodically compress the non-periodic gas-phase system to around 10 GPa, as was the approach employed by Wang et al.

Figure 7 shows the ANI-1xnr reaction mechanism to form glycine starting from the initial reactants. During our Miller simulation, glycine is formed three times and persists for approximately 225 fs, 375 fs and 913 fs. Dissociation of glycine in less than 1 ps is expected, considering the relatively high temperature of this simulation. The final step to form glycine is hydrogen addition to C₂H₄NO₂, similar to the mechanism of Saitta and Saija. However, hydrogen addition occurs at an oxygen atom in our mechanism, rather than at the α-carbon as in the Saitta and Saija mechanism. In one instance, our Miller simulation produced the same C₂H₄NO₂ isomer as reported by Saitta and Saija. By contrast to the Saitta and Saija mechanism, this C₂H₄NO₂ isomer dissociated in our simulation rather than forming glycine. The key precursor to C₂H₄NO₂ is CH₄N, which is formed through several pathways. The pathway to form CH₄N that proceeds through the CH₂O intermediate is very similar to the mechanism reported by Wang et al.⁴⁴ The mechanisms to form the intermediates formaldehyde (CH₂O) and hydrogen cyanide (CHN) from the initial reactants CO, NH₃ and H₂O were nearly identical to those reported by Wang et al.⁴⁴ and Saitta and Saija⁵². Overall, there are several similarities between our mechanism and those of Wang et al. and Saitta and Saija.

**Fig. 7: Miller experiment simulation results for ANI-1xnr.**

Conclusions

Here, we introduced a sampling procedure, dataset and MLIP (ANI-1nxr) based on the NR for organic condensed-phase MDs, including reactions. The NR-based AL process builds a reactive dataset spanning elemental compositions of C, H, N and O under a wide range of conditions starting from nine small seed molecules. The NR–AL procedure provided data with unprecedented chemical environment diversity and relevance compared with prior non-reactive AL, and uncovers more than 1,000 unique molecules in total, under condensed-phase reactive atomistic configurations. Each unique molecular species formed by MDs simulation in our NR sampler was the result of one or more reaction pathways that did not need to be known or specified before runtime.

We validated the generality of the ANI-1xnr potential on five real-world condensed-phase reactive case studies: carbon solid-phase nucleation, effect of oxygen on graphene ring formation from acetylene, ignition of biodiesel with various fuel additives, combustion of methane and the spontaneous formation of glycine in early earth conditions, all without retraining. In carbon solid-phase nucleation and graphene ring formation studies, we show that ANI-1xnr reproduces the experiment well. In other cases, in extreme simulation conditions where an experiment is not available for comparison, ANI-1xnr produces results that are generally consistent with traditional modelling approaches, such as DFT, DFTB, ReaxFF and even an application-specific MLIP. The effectiveness of the NR–AL approach demonstrates the power of coupling and automating the system exploration, data generation and model training processes to produce a robust MLIP.

Although the ANI-1xnr potential is already a broadly applicable tool for studying condensed-phase reactive chemistry, we envision continuous improvement of this MLIP. Future work could augment the condensed-phase ANI-1xnr dataset with low-density or in vacuo reactive data, for example, by sampling pathways for pre-determined reactions^37,43 or for reactions identified in the NR simulations⁴⁴. Future work could also extend the dataset to additional elements¹⁸. As the current dataset was computed using an affordable plane-wave DFT method, future work could also investigate the prospect for higher-accuracy QM methods (for example, double-hybrid DFT or post-Hartree–Fock) to obtain improved reaction barriers. In addition to simple retraining, any of these improvements could use more advanced ML training paradigms, such as transfer learning⁶⁰, meta-learning⁶¹ and lifelong learning⁶². Concerning the model form, the ANI-1xnr potential is fully local, meaning long-range effects, such as London dispersion and Coulombic interactions, are not described explicitly beyond the model cutoff radius (Methods). Certain applications may require more direct treatment of long-range effects. Future work could investigate incorporating recent developments, such as explicit long-range terms⁶³, charge equilibration schemes⁶⁴ or graph NN models^{5,6,13,14,15,16} that can implicitly account for long-range interactions. A recent advancement in ML for natural language processing is the concept of foundational models, that is, large, general models usually trained with unlabelled data that can be specialized to specific tasks quickly with very small amounts of data⁶⁵. As ANI-1xnr is trained to a large, general dataset, a clear future direction is to evaluate whether it can act as a foundational model for application-specific MLIPs when greater accuracy is required.

We are providing the ANI-1xnr dataset for future research. We are also providing the resulting ANI-1xnr potential to the community. We advise potential users to exercise strong caution if applying ANI-1xnr outside of the training domain (CHNO condensed-phase reactive chemistry). Nonetheless, considering that ANI-1xnr was developed independently of the five case study systems, the generality of ANI-1xnr is truly remarkable.

Methods

Model description and training details

The ANI-1xnr model was trained similarly to ANI models within other contexts¹⁸, including materials science³² and chemistry⁶⁶. We use the ANI descriptors⁶⁷, which is a modified form of the Behler and Parinello NN descriptors¹². ANI-1xnr uses a local cutoff of 5.2 Å for the radial descriptors and 3.5 Å for the angular descriptors. The model is trained for the elements C, H, N and O, each of which has its own specialized NN-based potential. The NN architecture for each element and symmetry functions are reported as Supplementary Tables 2 and 3, respectively. Similar to previous ANI models, ANI-1xnr predicts energies based solely upon the atomic positions and element types. Therefore, unlike more complex MLIPs, for example, SpookyNet¹⁶ and AIMNet⁶⁸, the ANI-1xnr-predicted energies do not explicitly depend on the charge or spin multiplicity. Thus, similar to ReaxFF, ANI-1xnr predicts only the ground-state energy, regardless of whether the lowest-energy electronic state corresponds to a radical or an ion.

Training

The ANI-1xnr model was trained using both energy and force terms in the loss function as described in previous work⁶⁹. During training, we employ early stopping to prevent overfitting with learning rate annealing to ensure a high-fidelity fit. The model training is considered converged when the learning rate drops below 1.0 × 10⁻⁵. Model performance against the held-out test dataset is presented in Supplementary Tables 4 and 5 and Supplementary Figs. 5–8. Note that, similar to previous MLIP studies^27,28, we report the per-atom energy errors. This is because energy is an extensive property and the ANI-1xnr dataset consists of systems spanning almost two orders of magnitude in the number of atoms. Therefore, the unnormalized energy spans an enormous range and, thus, the corresponding unnormalized energy error is skewed by larger systems.

Although the ANI-1xnr root mean squared errors are approximately an order of magnitude larger than an MLIP trained to near-equilibrium data (for example, ANI-1x). This is not surprising since the ANI-1xnr dataset is considerably more challenging to train due to the wider range of atomic environments and system energies present in a reactive dataset. While errors between MLIPs trained on different datasets are not precisely comparable, we note that the ANI-1xnr mean absolute errors are of similar magnitude to those for TeaNet²⁷, which was also trained on a very challenging dataset aimed at developing a universal potential. Furthermore, the ANI-1xnr dataset is much more general than most reactive datasets that are limited to a single system of interest, for example, the Zeng et al. reactive dataset for CH₄ + O₂. By comparison, the force root mean squared errors of ANI-1xnr are only about 30% higher than those of the MLIP trained to the Zeng et al. single-reactive-system dataset, despite the ANI-1xnr dataset covering a substantially wider range of reactive chemistry. A validation that ANI-1xnr conserves energy is shown in Supplementary Fig. 9.

Training dataset generation

The ANI-1xnr training dataset was generated through an iterative AL process, where sampling of new atomic configurations is obtained with NR-inspired MD simulations. To bootstrap the AL process, periodic cells containing randomly placed and oriented small molecules with less than three non-H atoms and with a randomly selected composition of C, H, N and O are generated. Starting from this small initial bootstrap dataset, the AL algorithm is applied iteratively, yielding generations of datasets designed to improve upon their ancestors. Iterations of sampling, selecting, labelling and training are performed until the resulting MLIP is no longer improving. Training was described in the previous section. Details regarding sampling, selecting and labelling are provided in the following paragraphs.

Sampling

Atomic structures (that is, positions) are sampled by performing NR simulations with the current AL-generation MLIP. The MLIP-driven NR simulations are initialized with random compositions of small molecules, containing in the order of 100 atoms. Random oscillations of temperature and density (that is, simulation box volume) promote reactions and the formation of new products during the allotted simulation time (less than 1 ns).

Selecting

From all the atomic structures sampled along the NR trajectories, only high-uncertainty structures are selected for the ever-growing dataset, as these structures are deemed to be poorly described by the current AL-generation MLIP. Similar to previous ANI studies, we utilize a query-by-committee⁷⁰ uncertainty metric, that is, the normalized ensemble standard deviation in energy and atomic forces^32,60. To achieve a balance between exploration of chemical space and exploitation of the most important regions of the potential energy surface, the uncertainty thresholds vary between AL iterations, where the latter AL iterations generally have larger thresholds than earlier iterations. The final uncertainty threshold values for the normalized energy and forces were 1.85 ${{{\rm{kcal}}}}\; {{{{\rm{mol}}}}}^{-1}\; {{\mathrm{N}}}^{-\frac{1}{2}}$ and 6.92 kcal mol⁻¹ Å⁻¹, respectively.

Labelling

Each selected structure is then labelled with QM system energy and atomic forces. These QM calculations are computed with the open-source CP2K software⁷¹ using unrestricted Kohn–Sham DFT⁷², Becke, Lee, Yang and Parr (BLYP) functional^73,74, triple-zeta valence basis set with two sets of polarisation functions (TZV2P)⁷⁵, Goedecker, Teter and Hutter (GTH) pseudopotentials⁷⁶, Grimme third-generation dispersion (D3) correction with zero damping⁷⁷ and energy cutoffs of 600 and 60 Ry, respectively, for the plane-wave and Gaussian contributions to the basis set, as recommended in previous work⁷⁸. Ensuring that each DFT calculation converges to the global-minimum energy is challenging for complex condensed-phase systems with a large number of molecules and partially broken bonds. Indeed, it is likely that the few large outliers observed in Supplementary Fig. 5 are DFT calculations that converged to a meta-stable electronic state. Fortunately, the fraction of these outliers is relatively low. Thus, these presumed meta-stable calculations do not meaningfully impede the MLIP from learning the dominant branch of convergence for the DFT calculations.

The overall spin multiplicity for each DFT calculation is constrained to a singlet state. Note that this is the spin for the entire box, not just a single molecule or radical. The assumption that a condensed-phase system does not accumulate an impactful amount of spin is effectively an infinite-system size approximation. This choice of spin multiplicity is consistent with previous studies that perform CP2K simulations of bulk systems containing radical species⁷⁹. However, a singlet spin multiplicity for low-density gas phase or in vacuo calculations would not always be appropriate (for example, for radicals or molecules with partially formed bonds). The use of a singlet spin multiplicity may explain, in part, why ANI-1xnr performs poorly for in vacuo bond-breaking calculations (Extended Data Fig. 4).

Below is a detailed step-by-step description of the AL workflow (for a high-level overview, see Fig. 1):

1.
Generate a bootstrap dataset (labelled with energies and forces) of 100 randomly generated periodic cells containing randomly placed and oriented small molecules including C₂, H₂, N₂, O₂, NH₃, CH₄, CO₂, H₂O and C₂H₂ with random composition
2.
Train ensemble of ANI potentials to the current training dataset using eightfold (16 blocks) cross validation (14/1/1: train/validation/test split) scheme
3.
Prepare for NR–AL sampling: build a new random box of small molecules with random size, density, placements orientations. Define a random schedule function for oscillating temperature (T) and density (ρ). Oscillating functional form is the same for temperature and density (see equations below), where t is time, ${t}_{\max }$ is a hyperparameter for the max time the simulation will run, and T_start, T_end, T_amp, ρ_start, ρ_end, ρ_amp and t_per are randomly selected values within a pre-determined range (Supplementary Table 6):
$$\begin{array}{rcl}T(t)&=&{T}_{{{{\rm{start}}}}}+\frac{t}{{t}_{\max }}({T}_{{{{\rm{end}}}}}-{T}_{{{{\rm{start}}}}})+{T}_{{{{\rm{amp}}}}}{\sin }^{2}\left(\frac{t}{{t}_{{{{\rm{per}}}}}}\right)\\ \rho (t)&=&{\rho }_{{{{\rm{start}}}}}+\frac{t}{{t}_{\max }}({\rho }_{{{{\rm{end}}}}}-{\rho }_{{{{\rm{start}}}}})+{\rho }_{{{{\rm{amp}}}}}{\sin }^{2}\left(\frac{t}{{t}_{{{{\rm{per}}}}}}\right)\end{array}$$
4.
Run the NR MD simulation using forces from current AL-generation MLIP
5.
Monitor energy and force uncertainty metrics every 5–50 MD steps (with an MD time step of 0.5 fs). If the uncertainty values exceed a pre-selected threshold value, end the simulation and add configuration to a new batch of structures. Otherwise, continue running MD for a maximum of 1 ns
6.
Run DFT single-point calculations on the new batch of structures to obtain energy and force labels
7.
Add new labelled data to the training dataset
8.
Go back to step 2 and repeat until the MLIP converges. We define convergence as when MLIP-driven MD sampling simulations run for on the order of 50 ps on average. In other words, convergence is achieved when the MLIP is confident in all new MD simulations

Resulting training dataset

After more than 50 iterations of AL, the resulting training dataset includes 26,650 simulation cells of atomic positions with corresponding DFT system energy and atomic forces. Two-dimensional visualizations of the local atomic environments present in the dataset (Fig. 2a–d) are generated using t-distributed stochastic neighbour embeddings⁸⁰. Distributions of the system sizes, compositions and densities can be found in Supplementary Figs. 10–12, respectively. The average system size is 139 atoms. The vast majority (≈95%) of configurations in the training dataset have a density between 0.5 and 2.0 g cc⁻¹. While the minimum density in the dataset is ≈0.03 g cc⁻¹, less than 1% of the configurations in the dataset have a density less than 0.1 g cc⁻¹, suggesting that ANI-1xnr should not be trusted for low-density gas-phase simulations or in vacuo calculations.

By cross-referencing the ANI-1xnr training dataset with the existing PubChem database⁸¹ for only CHNO molecules with ten or fewer CNO atoms, we conclude that the ANI-1xnr dataset contains 1,212 unique known PubChem molecules, or approximately 0.2% of the ≈555k PubChem CHNO molecules with ten or fewer CNO atoms. Supplementary Fig. 13 shows a histogram of the sizes of all molecules that are found in the ANI-1xnr dataset, which includes one molecule up to 145 atoms. The majority are small molecules of similar size or slightly larger than those from which the systems were initialized. There are many occurrences of molecules in the range of 10–90 atoms. The largest structures, ascertained by visual inspection, are graphene sheets. Furthermore, the 1,212 unique PubChem molecules only represent the simulation frames that were selected by the uncertainty estimate. Therefore, 1,212 should be considered a lower bound of molecules discovered during AL. There are probably many more molecules formed over all NR–AL sampling, which is estimated to be hundreds of nanoseconds of MD simulation time in aggregate.

To automate the extraction of common molecular entities that formed during the AL process, we developed a NetworkX-based package called MolFind. This Python software tool employs user prescribed cutoff distances for defining when two atoms are bonded or not and discovers clusters of atoms connected via bonds. The three-dimensional molecular architecture is partially captured through a graphical representation (that is, nodes and edges) of the bonding topology where atoms are nodes and bonds are edges. Graphs are encoded according to the open-source Python package called NetworkX⁸². The graphical representation and the NetworkX package enables (1) the counting of the number of topologically distinct molecular species in a simulation via a graph isomorphism check and (2) a comparison to known molecular entities with a specified topology. Previously, we tabulated a large database of known molecules and associated topologies by scraping the entirety of the PubChem database up to ten non-hydrogen atoms. The existence of a species in the database is not required for MolFind to extract a bonded atomic cluster but if found, it can affix a chemical/species name with the entity.

Simulation details

All MD simulations in this study are performed with the NeuroChem package⁶⁷ and the Atomic Simulation Environment⁸³. The average computational speed of our Atomic Simulation Environment–NeuroChem MD simulations was approximately 50k atomic gradients per second on a single NVIDIA Titan V graphics processing unit (GPU). We acknowledge that a more optimized code, such as Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)⁸⁴, would be noticeably more computationally efficient. For example, recent studies have demonstrated that highly optimized MLIPs within LAMMPS can obtain 1–10 million atomic gradients per second on a single NVIDIA A100 GPU^15,85. However, because of the relatively small system sizes of our AL simulations (less than 500 atoms) and our case study simulations (no greater than 5,000 atoms), it was not necessary to fully optimize our computational performance by utilizing a code such as LAMMPS. To demonstrate that there is an opportunity to greatly improve our performance, we achieved 90k atomic gradients per second simply by increasing our simulation size to 25k atoms and, thereby, more efficiently utilizing the GPU.