Background & Summary

Introduction

The development of molecular force fields driven by data is predominantly benchmarked against the MD17 dataset introduced by Chmiela et al.1 and its extension, the rMD17 dataset2. These datasets consist dynamic data of ten small to medium-sized gas-phase molecules. In molecular dynamics, data are intrinsically time-series sequences, necessitating careful sampling to prevent unintended information leakage into future states. A detailed analysis of MD17 and its variants reveals a significant sampling bias towards a narrow potential energy surface (PES) region close to the equilibrium structure. This narrow exploration of PES leads to limited conformation and energy space sampling, as our internal coordinate analysis shows. Thus, these datasets are suboptimal in terms of segmentation strategy and the molecular conformation space they cover.

For our discussion, we refer to these conventional molecular dynamics datasets as in-distribution datasets. Yet, many chemical processes of interest occur out-of-distribution. Consider a basic chemical reaction depicted in Fig. 1: the nuclear configuration space includes reactants, transition states, and products. Sampling exclusively from the reactant region fails to capture the full dynamics of chemical reactions. As a result, NFF models trained on such skewed datasets are biased towards reactant configurations, potentially leading to qualitatively inaccurate predictions for a complete chemical reaction.

Fig. 1
figure 1

Trajectories on a representative potential energy surface. The contour plot represents the energy landscape, with the color gradient indicating various energy levels. Trajectories are usually confined to regions near the minima, reflecting the system’s preference for low-energy states close to or at equilibrium.

To overcome these challenges, we introduce the extended excited-state molecular dynamics (xxMD) dataset in this work. The xxMD retains the core objective of capturing trajectory data for small to medium-sized gas-phase molecules but distinguishes itself by incorporating nonadiabatic trajectories which include the dynamics of excited electronic states. Comprising four photochemically active molecules, the xxMD begins with significantly higher initial energies, enabling it to traverse a more extensive nuclear configuration space and more authentically represent the entire chemical reaction PES — reactants, transition states, and products. Notably, the xxMD captures regions near conical intersections, which are critical to the pathways of potential energy surfaces across different electronic states3,4,5,6,7,8. By including these key regions, the xxMD dataset aims to establish new benchmarks and challenges for NFF models, providing a more comprehensive and chemically accurate dataset for the development of predictive models.

We note that our development of xxMD datasets is not the first attempt ever to try to go beyond the (r)MD17 datasets. For example, the recently developed WS22 database9 tries to include nuclear configurations from multiple minima and interpolate among these configurations. Although WS22 has gone beyond (r)MD17, the xxMD datasets developed in current work involve much more complex configurations, for example, regions that correspond to conical intersections and locally avoided crossings.

Existing datasets: MD17 and its variant

Chmiela et al. performed adiabatic ab initio molecular dynamics (AIMD) simulations on small gas-phase molecules at room temperature, with the electronic potential energies computed at the Kohn-Sham density functional theory (KS-DFT) level1. However, the original publication did not provide detailed specifics about the density functional, basis set, spin-polarization, grid for integration, and the software used. This lack of transparency presents a challenge for reproducibility and may limit the utility of the dataset for certain types of chemical simulation. Addressing the need for clarity, Christensen et al. revisited the  geometries of the MD17 dataset, recalculating them using the PBE density functional with the def2SVP basis set and enhanced grid precision2. This effort led to the creation of the rMD17 dataset, which has since been widely adopted in NFF studies10,11. Nonetheless, it is crucial to note the limitations of the PBE functional and def2SVP basis set for simulating accurate chemical reactions. While these computational tools can produce a continuous PES that varies with nuclear configuration, their ability to yield accurate results for chemically complex reactions — especially those involving bond breaking and formation — is often questioned. Despite these concerns, the MD17 and its refined counterpart, rMD17, are still considered to be well-behaved datasets for benchmarking purposes within certain constraints.

Adiabatic molecular dynamics datasets generated at low energy range are inherently limited in their sampling diversity and may not benefit fully from techniques such as uniform sampling and cross-validation. This is particularly true for adiabatic AIMD simulations, where initial low-energy conditions substantially constrain the nuclear configuration space. This limitation results in trajectories that predominantly occupy the reactant region of the PES, as depicted in Fig. 1.

To evaluate the breadth of configurations in the MD17 and rMD17 datasets, we conducted an analysis focused on internal coordinate distributions for azobenzene (C-N = N-C dihedral angle and the N = N bond length) and malonaldehyde (C-C-C = O dihedral angle and the C = O bond length). These distributions, along with the corresponding relative electronic potential energies and force norms, are illustrated in Fig. 2. The visual representation confirms that the internal coordinates distribution is notably narrow. Consequently, we observe a significant overlap between the training and testing samples within these datasets. Such overlap raises concerns about potential data leakage, which could inadvertently lead to overly optimistic results in benchmarking studies, as discussed in the literature10,11,12,13,14. The findings underscore the need for datasets that encompass a more diverse and extensive sampling of the PES to ensure robust and reliable benchmarks for NFF models.

Fig. 2
figure 2

Illustration of training and testing sets using the reference split indices for azobenzene and malonaldehyde datasets in rMD17. The X-axis depicts dihedral angles (marked by ‘C’, ‘N’, and ‘O’), the Y-axis denotes bond distances (highlighted by bold letters), and the Z-axis shows relative energy. Training and testing samples are differentiated by color, correlating to force norms. Note that training samples overlap with testing ones.

Dataset requirement

In classical MD and adiabatic AIMD simulations, chemical reactions are characterized by the system’s transition across different minima on the PES. These transitions correspond to changes in electronic potential energy as the system moves through various nuclear configurations. Systems naturally tend to follow the path of least resistance, referred to as the reaction pathway. To develop accurate NFFs, two fundamental elements are required: a comprehensive quantum chemical dataset that captures the full range of molecular transformations from various regions, and an advanced machine learning model with the capacity to interpolate and extrapolate across the PES. Fig. 1 illustrates typical low energy adiabatic AIMD trajectories on a PES. It’s evident that these low energy adiabatic AIMD trajectories tend to be localized around the ground state minima.

In contrast, datasets derived from nonadiabatic dynamics simulations are particularly valuable as they provide a more diverse array of nuclear configurations, going beyond the limitations of low energy adiabatic AIMD. These enriched datasets allow for the exploration of PES regions that are critical for understanding complex chemical processes, which are often not adequately represented in low energy adiabatic simulations.

Summary

In summary, the xxMD dataset developed in current work includes four molecular systems: azobenzene, malonaldehyde, stilbene, and dithiophene, with crucial geometries along their reaction pathways illustrated in Figure S3. Notably, azobenzene and malonaldehyde are also part of the MD17 and rMD17 datasets, allowing for direct comparison.

The geometries in xxMD dataset are sampled from nonadiabatic dynamics. The potential energies and gradients, i.e. forces, for the first three singlet electronic states at the state-averaged complete active state self-consistent field (SA-CASSCF) level of theory15 are included in xxMD-CASSCF dataset. In addition, spin-polarized KS-DFT with M06 functional16 calculations are performed on the same geometries as in xxMD-CASSCF dataset, the resulting ground singlet electronic state potential energies and gradients are included in the xxMD-DFT dataset. Therefore, the xxMD datasets developed in current work involve a multi-state dataset — xxMD-CASSCF dataset, and a single-state dataset — xxMD-DFT dataset.

Method

For our xxMD dataset, we employ the trajectory surface hopping (TSH) semiclassical nonadiabatic dynamics algorithm3,4,17 with SA-CASSCF electronic theory15. The SA-CASSCF is a multireference electronic structure theory that provides qualitatively correct description of strong correlation - which are critical for deformed geometries and conical intersections, while the linear response time dependent Kohn-Sham density function approximations failed qualitatively18,19. We ensured that only energy-conserving trajectories were sampled. The size of the data samples is detailed in Table S6 in supplementary material.

Nevertheless, to ensure compatibility with prevalent datasets like MD17, we also computed single-point spin-polarized KS-DFT (also called unrestricted KS-DFT) values. These calculations employ the M0616 exchange-correlation functional — a notably superior meta-GGA functional relative to PBE for chemical reactions. This dual approach culminates in two datasets: xxMD-CASSCF and xxMD-DFT. The former captures potential energies and forces across the first three electronic states for azobenzene, dithiophene, malonaldehyde, and stilbene. The latter provides recomputed ground-state energy and force values, anchored on the same trajectories. All computational details are described in supplementary information section G Computational details. Notice that SA-CASSCF PESs can be more complicated than DFT surfaces due to more complicated electronic structure algorithm from SA-CASSCF, i.e. choice of active space. Both xxMD datasets are structured via a temporal split method, partitioning training and testing data based on trajectory timesteps. We want to emphasize that xxMD datasets do not involve nonadiabatic coupling vectors (NACs) for two reasons: first, the advances in the field of nonadiabatic dynamics have enabled NAC-free nonadiabatic dynamics simulations, for example, curvature-driven dynamics20,21,22,23,24. Second, the purpose of the current work is to provide a database which includes a wide nuclear configuration space for which the energies and gradients of multiple electronic states are available. Therefore, the machine learning force field models can be tested against each surfaces. We note that an appropriate fit of a coupled PESs with multiple electronic states for a single system requires diabatic representation, which is beyond the discussion of the current work25,26,27.

We evaluated six message-passing NFF models on the xxMD datasets: SchNet28, DimeNet++ (DPP)29, SphereNet (SPN)14, NequIP10, Allegro30, and MACE11. Each model was mostly used with its default parameters, and in line with convention, we trained the NFFs emphasizing more on force losses. While hyperparameter optimization could potentially improve performance (See Supplementary Information for an example), it remains outside the scope of this study. Therefore, the presented results might not showcase the absolute best performance for each model. Given our observations, we encourage researchers aiming to apply NFFs in practical scenarios to conduct rigorous re-benchmarks tailored to their specific chemical systems and objectives.

Temporal splitting was chosen over random splitting to partition the xxMD datasets. This method involves dividing time-series data based on timesteps, reserving a specific range for testing and applying a 50:25:25 split for training, validation, and testing sets. Such a split allows for a rigorous assessment of a model’s ability to predict unexplored areas of the PES. This is highlighted in Fig. 3, where deviations in trajectories over time emphasize the datasets’ capability to challenge and evaluate the extrapolative power of NFFs. However, it is possible to use random splitting on xxMD datasets considering the wide coverage of conformation space.

Fig. 3
figure 3

Comparison of Average RDFs and MSDs Across Multiple Trajectories. Each row corresponds to a group of trajectories, with RDF on the left (indicating particle density as a function of distance) and MSD on the right (showing particle displacement over time). Shaded regions represent standard deviations.

Data Records

The xxMD-CASSCF and xxMD-DFT datasets have been made publicly available on GitHub at the following URL: https://github.com/zpengmei/xxMD; and on Zenodo at the following URL: https://doi.org/10.5281/zenodo.1039385931. These datasets are stored in compressed archives, each containing pre-split extended XYZ format files based on temporal information. The files have been processed using the Atomic Simulation Environment (ASE) software package, as documented in the reference32. The GitHub repository is structured into two main directories, each corresponding to one of the datasets: xxMD-CASSCF and xxMD-DFT.

Within each directory, data is further organized into subdirectories named after the four molecules studied: malonaldehyde, azobenzene, stilbene, and dithiophene. Each molecule’s subdirectory contains the associated dataset files. Notably, the xxMD-CASSCF dataset includes an additional subdirectory structure that segregates the state-specific data for the first three electronic states.

Technical Validation

Dynamic properties

Through the ensemble-averaged radial distribution function (RDF) and mean square displacement (MSD), the xxMD datasets exhibit a comprehensive sampling of the nuclear configuration space, surpassing that observed in MD17. Illustrated in Fig. 3, the RDF and MSD track nuclear configurations over time, offering insights into the spatial distribution and mobility of particles, respectively. The RDF measures the likelihood of particle presence at varying radial distances from a reference point, whereas the MSD quantifies the average squared distance that molecules travel over a time interval.

The pronounced shifts in nuclear configurations captured by nonadiabatic dynamics in the xxMD datasets, as reflected in the dynamic breadth of the RDF and MSD, underline the enhanced diversity of PES regions sampled. Consequently, the complexity of mastering the PESs for molecules in the xxMD dataset is expected to be significantly elevated, presenting a robust challenge for the accuracy of NFFs.

Benchmarks on xxMD-CASSCF and xxMD-DFT datasets

We picked six representative equivariant NFFs to benchmark. The hyperparameters and training details of models are described in the supplementary information. We used a weighted loss of 1:1000 on energy and forces. We stress that our purpose is not to perform an extensive comparison of models over multiple choices of hyperparameters. Rather, we limit ourselves to showing the performance of the models in the default configurations.

We first evaluate the regression precision of all models on the first three electronic states, which are labeled as S0, S1, and S2 respectively (Label S denotes the singlet spin state which is a widely used notation in quantum chemistry) by using the temporal splitting approach for data in xxMD-CASSCF dataset. The mean absolute error (MAE) of the predictive energies and forces for test sets are shown in Table 1. Similarly, we present such results of using xxMD-DFT datasets in Table 2. The best performance on each row is bolded. Additional results on the validation sets are available in the supporting information. Note that validation sets depict the nuclear configurations that are closer to the training sets due to the temporal splitting. Therefore, the MAE shown in validation sets are in general lower than that for test sets.

Table 1 Comparison of predictive MAE of energy(E, meV) and forces(F, meV/A) on hold-out testing set for different models on temporally split xxMD-CASSCF datasets and tasks.
Table 2 Comparison of predictive MAE of energy(E, meV) and forces(F, meV/A) on hold-out testing set for different models xxMD-DFT datasets and tasks with temporal split.

Comparison with existing datasets

In this section, we analyze model behavior for two molecules, namely azobenzene and malonaldehyde. These two molecules are both available in xxMD and (r)MD17 datasets. Benchmarks for (r)MD17 reveal that the accuracy of MACE, NequIP, and SPN exceeds that of traditional electronic structure methods10,11,14,33. It’s essential to note that typical errors for KS-DFT in predicting relative transition state energy can be several kcal/mol. For instance, the MAEs of HTBH38 (Hydrogen transfer barrier heights) and NHTBH38 (non-Hydrogen transfer barrier heights) databases are about 9.1 kcal/mol for PBE and 2.4 kcal/mol for M06. Thus, an NFF fitting error below 50 meV would surpass the accuracy of modern density functional calculations. However, such claims are pertinent mainly to ground state potential energies, given that excited state calculations are often less precise. Therefore, given the reported MAEs, these NFF models perform admirably on (r)MD17 datasets.

However, this conclusion might be deceiving. Previous discussions highlight the constrained nuclear configuration space in MD17 and rMD17. A comparative analysis of MAEs for the six NFF models on azobenzene and malonaldehyde from xxMD-DFT and (r)MD17 is presented in Table 3. Literature-derived MD17/rMD17 results indicate that all models used 1,000 training samples10,11,14. Predictably, the predictive prowess of NFF models diminishes when applied to the xxMD dataset.

Table 3 Comparison of predictive MAE on hold-out testing sets of NFF models on azobenzene and malonaldehyde in (r)MD17 and xxMD-DFT datasets. (r)MD17 benchmarks with 1,000 samples are taken from11,14,28.

The differences of MAEs for a same NFF model for rMD17 and xxMD come from two aspects, namely, the differences in dataset, and the differences in splitting method. The xxMD datasets contain much more complex nuclear configurations than (r)MD17. For the splitting method, one can have either random splitting or temporal splitting. For certain purposes, for example, if one uses the trajectory data to construct a global PES for the system, random splitting would be a good approach. For purpose of extended trajectory simulation with existing trajectory data, temporal splitting may be favored. Because the ultimate goal is to look for unknown chemical events that may not be observed from short trajectory simulations. In that spirit, we use temporal splitting in the current work. For the purpose of extended trajectory simulation, random splitting, which has been used to test against (r)MD17 dataset, means a severe leakage of future information. In practice, if we would like to model a chemical reaction, it would be impractical to manually sample every relevant region on the potential energy surfaces. Therefore, it is a desired property for an NFF model has the capability of physical extrapolation to some extent. Physical extrapolation is achieved in several models, for examples, reactive force field34, and use of a parametrically managed activation function35.

The effectiveness of NFF models largely depends on the datasets they are benchmarked against. Historically, the (r)MD17 datasets have been the gold standard for this purpose. However, our study highlights the potential shortcomings of relying solely on (r)MD17 datasets. Given that they primarily capture a narrow nuclear configuration space from low energy ground state AIMDs, they fall short of encompassing the holistic nuclear configuration pertinent to chemical reactions. Training NFF models on such datasets can be somewhat trivial and could result in misleading conclusions about their true capabilities. For instances, computational chemists have a long history of using system specific force fields, which can be easily developed by computing a hessian at the ground state equilibrium geometry36,37.

To address this gap, we introduced the xxMD dataset, derived from nonadiabatic dynamics trajectories. The xxMD dataset offers a comprehensive representation of the nuclear configuration space, encapsulating the reactant, transition state, product, and conical intersection regions of PESs. Its inclusion of several low-lying excited state potential energy surfaces underscores its importance and the challenges it presents for NFF model development. Our benchmarks of prevailing NFF models on the xxMD dataset have revealed pronounced difficulties. Utilizing default hyperparameters, the chosen NFF models struggled to offer quantitatively or even qualitatively accurate force field models for specific systems. We anticipate that our findings will galvanize the community towards pioneering more advanced NFF models better equipped to study intricate chemical reactions.

Code availablity

Nonadiabatic dynamics are performed with Surface Hopping with Arbitrart Coupling (SHARC) code, which is available at https://github.com/sharc-md/sharc. SchNet, DimeNet++ and SphereNet are available as implemented in the Dive Into Graphs package (https://github.com/divelab/DIG.git). NequIP package is available at https://github.com/mir-group/nequip.git. Allegro package is available at https://github.com/mir-group/allegro. MACE package is available at https://github.com/ACEsuit/mace.git. All packages are up-to-date at the data of the publication. All the trainings are done with single precision float format. SchNet, DPP and SPN models are initialized using the default hyperparameters shipped with the packages. Allegro hyperameters can be found at https://github.com/mir-group/allegro/blob/main/configs/example.yaml, NequIP hyperparameters are available at https://github.com/mir-group/nequip/blob/main/configs/example.yaml, MACE hyperparameters are available at https://github.com/ACEsuit/mace. Since Dive Into Graphs package doesn’t implement the scale and shift of the energy, we manually rescaled the energy by substracting the energy of the configuration with the lowest potential energy.