A benchmark dataset for Hydrogen Combustion

The generation of reference data for deep learning models is challenging for reactive systems, and more so for combustion reactions due to the extreme conditions that create radical species and alternative spin states during the combustion process. Here, we extend intrinsic reaction coordinate (IRC) calculations with ab initio MD simulations and normal mode displacement calculations to more extensively cover the potential energy surface for 19 reaction channels for hydrogen combustion. A total of ∼290,000 potential energies and ∼1,270,000 nuclear force vectors are evaluated with a high quality range-separated hybrid density functional, ωB97X-V, to construct the reference data set, including transition state ensembles, for the deep learning models to study hydrogen combustion reaction.


Background & Summary
The expectation behind training deep learning models to predict molecular energies and atomic forces of molecules is the requirement of large data sets. However, very recently it has become recognized that deep learning methods that are designed with rotationally equivariant operators offer a significant reduction in data needed for training relative to invariant ML models [1][2][3][4] , and often outcompete even kernal methods that have traditionally been considered advantageous due to their low data requirements 5 . However, the promise in regards equivariant deep learning models must be further validated by construction of more challenging data sets than encountered up until now. For example, the recent SN2 data set provides reference energy and forces for more than 450,000 structures calculated using Density Functional Theory (DFT), but ultimately is data on highly similar individual reactions of methyl halides with one of four substituted halogens, F, Cl, Br, and I 6 .
Capturing the energy release in hydrogen combustion is a proposed energy solution for zero CO 2 emissions, and many of the elementary reactions of H 2 combustion are also present in other types of fuel generation 7 . Under realistic reaction conditions of very high temperature and high pressure make it extremely difficult to study H 2 combustion reactions experimentally. Because hydrogen combustion is difficult to study experimentally under these extremes 8 , theoretical models must play an active role in filling the breach, but fundamentally relies on an accurate potential energy model of not only the elementary reactions 9 but the excursions away from the reaction coordinate.
Hydrogen combustion, despite being the simplest combustion system, is nonetheless still quite chemically complicated because it can encounter one or more 19 reaction channels during the combustion event depending on the physical conditions of high temperatures and pressures 8 . This compounds the need for high quality data that is expensive to generate given the need for extensive sampling and the presence of metastable points such as transition states. For non-reacting chemical systems, conventional MD simulations are well-suited for generating a large number of configurations, which are then used as input into single point quantum-chemical energy and force calculations [10][11][12] . However, for reactive systems, conventional force-field based MD simulations are not useful as they don't allow breaking and forming of chemical bonds. Recent work has attempted to address this deficiency through graph-based methods that generate reference data for reactive systems 13,14 , but they are also prone to produce large numbers of specious chemical states and unrealistic intermediates such as highly unstable radicals. Therefore fully ab initio sampling methods are a necessity for creation of the many molecular fragments involved in combustion chemistry, including the presence of stable and unstable intermediates, high energy transition states, and a variety of product molecules that can be formed during the reaction that is dependent on the reactive channel 8,9,[15][16][17][18] .
Our goal here is to characterize the potential energy surface (PES) of hydrogen combustion through the reaction channels proposed by Li et al. 19 using a systematic approach in ab initio data generation that samples off the intrinsic reaction coordinate (IRC). This study provides a data set of ∼290,000 potential energies and ∼1,270,000 nuclear force vectors for structures that are sampled near and far from the IRC for 19 hydrogen combustion sub-reactions, some of which are barrierless transitions, others are dominated by large activation barriers, and even reactions involving changes in spin state 19 . This data set offers a new ML benchmark set that allows systematic investigation of data reduction when using emerging equivariant deep learning model, as well as being of interest in its own right as a source of data for machine learning of energy and forces that drive an MD engine for combustion under extreme thermodynamic conditions.

Methods
We have used fully ab initio methods for sampling 19 reactive channels for hydrogen combustion as summarized in Table 1. For each reaction we used the ωB97X-V DFT functional 20 with the cc-pVTZ basis set. All calculations were performed as unrestricted open shell, using an ultrafine integration grid of 99 radial points and 590 angular points, with an SCF convergence of − 10 8 using the GDM method 21 . All potential energies for each configuration of the 19 reactions are reported as ΔE total i atom using the atomic energies E H = −0.5004966690 a.u. and E o = −75.0637742413 a.u., and with ΔE converted to units of kcal/mole. All calculations were performed using the Q-Chem program 22,23 .
We have organized the PES data into four categories that classify the reaction mechanism involved in the elementary steps for each reactive channel: association/dissociation reactions (channels 5-9 and 15), substitution reactions (channel 16), oxygen transfer (channels 1, 11, and 12), and hydrogen transfer (channels 2-4, 10, 13, 14, 17-19). We have kept the same numbering scheme as Li and co-workers 19 in these categories so that readers can refer back to any particular IRC of that work if desired. www.nature.com/scientificdata www.nature.com/scientificdata/ The PES for each reaction channel are visualized by means of two collective variables of coordination numbers (CN) represented by where r 0 is the equilibrium distance and σ = .
3 0 controls the sharpness of the function. Reaction channels 5-7 involve only two atoms, and thus only a 1-D distance scan is performed.
Finally, we developed a strategy for extensive sampling of the PES for the 19 reaction channels for hydrogen combustion as follows: 1. Transition States and IRCs. Approximate TS structures were found using the freezing string method 24,25 , and refined by the partitioned-rational function optimization eigenvector following method (P-RFO) 26 . An IRC scan is then generated, and vibrational frequency analysis was performed to confirm that reactants and products have no imaginary frequencies and the TS has only one imaginary frequency. As the IRC configurations connect the minimum energy pathway, and therefore span a meaningful fraction of the configurational space of a given reaction, they serve as useful starting geometries for systematic normal mode displacements and stochastic generation of structures using AIMD at finite temperatures to explore the PES for each reaction channel in more detail. 2. AIMD Simulations. We employed AIMD simulations to sample configurations around the IRC structures using the TS as the initial configuration for each of the reaction channels. The AIMD simulations were performed at four different high temperatures by initializing the Maxwell-Boltzmann distribution of velocities at temperatures of 500 K, 1000 K, 2000 K and 3000 K. Furthermore at each temperature three different simulation timescales are performed using a 1.21 fs (1.au.) time step: 10 independent (i.e. reinitialized velocities) long simulations of 121 fs, 20 independent short trajectories of 60.5 fs, and finally 25 very short simulations of 24.2 fs. In summary, the AIMD calculations yielded a total of 10000 configurations along www.nature.com/scientificdata www.nature.com/scientificdata/ with their potential energies and nuclear forces for each reaction channel (see Table 1). 3. Normal Mode Displacements. Systematic normal mode displacements along the IRC is performed. Starting from each IRC structure, the frequencies were calculated and all atoms were displaced along each normal mode (NM) with a ± . 0 01, ± . 0 025, 0 05 ± . , ± . 0 075, ± . 0 1, ± . 0 125, and ±0.15. increment. These sampled structures that compress or expand the IRC structures help to diversify the AIMD geometries for each reaction, yielding ∼ 127,000 configurations as summarized in Table 1. The IOData Python library was used for parsing the Q-Chem output files in generating these geometries 27 . Figure 1 provides a representative ab initio sampling of one of the hydrogen transfer reactions,

Technical Validation
in which two collective coordinates reasonably capture the potential energy surface of this reaction channel. Upon analyzing the AIMD generated geometries and their energies, it is noticed that both the reactant and product endpoint regions are well sampled ( Fig. 1(a)). However, near the transition state or in regions of high slope on the potential energy surface, data points from the AIMD simulations are more sparse. The addition of normal mode displacement points greatly improves sampling the configuration space of the PES along the IRC path ( Fig. 1(b)). Figure 2 shows that the AIMD and NM calculations are complementary for sampling different areas away from the IRC, particularly evident for reaction channel 1 involving oxygen transfer (Fig. 2(a)), reaction 8 that probes the association reaction mechanism (Fig. 2(b)), and for reaction channel 16 pertaining to a substitution mechanism (Fig. 2(c)). In all cases the use of two collective coordinates is sufficient to capture the IRC and its AIMD and NM extensions, borne out in the supplementary information Figures S1-S4 that provides the potential energy surfaces generated for the remaining reaction channels for these classes of hydrogen combustion reactions. Figure 3 shows the nature of the alternative potential energy surfaces that are represented by the changes in spin state from doublet to quartet for the oxygen transfer reaction channel 12. Figure 3(a) shows that the energy difference between the two spin states is very small near the reactant, less than 0.2 kcal/mol, but favors the quartet state substantially around the product. Figure 3(b) plots the IRC using either the doublet or quartet www.nature.com/scientificdata www.nature.com/scientificdata/ spin state energies using the quartet spin state static structures, and similarly Fig. 3(c) represents the two spin state energies using the doublet energy configurations. Figure 3(d) shows the minimum energy of the two spin states along a single generated IRC. These differences indicate that while the geometric effects may be small, the electronic energy differences between spin states are significant. In the supplementary information we also provides the potential energy surfaces generated for reaction channel 6 which also undergoes a spin state change.
In summary, we generated high quality DFT data for hydrogen combustion reaction channels using range separated hybrid meta-GGA functional ωB97X-V with the cc-pVTZ basis set. This level of theory is considered highly accurate for thermochemistry and reactive barriers 28,29 , and the IRC profiles compared against the gold standard CCSD(T)/cc-pVTZ methods determined very small errors with the DFT level of theory 7 . This work moves beyond benchmarks such as the IRC for H 2 combustion by extensive sampling off the reaction coordinate using ab initio MD simulation and normal mode analysis for each of the 19 reaction channels. Furthermore, we also consider multiple spin states of the species formed in the hydrogen combustion process. This high quality data is now available to benchmark deep learning models for chemical reactivity, and as a model of the PES for generating kinetic models of H 2 combustion, especially at high pressure.

Data Records
All data can be found in the figshare repository. For each reaction channel the IRC, AIMD and NM generated configurations and corresponding energies and atomic forces are provided in.npz file format; for reaction channel 5, 6 and 7 only IRC generated data are provided as discussed above. Each .npz file contains six keys including, "R" (atomic Cartesian coordinates), "Z" (atomic numbers), "N" (number of atoms), "ΔE" (reference potential energy), "F" (atomic force vectors), and "RXN" (reaction number). All the atomic position are in Å and energy and force vectors are provided in kcal/mol and kcal/mol/Å, respectively. Reaction channels such as 6 and 12 involve nuclear spin changes during the reaction, and therefore IRC calculations are performed for both spin states, with the data sorted to either (1) retain energies and forces consistent with one spin state, or (2) retaining Fig. 3 The changes in the PES for reaction channel 12 involving changes in spin state. (a) the spin cross over between the two closely spaced doublet and quartet spin state energy levels around the reactant region with widening differences progressing to product. (b) the IRC path defined by the doublet energy but geometries from the quartet (green), and from the doublet energy and geometries (red). (c) the IRC path defined by the quartet energy but geometries from the doublet (green) and from the quartet energies and geometries (red). (d) Resultant PES obtained reaction channel 12 (  + →  +  HO  O  OH O   2 2 ) by choosing the minimum energy between the two spin states. Each CN represents the formation or breaking of respective bond involved in the reaction process mentioned in the axes.