Machine learning in chemical reaction space

Chemical compound space refers to the vast set of all possible chemical compounds, estimated to contain 1060 molecules. While intractable as a whole, modern machine learning (ML) is increasingly capable of accurately predicting molecular properties in important subsets. Here, we therefore engage in the ML-driven study of even larger reaction space. Central to chemistry as a science of transformations, this space contains all possible chemical reactions. As an important basis for ‘reactive’ ML, we establish a first-principles database (Rad-6) containing closed and open-shell organic molecules, along with an associated database of chemical reaction energies (Rad-6-RE). We show that the special topology of reaction spaces, with central hub molecules involved in multiple reactions, requires a modification of existing compound space ML-concepts. Showcased by the application to methane combustion, we demonstrate that the learned reaction energies offer a non-empirical route to rationally extract reduced reaction networks for detailed microkinetic analyses.

In order to allow for a clear definition of reactions in terms of bond-breaking and formation, only those systems where UFF and DFT geometries describe the same molecular topology are included in the database. This leads to a drastic reduction from the initial set of over 27,000 systems to 10,712 structures in the final Rad-6 database. A positive side effect of this is that the structures in Rad-6 can be expected to be reasonably stable, since they represent local minima on the DFT calculated potential energy surface.
Rad-6-BS database: To investigate the stability of the proposed ML approach with respect to changes in the data (in particular regarding the spin-state), a second set of singlepoint energies was calculated for the Rad-6 database, using broken-symmetry DFT. Here, calculations were performed with the revPBE functional and def2-TZVP basis-set using Orca. 11,12 To enable symmetry breaking even for nominally closed-shell systems, the betaspin orbitals in the initial guesses were perturbed by randomly mixing an occupied and unoccupied orbital. After convergence, four additional calculations were performed for each system, reusing the converged wavefunctions from previous runs and further perturbing the orbitals. This procedure was used to avoid SCF convergence into local minima or saddle points. 13 As mentioned in the text, for every species a separate density is constructed. Planar cuts through ρ C a (r) (left) and ρ H a (r) (right) around a cutoff centered carbon atom (star) in ethylene are shown.
The black circle represents the radial cutoff distance.

Supplementary Note 2: Theory and Computational Methods
Smooth overlap of atomic positions (SOAP): SOAP is a local kernel that measures the similarity of atomic environments. 10 It was found to be highly successful in molecular and solid-state applications. 10,[14][15][16] Below, a brief overview of the concept is given, more details can be found in the literature. 17,18 SOAP is based on the neighborhood density function ρ a (r) around a reference atom a: where the sum runs over all neighboring atoms i (within a cutoff radius, the atomic environment χ) and f cut (r) is a damping function ensuring that the density smoothly approaches zero at the cutoff. Each atom (including the reference atom) within the cutoff is broadened with a Gaussian of width σ at , leading to a smooth, local representation of the atomic environment.
The atom centred neighborhood density in Supplementary Eq. 1 complies with a system containing only one type of atomic species. For systems with different types of elements, like molecules, the density is individually constructed for every atomic species (Z) within the atomic environment χ of atom a (see Supplementary Figure 2): The similarity between two such environments can be measured via a rotationally aver- where the outer integral is over all rotationsR, so thatk(χ a , χ b ) is invariant to rotations or permutations of atoms. The power of two in the inner integral ensures that the kernel retains angular information about the neighborhood density.
Importantly, this integral can be solved analytically, if the neighborhood density is expanded in an atom-centered basis of orthogonal radial basis functions g n (|r|) and spherical harmonics Y lm : The coefficients c Z nlm are then transformed into the so-called power spectrum for individual species: which we truncate at n ≤ 8 and l ≤ 8.
The kernel from Supplementary Eq. 3 can now be computed as a simple dot product of the 'partial' power spectra:k To obtain the final SOAP kernel, this function is normalized and squared so that: The atomic kernels are the basis to build global kernels for structure matching (e.g. molecules) instead of local environments. Here, we use the average (Eq. 4) and sum kernel (Eq. 5) described in the main text. It is worth stressing that the average kernel, an intensive kernel has to be normalized: In this work, we use the quippy code 19 and the mltools package 20 to compute SOAP kernels. In order to have flexibility in the description of short and mid-range contributions, a kind of 'multiscale' (ms) SOAP is used. Specifically, two global SOAP kernels with cutoff values of 2Å (K 2 ) and 4Å (K 4 ) are applied simultaneously. We use σ at of 0.3Å for K 2 and σ at of 0.6Å for K 4 . We combine short and mid-range contributions for the average kernel as the average of the normalized kernels K 2 and K 4 , i.e. K ms int = K 2,int +K 4,int 2 . For the sum kernel we simply sum up the individual sum kernels K ms ext = K 2,ext + K 4,ext . 21 Kernel ridge regression: Kernel ridge regression (KRR) is a supervised machine learning technique to obtain function values for given input configurations x i . In this section we give a short overview about this technique, however for a detailed description and mathematical derivations the reader is referred to literature. 22 The function can be expressed as linear combinations of kernel functions (K(x i , x)): while the kernel functions act as similarity measures between different input configurations x and x i with target properties y and y i . The x i are feature vectors of training data points and α i are regression weights.
KRR provides a closed-form solution for the optimal set of weights α. This can be obtained by minimizing the loss-function l (of a regularized least-squares problem): The solution of this problem is than given in matrix vector notation: where K is the kernel matrix of the training set (with K ij = K(x i , x j )), σ is the regularization parameter and I is the identity matrix. σ is a hyperparameter that has to be determined empirically (see Supplementary Note 3). It represents the noise level of the reference data and is used to control over-and underfitting.
In our work we applied mean-correction to the observables in the fit with the intensive kernel while we did not for the extensive kernel. To this end, the kernel matrix is constructed analogous to KRR and then 'centralized': where 1 N is a matrix with the same dimensions as the kernel matrix, in which every element is identically 1/N (with the number of data points N ). ForK, the eigenvalue problem has to be solved,K where v i is the i th eigenvector and λ i the respective eigenvalue. The data can be projected into the new space via: Supplementary with an arbitrary data point and sequentially adds new structures so that the distance between the newest structure and all previously selected ones is maximized. 10,18,21 This requires a distance matrix that is constructed using the kernel, according to: A sequence is generated for the complete Rad-6 database. The last 1030 structures went into test set and 100 structures before the last 1030 into the validation set. Since the distance matrix is a function of the kernel, we obtain different training, validation and test sets for the average and sum kernel. In this work the FPS is done with K int and K ext using UFF geometries and started with the H-atom, respectively.
Hyperparameter search: Our ML models contain several hyperparameter in the SOAP kernel and one hyperparameter in kernel ridge regression. In this work we do not focus on the optimization of the hyperparameter in the SOAP kernel, but optimize the σ hyperparameter in KRR. This is done by evaluating the RMSE of the validation set in a grid search. 9 The results for all kernels and FPS splits are shown in Supplementary Figures 4-7.
Including also the hyperparameter optimization for the SOAP kernels could lead to even smaller errors on the predictions.
Learning curves: Learning curves of AE and RE for the respective kernels and FPS splits are shown in Supplementary Figures 8-11. These plots show the MAE and RMSE for training, validation and test set (two left subplots) as well as for the reaction network Rad-6-RE (two right subplots).

Supplementary Note 4: Learning AE with UFF Geometries
Supplementary Figure 12 displays the results for the predictions of atomization energies using DFT geometries (Fig. 4 main text) as well as the MAEs for AE using UFF geometries.
As mentioned in the main text, using UFF instead of DFT geometries leads to the same  kPCA is a data visualization tool in which huge data sets are intuitively presented and insights into the database are provided. Herein, kPCA is used to have a closer look into the Rad-6 database and visualize similarities and differences between the intensive and extensive kernel (see Supplementary Figures 13, 14).
The location of molecules in the PCA plot is determined by their structural topology.  Figure   15 ).

Supplementary Note 7: σ-Scaling
As discussed in the main text, for real applications using forcefield or semiempirical instead of DFT geometries is inevitable. The description of molecular geometries with UFF can be of varying quality for different molecules, which implies that there is not a constant level of noise on the reference data (see Supplementary Figure 19). To this end, Bartók et al. 10 suggested to weight training structures so that the ML model naturally assumes higher uncertainties for configurations that have poor geometries. In their work they quantify the difference between high and low level structures as the root mean square deviation (RMSD) d and scale the regularization parameter σ to be proportional to the factor f = exp( d 2 λ ). By this a new hyperparameter λ arises that has to be determined empirically. To estimate the range of reasonable parameters we plot the scaling factor f as a function of λ using the   In contrast, the results change somewhat for the predictions using the extensive FPS split (left subplots in Supplementary Figure 21). In these cases, σ-scaling lowers the error of the validation set, but increases the RMSE in the test set for both the extensive and the intensive kernel and thus leads to some degree of over-fitting.
To conclude, an improvement of the predictions for AE using the RMSD of UFF and DFT training configurations to scale the regularization parameter was not successful. This is likely due to the different and poor quality of UFF geometries of open-shell structures.
In this work the RMSD values are calculated with the code rmsd obtained from GitHub. 24,25 Since the molecules for the UFF and DFT geometry optimization are created from the same smile string, no reordering was applied.

Supplementary Figure 21. Learning curves of AE predictions for validation and test structures
with and without σ-scaling using the extensive and intensive kernels with both FPS splits. The RMSE is displayed because the hyperparameter are selected according to the minimum RMSE of the validation set.

Supplementary Note 8: Learning RE with UFF Geometries
Supplementary Figure 22 shows the correlation plots between the predicted atomization energies and reaction energies for DFT and UFF geometries. Comparable to the learning of AE, trends for RE with UFF geometries are similar to those with DFT, but with an higher MAE.

Supplementary Note 9: Training Set Selection with Random Sampling
In this section we show the performance for the prediction of AE and RE using random sets are shown in Supplementary Figures 23 and 24.
Hyperparameter search: The hyperparameter search was performed as described in Supplementary Note 3 (see also Supplementary Figures 25 and 27).
However, an exception was made in the case of the intensive kernel. Here, only 99 molecules were used in the validation set to determine the regularisation parameter σ. This is because large errors for the carbon dimer lead to a poor choice of σ in this case (i.e. the models were severely underfitted). This illustrates the dangers of pure random sampling: C 2 has low similarity with all other molecules in the dataset and should therefore be included in the training set (see also Supplementary Figure 26).
Learning curves: The learning curves are analogous to Supplementary Figures 8-11 but use training sets from random sampling.

Supplementary Note 11: Microkinetic Simulation
In the main text, we explore a realistic reaction network consisting of 21,392 reactions using an approximate microkinetic simulation. This network contains bond-breaking, transfer and rearrangement reactions of the general form: where molecules B and/or D can be 'empty' placeholders for bond-breaking and rearrangement reactions. 2 The kinetics of this reaction network are governed by differential equations of the form: where θ A is the concentration of molecule A, k CD AB is the rate constant for the reaction A + B → C + D. Note that the first sum is over all elementary reactions that consume A, and the second sum is over the corresponding reverse reactions, where A is formed.
The term θ A θ B k CD AB corresponds to the current rate of a given reaction, r CD AB . In other words, the rate depends on the concentration of the educts and the rate constant k CD AB , which is in turn proportional to the reaction energy and the activation energy. As mentioned in the main text, all activation energies are assumed to be identical. We can then compute the rate constants from transition state theory via: Here, the energy difference ∆E is the reaction energy plus the activation energy for an endothermic reaction and the activation energy for an exothermic reaction. Under these circumstances, the actual value of the activation energy is not important (it is chosen to be 0.3 eV), and only changes the arbitrary time unit of the simulation. Similarly, we choose a constant pre-exponential factor of 1 for all reactions.
The simulation is initialized with equal concentrations of CH 4 and O 2 , all other concentrations set to 0. At the beginning of the simulation, all rates are thus also 0, except for reactions involving CH 4 and O 2 . We then propagate the differential equations specified above using a third-order Runge-Kutta integrator. 26 As the concentrations are updated, more rates become larger than zero. The subgraphs shown in the main manuscript show all reactions with non-zero rates at a given simulation time.

Supplementary Note 12: Validation of Out-Of-Sample Predictions
As discussed in the main manuscript, the reaction network used in the microkinetic analysis contains several systems that are not included in Rad-6, and thus represent a true out-of-sample application of the ML model. To evaluate the quality of these predictions, DFT calculations were preformed on these out-of-sample systems. Unfortunately, these systems are missing from Rad-6 because they either decomposed upon geometry relaxation or had SCF convergence issues in the original high-throughput simulations for the database. We were, however, able to obtain single-point DFT energies on frozen UFF geometries (DFT@UFF, same computational settings as for Rad-6) for all but one of these systems.
In Supplementary Figure 34, correlation plots for DFT@UFF, DFT@DFT and ML predicted AEs are shown (with the out-of-sample systems highlighted in blue). As expected, the ML and DFT@DFT values display an excellent correlation. Meanwhile, both of these approaches consistently predict more negative AEs than the DFT@UFF approach, since the latter is missing geometry relaxation effects. Importantly, this is also the case for the out-of-sample predictions, meaning that the ML model can be used to estimate relaxation effects even when DFT relaxations are not available.
Overall, there is also a good correlation between the DFT@UFF values and the ML predictions, with R 2 =0.994. To quantify the magnitude of geometry relaxation effects, we calculate the mean error (ME) between DFT@UFF and ML, in addition to the MAE. We find that the ME and MAE are nearly identical (ca. 0.6 eV, see Supplementary Table 2 ), confirming the systematic nature of the deviation. For comparison, the corresponding DFT@DFT values are also shown, again with identical ME and MAE. Note that the deviations relative to DFT@UFF are not identical for ML and DFT@DFT because the ML comparison includes more systems (the out-of-sample set). Taken as a whole, these observations provide a strong indication that our ML model predicts reasonable AEs for the out-of-sample molecules in the network.
This data is also used to verify the reaction energies that go into the microkinetic simulations, as shown in Supplementary Figure 35. Again, we find a good correlation between our ML model and the DFT@UFF calculations, with some scatter. Importantly, similar correlation and scatter are observed when comparing DFT@DFT and DFT@UFF, confirming the high quality of the ML predictions.