To address surface reaction network complexity using scaling relations machine learning and DFT calculations

Surface reaction networks involving hydrocarbons exhibit enormous complexity with thousands of species and reactions for all but the very simplest of chemistries. We present a framework for optimization under uncertainty for heterogeneous catalysis reaction networks using surrogate models that are trained on the fly. The surrogate model is constructed by teaching a Gaussian process adsorption energies based on group additivity fingerprints, combined with transition-state scaling relations and a simple classifier for determining the rate-limiting step. The surrogate model is iteratively used to predict the most important reaction step to be calculated explicitly with computationally demanding electronic structure theory. Applying these methods to the reaction of syngas on rhodium(111), we identify the most likely reaction mechanism. Propagating uncertainty throughout this process yields the likelihood that the final mechanism is complete given measurements on only a subset of the entire network and uncertainty in the underlying density functional theory calculations.

where E slab+ads is the raw DFT energy of the surface and adsorbate, E slab is the raw DFT energy of the surface slab, n i is the number of atoms of species i in the adsorbate, and µ i are reference energies. In this case the reference energies are given by electronic potential energies (as opposed to Gibbs free energies) and are computed by: where E j is the DFT gas-phase energy of species j. We applied corrections to gas-phase and surface adsorbates for species with known inaccuracies in descriptions by BEEF-vdW [6]. Corrections include gas-phase H 2 (+0.09 eV), Formic acid (+0.20 eV), CO 2 (+0.41 eV) and surface adsorbate COOH* (+0.20 eV). Binding energy correction for surface adsorbed species due to hydrogen-bond stabilization were included for OH* (-0.5 eV) and COOH* (-0.25eV).
Vibrational frequencies are calculated with a normal mode analysis by using a finite-difference approximation of the Hessian matrix as implemented in ASE. The finite difference delta is 0.01Å with 4 displacements per Cartesian coordinate. Imaginary frequencies are assumed to correspond to very low vibrational modes and are replaced by 6.8 meV due to the fact that the entropy associated with a mode diverges as the mode goes to zero. The cutoff of 6.8 meV corresponds to an entropy of ca. 3k B at 600 K; it is assumed that below this frequency the entropy would be bounded by other effects.
Gibbs free energies of adsorbed species are calculated using the harmonic approximation (all degrees of freedom are assumed to be vibrational and free energies are calculated from the harmonic oscillator partition function). Gasphase free energies are computed using the ideal gas approximation with vibrational modes calculated using DFT, and the rotational moments of inertia are calculated using the DFT optimized geometries. It was verified that the ideal gas approximation agrees with tabulated free energies to within 0.05 eV for small molecules. All free energies are calculated at 573 K with 1 atm partial pressure for each of the gas phase species (CO, H 2 , ethanol).

Supplementary Note 1: Reaction Mechanism Enumeration
The reaction network for intermediates with up to two carbons (C1/C2 chemistries) was built by constructing a list of elementary reaction steps and searching paths that connected the desired reactants and products. First, a list of C1 and C2 intermediates was constructed. All C1 intermediates were enumerated (e.g. CH 4 , CH 3 , CH 3 OH). C2 intermediates with CCO or OCCO backbones were also included. Adding other intermediates of interest such as elemental species (H,O,C), H 2 O, and OH resulted in a total of 99 intermediates of interest. Gas phase species of interest were also added, including CH 3 CH 2 OH, CH 3 OH, CH 4 , H 2 O, H 2 , CO 2 , CH 3 CHO, CO. Elementary reaction steps were then generated by including adsorption reactions for gas-phase species and generating all reactions that contained two adsorbed reactants and produced a valid intermediate (e.g. no C3 or above products). This process resulted in approximately 190 elementary reactions. Finally, pathways through the reaction network were generated by starting with the initial reactants, and performing a recursive search. The search was truncated at 8 reaction steps, and the resulting pathways were added if they reached one of the desired gas-phase final products. This recursive algorithm for mechanism enumeration is sufficient for networks of the size in this study, but the time complexity with respect to the number of reactions could be improved with more complex methods [7].

Supplementary Note 2: Prediction of Surface Intermediate Formation Energies
Formation free energies of intermediate surface species are predicted using a Gaussian process (GP) regression scheme with group-additivity based fingerprints as illustrated in Supplementary Figure 1. Given the chemical structure of an intermediate of interest, we first decompose the system into fragments using Extended Connectivity Fingerprints (ECFP) with radius-1 [8]. Each fragment represents the local environment of each atom by considering all atoms directly bonded. Bond types are used in the fingerprint. With only radius-1 fragments, there is one fragment per atom. Unlike other approaches that have aimed to develop the most accurate group-additivity approach for predicting these formation energies [9,10], we explicitly do not include bonds to the surface since doing so generally requires some insight or prediction into precisely how the molecule is adsorbed onto the surface. Instead, this same physical insight is effectively included in the regression scheme. This allows this approach to be very flexible in quickly building networks for various surface species where the precise adsorption configuration is not initially known, but limits the transferability of the model between facets or surface adsorption sites. It is expected that electronic-structure calculations represent the lowest-energy configuration for a given adsorption site, as the described DFT methodology provides.
In the network of reactions on Rh(111), there are approximately 100 species, and roughly 45 unique fragments in the EFCP fingerprints. The dimensionality of this parameter space is comparable to the number of intermediates and thus including all of these fragments in a fitting procedure would lead to overfitting. We know that many of these fragments do not consist of independent quantities, but are actually dependent and related as demonstrated by the wide applicability of linear scaling relations [11]. Instead, we take the fingerprint for each molecule (a vector of dimension approximately 50), and reduce the dimensionality to just 10 using principle component analysis (PCA). This dimensionality reduction was performed after each iteration of the exploration algorithm, so that the precise mapping of the initial to final fingerprint space is different at each iteration. This process is equivalent to re-constructing a new set of linear scaling relations at each iteration. The number of final dimensions after the PCA reduction (either 10 or 15) did not have a significant effect on the accuracy or convergence of the process, suggesting that there were sufficient degrees of freedom for the group additivity method, and is expected given the remarkably broad applicability of far simpler linear scaling relations. This number could instead be chosen in an automated fashion instead using cross-validation. The PCA reduction was only applied after the number of molecules in the training set was greater than the final number of dimensions in the PCA.
A GP regression scheme was used for the final prediction of the formation free energy for each intermediate relative to the gas phase reference species (CO, H 2 , H 2 O). The reduced dimension fingerprint (from the PCA analysis) was used as the input vector, and the free energy of formation (G f ) as the observable. The GP was trained on the available DFT measurements and re-trained at each iteration as the free energy of requested intermediates were added from DFT calculations. DFT uncertainty was propagated through the GP by constructing a GP for each of the BEEF ensembles. Uncertainty associated with the GP itself was not propagated in this scheme, as it would have required a more sophisticated sampling scheme to properly include the effects from multiple independent sources of uncertainty. The GP was used only for prediction of unmeasured species; if the formation energy of a species had been calculated with DFT, the measured free energy was simply used. A second GP was trained and used for the prediction of enthalpic energies (E f ) in an equivalent fashion. Both Gaussian processes were constructed using identical settings in the python module sklearn, with a squared exponential correlation function, a constant regression scheme, θ 0 = 0.01 and a nugget of 0.05 2 (representing an assumed uncertainty of 0.05 eV in the underlying DFT calculations. These parameters were chosen empirically to balance a GP that would be accurate near points in the training set while still yielding useful predictions for trial points.

Supplementary Note 3: Scaling Relations Using CatApp
Estimates of transition state enthalpies were provided with linear scaling relations. Scaling relations between the enthalpy of reaction and enthalpic activation energy were constructed using the CatApp database [12]. Two scaling relations were constructed: one for hydrogenation reactions, and one for all other reactions, as illustrated in Supplementary Figure 2. The resulting linear relation was of the standard form where α, β are fitted constants for each reaction type and the energy of the transition state relative to the reference state (gas phase enthalpies) was where E IS represents the combined enthalpies of reactants, also relative to the reference gas phase species (H 2 O, CO, and H 2 ). The slope of the BEP relation α was taken from fitting the measurements in CatApp. The measurements in this work were observed to lie within uncertainty of the CatApp data, but were slightly below the CatApp trend line. A linear regression for the measured Rh(111) hydrogenation data (in the form above, α∆E rxn + β) using the same slope (α = 0.69) resulted in an offset about 0.2 eV lower than the CatApp trend line (β = 0.70 eV vs β = 0.95 eV for CatApp data). A similar process for non-hydrogenation reactions resulted in a significantly lower trend line (β = 1.00 eV vs β = 1.57 eV for CatApp), suggesting that the non-hydrogenation reactions in this work (mainly C-O scission, C-C scission and adsorption/desorption), were not representative of all of the reactions included in the CatApp database. Also contributing to this discrepancy is the inclusion of van der Waals interactions in this work, which were not present in most of the calculations in the CatApp database. Similar relations were used to construct an estimate of the transition state free energies. We assumed that the BEP relations for free energy followed the same form In the absence of a database of transition state free energies (analogous to CatApp with enthalpic information) to calculate α G , β G , we found that the linear scaling relations for enthalpies were reasonable approximations for the free energy linear scaling relations, as shown in Supplementary Figure 2. For both hydrogenations and non-hydrogenation classes of reactions, the linear fits were nearly identical as to the fits for enthalpy, with β about 0.05 eV lower in both cases. Using just two classifications represents a conservative estimate of the uncertainty in this process. Improved classification of reactions (more than just hydrogenation, non-hydrogenation used here) would likely lead to more accurate BEP relations but would require a larger database of surface reactions.

Supplementary Note 4: Identification of Rate Limiting Steps
Pathways through the reaction network were evaluated by the rate-limiting transition state along the reaction pathway. In transition state theory, the rate-limiting step for reaction pathways in series with intermediate energies all larger than the initial state will be reaction with the highest transition state free energy, G max TS , relative to the initial state. Pathways with varying sinks before the rate-limiting step would have different kinetics and require a more complicated treatment, such as the energetic-span approach from literature [13]. Pathways with smaller G max TS were assumed to have higher fluxes through the rate-limiting steps. We ignore effects of coverage on transition state energies. With this simple criterion, we were then able to rapidly identify the likelihood of each reaction being the most-likely reaction, without resorting to expensive and more computationally demanding microkinetic models of the entire reaction pathway. This will not hold for reactions that have intermediate energies substantially lower than the initial state. Given the proximity of the rate-limiting step to the initial species in this work, this is not a concern, but could be addressed for other systems with a more detailed microkinetic model.
The highest transition state for each pathway G max TS was evaluated using the predicted G TS and corresponding uncertainty for each transition state in the pathway as outlined above. Uncertainty in the predicted G TS for each species each of these samples, the pathway with the lowest G max TS was evaluated. The probability of each path being the one with the lowest G max TS was evaluated by counting the number of samples for which this was the case. This process allowed for correlations in the uncertainty of different species in the BEEF-vdW ensemble to propagate through to the end result. For example, uncertainty in the relative G max TS for two pathways with the same rate-limiting step is substantially lower than for pathways with different rate-limiting steps. Reduced networks were then constructed by including pathways with progressively less likelihood of containing a lower G max TS until until a desired probability of completeness was accumulated.

Supplementary Note 5: Final Reduced Mechanism at Reaction Conditions
Included is the most likely pathway to ethanol on Rh(111) at the given conditions according to first-principles DFT, as illustrated in Figure 3 of the main text. Reaction numbers correspond to the full list of elementary steps in the final section of the SI.