## Introduction

The advancement in density functional theory (DFT) has enabled mechanism development and in silico catalyst design1. DFT calculations have been performed for several small-molecule chemistries, including hydrogen evolution and oxidation reactions2,3, oxygen reduction and evolution reactions4,5,6,7,8, CO2 reduction, N2 reduction9,10, and CH4 activation11. Computing the species configurations and thermochemistry is essential, as correlated uncertainty quantification reveals that more thermodynamic parameters than activation energy parameters affect the kinetics12. Adsorbate configurations are prerequisites in computing activation energies of elementary reactions. While manual DFT calculations have been adequate for small molecules, they are impractical for large molecules due to the combinatorial size of the reaction network that includes all intermediates13. Thus, an extension of computations to large molecules on transition-metal catalysts has been lagging. Establishing a framework for modeling large molecules would thus be essential to significantly accelerate mechanistic and discovery studies, for example, in renewable energy, such as biomass pyrolysis and gasification14,15, biomass upgrade via hydrogenation16,17,18 and hydrodeoxygenation19,20, and hydrogen production via biomass reforming21, and recycling of plastics.

Surprisingly, the challenge in DFT calculations of adsorbates is not merely the computational cost–databases in the order 106 are becoming commonplace22,23,24,25. A challenge is the automated generation of stable adsorbate configurations on surfaces. The adsorption configuration of large molecules is combinatorially intractable to enumerate in practice due to the multiple adsorption sites and several surface-binding atoms26. Each stable configuration can undergo different chemistry, and the reaction network thus depends critically on identifying all (or at least the most) stable configurations. It turns out that this task escapes intuition.

Several tools can ease the generation of stable configurations. Peterson et al.27 developed a global adsorbate configuration optimization method using the constrained minima hopping method, but its scalability is limited as DFT-based annealing is used. Medford and coworkers utilized the minima hopping with faster density functional theory tight-binding (DFTB) methods for bidentates, but obtaining reliable DFTB parameters is not trivial28. Bligaard and coworkers have implemented graph-based enumeration for bidentate adsorbate configurations29, and Greeley and coworkers developed a python-based graph theory package to encode the adsorption structure into a graph to identify the adsorption structures and generate high coverage configurations uniquely30. Currently, no general strategy exists that systematically identifies stable adsorbate configurations with three or more surface-binding atoms needed to adequately describe the chemical reactions of large molecules on metal surfaces.

Here, we introduce a general framework to predict a nearly complete set of stable adsorbate configurations on metal surfaces. We introduce expert knowledge-based enumeration rules to generate the configuration space, containing most, if not all, stable configurations. The configurations are optimized using a force field, and strained configurations are removed. For the configurations with ≤3 heteroatoms (non-hydrogen organic atoms), we perform multi-fidelity DFT calculations to assess the configuration stability. With this data, we train a machine learning (ML) model and use it as a screening tool to predict the stability of larger adsorbates before performing DFT calculations. The workflow is summarized in Fig. 1. We apply the framework to close-packed surfaces of Ag, Au, Co, Cu, Ir, Ni, Pd, Pt, and discover 4,979 stable configurations. The predictive ability of the ML model-based screening is further demonstrated for 1650 configurations with 4 ≤ heteroatoms ≤ 6 also computed via DFT. We find that distinct trends in stable configurations among catalysts explain the observed selectivity in experimental systems, and the clustering in the adsorbate data is rationalized by the d-band/adsorbate interactions. We propose that stable intermediates are essential for a catalyst to carry out a specific reaction, and the extensive library created here can be leveraged to pre-screen catalysts for all commonly metal-catalyzed chemistries. This work paves the foundation toward mechanistic insights into and design principles of large molecule conversion.

## Results

### Skeleton enumeration

We introduce graph transformation rules to enumerate “skeleton” configurations, which contain carbons and their connectivity patterns to the surface, inspired by Ruddigkeit et al.31 The initial pool of configurations are built by adding carbon on top, bridge, and hollow sites on a large surface lattice graph. Then, the rules that precisely add one carbon atom are repeatedly applied to build larger configurations. Hydrogen additions and electronic effects are considered later.

Four types of rules can comprehensively enumerate all possible adsorbate configurations. The first type adds an adsorbed carbon to an adsorbed carbon (surface propagation rules). These rules can be made systematically using the following steps. First, find all possible one atom binding sites on close-packed surfaces (top, bridge, and hollow sites; inset 1 in Fig. 2a). Second, enumerate two-atom configurations by exhaustively evaluating (1) the number of metal atoms that participate in two binding sites, e.g., an atom involved in bonding of two bridge sites, and (2) the total number of the adsorbate-surface bonds–1, 2, and 3 for top, bridge, and hollow sites (Fig. 2a). Third, remove unreasonable configurations of unrealistic bond distances. Fourth, convert the two-atom configurations (e.g., green box in inset 2 in Fig. 2a) to graph transformation rules (e.g., blue box in inset 2 of Fig. 2a). A rule consists of a pattern graph (left-hand side of the blue box) and a replacement graph (right-hand side of the blue box). A graph transformation is applied to a configuration by searching for the occurrence of the pattern graph in the configuration, and by replacing the found occurrence with the replacement graph. The two-atom configuration (the green box) becomes the replacement graph (right side of the blue box). The pattern graph (left side of the blue box) is made by removing an atom in two-atom-configurations (the green box). The key postulates are (1) the systematic enumeration of all possible two-atom configurations and (2) the larger configurations consist of two-atom configurations (e.g., a six-atom skeleton can be decomposed to the two-atom configurations). This framework applies to other planar surfaces, such as fcc(100), hcp(1010), and bcc(110).

The second type of rule accounts for non-surface-bonding carbons (e.g., -CH2- and -CH3). Non-surface-bonding carbon can be added to an adsorbed carbon on top, bridge, or hollow site. Also, non-surface-bonding carbon can be added to another non-surface-bonding carbon to increase the chain length. We call these rules vacuum propagation rules (Fig. 2b).

As shown in Fig. 2c, adsorbates form an “arc” containing a non-surface-bonding atom chain and two anchoring adsorbed atoms (e.g., (CH2)x). Rules that add an adsorbed carbon to a non-surface-bonding atom can be used to construct arcs, but two anchoring atoms cannot be too far apart. Thus we introduce two metrics, as shown in Fig. 2c, where dsurface and dnearest neighbor are the distance between the two anchoring surface atoms and the distance between two nearest neighbor surface atoms, respectively. The ratio of the two defines a normalized length threshold for the arc to be stable (Fig. 2d), which we estimate using DFT with (CH2)x on Pt(111). The line between the stable and unstable data indicates the decision boundary we used to decide the arcs’ stability. Figure 2e demonstrates a rule for anchoring an arc (called anchoring rules), the pattern graph of which has to respect the distance constraint of Fig. 2d.

The last type of rule adds an adsorbed carbon to two adsorbed carbons, forming a ring (ring rules). Figure 2f shows the ring rules developed by enumerating three adsorbed-carbon chains and building the pattern graph by removing the central atom.

After the enumeration, surface atoms in each enumerated configuration are systematically pruned to build a unique, unambiguous graph (see Supplementary Fig. 1). Duplicate configurations are removed by comparing their hash, such as the SMILES string.

### Force field screening

We remove strained configurations by optimizing the structures of skeleton configurations with the universal force field32 with additional interactions between the adsorbate and the surface (see methods for details) with heuristic parameters. The structures with C–C bond lengths outside the range of 0.8 Å and 1.65 Å are removed, which is a broad threshold based on the covalent radius of carbon and oxygen.

### Transformation to an adsorbate

The unstrained skeleton configurations produce realistic configurations on which we substitute carbon with oxygen at all possible locations and add hydrogens to carbons and oxygens while respecting the valency rule. A varying number of hydrogens is added to the skeleton to represent all possible degrees of saturation; thus, the number of configurations significantly increases in this step.

### Multi-fidelity DFT screening

We perform low-fidelity DFT calculations of configurations with ≤3 heteroatoms with an early stopping criterion upon configuration divergence to assess the stability. The parameters used for the low-fidelity DFT setup result in less accurate but more efficient calculations (see methods). These achieve decent accuracy compared to the standard DFT relaxation (see methods). The configuration of the DFT-calculated structures is built by determining the connectivity between atoms using dij < t(rcov,i + rcov,j), where dij is the distance between atoms i and j, t is the tolerance factor (1.18 used), and rcov,i is the covalent radius of atom i. The stable configurations are further refined using high-fidelity DFT calculations.

### ML-based stability prediction

We rapidly screen the stability of the configurations with >3 heteroatoms by introducing a fingerprint-like descriptor-based logistic regression (FLDLR), shown in Fig. 3a, with fingerprint-like descriptors as input features33. In this method, all possible subgraphs of adsorbate are enumerated, and, for each subgraph, surface atoms connected to the adsorbate are added. The output feature vector contains the number of occurrences for each fingerprint. The training data set is obtained by performing DFT calculations for configurations with ≤3 heteroatoms where the stability is quantified as 1 (stable) or 0 (unstable). A configuration is labeled stable if the connectivity does not change after the DFT relaxation (i.e. the configuration represents a local or global minimum on the potential surface). If the connectivity pattern changes upon DFT relaxation, we labeled them unstable, as the configuration represents an unstable point on the potential surface. As the model will primarily be used to predict configurations of larger adsorbates, we devise a similar extrapolation test. We train the model with adsorbates of ≤2 heteroatoms and assess its error on adsorbates with three heteroatoms. Logistic regression calculates the probability (a continuous value between zero and one) that a configuration is stable. The probability threshold is used as a tunable parameter for screening. Its effect on the model performance is assessed by the test set recall, precision, F1 score, selectivity, and accuracy in Fig. 3b–e, and Supplementary Fig. 2. As we are interested in a comprehensive database containing nearly all stable configurations, a high recall TP/(TP + FN) value is desired. Here T, F, P, and N are true, false, positive, and negative, respectively. A low threshold of 0.2 (Fig. 3b) ensures that 95% of all stable configurations are sampled (a high recall). However, a low threshold implies also that unstable configurations are also selected (undesired). The precision TP/(TP + FP) in Fig. 3c shows that only 10% of the selected configurations will be stable (a low precision). The F1 score in Fig. 3d shows the harmonic mean of the precision and recall. A threshold of 0.76 most efficiently samples the stable configurations at the cost of unaccounted stable configurations. The selectivity TN/(TN + FP) in Fig. 3e indicates the DFT cost-saving from the ML screening, where we would screen out 44% of the unstable configurations using ML at the threshold of 0.2. Supplementary Figure 2 shows that the accuracy is high at higher tolerance, as most of the enumerated configurations are unstable.

Incorporating FLDLR as a screening tool before performing DFT calculations can significantly reduce the computational cost for larger adsorbates. We retrained the model with ≤3 heteroatoms configurations, and randomly sampled 50 configurations each for 4, 5, and 6 heteroatoms on 11 metals using the uniform distribution over stability score, and performed DFT calculations. The FLDLR calculated score and the DFT inferred stability are compared in Supplementary Fig. 3. We find that 99% of the configurations with low scores (<0.05) are unstable. Since the configurations with low scores (<0.05) comprise most of the large molecule configuration space (84%, 95%, and 99% for 4, 5, and 6 heteroatoms, respectively), one to two orders of magnitude reduction in DFT calculations is expected using the low score as a screening criterion. We believe that ML predictions in the low score region extrapolate well to larger adsorbates; the fingerprints causing instability in the small adsorbate configurations are also present and also cause instability in larger adsorbate configurations. Some of the converged structures with 6 heteroatoms are shown in Fig. 4.

### Enumerated data distribution

The number of configurations in the various methodological stages is shown in Fig. 5. It increases exponentially with increasing the number of atoms, reaching ~108 configurations for six heteroatoms. The number of DFT-calculated stable structures (green points) scales less steeply than the enumerated ones. The ML screening (using a threshold of 0.2) reduces the number of calculations by two orders of magnitude for adsorbates with 6 heteroatoms.

Some molecules do not adsorb on some metals. For example, ethylene (CH2CH2) does not adsorb on Au(111) and Ag(111) but adsorbs on Cu(111) in η adsorption mode, in agreement with previous DFT calculations38. Thus, we perform a principal component analysis of a binary matrix with dimension (number of metals) × (number of molecules), where the matrix element is set to 1 if the given molecule adsorbs on metals and 0 otherwise (molecule adsorption stability matrix in Fig. 6c). Compared to the previous matrix, the second dimension runs over molecules. There are essentially three clusters of data: the first cluster contains mostly strongly binding metals (Pt, Ni, Rh, Co, Ir, Re, Pd, and Ru). On these metals, most of the molecules have multiple stable adsorbed configurations. The second includes several multidentate molecules and molecules with high valency that do not adsorb on Au. This explains the poor performance of Au for C–C scission (encountered, for example, in steam and dry reforming of larger fuels, e.g., ethanol) and isomerization, as important dehydrogenated reaction intermediates, such as CH3CHO, CH2CH2, and CH2C, do not adsorb on Au39,40. Similarly, Au is a poor catalyst for the Fischer Tropsch synthesis as important intermediates for C–C coupling typically have high valency41,42,43. The third contains Ag and Cu that can adsorb three atom-ring structures that are unstable on other metals which typically dissociate. Some of these molecules are dehydrogenated ethylene oxide (epoxide). Ag and Cu have long been used for selectively producing ethylene oxide44,45. Hence, these metals’ affinity for the stable ethylene oxide derivatives may be the key to their high selectivity.

### Predicting selective catalysts

Exploiting the concept of stability of adsorbates being crucial for selectivity, we predict selective catalysts for four heteroatom closed-shell molecules using ethylene oxide as a reactant. We enumerate all possible reaction paths between ethylene oxide and four heteroatom closed-shell molecules by adding and removing C, H, and O in the enumeration rules. For each metal, the shortest reaction paths containing stable intermediates were extracted. The stability of adsorbates was assessed using DFT for ≤3 heteroatom adsorbates and FLDLR with a threshold of 0.95 (high probability of stability) for >3 heteroatom adsorbates. The paths to closed-shell molecules with less than 5 viable metals are shown in Fig. 7 as examples of selective catalysts. The thermochemistry and kinetics were not assessed, thus realizing these chemistries requires further investigation. We find that Au(111) is selective to all nine molecules whereas seven other surfaces are selective to a few. Especially, eight out of nine molecules contain rings, which are typically produced by homogeneous organic reactions. Specifically, homogeneous gold catalysts produce small rings with less than six atoms46, and some cyclization transfers to gold nanoparticles47. These facts indicate that the discovered pathways could be experimentally viable.

## Discussion

The conversion of large molecules is poorly understood due to the large size of the reaction network and the lack of automation for initializing DFT calculations of large adsorbates. This, in turn, stems from the combinatorial explosion of complex adsorbate configurations that dictate thermochemistry and reaction pathways. The intuitive binding of adsorbates, based on the heteroatom valency, has long been used. We discover it can fail, yet certain clusters of data are observed based on the d-band/adsorbate orbital interaction. To the best of our knowledge, this work presents the first systematic enumeration of multidentate adsorbate configurations with arbitrary binding motifs. Importantly, we also find correlations between configurations with the d-band. We observe that the stability of intermediates is essential for highly selective catalysis, as a correlation between the intermediate stability and selectivity is demonstrated for the ethylene oxide and Fischer Tropsch process. More generally, a catalyst cannot produce a molecule if its reaction intermediates are not stable on it, and the library of molecules we built can be leveraged to understand if a metal catalyst can conduct specific chemistry. Potentially, highly selective catalysts can be made by designing catalytic sites that selectively adsorb desired adsorbates. Furthermore, we often assume in creating volcano curves for materials discovery that the reaction pathway, intermediates, and rate-determining step are the same on all catalysts. Our results clearly identify clusters of materials for which this is true but expose profound differences among clusters. The developed database could aid in the theoretical investigation of large molecules by predicting adsorbate thermodynamic properties and enabling a database for lateral interaction models30, Brønsted−Evans−Polanyi relations48 (scaling relationship between reaction energy and activation energy), and transition state structures. These investigations could enable microkinetic model development toward elucidating catalyst design principles. We emphasize that, while we focused on the widely studied close-packed surfaces, the framework can be expanded to other surfaces such as fcc(100), stepped surfaces, and alloys by constructing an appropriate surface lattice, and differentiating surface atoms by elements and location (e.g., step-edge, corner, terrace). Other heteroatoms, such as nitrogen and sulfur, with pharmaceutical applications, can trivially be considered.

The number of enumerated configurations becomes computationally vast, reaching 108 for adsorbates with six C and O atoms, posing a significant challenge in studying large molecules. The difference in the slopes of enumerated configurations and DFT-calculated stable configurations is notable, underscoring that an improved enumeration algorithm could potentially be developed. We expect the performance of the ML model to improve significantly by adding structures of four C and O atoms, as an adsorbed carbon has a maximum of three neighbors. We are expanding the database to improve the ML model.

Our scheme can be further improved in several directions. Lateral interactions between adsorbates are well-known to affect the adsorption energy and potentially change the preferred site49. While we used a relatively low coverage, the effect of lateral interactions on the configuration stability remains unclear. We also did not assess the vibrational modes of adsorbates, and thus, some adsorbates may be on unstable saddle points on the potential energy surface. Our scheme faces an additional challenge for larger biomass molecules, such as glucose involving 12 C, O atoms, requiring >106 DFT calculations. Potentially online learning, where we repeat the cycle of data sampling and model training, can improve model accuracy and reduce the number of candidates continuously on the fly. Our scheme has similarities with global optimization techniques aiming to identify all minima in a high-dimensional space. Integration with advanced global optimization algorithms50,51,52 can improve scalability as well. As we focused on the enumeration of adsorbates’ connectivity patterns, our scheme does not account for cis/trans isomers not implicitly accounted for by the connectivity pattern (Supplementary Fig. 4). The assessment of the quality of the data is critical. While we addressed the challenge of the enumeration of connectivity patterns, future work should include the curation of the data, which can include manual curation, and the use of statistics to identify faulty data.

## Methods

### Force field optimization

The universal force field as implemented in RdKit (Rdkit.org) is modified to generate the structures. In addition to the standard UFF parameters, distance and angle constraints are added using the quadratic relations,

$$\begin{array}{c}{E}_{{{{{{\rm{r}}}}}}}=1/2k{(r-{r}_{{{{{{\rm{eq}}}}}}})}^{2}\\ {E}_{{{{{{\rm{\theta }}}}}}}=k{(\theta -{\theta }_{{{{{{\rm{eq}}}}}}})}^{2}\end{array}$$
(1)

where Er and Eθ are the distance and angle energy, k is the force constant, r and θ are radius and angle, and the subscript eq represents the equilibrium value. Forces that hold surface atoms in their lattice position and describe adsorbate atom–surface atom bond are added, as shown in Table 1. Also, various angle constraining forces are added to generate reasonable structures, as shown in Table 2. The heuristic forces provide a plausible initial guess structure for DFT calculations, typically better than the manually guessed structures. As a strong force constant is used for the adsorbate-surface bond, the strain manifests as the distorted adsorbate-adsorbate bond, which we used to decide the strained, unstable configurations.

### DFT calculations

We performed DFT calculations using the Vienna ab initio Simulation Package53. The electron exchange and correlation energies were computed using the PBE functional54. Our previous study finds that the choice of functional and dispersion correction does not affect the geometry of a large molecule, namely furan, significantly55. The core electrons were calculated with the projector augmented-wave (PAW) pseudopotentials56. The Brillouin zone is sampled with a Methfessel-Paxton smearing of 0.1 eV57.

To construct the slab, the lattice constants of the metals are optimized using 15 × 15 × 15 Monkhorst-Pack k-point mesh with Blöchl correction58,59, D3 dispersion correction60, and the plane-wave cutoff energy of 500 eV. Close-packed surfaces (fcc(111), hcp(0001)) were modeled with a four-layer deep 4 × 4 unit cell with a 20 Å vacuum where the bottom two layers are fixed.

For assessing configuration stability, we used low-fidelity parameters. The cutoff energy of 300 eV was used with non-spin polarized calculations. Gamma point was used to sample the Brillouin zone. The quasi–Newton algorithm was used to converge the structure into its instantaneous ground state. The DFT calculations were stopped if the configuration diverged to another configuration. Molecular graphs are constructed by adding an edge between two atoms if the distance between the two is less than the sum of the two elements’ covalent radius multiplied by 1.18. If the calculation did not converge after 200–800 ionic steps, we used the conjugate-gradient algorithm to relax the structure. Here, the early stopping is not used to observe the final structure. The configuration graph is determined using covalent radius-based graph construction33. To test the stability convergence, we compare the stability of adsorbates with ≤2 C, and O atoms on Pd(111) between the low-fidelity and standard DFT parameters. The high-fidelity calculations entail cutoff energy of 400 eV with 3 × 3 × 1 Monkhorst-Pack k-point mesh58, spin-polarization, and D3 dispersion correction60. Here, 9 out of the 52 stable configurations of standard DFT calculations diverged in low-fidelity (see confusion matrix in Supplementary Table 2). Out of these, the binding energies of the four configurations are >0.5 eV higher than the ground state configuration binding energy of the respective adsorbate. Four configurations are local ground states (0.13, 0.19, 0.12, and 0.01 eV with respect to each molecules’ ground state configuration). Only one ground state configuration was not predicted stable in the low-fidelity calculation. This was due to the early stopping method, stopping calculations prematurely before convergence.