Statistical learning goes beyond the d-band model providing the thermochemistry of adsorbates on transition metals

The rational design of heterogeneous catalysts relies on the efficient survey of mechanisms by density functional theory (DFT). However, massive reaction networks cannot be sampled effectively as they grow exponentially with the size of reactants. Here we present a statistical principal component analysis and regression applied to the DFT thermochemical data of 71 C\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{1}$$\end{document}1–C\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{2}$$\end{document}2 species on 12 close-packed metal surfaces. Adsorption is controlled by covalent (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d$$\end{document}d-band center) and ionic terms (reduction potential), modulated by conjugation and conformational contributions. All formation energies can be reproduced from only three key intermediates (predictors) calculated with DFT. The results agree with accurate experimental measurements having error bars comparable to those of DFT. The procedure can be extended to single-atom and near-surface alloys reducing the number of explicit DFT calculation needed by a factor of 20, thus paving the way for a rapid and accurate survey of whole reaction networks on multimetallic surfaces.

2. Calculate the average µ j (in eV) from Supplementary Equation 3 for each column of E.
3. Center E to get X from Supplementary Equation 4. It is not necessary to normalize E by including standard deviations. 1, 2 X has size n × m and its elements are in eV.
For PCA, standard deviations are commonly included to normalize input matrices. 1, 2 However, our input matrix E contains only thermochemical data in consistent units, eV. Thus, including stardard deviations gives more weight to molecules whose adsorption energy is rather constant, such as CH 4 , and lower the prediction accuracy of larger molecules with higher number of unsaturations. 4. Calculate the covariance matrix, C, from Supplementary Equation 5. The size of C is m × m and its elements are in eV 2 .
5. Diagonalize C as a symmetric matrix, Supplementary Equation 6. The matrix D contains the eigenvalues of C in descending order. V is the matrix of eigenvectors associated to D. D and V have the same size as C, m × m. The elements of D are in eV 2 and the ones of V are dimensionless.
6. Select the number of principal components, k max .
7. Generate the matrix W taking the first k max columns of V. W has size m×k max .
8. Project X onto the basis vectors contained in the columns of W, to get matrix T. T has size n × k max and its components are in eV.
9. The values of X can be estimated from Supplementary Equation 8 taking k max principal components. Get the formation energy for molecule j on metal i, E ij by undoing the centering step, Equations 4 and 9.
10. Increase the number of principal components, k max , until reaching the desired accuracy. To this end, two criteria were used in this study: 1 • Compare the accumulated variance for different values of k max . This can be obtained by adding up the k max first elements of the diagonal matrix D. A percentage of the variance can be obtaining by dividing that number by the trace of D. In this study we used 95% of the total variance as threshold.
• Compare the maximum, minimum, mean average, and standard errors for different values of k max until the error reduction stagnates. At that point, increasing k max may lead to overfitting. 1

Selection of representative predictor molecules
To provide a rapid survey on the adsorption energies of adsorbates on a given alloy, it is desirable to calculate by DFT only a small subset of adsorbates, called predictors in the main text. Their number should be at least equal to the number of principal components, k max . However, even small variations in their energies may introduce a large noise in the thermochemistry being estimated. This noise can be reduced by taking more predictors than principal components and then following a Principal Component Regression. In case the number of principal components were equal to the number of predictors, the results would be analogous to the reported in Ref. [3,4] The representative predictors were obtained following these steps: 1. Apply a principal component analysis. Obtain the eigenvalues and eigenvectors matrices, D and V, and determine how many principal components should be used.
2. Compare the predicted and DFT formation energies for all the species to get the prediction errors ε j by Supplementary Equation 10 and calculate the standard error for the predictions σ j Supplementary Equation 11. The species with high standard error within the selected principal components are not good predictors as they increase the systematic error when estimating the thermochemistry of other adsorbates.
3. Obtain the robustness for species j for becoming a predictor, ι j , using Supplementary Equation 12. There, λ k is the k-th eigenvalues of matrix C, in eV 2 , w kj is the k-th descriptor for the molecule j. By definition, ι j is a dimensionless, real positive number, although in the present study it took values between 8 and 83.
4. Select a set of predictors. Give preferences to the ones that have a high ι j , and whose descriptors w kj expands a large area (i.e., they are orthogonal).

Principal Component Regression (PCR) algorithm
For PCR, the metals are divided into training and validation (prediction) sets. In the training set, the formation energies of all molecules needs to be known. In the validation (prediction) set, only the formation energies of the predictors are needed. The PCR was applied in two ways: i) To validate the PCA results by a Leave-One-Out (L1O) test, and ii) To estimate the formation energies of all species on single-atom and near-surface alloys. The procedure, summarized in Figure 1 and Supplementary Figure 3, was applied as follows: 1. Split the energies in the training set into a matrix containing only the predictors, E , and another containing the rest of the species, E . In the main text, this matrix contained 3 predictors (O*, OH*, and CCHOH*) on 12 metals. For PCR-L1O, only 11 metals were taken.
2. Center E to get X , Supplementary Equation 4. It is not necessary to center and E .
3. Apply a Principal Component Analysis on X to obtain the matrices D (eigenvalues) and V (eigenvectors).
4. Take the first k max columns of V to get W , which is associated only to the predictors.
5. For each metal in the validation (prediction) set, put the energies of the predictors into the row matrix E val . Center E val using the averages from matrix E to get X val 6. Project X val onto the basis vectors contained in the columns of W , to get the row matrix T , Supplementary Equation 13. T has size 1 × k max and its components are in eV.
7. Approximate the descriptors for the species in the validation (prediction) set, {w kj,val } and µ j,val , by a linear regression on Supplementary Equation 14.
8. Finally, the energies from the validation (prediction) set can be estimated from Supplementary Equation 15.
Generation of near-surface and single-atom alloys 3. Click on the title of any calculation to see its structure, energy, and other related metadata.
• For adsorbed C 0 -C 2 species on clean metals, tags have the format MM-XXXX, where Mm is the metal: Cu, Ag, Au, etc., and XXXX is a code shown in Ref. [6]. The description includes the formula of all intermediates.
4. To download a .csv file containing the full set of energies with their tags, click on "Export .csv" on the top right corner.

5.
The .csv file should be opened using the UTF-8 encoding.
• In LibreOffice Calc, UTF-8 is the default encoding. Use commas as delimiters when importing the data. No further actions are requested.
• In Microsoft Excel: Open a new spreadsheet, click on "Data" menu, "From Text" button, select the .csv file, on "File Origin" choose "65001 Unicode UTF-8", and use commas as delimiters.

Supplementary Tables
Supplementary Table 1: Structural and van der Waals parameters for metals. DFT and experimental 7 lattice parameters (a, inÅ) for fcc and hcp metals in this study. [ c a ] ratio is included for hcp metals. R 0 and C 6 (inÅ and J nm 6 mol −1 ) parameters for the Grimme's D2 method 8 were obtained following the procedure on Ref. [9]. The lattice parameters were obtained by regression using a linearized form of the Birch Murnaghan equation . Both a dft and [ Table 3: Prediction errors as a function of the number of principal components: Mean absolute, standard, minimum, and maximum errors: mae, σ, ε − , and ε + . All errors in eV. The arithmetic average of the errors is 0.00 eV for all values of k max , as the PCA includes a centering step. The eigenvalues of the covariance matrix, λ (in eV 2 ), are also included. The total variance is λ = 550.26 eV 2 , i.e., the trace of D. Only 11 eigenvalues are non-zero due to the data centering (Supplementary Equation 4).

Supplementary Figures
Supplementary Figure 1: First and second descriptors for the metal, t i1 and t i2 , plotted against the d-band centers (a,c) and the reduction potentials 7 (b,d). The d-band center modulates the adsorption strength on late transition metals (groups 8-11), 19, 20 but it cannot describe the behavior of Zn and Cd.
Supplementary Figure 3: Data flow diagram for PCA and PCR. The formation energy of species j on metal i is obtained from DFT and Equations 1-2. All the energies are grouped in thermochemistry matrices, in red. The approximations done via PCA/PCR are shown in green. Variables associated with metals and species are shown in blue and orange, respectively. Variables associated with mathematical procedures are in black. The size of the matrices used in this study are shown below. Those in parentheses corresponds to the PCR-L1O procedure when they were different from the PCR applied on single-atom and near-surface alloys.
Supplementary Figure 5: Second descriptor for the metals, t i2 , plotted against a Pauling and b Mulliken electronegativity. 7 Pauling's electronegativity scale, χ i in eV −0.5 , is defined from valence bond theory as the additional stabilization of a heteronuclear bond due to ionic contribution. Mulliken's electronegativity, χ i in eV, is the average of electron affinity and ionization potential, which are defined for isolated neutral atoms in gas phase, instead of bulk metals.