Accuracy of theoretical catalysis from a model of iron-catalyzed ammonia synthesis

Density functional theory is central to the study of catalytic processes, but its accuracy is widely debated, and lack of data complicates accuracy estimates. To address these issues, this work explores a simple eight-step process of iron-catalyzed ammonia synthesis. The models’s importance lies in the availability of experimental data and the accessibility of coupled-cluster CCSD(T) calculations, enabling direct assessment of method accuracy for all reaction steps. While many functionals accurately describe the net process N2 + 3H2 → NH3, errors of +100 kJ mol−1 occur in many individual steps for popular functionals such as PBE, RPBE, and B3LYP, which are much worse than commonly assumed. Inclusion of the stoichiometric reaction coefficients reveals major accuracy bottlenecks surprisingly distinct from the N–N dissociation step and dependent on the applied functional. More focus should be directed to these problematic steps in order to improve the accuracy of modeling the catalytic process. DFT is widely used to study catalytic processes but its accuracy is debated. This paper shows major variations in DFT outcome for a simple model of iron-catalyzed ammonia synthesis benchmarked against experimental and high-level quantum mechanical data.

C hemical catalysis is the central basis for life and at the same time provides the material basis for the modern world [1][2][3] . Prediction and rational design of chemical reactions require estimates of the involved bond energies with as little error as possible 4,5 . Kohn-Sham density functional theory (DFT) can describe many chemical processes at relatively good accuracy with modest computational cost and has been successfully applied to both heterogeneous and homogenous catalysis [6][7][8][9][10][11][12][13] . Recently, the accuracy and reproducibility of DFT has become widely debated, with variable conclusions [14][15][16][17][18] . Most notably, while different DFT codes are generally precise and hence produce reproducible results 15 , there is little evidence for the accuracy of DFT applied to each of the individual catalytic steps due to the scarcity of benchmark data.
Many density functionals were developed to reproduce thermochemical data for main-group molecules with simple closedshell wave functions and for one reaction step at the time 19 . For such cases, accuracy can be quite high, with errors as small as 10 kJ mol −1 even for closed-shell transition metal systems 20 . However, the most important catalytic processes commonly involve transition metals with unpaired d-electrons whose treatment depends critically on the type of DFT applied and multiple steps having very distinct changes in electronic structure 5,21 . Strong double and triple bonds that are often broken during catalysis (e.g., N 2 , CO, O 2 ) pose an additional challenge to DFT, as the error in electron correlation typically scales with the bond strength 22 . Despite this, many theoretical studies of transition metal catalysis routinely use a single popular functional such as RPBE or B3LYP for all steps as the basis for conclusions, with assumptions of accuracy largely extrapolated from previous onestep benchmarks [23][24][25] .
We thus need to understand (i) the benchmarking of full catalytic processes and how errors arise in all individual catalytic steps, (ii) whether performance is determined by specific steps of the processes, i.e., "accuracy bottlenecks", (iii) to what extent overall accuracy can benefit from cancellation of systematic errors, and (iv) if there are clear performance issues due to the type of DFT used that should be solved. Such assessments are difficult because most available data reflect overall rates and adsorption energies of the full systems 26 . This prevents mapping the error of a theoretical method to the individual molecular events of the catalytic process.
In order to address this problem, this work considers ammonia synthesis from N 2 and H 2 as a simplified iron-catalyzed reaction, with the net reaction N 2 + 3H 2 → 2NH 3 . 27 The model reaction is uniquely defined as the minimal catalytic model where each ligand atom is allowed to bind only one iron atom at a time. The particular advantage of this model benchmark reaction is the availability of experimental thermochemical data and advanced CCSD(T) computations with sufficiently large basis sets to serve as benchmark data. With this model reaction as framework, we can address the four questions raised above by computing the reaction enthalpy of each step using a variety of density functional methods.
Ammonia synthesis was chosen as benchmark as one of the most important and remarkable industrial catalytic processes: ammonia is used as fertilizer to increase crop yields throughout the world to sustain the growing human population 28 . Industrially, ammonia is produced by the Haber-Bosch process 29,30 , which was explored by Ertl and co-workers in their Nobel-prizeawarded work 31 . The process also occurs within nitrogen-fixating bacteria using the nitrogenase enzyme, which contains the remarkable FeMo cofactor 32 . In the heterogeneous catalytic reaction, dissociation of the extremely strong (945 kJ mol −1 ) triple N-N bond on the catalyst surface is the rate-determining step of the total reaction 33 . DFT has been widely used to study this process, and the impact of various approximations is described elsewhere; 27,34,35 these approximations generally make predictions very precise (and thus reproducible and useful in trend predictions), but not necessarily accurate 15 . The fact that the reaction is well-explored makes it particularly suitable for addressing the problem of accuracy dissected into individual catalytic steps.
The present work shows major (+100 kJ mol −1 ) differences in the computed reaction energies with commonly used functionals. When including the stoichiometric reaction coefficients, the accuracy bottlenecks of the overall catalytic modeling become evident: Other steps than the energy-requiring N-N dissociation become critical to accuracy and the performance of a functional is very dependent on the chemical step studied. Thus, while DFT calculations are generally precise and reproducible 15 , accuracy is extremely reaction step-dependent and no universal best method exists. This conclusion is not dependent on the accuracy of the benchmark CCSD(T) calculations that may be challenged for some steps since the large variations in DFT outcome prevail regardless of the exact energy of each step. Accordingly, we conclude that studies in theoretical catalysis must necessarily rely on major error cancellation in the reaction steps of interest to become predictive. As a necessary practice, one should therefore always test results using several functionals for the relevant steps of the catalytic process, as performance for one step can be very misleading. On the positive side, the weaknesses of each functional to each type of electronic transformation points to the improvements needed, e.g. in the loosely bound metal hydride states.

Results
Performance of methods for the net reaction. The catalytic model process studied in this work involves eight individual steps, which are electronically highly diverse (reaction 1-8 below): Fe þ Fe À N 2 ! 2Fe À N ð3Þ It is initiated by bonding of N 2 and H 2 to iron with subsequent cleavage of the bonds into atomic nitrogen and hydrogen, and then gradual formation of ammonia by consecutive hydrogen atom transfer to nitrogen. The optimized geometries for all systems are shown in Supplementary Table 1, and the studied density functionals are listed in Supplementary Table 2. The performance of all methods for the net reaction, which does not involve iron, is shown in Fig. 1 (for detailed electronic energies, see Supplementary Tables 3-5; results are in all cases corrected for zero-point energy, Supplementary Table 6). Many of the methods are accurate, largely because it is a closed-shell reaction of small molecules with no net effect on performance due to the treatment of unpaired electrons. All the methods MP2, CAMB3LYP, CCSD(T), CCSD, PW6B95, M06-2×, B3LYP, and B3LYP* are within chemical accuracy (4 kJ mol −1 ) for the net process. CCSD(T) includes the essential electron correlation and performs at chemical accuracy, as expected since the applied basis set is saturated within this window of accuracy 22,36 . The zeropoint energy corrections provide most of the correction of the electronic energies to enable comparison to experiment, with a minor additional contribution from thermal enthalpy of~RT (~2.5 kJ mol −1 ) at 298 K. At higher temperature where real processes take place, these corrections become larger but are derived from the same (BP86) frequency calculation representing the geometry optimization and thus do not differentially affect method performance. Thus, errors reported below are primarily due to the methods themselves and not to other approximations applied in the computational treatment. Figure 1 shows that the hybrid functionals perform decently, as expected. However, the performance of PBE-based functionals, even including the hybrid PBE0, is modest. PBE and its revised version RPBE show errors of 20 and 18 kJ mol −1 , respectively, but notably in different directions; i.e., as PBE favors the forward reaction too much, whereas RPBE favors it too little. Figure 1 shows the errors that can be expected when modeling the net energetics of a chemical process where bonds are formed and broken. However, the net reaction tells us little about the catalytic process which involves different steps with large changes in electron correlation; these steps are discussed below.  Table 8). The gray dashed lines shows the mean signed error (MSE) for the method, averaged over all five systems, whereas the orange dashed line shows the corresponding mean absolute error (MAE). The methods are ranked according to their MAE. The quantum mechanical goldstandard CCSD(T) agrees well with most of the experimental values, but for Fe-NH 3 there is disagreement of 36 kJ mol −1 between experiment and CCSD(T), i.e., one of the numbers is less accurate, and thus both numbers may be considered when evaluating the DFT methods for this particular enthalpy. Major changes in static correlation effects upon ligand dissociation could plausibly cause CCSD(T) to fail, but generally CCSD(T) achieves chemical accuracy with the applied basis set 22 .
The HF method produces the clearly largest MAE of 204 kJ mol −1 and a MSE of −204 kJ mol −1 (not shown in Fig. 2; numbers and errors in Supplementary Tables 7 and 8). HF lacks correlation energy in the Löwdin definition and thus underestimates the experimental D 0 by more than 200 kJ mol −1 on average for the five bonds, but clearly most for the strong triple bond of N 2 , as found previously 22,36 . The local density method PWLDA displays a MAE of 78 kJ mol −1 but with an opposite tendency of strong over-binding (MSE =+78 kJ mol −1 ) (Supplementary Tables 7/8), which is also well-known and illustrates the original rationale for mixing local DFT and HF exchange in hybrids 37 . As these extremes illustrate, both the overall accuracy, measured by MAE, and the binding strength, measured by MSE, are important when assessing performance.
The most accurate methods reside to the left of Fig. 2. Given the uncertainty in the benchmark data, the first five methods perform insignificantly different. Among the best methods, one expects to see CCSD(T), which indeed ranks #4. The doublehybrid B2PLYP and the strongly parameterized M06-2X and M06 perform very well for these one-step benchmarks, as does PW6B95. The M06 functionals were developed for good performance on this type of data 38 and the test in Fig. 2   (OPBE, OLYP). After these functionals follow hybrid functionals with a HF percentage of 15-25% (from B3LYP* to PBE0). CCSD cannot compete with hybrid DFT in terms of accuracy, in agreement with previous findings 22 and in stark contrast to its high accuracy for closed-shell 2-electron and 4-electron systems 16,17 . The performance of the GGA functionals PBE 41 and revPBE/RPBE (which gives essentially similar results) is of particular interest as they are widely used in theoretical catalysis, including in the study of the Haber-Bosch process 27,35,42 . PBE performs poorly in direct comparison to experimental data, with considerable over-binding bias (MSE = 22 kJ mol −1 and MAE = 37 kJ mol −1 ). RPBE and revPBE perform better with a MSE of only 6-7 kJ mol −1 , largely removing the overbinding tendency of PBE, but still lacks accuracy across the diverse types of bonds, with MAE = 26 kJ mol −1 . Interestingly, the OPBE functional performs markedly better than RPBE for the net reaction.
Performance for iron-catalyzed ammonia synthesis. From the above comparison to experimental D 0 values we conclude that CCSD(T) is a suitable benchmark for the steps of the model reaction where experimental data are not available. Accordingly, it is meaningful to study a simple but complete model of ironcatalyzed ammonia synthesis; this model reaction is uniquely defined by the requirement that every ligand can be bound only to one iron atom at the time. While this model of course does not reflect the multi-metal bond types involved in a real catalytic process, it includes the fundamental treatment of all the Fe-N, N-N, H-H, and Fe-H bonds involved in the real process while at the same time being computationally accessible with the costly CCSD(T) methodology that was validated above. This makes the reaction uniquely suited for testing the performance of density functionals where experimental data are scarce or do not entirely reflect the individual steps of the process. Studying several iron atoms and ligand atoms is beyond reach with CCSD(T), and smaller basis sets will lose critical accuracy as shown previously 22,43 . Accordingly, the described benchmark model reaction may be an important validation tool of DFT methods for the study of iron-based catalysis. Figure 3 shows the reaction enthalpies of each reaction step 1-8 computed with the studied methods (excluding HF and PWLDA; all numbers are found in Supplementary Tables 9-12). Each step occurs in its relevant stoichiometry for the overall reaction. The benchmark CCSD(T) values are represented by the dashed line. Figure 3 immediately establishes the performance of each functional and identifies weaknesses in each functional specific to each step separately, which is important since these steps are very electronically distinct. Figure 3 reveals enormous spread in the performance of the studied methods. Even leaving out HF and MP2, which fail massively in most cases, the density functionals spread over hundreds of kJ mol −1 in their estimate of the reaction enthalpy of many steps. The largest variation in performance, and hence the most problematic steps to model, are the central steps 3-6, with estimates varying by more than 500 kJ mol −1 . For all steps, the spread in results exceeds 200 kJ mol −1 .
Step 3, which involves chemisorbed N-N cleavage, produces the largest variation in computational outcome, as one might expect; however step 5 also produces extremely diverse results, as it involves major electronic changes as iron-adsorbed nitrogen and hydrogen combine to form the first N-H bond and free iron. Even for normally wellperforming GGA functionals, the errors in the strong, highly correlated bonds such as N 2 can exceed 100 kJ mol −1 , making these bonds an "accuracy bottleneck" in theoretical catalysis; 22 as these steps are combined with other changes in a real catalytic process, results become even more heterogeneous, unfortunately.
Upon closer inspection, the performance of each functional is due to underestimation of some steps and overestimation of other steps. For example, the best performing functional is PBE0 (MAE  Figures 1-5), it also has the strongest performance (R 2 = 0.92), with many other functionals having R 2~0 .80 ± 0.05 and some (such as the M06 functionals and PWLDA) even lower. Despite its general performance, even PBE0 displays major inconsistencies in its accuracy: It performs excellently for step 1 but underestimates step 2 and 5 considerably (−95 and −133 kJ mol −1 ), and overestimates step 3 (+136 kJ mol −1 ). Correspondingly, PBE has particular problems with steps 2, 6, 8, and 1, in order of difficulty. This behavior is, importantly, similar for all the GGA nonhybrids. Thus, one can conclude that the tendencies relate to the treatment of the open shells, and in particular the amount of HF exchange in the functional (25% for PBE0 vs. 0% for PBE and other non-hybrid GGAs). Figure 4 displays the errors for each reaction step separately. The energy-requiring step of breaking the N-N triple bond is the most difficult for the HF and post-HF methods (HF, MP2, and CCSD) because the HF reference cannot explain the correlation energy of the strongly correlated N-N triple bond. The muchused hybrid PBE0 also has surprisingly large problems with this step. However interestingly, for most other density functionals, steps 2 (orange) and 6 (light green), and to a minor extent steps 1 (light blue) and 8 (brown), are the accuracy bottlenecks of the overall process. Thus, while step 3 (N-N dissociation) is the slowest step of the real process due to the 945 kJ mol −1 bond energy, the critical steps in terms of predictive accuracy (i.e., the "accuracy bottlenecks") are, remarkable, distinctly other steps. The accuracy-wise critical step 2 is the chemisorption of H 2 onto iron to produce Fe-H 2 bonds.
Step 6 is the more complex breaking of Fe-H and hydrogen abstraction to NH to form NH 2 . Consistent with Fig. 2, modeling the Fe-H bond is challenging for almost all functionals, probably because the loosely bound hydrides are subject to self-interaction error.
Step 8 is the dissociation of the ammonia gas product, which also poses a challenge. These steps are the accuracy bottlenecks that prevent DFT from reaching predictive accuracy, the errors are much more diverse and larger than commonly stated in the literature, and new efforts seem needed to address such "pathological" steps.

Discussion
While the precision of DFT programs is by now wellestablished 15 , there is much debate on the accuracy of current DFT in midst of its massive use throughout chemistry [16][17][18] , partly because accurate benchmark data for complex processes are limited. This study explored the accuracy of DFT for a simple model process of iron-catalyzed ammonia synthesis for which accurate CCSD(T) calculations are feasible and experimental data are available. The main feature is the use of a model reaction for a full catalytic process where the accuracy for individual steps can be directly assessed. Very few model reactions are simple enough to serve as benchmarks while at the same time representing a full catalytic process (many benchmarks exist for individual steps such as bond dissociation energies). This probably explains why the accuracy of an applied functional is not generally addressed in many applications.
Two main limitations can be argued to exist in the present approach: One is the use of CCSD(T) as a benchmark, rather than full CI or multi-reference methods. The other is the simplicity and possible lack of generality of the model reaction. As to the first point, CCSD(T) is considered a golden standard in the field and thus its performance will always be of interest. However, the main results of the work are that the performance of functionals differs massively for many steps, and the performance for each step is also very functional-dependent. These variations will persist regardless of the exact true values of the energies of each step. As to the second point, despite its resemblance to the important Haber-Bosch process, the model process is very simple and specific which can be considered as a necessary sacrifice to achieve benchmark data. The development of additional benchmark models of real catalytic processes should be a main priority of future work as errors will depend on the specific steps of the process studied. However, even if the presence of more metals and compensating interactions in a condense phase will surely reduce errors in larger more realistic models, the tendencies of the functionals will probably persist and reflect methods that overbind and underbind, respectively. Accordingly, the observed large variations in both functional and step performance will also persist. Most steps of the full process are electronically very distinct and thus cover a good portion of the real chemistry of the bonds, except of course for the missing metal-metal modulation on a real surface. The heterogeneous outcome arising from the diversity of electronic-structure changes should be generic to many important catalytic processes involving transition metals with unpaired d-electrons, and as the process is probably not unique in its electronic complexity, the negative results are probably also partly transferable to other open-shell processes. The difference in accuracy when modeling the overall net reaction and the individual catalytic steps become very clear in this work: For the net reaction N 2 + 3H 2 → 2NH 3 , which only involves closed-shell species, many methods are within chemical accuracy (4 kJ mol −1 ). However, for the individual steps involving complex open-shell electronic configurations, errors commonly exceed several hundred kJ mol −1 . Most importantly, the most energy-requiring N-N dissociation step is often not the accuracy bottleneck of modeling the overall process, and different density functionals experience different accuracy bottlenecks. Accordingly, it will be hard to develop simple interpolations that generically reduce DFT errors for transition metal catalysis, as is sometimes possible for organic molecules due to their modularity 44 . This has negative implications for the accuracy of many studies in theoretical catalysis, which must necessarily rely on extreme error cancellations to become predictive.
The question becomes, how often is DFT really used predictively, and when it is merely used as rationalization of known facts? When does a particular functional work, how predictive is it really, and how far can one rely on e.g., scaling behavior and error cancellation to remedy the major deficiencies in modeling the individual steps which haunt even the best functionals? If only one step of the process is really interesting (e.g., the slowest step) then focus may be put on this step regardless of the major errors for the full cycle. This would need to be explicitly explained and justified. If interest is only in the relative performance vs. a reference catalyst (e.g., for catalyst design) then errors may be substantially reduced, but this has to be shown by a sensitivity test using several functionals. It is very interesting that different functionals have difficulties with different types of electronic transformations. Identification of further accuracy bottlenecks of full, but simple catalytic model reactions as the one in this work should be valuable in future efforts to make DFT accurate enough to predict full catalytic cycles at uniform accuracy.

Methods
Applied software and basis sets. All computations were performed using the Turbomole software, version 7.0 45 . All densities and energies were converged to 10 −7 a.u. Geometries were optimized using the BP86 functional, which produces accurate metal-ligand bond lengths (mean absolute errors of~0.02 Å) 46 , using the def2-TZVPP basis set 47 with polarization functions on all atoms including hydrogen, as is important during this particular process where six hydrogen atoms repeatedly bind to iron and nitrogen. High-quality energies were subsequently computed using def2-QZVPP for iron and aug-cc-pV5Z basis set for N and H 48 .
The large basis set with diffuse functions is required for the electronegative ligand atoms as they contain surplus electron density upon binding to iron, in particular for the loosely bound hydride (M-H − ) states. Polarization functions are required for adequate description of differential electron correlation during bonding. The basis set is saturated to within chemical accuracy (4 kJ mol −1 ) for bond energies, including the strongest, most correlated N-N bond; 22 smaller basis than this prevent accurate benchmarking since basis set effects may affect the apparent outcome of the chosen DFT method.
Studied computational methods. To obtain a detailed and general overview of method performance, 25 methods were explored for all eight steps. Four ab initio methods were studied: The Hartree-Fock method (HF), Møller-Plesset secondorder perturbation theory (MP2), Coupled-Cluster with single and double excitations included (CCSD), and Coupled-Cluster with the perturbative triple-excitation corrections, CCSD(T), all based on the same HF reference state for each electronic system to ensure consistency. In addition, 21 density functional methods were investigated (Supplementary Table 2; the local PWLDA has been omitted from some figures as the errors are off-scale). They vary in design type and expected performance 7,49,50 and thus effectively help to discern the impact of different DFT approximations on the various steps of the process. The HF fraction of hybrids substantially affects computed chemical energies and is thus explicitly stated in Supplementary Table 2. The functionals were studied using their standard keywords or using the xcfun library implemented with Turbomole 51 . PBE and PBE0 41 represent non-empirical GGA and hybrid functionals, which are widely used in theoretical catalysis. The revised RPBE functional is also widely used 52 , as was the similar revPBE 53 (both are widely used in the study of surface catalysis). BLYP and the 20% HF-exchange hybrid B3LYP 54,55 and its 15% version B3LYP* 56 were studied as representative functionals using the LYP correlation functional; B3LYP is the most popular functional in computational chemistry and B3LYP* is reportedly accurate for 3d metals and specifically iron 57,58 . B3P86 was included to test the change from LYP to another correlation functional, P86. TPSS and TPSSh 59,60 represent non-hybrid and hybrid meta functionals; TPSSh was reported to perform well for transition metal chemistry 61,62 . PWLDA represents the local density approximation (LDA), which uses only the electron density in the description of the energy is fast to compute, and known to over-bind; its performance is very bad and thus not included in the figures for readability but its data can be found in the Supplementary Information. The optimized exchange functional (O) by Handy and Cohen 39,40 may provide a more balanced description of d-block chemistry due to its parameterization toward HF energies to reduce the GGA over-binding of metal-ligand bonds and bias towards low-spin states 5 . To ensure that its effect is robust vs. correlation functional, both OLYP and OPBE were studied. B2PLYP 63 was studied as an example of a double hybrid performing well in previous tests of transition metal thermochemistry 20,50 . The PBEH-3C method was also included for academic interest, although it is designed for structures and frequencies using smaller basis sets 64 . Finally, the hybrid PW6B95 65 and the meta hybrids M06 (with 23% HF exchange) and M06-2× (with 54% HF exchange), and the local meta functional M06-L 66 were included as examples of functionals parameterized specifically for broadly accurate thermochemistry.  Tables 3 and 4); FeNH (2); FeNH 2 (3/2); FeNH 3 (2); Fe + (3/2); H (½); N (3/2); NH 2 (½). H 2 , N 2 , and NH 3 are closed-shell systems and were computed as such. In cases where lower-spin open-shell configurations were required, these were carefully optimized starting from higher multiplets and testing for local minima by using several types of start orbitals to derive the lowest possible HF energy, whose orbitals were then used consistently as starting point for all other calculations to ensure full consistency of the electronic states.
Experimental data and corrections applied. Data for the D 0 at 298 K of H-H (436 kJ mol −1 ), N-N (945 kJ mol −1 ), Fe-NH 3 (31 kJ mol −1 ), NH 2 -H (450 kJ mol −1 ) and Fe-H (148/179 kJ mol −1 ) are available and serve as important benchmark data 67 . The D 0 for Fe-H of 148 kJ mol −1 reported in the Handbook of Chemistry and Physics 67 has been revised to 179 kJ mol −1 based on very accurate multi-configurational ab initio methods 68 . The need for upwards revision of the low FeH value has been previously addressed by others 68,69 and thus we used the value 179 kJ mol −1 . Dispersion corrections are negligible for these reactions: The largest correction computed by the D3 method 70 amounts to −0.76 kJ mol −1 for Fe-N 2 . For direct comparison to CCSD(T), it suffices to use the non-relativistic benchmark, since relativistic corrections affect CCSD(T) and DFT values to the same extent; thus relativistic corrections were not included; they may potentially contribute up to~10 kJ mol −1 to the real enthalpy of each step 46,49,69 .

Data availability
All data required to produce all results of this work (Figs. 1-4) can be found in the supplementary information. Additional raw data are available from the author upon request. xyz coordinates of the systems, details on the applied density functionals, the raw electronic energies computed for all electronic systems and methods and used for constructing the figures, errors for each method, zero-point energy and vibrational and thermal corrections are supplied (Supplementary Tables 1-12