Introduction

In the past few decades, density functional theory (DFT) has become the workhorse of materials simulations and owes its impressive success to its good compromise between accuracy and numerical efficiency1. The achievements of DFT come, of course, at a cost: The unknown exchange-correlation function has to be approximated. Very simple approximations derived from the uniform electron gas, such as the local density approximation1,2, already provide satisfactory accuracy in describing the properties of periodic materials. However, standard DFT functionals are known to fail for certain classes of systems, particularly where weak interactions are important or where the electronic correlation is strong3,4; importantly, chemical accuracy is not systematically achieved, and this often limits the predictive power of DFT-based materials simulations. Because of the non-systematic control on the accuracy, it is often difficult to understand if the failure to accurately reproduce or predict experimental results is due to the inadequacy of the particular model under consideration or of the approximations involved in the DFT functional.

Correlated quantum chemical methods based on post-Hartree-Fock (post-HF) approximations are instead systematically improvable and could potentially overcome some of the limitations of DFT for materials simulations. Among those, second-order Møller-Plesset perturbation theory (MP2)5 and coupled cluster theory6 have been recently implemented for periodic materials7,8,9,10,11. However, their computational cost is significant for most practical applications in materials science and this issue becomes even more dramatic when finite-temperature effects have to be included by performing molecular dynamics (MD) simulations or Monte Carlo sampling. For example, a brute-force computation of the enthalpy of adsorption considered in this work would require billions of CPU hours and hundreds of real-time years to be completed.

In the context of MD simulations, machine learning (ML) is nowadays a well-established tool to achieve larger system sizes and longer time scales12,13,14,15. This is achieved by decomposing the total energy in atomic contributions and using ML regression models to “fit” the interatomic potential. Still, the ML-accelerated MD typically requires large amounts of data and becomes rapidly challenging for the more expensive approximations. In this work, we show how finite-temperature observables for periodic materials can be evaluated using the ‘gold standard’ coupled cluster ansatz, including single, double, and perturbative triple particle-hole excitation operators (CCSD(T)) in combination with machine learning techniques coupled with thermodynamic perturbation theory and Monte Carlo sampling. Within this approach, the computational cost is limited to a small number (a few tens) of single-point energy calculations that are then used to train a data-efficient ML model.

For molecular systems, the application of ML techniques has already been proven to be effective in enhancing the efficiency of CCSD(T) MD simulations13,16,17,18,19. Very recently, applications to molecular condensed phase systems, specifically to liquid water, have also been considered. In ref. 20, the ML model for periodic water was trained with data produced for finite water clusters using near-linear scaling coupled cluster theory. In ref. 21, CCSD(T) calculations were restricted to very small periodic models based on a box of 16 H2O molecules, and the ML model was then used to compute radial distribution functions, diffusion coefficients, and vibrational densities of states. To the best of our knowledge, the application of CCSD(T) to finite-temperature simulations of periodic solid materials has not been previously reported in the literature. Our study was challenging for several reasons. First, we dealt with a system containing more than 200 electrons, far more than used in any previous report on ML-assisted MD CCSD(T) simulations. Second, we focused on a measurable thermodynamic quantity, the enthalpy of adsorption, whose prediction imposes high demands on the quality of the ML model. This is because any error in the energy of a configuration affects not only the underlying phase space function used in ensemble averaging but also the statistical weight of that contribution (see Eq. (2)).

This work is based on the combination of two ingredients. The first is an efficient periodic coupled cluster theory implementation. This implementation is based on a plane-wave basis set and finite size and basis set correction techniques that accelerate the convergence to the complete basis set limit and thermodynamic limit significantly22,23. Using these techniques, it is possible to obtain well-converged correlation energies at the CCSD(T)-level of theory for periodic solids and surfaces containing more than 100 electrons on modern supercomputers24,25,26,27.

The second fundamental ingredient is an approach that couples machine learning and thermodynamic perturbation theory (TPT)28, which will be denoted as MLPT29,30,31,32. Within this approach, an ab initio molecular dynamics simulation is first performed at an affordable level of theory (semi-local DFT), and the statistical distribution is subsequently reweighted to obtain observables at a higher level of theory (e.g. coupled cluster). While this TPT procedure requires, in principle, a large number of single-point calculations at the expensive level of theory, in practice, those can be replaced to a large extent by inexpensive machine-learning predictions. By using efficient machine learning algorithms based on the smooth overlap of atomic positions (SOAP) kernel33,34 and Δ − ML35, MLPT requires a limited amount of data to be trained on. For example, the calculation of enthalpies of adsorption at the random phase approximation (RPA) level of theory achieved convergence with as few as 10 single configuration energies29. This is particularly important when employing expensive approximations in a finite-temperature context since otherwise, the amount of single-point calculations and the associated computational cost would be too large.

As a specific application of our approach, we consider the calculation of the enthalpy of adsorption of carbon dioxide in protonated chabazite (HChab). The adsorption of molecules in zeolites is fundamental for many applications, including depollution, separation of chemicals, and catalysis36,37,38. In this field, more quantitative and systematically improvable theoretical predictions are instrumental in interpreting experimental findings and predicting new materials. Although the calculations presented in this work are still significantly more expensive than those based on standard density functional theory, our proof-of-principle work paves the way to a more systematic use of highly accurate post-HF methods in materials simulations.

Results

Enthalpy of adsorption from first principles

In this work, we consider the calculation of the enthalpy of adsorption of carbon dioxide in a porous zeolitic material, chabazite. In practice, this quantity is computed as

$$\begin{array}{l}{\Delta }_{{{{\rm{ads}}}}}H(\,{{\mbox{M@Z}}}\,)\,=\,{\Delta }_{{{{\rm{ads}}}}}U(\,{{\mbox{M@Z}}}\,)+{\Delta }_{{{{\rm{ads}}}}}(pV)(\,{{\mbox{M@Z}}}\,)\\\qquad\qquad\qquad\quad =\,\langle E(\,{{\mbox{M@Z}}}\,)\rangle -(\langle E(\,{{\mbox{M}}}\,)\rangle +\langle E(\,{{\mbox{Z}}}\,)\rangle )-{k}_{{{{\rm{B}}}}}T,\end{array}$$
(1)

where ΔadsU is the internal energy of adsorption, 〈E(i)〉 denotes the ensemble average of the total energy of the system i corresponding to a gas phase molecule (M), clean zeolite (Z), and the adsorbed system (M@Z), and the identity Δads(pV)(M@Z) = −kBT is obtained by assuming an ideal gas behavior of M and a negligible change of pV of the zeolite due to adsorption. The value of T is fixed at 300 K in all our simulations. The canonical ensemble energy can be evaluated by directly performing an ab initio molecular dynamics (AIMD) simulation but because of the high computational cost of CCSD(T) and MP2, this approach is impractical at these levels of theory. Based on the plane-wave basis set coupled cluster calculations are performed in several steps involving Hartree–Fock and MP2 theory to obtain corresponding energies and optimized approximate natural orbitals39. Once the natural orbitals have been computed, the Cc4s40 interface to VASP41,42 is used to compute intermediate quantities43 that are needed for the subsequent coupled cluster energy calculations, including the corresponding finite size22 and basis set corrections23. In the present calculation, 10 unoccupied approximate natural orbitals per occupied orbital are used for the CCSD calculations, whereas only 5 unoccupied approximate natural orbitals per occupied orbital are employed to evaluate the (T) contribution. A single CCSD(T) calculation for the given structures containing up to 40 atoms took about 10,000 core hours.

Machine learning approach

The large number of high-level calculations required to estimate the enthalpy (or other finite-temperature quantities) can be significantly decreased using machine learning techniques. Specifically, starting from an AIMD trajectory obtained using numerically affordable semi-local DFT with empirical van der Waals corrections (PBE + D2)44,45, the post-HF ensemble energies are estimated using the MLPT approach trained on a small number of single-point calculations. This approach is described in detail in ref. 29,30,31,32, and the two main steps are summarized here:

  1. 1.

    Given a set of configurations \({\{{{{{\bf{R}}}}}_{i}\}}_{i = 1}^{M}\) from an AIMD trajectory in an NVT ensemble with the PBE + D2 reference Hamiltonian H0 and potential energy E0, the ensemble average energy generated by the target Hamiltonian H1 with potential energy E1 (MP2 or CC level) can be obtained from thermodynamic perturbation theory by reweighting:

    $${\langle {E}_{1}\rangle }_{1}=\frac{\mathop{\sum }\nolimits_{i = 1}^{M}{E}_{1}({{{{\bf{R}}}}}_{i})\exp (-\beta \Delta E({{{{\bf{R}}}}}_{i}))}{\mathop{\sum }\nolimits_{i = 1}^{M}\exp (-\beta \Delta E({{{{\bf{R}}}}}_{i}))},$$
    (2)

    where ΔE(R) denotes the energy difference E1(R) − E0(R) for a specific atomic configuration R. In this work, E1 denotes either the MP2 or the CCSD(T) target method potential energy. The trajectory obtained with the reference Hamiltonian is called the production trajectory.

  2. 2.

    While the application of Eq. (2) requires a large number of high-level calculations, in practice, those can be largely replaced by inexpensive predictions of a machine learning model. MLPT limits the amount of data required for the training by using efficient algorithms based on the kernel ridge regression with the SOAP kernel33,34 and Δ − ML35. E0(R) is known, and the evaluation of Eq. (2) requires only the energy difference ΔE(R).

Since MLPT is based on thermodynamic perturbation theory, a limited overlap between the production and target configurational spaces can lead to inaccurate results. If a suboptimal overlap is suspected, a Monte Carlo (MC) resampling can be performed. This procedure, described in detail in ref. 32, uses Metropolis MC46 to resample the canonical ensemble at the CCSD(T) and MP2 levels of theory. At each MC step, configurational energies are computed with the production approximation (PBE + D2) and subsequently evaluated at the post-HF level using the same ML model of MLPT. The Metropolis acceptance criterion is applied at the target level of theory, and accordingly, the correct target configurational space is sampled without bias from the starting point.

Calculation of molecular adsorption enthalpies in zeolites

Let us now present and discuss the adsorption enthalpies of CO2 in protonated chabazite as computed at the MP2 and CCSD(T) levels of theory. The latter approximation is commonly described as the ‘gold standard` of quantum chemical simulations and is routinely used to produce reference test sets to benchmark the accuracy of other methods47,48. The primitive cell of the model considered here is shown in Fig. 1.

Fig. 1: Adsorbed system.
figure 1

The unit cell of the system studied in this work is CO2 in protonated chabazite.

The experimental value of the enthalpy of adsorption of CO2 in HChab, −8.41 kcal mol−1 49, is used as a reference for the computational results. This experimental estimate is obtained by extrapolating measurements to the zero coverage limit. The errors possibly arising from this procedure are not discussed in ref. 49 and we cannot exactly quantify the uncertainty in the experimental reference.

The computed results are presented in Table 1, where the error bars related to the finite sampling and the ML model are also indicated29. The molecular dynamics at the PBE + D2 level leads to an estimate for the adsorption energy, which is more than 1 kcal mol−1 below the experimental value, corresponding to a deviation well beyond chemical accuracy. This MD trajectory is used as a starting point for MLPT to obtain post-HF enthalpies. Similarly, the MP2 approximation obtained from MLPT also tends to overbind and leads to results that do not qualitatively differ from PBE + D2. This is not surprising and we believe that this overestimation is caused by the lack of screening of long-ranged correlation effects in MP2 theory. The computational estimate of the enthalpy significantly improves at the CCSD(T) level, which provides a value in excellent agreement with the experiment. This result demonstrates the high accuracy and predictive power of the CCSD(T) approximation also for finite-temperature simulations of materials.

Table 1 Enthalpy of adsorption of CO2 in protonated chabazite (kcal mol−1) computed using different target and sampling methods

In a previous work, we demonstrated that the RPA also provides accurate enthalpies of adsorption of molecules in zeolites29. Specifically, the value for CO2 in protonated chabazite is −8.01 kcal mol−1. Although the RPA has a diagrammatic structure it is not as straightforward to systematically improve its accuracy as for post-HF methods50,51,52,53,54,55,56,57. In practice, the RPA often provides more realistic results starting from a DFT approximation rather than from HF52, and this starting point dependence makes this approximation less reliable as a general predictive method.

Discussion

To fully prove the accuracy of the MLPT approach for MP2 and CCSD(T), a crucial point concerns the reliability of the PBE + D2 trajectory used as a starting point for thermodynamic perturbation theory. Specifically, if the target (MP2 or CCSD(T)) configurational space has a small overlap with the production (PBE + D2) configurational space, the results of TPT may be affected by a strong systematic error. As discussed thoroughly in ref. 32 for systems similar to the one considered here, the occurrence of this issue can be identified even if the exact target trajectory is unknown. Thermodynamic perturbation theory is based on the reweighting of the statistics sampled by the production trajectory to obtain the target level statistics (see Eq. (2)); in case of a poor overlap, only a few configurations contribute to the total weight, leading to poor ensemble estimates. In practice, this effect can be measured by the Iw index, as defined in ref. 32. This index assumes the value of 0.5 in the optimal configuration overlap case and tends to 0 for decreasing overlaps. For the adsorption of molecules in zeolites, it has been shown that even relatively small values of Iw around 0.03–0.05 still allow for reliable MLPT estimates32. The reweighting of the trajectories at the MP2 level provides large values for Iw (>0.15), and the corresponding enthalpies in Table 1 should be considered fully reliable. For the CCSD level, a very low Iw value for the adsorbed system (0.008) precludes making any reliable predictions of adsorption enthalpy; for this reason, this level of theory is not discussed here. For the CCSD(T) level of theory, the Iw coefficient is one order of magnitude higher: 0.07 for HChab and 0.05 for the adsorbed system, indicating a better match between the PBE + D2 equilibrium structure as compared to the CCSD level. While these Iw values are likely to be sufficient to confirm the reliability of our results32, considering the pioneering nature of our work and the lack of any previous finite-temperature benchmark results for periodic CCSD(T), we further investigated the robustness of the MLPT estimate by resampling the CCSD(T) trajectory. This is achieved by performing a Metropolis Monte Carlo (MC) sampling of the canonical ensemble at the CCSD(T) level by replacing the expensive coupled-cluster calculations with the predictions of the same machine learning model previously trained for MLPT. Differently from most machine learning-based MD approaches12,13,14,15, this MLMC approach avoids training on atomic forces, which are not readily available in the current periodic CCSD(T) implementation and would require a significant overhead cost. Since thermodynamic perturbation theory is not used and a new trajectory is instead sampled from scratch, MLMC avoids the starting point bias. The corresponding result for the enthalpy of adsorption, shown in Table 1, differs by only 0.2 kcal mol−1 from the MLPT value. In the MLMC case, the error bar is, however, sizeably larger because of the long auto-correlation length of this trajectory (about a factor 10 longer than for the MD trajectory), but this is sufficient to support our conclusion that the PBE + D2 trajectory provides a reliable starting point to compute CCSD(T) ensemble energies. This is also qualitatively confirmed by visualizing the (high-dimensional) geometries sampled by the MD and MC methods with the t-distributed stochastic neighbor embedding (t-SNE) algorithm58. As shown in Fig. 2, the PBE + D2 molecular dynamics and the CCSD(T) Monte Carlo trajectories span configurational spaces that overlap well. This figure also demonstrates that the training set provides a rather uniform sampling of the data, as required for a balanced training of the ML model.

Fig. 2: Visualization of the configurational spaces.
figure 2

t-SNE representation of the configurational spaces spanned by the PBE + D2 molecular dynamics (MD) trajectory and the CCSD(T) machine learning Monte Carlo (MLMC) trajectory. The configurations included in the training set are also shown to demonstrate that they cover essentially the whole relevant part of the configurational space sampled at the CCSD(T) target level. The axes represent the two components of the t-SNE projection.

To further analyze the overlap between the configurational space of the PBE + D2 functional and of the post-HF methods we consider the structure of the protonated chabazite cage. For this purpose, the radial distribution function of the Si–O pairs has been computed for the PBE+D2 molecular dynamics trajectory and for MP2 and CCSD(T) approaches using MLPT and MLMC. As previously shown in ref. 32, the most spectacular failures of MLPT are encountered when the production approximation predicts equilibrium distances of covalent bonds that differ from the target theory; this translates to very different configurational spaces and fully unreliable perturbative estimates. For protonated chabazite, Fig. 3 clearly shows that the radial distribution functions computed for the Si-O pairs are similar at different levels of theory, and problematic behaviors of MLPT should not be expected.

Fig. 3: Radial distribution functions.
figure 3

First (a) and second (b) series of peaks of the partial radial distribution function for the Si–O pairs determined at different levels of theory.

Finally, it is important to notice that the effects included by MLPT do not correspond to a trivial energy correction. Within a simplified approach, the coupled cluster and MP2 enthalpies could be approximated as

$${\Delta }_{{{{\rm{ads}}}}}{H}_{{{{\rm{CCSD(T)}}}}/{{{\rm{MP}}}}2}\,\approx\, {\Delta }_{{{{\rm{ads}}}}}{H}_{{{{\rm{PBE}}}}+{{{\rm{D}}}}2}+({\Delta }_{{{{\rm{ads}}}}}{E}_{{{{\rm{CCSD(T)}}}}/{{{\rm{MP}}}}2}-{\Delta }_{{{{\rm{ads}}}}}{E}_{{{{\rm{PBE}}}}+{{{\rm{D}}}}2}).$$
(3)

In this case ΔadsECCSD(T)/MP2 = ECCSD(T)/MP2(M@Z) − ECCSD(T)/MP2(M) − ECCSD(T)/MP2(Z) corresponds to the CCSD(T) or MP2 adsorption energies computed using the structures corresponding to potential energy minima determined at the PBE+D2 level; ΔadsEPBE+D2 is analogously defined for PBE + D2. This approximation is efficient since a single post-HF calculation is required for each one of the three systems. However, this “static” approach is based on a strong and not generally valid assumption that the post-HF and DFT approaches produce energy surfaces that are shifted by a constant but otherwise parallel. MLPT instead requires only a reasonable overlap between the configurational spaces of production and target approximations, which can be tested using the Iw index, and “deformation” effects of the energy surface are kept into account by the reweighting in Eq. (2). By applying Eq. (3) we obtain ΔadsHCCSD(T) = − 8.69 kcal mol−1 and ΔadsHMP2 = − 9.03 kcal mol−1. The CCSD(T) enthalpy obtained in this way agrees fairly well with the MLPT value. However, within this static correction approach, CCSD(T) and MP2 provide very similar results, while MLPT showed that these results should differ by about 1.2 kcal mol−1 (see Table 1). This observation shows that the static correction approach of Eq. (3), while providing reasonable estimates in some cases with fortuitous error cancellations, is not reliable in general and can lead to misconceptions.

In conclusion, we have presented an application of CCSD(T) to compute the enthalpy of adsorption of carbon dioxide in a periodic model of zeolite. Due to the high computational cost, applications of CCSD(T) to periodic materials are so far limited, and direct calculations of finite-temperature observables are unpractical in terms of required computational resources and execution time. Here we showed that these challenges can be overcome by coupling machine learning models requiring small training sets with an efficient implementation of periodic coupled cluster theory. The computed enthalpy of adsorption of carbon dioxide in protonated chabazite was found to be in excellent agreement with the experiment. While still significantly more expensive than approaches based on density functional theory, our pioneering work opens the door to more reliable and predictive simulations of materials in finite-temperature conditions. Future work will be aimed at demonstrating the accuracy of ML-based CCSD(T) in broader classes of problems, including, for example, the computation of free energies of activation, which play a fundamental role in the modeling of catalytic reactions.

Methods

Coupled cluster calculations

The coupled cluster theory calculations are performed using the Cc4s code40, which is interfaced with the Vienna ab initio simulation package (VASP)41,42. In ref. 24, all individual steps are described when combined with an embedding approach, which was not necessary for the present system due to its relatively small unit cell containing up to 40 atoms only. The convergence of the CCSD and (T) correlation energy contributions to the molecular adsorption energy was tested on a single configuration, and the details are reported in the Supplementary Information.

Ab initio molecular dynamics

ab initio molecular dynamics simulations based on the PBE + D244,45 were performed in the NVT ensemble, and the simulation temperature of 300 K was controlled using the Andersen thermostat59 with a collision probability of 0.05. Two hundred thousand configurations were sampled with a timestep of 0.5 fs for a total of 100 ps. The first 10 ps of each trajectory were discarded as the equilibration period. The cell parameters were fixed to the values obtained, optimizing the chabazite cell at the PBE level60. The VASP electronic structure program was used for all AIMD simulations and single-point calculations within the Γ point approximation. Hydrogen atomic mass was set to 3.0 au.

ML methods

In this work, the training set is based on 100 uncorrelated configurations evenly spaced along the PBE + D2 trajectories and 10 randomly chosen configurations for the test set. The MP2 and CCSD(T) calculations are performed only for those selected geometries. Kernel ridge regression, using the rematch kernel34 and the smooth overlap of atomic positions (SOAP) descriptor, was used as implemented in the DScribe library61. The model is trained to predict the differences between the post-HF and the PBE + D2 energies35. Details of the hyperparameter tuning and model accuracy are provided in Supplementary Information.

In the machine learning Monte Carlo resampling, new configurations xnew are proposed by sampling velocities v from a Maxwell–Boltzmann distribution and integrating them for a timestep Δt chosen to be 0.5 fs: xnew = xold + vΔt. For the adsorbed system, the molecule is additionally subject to a random translation (up to 0.5 Å) and rotation (up to 35°). The new proposed configuration is then accepted or rejected according to the Metropolis criterion based on the energy predicted by the machine learning model.