Quantum chemical accuracy from density functional approximations via machine learning

Kohn-Sham density functional theory (DFT) is a standard tool in most branches of chemistry, but accuracies for many molecules are limited to 2-3 kcal ⋅ mol−1 with presently-available functionals. Ab initio methods, such as coupled-cluster, routinely produce much higher accuracy, but computational costs limit their application to small molecules. In this paper, we leverage machine learning to calculate coupled-cluster energies from DFT densities, reaching quantum chemical accuracy (errors below 1 kcal ⋅ mol−1) on test data. Moreover, density-based Δ-learning (learning only the correction to a standard DFT calculation, termed Δ-DFT ) significantly reduces the amount of training data required, particularly when molecular symmetries are included. The robustness of Δ-DFT is highlighted by correcting “on the fly” DFT-based molecular dynamics (MD) simulations of resorcinol (C6H4(OH)2) to obtain MD trajectories with coupled-cluster accuracy. We conclude, therefore, that Δ-DFT facilitates running gas-phase MD simulations with quantum chemical accuracy, even for strained geometries and conformer changes where standard DFT fails.


Water dataset
The water molecule has three internal degrees of freedom, two O-H bonds and the H-O-H angle, with bond lengths of 0.97 Å and a bond angle of 104.0 • (1.82 radians) for the geometry optimized with the PBE functional. A preliminary dataset composed of 1000 molecular geometries was generated by uniformly sampling O-H bond lengths and H-O-H angles. The 102 geometries for the training set were selected based on the minimum energy conformer of the 1000 structures (1.6 kcal·mol −1 above the optimized geometry), with bond lengths in the range 0.97 ± 0.25 Å and the bond angle in the range of 115 ± 26 • (2.00 ± 0.45 radians). In order to simplify the learning problem, the molecules were aligned in the xy-plane with the bisector of the H-O-H angle along the y-axis and the longer O-H bond in the same quadrant. The resulting geometries are shown in Supplementary Fig. 1. The conventional coupled cluster calculations were run in Molpro Quantum Chemistry Software [1] or Orca v.3.0.3 [2] using CCSD(T)/aug-cc-pVTZ [3]. For all water geometries in this work, using a single reference wave function is appropriate (T1 diagnostic < 0.02).
The model is trained on total calculated energies, but there is a shift in the energy range due to the differing bonding energy between electronic structure methods. Therefore, we report relative energies in Fig. 1b based on the lowest energy in the grand training set. These relative energy errors are also shown in Supplementary Fig. 2 and are used to calculate the MAE reported in Fig. 1c. The potential energy surfaces in Fig. 1d are shown relative to the lowest energy conformer for each energy method. The symmetric water dataset was constructed for the range of bond lengths and angles sampled during a DFT-MD simulation at 300 K using deviations of 3σ from the average value. The resulting bond range is 0.97 ± 0.075 Å and the angle range is 104.0 ± 15.5 • (1.81 ± 0.27 radians). For the symmetric water conformers, the errors for predicted relative energies are qualitatively and quantitatively different for the direct ML energy functionals and the ∆-DFT approach, as shown in Supplementary  Fig. 3. The errors in the direct methods reflect the overlap between the training set geometries and the out-of-sample test set, with the largest errors for small angles on the edge of the training set (coordinate range shown in Supplementary Fig. 1). The ∆-DFT errors are significantly smaller and qualitatively different, as they depend much more strongly on the bond length than the bond angle.

Ethanol dataset
Training and test set data for ethanol is from the MD17 dataset [4,5]. The 1000 unique geometries in the training set are aligned with the three heavy atoms defining a reflection plane for the symmetryaugmented dataset. Supplementary Fig. 4 shows the atomic distributions of the ethanol datasets after alignment and inclusion of symmetry-generated training points. The specific data generation methods are reported in Refs. [4,5]; Briefly, the conformers were generated using DFT-MD at 500 K (PBE+TS) and the coupled cluster energies use CCSD(T)/cc-pVTZ. For all ethanol coupled cluster calculations, the T1 diagnostic is below 0.02. In addition to DFT-and CC-optimized structures, we also optimize each local minima using MP2/6-31g * in Gaussian09 [6]. All geometries are similar, with the largest difference seen for the C-O bond length (DFT > MP2 > CC), as shown in Fig. 2. We report relative energies for ethanol based on the lowest energy minima for each electronic structure method (anti for CC by 0.08 kcal·mol −1 and gauche for DFT by 0.10 kcal·mol −1 ). These relative energy errors are shown in Supplementary Fig. 5, with maximum relative energy errors of 5.6 kcal·mol −1 and 5.9 kcal·mol −1 for geometries included in the training and test sets, respectively. The local minima of ethanol differ most clearly in the dihedral angle of the alcohol OH, but the optimized geometries also differ in the C-C-O angle. The training set shown in Fig. 2a is also shown in Supplementary Fig. 6 sorted by OH dihedral angle. Using ∆-DFT with force-based methods. ∆-learning is a universal approach that can be applied with any machine learning model to improve the accuracy, however, some models stand to benefit more from ∆-learning than others. To explore this we took sGDML [5], one of the stateof-the-art force-field based ML models, and trained it on the differences between the CC and DFT forces of 1000 ethanol geometries. Table 1 compares the accuracies of our density-based model and sGDML when predicting DFT, CC and ∆-DFT energies. We notice that ∆-learning with sGDML shows essentially no difference from the originally reported sGDML results for directly learning CC energies, while our model improves by a factor of 10 using the ∆-DFT approach. A likely reason for this is that sGDML uses forces for training, which contain more information than just the energy. This means that the prediction accuracy for sGDML converges more quickly, so it cannot benefit from ∆-learning as much.

ML Model
Using ∆-DFT with molecular symmetry.

Benzene dataset
Training and test set data for benzene is from the MD17 dataset [4,5], with the addition of the geometry optimized using MP2/6-31g * in Gaussian09 [6]. The specific data generation methods are reported in Refs. [4,5]. Briefly, the conformers were generated using DFT-MD at 500 K (PBE+TS) and the coupled cluster energies use CCSD(T)/cc-pVDZ. To normalize the molecular positions, the optimized geometry is aligned with all C atoms in the xy-plane, and two atoms on the y-axis. The 1000 unique geometries in the MD17 training set are aligned by minimizing the root mean squared deviation of C atoms (RMSD C ) from the optimized structure, with an RMSD C of 0.045 Å for the training data and 0.046 Å for the test data. Benzene is a highly symmetric molecule, and using our symmetrization approach increases the effective dataset size by a factor of 24. Supplementary Fig. 7 shows the atomic distributions of the benzene datasets after alignment and inclusion of symmetry-generated training points. For each dataset and electronic structure method (including the ML models), we report energies relative to the optimized MP2/6-31g * geometry (included in the training set). The relative energy errors using the conventional DFT method are shown in Supplementary Fig. 8, with maximum relative energy errors of 3.2 kcal·mol −1 and 3.0 kcal·mol −1 for geometries included in the training and test sets, respectively. We note that for all benzene coupled cluster calculations, the T1 diagnostic is below 0.02, so a single reference method is appropriate.  In addition to the self-consistent densities from fully converged DFT calculations, we also generate input electron densities from a standard initial guess in electronic structure calculations: the superposition of atomic densities (SAD). The energy returned by the DFT functional prior to any density optimization steps is E DFT [n SAD ]. Supplementary Fig. 9 shows the relative energy errors compared to the CC values for each conformer, using E SAD of the MP2/6-31g * optimized geometry as the reference. The E SAD range is considerably larger than for the converged calculations, with a variance that is seven times larger and a qualitatively different distribution of energy errors.
Supplementary Figure 9: Relative energy errors (CC-DFT) plotted against the relative conformer energy calculated using CC for the MD17 training data evaluated using superpositions of atomic densities (open circles) and converged densities (filled circles, same data as in Supplementary Fig. 8a).
Relevant dimensionality estimation of different density types. To further analyze the effect of using different types of densities as input descriptors, we used relevant dimensionality estimation (RDE) [8]. With kernel based methods, the relevant information for a supervised learning problem is contained up to a negligible error in the feature subspace defined by a leading number of components given by kernel principal component analysis (PCA) [9]. RDE allows us to estimate the dimensionality of the subspace containing the label-relevant information, thereby giving us a way to quantify the effective complexity that each of the ML models requires to predict the energy labels using different types of density descriptors.
To obtain the RDE of a kernel ridge regression model, we perform kernel PCA [9] in the feature space of the kernel, where we obtain the kernel PCA components c 1 , . . . , c M as the eigenvectors of the kernel matrix. Subsequently, we determine the number of label relevant dimensions d in the feature space, which is the number of leading kernel PCA components needed to accurately reconstruct the labels. The labels are reconstructed by projecting them on the leading d kernel PCA components: In our case labels denote the respective energies (CC or PBE). Finally, number of relevant dimensions d is estimated by minimizing the following objective function [8]: Given an optimized number of relevant dimensions d, we can estimate the noise level of the labels as the error of the reconstructed labelsŶ(d) with respect to the true labels Y. Additionally, we can estimate the signal-to-noise ratio as: The resulting estimates for the relevant dimensions, noise level, and signal-to-noise ratio obtained by applying RDE to the different ML models are shown in Supplementary Tables 4, 5, 6. Supplementary

Resorcinol dataset
In order to construct the ML functionals, the resorcinol datasets were generated using standard classical MD simulations. To sample more extreme geometries, training set points were selected from a 500 K trajectory, while the test set was taken from a 300 K simulation. After aligning the molecules to have all C atoms in one plane, RMSD C is 0.045 Å for the 300 K dataset and 0.053 Å for the 500 K dataset. Supplementary Fig. 10 shows the geometries and C-C-O-H dihedral angles in the original training set, the training set with symmetry operations applied, and the test set. The finite temperature classical MD simulations primarily sample geometries around the minimum energies, but all conformers are more than 7 kcal·mol −1 higher in energy than the global minimum. The all-electron CCSD(T)/cc-pVDZ [3] calculations were run using Orca v.3.0.3 [2], and the single reference wave functions have a maximum T1 diagnostic of 0.012 for all conformers.  Resorcinol rotational barrier. The sparse sampling of geometries away from the minima is due to an energy barrier for each OH group rotation. For an optimized geometry, rotation of a single OH requires crossing an energy barrier of almost 4 kcal·mol −1 based on CC calculations, as shown in Supplementary Fig. 12. The barrier height based on PBE energies is even larger, but can successfully be corrected using the ∆-DFT approach.
Supplementary Figure 12: Relative energies for DFT and CC showing the difference in the OH rotational barrier between the two methods, along with the energy predicted by the E CC s∆-DFT [n DFT sML ] energy map.
Resorcinol model performance. Supplementary Fig. 13 shows the MAE of the different ML models for different training set sizes of the resorcinol dataset. Each training set is chosen from the original 1000 points using k-means and then expanded using the four symmetry operations prior to training. The ∆-DFT model has an error below 1 kcal·mol −1 when using only 25 data points. In this manuscript, model performance results and MD simulations are run using the full 1004 point training set to facilitate comparison between the best direct energy models and the E CC s∆-DFT approach. Supplementary Figure 13: Comparison of the out-of-sample prediction performance for the DFT, CC, and ∆-DFT energy prediction models for different training set sizes.
For resorcinol, using true densities rather than ML-HK densities as input improves the performance for the direct ML models more than in the case of benzene (see Supplementary Table 7). In contrast, the ∆-DFT model error is considerably smaller than either direct model and is unaffected by the accuracy of the electron density, indicating again that the ∆-DFT approach is robust and the energy difference landscape is smoother and easier to learn than the total energy itself.   Starting from an initial condition close to the barrier crossing event, a DFT-based MD simulation was run using NVE for 1.5 ps. The relative CC energy for snapshots along this trajectory are shown in Supplementary Fig. 15 along with the relative energy errors of the DFT calculations themselves (relative to CC).

Supplementary
Supplementary Figure 15: Relative energy errors (CC-DFT) plotted against the relative conformer energy calculated using CC for an NVE DFT-based MD trajectory for resorcinol starting from an initial condition near the barrier crossing.
Resorcinol ML-MD. The E CC s∆-DFT [n DFT sML ] model can be used to generate a self-consistent MD trajectory, as seen in Supplementary Fig. 16. Starting from a random training point, the 150 fs trajectory has a MAE of only 0.2 kcal·mol −1 relative to the true CC energies. There is no drift in energy error as the trajectory proceeds, indicating that the ∆-DFT approach is stable for a range of conformers.   Supplementary Fig. 17b), with a similar performance seen for a m = 3 trajectory (Supplementary Fig. 17c). The trajectories shown in Supplementary Fig. 17c and d are the same as in Fig. 3b.

Phenol dataset
Phenol datasets were generated using standard classical MD simulations. As for resorcinol, training set points were selected from a 500 K trajectory, while the test set was taken from a 300 K simulation. After aligning the molecules based on C atom RMSD to the equilibrium geometry, RMSD C is 0.043 Å for the 300 K dataset and 0.052 Å for the 500 K dataset. In the finite temperature classical MD simulations, all sampled conformers are higher in energy than the global minimum by more than 5 kcal·mol −1 for the 500 K training trajectory and more than 2 kcal·mol −1 for the 300 K test trajectory. The all-electron CCSD(T)/cc-pVDZ [3] calculations were run using Orca v.3.0.3 [2], and the single reference wave functions have a maximum T1 diagnostic of 0.010 for all conformers.

Combining densities for improved sampling
Electron density datasets can include data from multiple molecules since the Fourier basis representation does not depend on the type, number, or order of atoms in a molecule of interest. As an example, we combine the previously discussed training data for benzene (from MD17 dataset [4,5]) and resorcinol (from classical MD) with phenol (sampled from classical MD, see Methods section). The models are evaluated on the resorcinol test set. When adding the phenol molecules to the resorcinol dataset, the molecule is aligned so that the OH group is in the same orientation as one of the OH groups of resorcinol. When symmetrizing the 1000 sampled geometries, we obtain an effective dataset of 4000 points. Supplementary Table 8 shows the performance for the combined model of resorcinol and phenol aligned with one OH group. Since resorcinol has more than one alcohol, we also generated an 8000 point training set by reflecting each symmetrized phenol molecule to align the OH with each of the two OH groups of resorcinol. Despite both models using the same 1000 energy labels, the doubled model shows improved performance and these results are reported in Tables 4, 5, and Supplementary Table 10.

Using non-self-consistent densities
Our model appears to violate one of the basic results of density functional theory, as it produces CCSD(T) energies from PBE densities. Hohenberg and Kohn [10] showed that, for any given approximate energy functional, one can minimize the energy functional to find a self-consistent formula for the density. The Kohn-Sham scheme is defined [11] to find that density when only the exchange-correlation contribution to the energy is approximated. The purpose of this section is show how the self-consistent density could be found, at least in principle, and also to argue that the energetic consequences of using the PBE density would be negligible here.
There is an exact formula for extracting the energy from any approximation for the ground-state energy for a given external potential [12]: is the ground-state energy associated with one-body potential v(r). This could be used to extract a density pointwise from any such approximation: Add a small narrow Gaussian centered at r 0 to v(r), and note the corresponding change in energy. In the limit of infinitely narrow, infinitely weak perturbations, this yields the density at r 0 . Of course, such a procedure is highly impractical in a standard basis set of atom-centered Gaussians, but could be easily employed to find specific moments of the density. If the perturbation is a weak static electric field, the prescription yields the dipole moment, as can be seen by multiplying both sides by r and integrating over all space. Almost all electronic structure calculations in chemistry and materials science are aimed at finding accurate ground-state energies and the many properties that can be derived from them, such as geometries and barriers. The error in any DFT calculation can be split into two contributions, the functional error and a density-driven error (the energy error due to an incorrect density) [13]. In most DFT calculations (including all those given here), the self-consistent density is so accurate that the energy error is dominated by the functional error [14]: using the exact density in the approximate functional has negligible effect on the energy error. Recent arguments that attempt to distinguish the quality of functionals by constructing metrics of density errors [15] have not held up when analyzed in terms of energies [14].
We can use Supplementary Eq. (3) to analyze the present situation. We know it must be satisfied by the PBE density and energy functional. Thus the difference between the PBE and CCSD(T) densities is simply This will be a very small energy for normal systems. The fact that the energy difference is easier to learn than the PBE energy itself suggests a smoothness of energy difference with respect to the potential, making density differences tiny. We also note an additional twist on this question in the context of machine learning. Long ago, Görling and Levy [16] and others pointed out that one could define an exact energy functional on an approximate density, such as the HF density. In fact, as was noted by Li et al. [17], to learn accurate energies, a very crude representation of the density suffices, so long as it forms a sufficiently useful feature for the energy. In a prototype problem (particle in a box with potential well), with even a very small grid (far too coarse to find accurate solutions to the Schrödinger equation) and essentially exact energies, one could still use kernel ridge regression to find a highly accurate ML functional.
Thus use of PBE densities to find CCSD(T) energies is both practical and theoretically allowed and well understood. On the practical side, it completely avoids the need to extract CCSD(T) densities to train upon. Because the density is not needed to perform a CCSD(T) calculation, it is not available from many CCSD(T) codes. On the theoretical side, we know (a) the errors in energies will be very small, (b) how to correct them if need be, and (c) that the kernel ridge regression has no difficulty learning on the PBE density. We note that the models using the SAD approximation perform reasonably well when assigned energy labels corresponding to the self-consistent PBE density. When using the PBE energies evaluated on the SAD density, the model errors are considerably worse, indicating that using a density that reflects the bonding environment is an important consideration if accurate energies are required.
On the other hand, use of non-self-consistent densities breaks the standard relation between energies and forces from the Hellmann-Feynman theorem, and small errors in energies do not automatically imply small errors in forces. Since we use numerical derivatives of energies throughout this paper, we extract the correct forces, but there will be corrections if analytical derivatives are attempted. Whether or not these are significant is beyond the scope of the present work.

Challenges when comparing combined models
The combined density model is a step towards a more universal machine learned density functional that can be used for CC level simulations across a wider range of molecules. Other published ML models aim to provide energies for many molecules, and it appears tempting to compare their performance to our density-based approach. For example, ANI-CC is a general neural network potential that has been trained on approximate CCSD(T) energies of a wide range of molecules [18]. However, a number of differences in model construction mean that a direct comparison of these energies with the current model is not one-to-one. As the ML methods are trained for different energy targets, we instead are required to use relative energies, with the global minimum geometry of resorcinol as the zero point.
When evaluated on our resorcinol test set, the ANI-CC model has a mean absolute difference (MAD) of 1.78 kcal·mol −1 . Comparing this error to the results in Table ??, we see that our combined model of resorcinol, phenol and benzene differs from ANI-CC by more than a factor of two when directly predicting the E CC , and by more than a factor of 20 for the ∆-DFT variant.
This result was to be expected, since ANI-CC and our model differ in many respects, including the very energy targets to be returned and the training data selection. The density-based ML models have been trained specifically to reproduce the geometry fluctuations of a few related molecules, rather than including many functional groups. In the general, while the relative energies of both model types are strongly correlated, they still represent different training goals. Users of the ∆-DFT approach need to fully understand that the energy correction is trained to link a specific DFT calculation (functional and basis set) to a specific CC energy (type of excitations and basis set).

Kernel ridge regression
Kernel ridge regression (KRR) [19] is a powerful machine learning method for non-linear regression. Non-linearity is achieved by incorporating the kernel trick into Kernel ridge regression, extending linear ridge regression, which finds the optimal linear mapping from the inputs to the labels under where k is a kernel function and α = [α 1 , . . . , α M ] T are the model weights. The model weights are obtained by solving the following optimization problem: where λ is a regularization parameter and K is the kernel matrix with K ij = k(x i , x j ). The analytical solution to the minimization problem is then given by In this paper we use the Gaussian (radial basis function) kernel where the kernel width σ is a model parameter that needs to be tuned using cross-validation.

Decomposability of the ML-HK map
When using the ML-HK map to predict electron density, the contributions of the prediction error to the cost function are given by By writing the density in terms of its basis representation and assuming orthogonality of the basis functions we obtain The resulting equation shows that the error can be decomposed into the independent error contributions for each of the basis coefficients. By viewing the errors independently we obtain L separate KRR minimization problems, and analogously to equations 6 and 7 we obtain the analytical solutions where for each basis function φ l , λ (l) is a regularization parameter, u (l) is a vector containing the training set coefficients for the l-th basis function and K σ (l) is a Gaussian kernel matrix with width σ (l) .

Cross-validation
All hyperparameters used in the model are estimated solely on the training set. The width γ and spacing ∆ hyperparameters for the artificial Gaussians potential as well as the kernel width σ and the regularization parameter λ were optimized individually for each molecule. In both cases the hyperparameter optimization was performed using cross-validation [20] on the training set. After training and cross-validation, the model is fixed and is applied unchanged on the out-of-sample test set. The optimal hyperparameters for the artificial potentials grid selected using the cross-validation procedure are given in Supplementary

Multiple time-step molecular dynamics in the ML framework
The generation of a molecular dynamics (MD) trajectory via numerical solution of the equations of motionṘ α = P α /M α ,Ṗ α = F α , possibly also coupled to a thermostat, where α = 1, ..., N indexes the N atoms, M α is the mass of the αth atom, P α is its momentum, R α is its position, and F α is the force on this atom. In DFT, if the true external potential is v ext (r, R), where R denotes the full set of atomic coordinates, then the force is given by where F α,NN is the force due to the nuclear-nuclear electronic repulsion. The first term refers to the force originating from the electron-nuclear interaction. In the machine-learning framework, the nesting of functional dependencies, leading to the progression R → v → n → E, such that the electronic force is given by In a standard MD calculation, the numerical integration algorithm for generating P steps of MD using a time step ∆t, is structured according to the pseudocode shown below: for i in range(P ) for α in range(N ) P α ← P α + ∆t * F α /2 The obvious bottleneck in an MD calculation, which often restricts the value of P , i.e., the time scale that can be accessed, is the computational overhead associated with the force calculation, as each step requires a full calculation of F α . The computational time required to generate an MD trajectory can be reduced if the force can be subdivided into a component that has a low computational overhead and a correction that varies on a slower time scale and carries most of the computational overhead of the full force calculation. Denoting the former of these as a reference for F (ref) α and the correction as δF α , the full force is then F α = F (ref) α + δF α . With this force decomposition, a reversible, symplectic multiple time-step integration algorithm can be constructed [21] based on the assumption that the correction δF α only needs to be updated every m steps, where typically m ∼ 5, which, if the computational overhead of the reference force calculation is negligible compared to the correction, will reduce the computational cost of the calculation by a factor of m. The algorithm, called the reversible reference system propagator algorithm (reversible RESPA), has the same structure as that shown in Supplementary Eq. (14) with the following provision: In each step, the default is to use F (ref) α in place of F α in each step, and the force update is only an update of the reference force; every m steps, however, one uses the force F (ref) α + mδF α , and the force update requires calculation of both the reference force and the correction.
In our ∆-machine learning approach, a natural force decomposition arises from the expression for the energy where the density n(r) is either the explicit PBE density or the density from the Hohenberg-Kohn map. Note that, in the second line, we have added and subtracted the direct CCSD(T) ML model. We, thus, associated the force obtained from E ML [n] with the reference force F (ref) α , the computational overhead of which is quite low. We then associate the correction δF α with the energy correction δE ML [n]. As this term requires a full DFT calculation, its computational overhead is significantly higher, and the overall reduction in computational time is very nearly equal to the value of m.