Abstract
Molecular dynamics (MD) simulations employing classical force fields constitute the cornerstone of contemporary atomistic modeling in chemistry, biology, and materials science. However, the predictive power of these simulations is only as good as the underlying interatomic potential. Classical potentials often fail to faithfully capture key quantum effects in molecules and materials. Here we enable the direct construction of flexible molecular force fields from highlevel ab initio calculations by incorporating spatial and temporal physical symmetries into a gradientdomain machine learning (sGDML) model in an automatic datadriven way. The developed sGDML approach faithfully reproduces global force fields at quantumchemical CCSD(T) level of accuracy and allows converged molecular dynamics simulations with fully quantized electrons and nuclei. We present MD simulations, for flexible molecules with up to a few dozen atoms and provide insights into the dynamical behavior of these molecules. Our approach provides the key missing ingredient for achieving spectroscopic accuracy in molecular simulations.
Introduction
Molecular dynamics (MD) simulations within the BornOppenheimer (BO) approximation constitute the cornerstone of contemporary atomistic modeling. In fact, the 2013 Nobel Prize in Chemistry clearly highlighted the remarkable advances made by MD simulations in offering unprecedented insights into complex chemical and biological systems. However, one of the widely recognized and increasingly pressing issues in MD simulations is the lack of accuracy of underlying classical interatomic potentials, which hinders truly predictive modeling of dynamics and function of (bio)molecular systems. One possible solution to the accuracy problem is provided by direct ab initio molecular dynamics (AIMD) simulations, where the quantummechanical forces are computed on the fly for atomic configurations at every time step^{1}. The majority of AIMD simulations employ the current workhorse method of electronicstructure theory, namely densityfunctional approximations (DFA) to the exact solution of the Schrödinger equation for a system of nuclei and electrons. Unfortunately, different DFAs yield contrasting results^{2} for the structure, dynamics, and properties of molecular systems. Furthermore, DFA calculations are not systematically improvable. Alternatively, explicitly correlated methods beyond DFA could also be used in AIMD simulations, unfortunately this leads to a steep increase in the required computational resources, for example a nanosecondlong MD trajectory for a single ethanol molecule executed with the CCSD(T) method would take roughly a million CPU years on modern hardware. An alternative is a direct fit of the potentialenergy surface (PES) from a large number of CCSD(T) calculations, however this is only practically achievable for rather small and rigid molecules^{3,4,5}.
To solve this accuracy and molecular size dilemma and furthermore to enable converged AIMD simulations close to the exact solution of the Schrödinger equation, here we develop an alternative approach using symmetrized gradientdomain machine learning (sGDML) to construct force fields with the accuracy of highlevel ab initio calculations. Recently, a wide range of sophisticated machine learning (ML) models for small molecules and elemental materials^{6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46} have been proposed for constructing PES from DFA calculations. While these results are encouraging, direct ML fitting of molecular PESs relies on the availability of large reference datasets to obtain an accurate model. Frequently, those ML models are trained on thousands or even millions of atomic configurations. This prevents the construction of ML models using highlevel ab initio methods, for which energies and forces only for 100s of conformations can be practically computed.
Instead, we propose a solution that allows converged MD simulations with fully quantized electrons and nuclei for molecules with up to a few dozen atoms. This is enabled by two novel aspects: a reduction of the problem complexity through a datadriven discovery of relevant spatial and temporal physical symmetries, and enhancing the information content of data samples by exercising these identified static and dynamic symmetries, hence implicitly increasing the amount of training data. Using the proposed sGDML approach, we carry out MD simulations at the ab initio coupled cluster level of electronicstructure theory and provide insights into their dynamical behavior. Our approach contributes the key missing ingredient for achieving spectroscopic accuracy and rigorous dynamical insights in molecular simulations.
Results
Symmetrized gradientdomain machine learning
The sGDML model is built on the previously introduced gradient domain learning (GDML) model^{47}, but now incorporates all relevant physical symmetries, hence enabling MD simulations with highlevel ab initio force field accuracy. One can classify physical symmetries of molecular systems into symmetries of space and time and specific static and dynamic symmetries of a given molecule (see Fig. 1). Global spatial symmetries include rotational and translational invariance of the energy, while homogeneity of time implies energy conservation. These global symmetries were already successfully incorporated into the GDML model^{47}. Additionally, molecules possess welldefined rigid space group symmetries (i.e. reflection operation), as well as dynamic nonrigid symmetries (i.e., methyl group rotations). For example, the benzene molecule with only six carbon and six hydrogen atoms can already be indexed in \(6!6! = 518400\) different, but physically equivalent ways. However, not all of these symmetric variants are accessible without crossing impassable energy barriers. Only the 24 symmetry elements in the D_{6h} point group of this molecule are relevant. While methods for identifying molecular point groups for polyatomic rigid molecules are readily available^{48}, LonguetHiggins^{49} has pointed out that nonrigid molecules have extra symmetries. These dynamical symmetries arise upon functionalgroup rotations or torsional displacements and they are usually not incorporated in traditional force fields and electronicstructure calculations. Typically, extracting nonrigid symmetries requires chemical and physical intuition about the system at hand. Here we develop a physically motivated algorithm for datadriven discovery of all relevant molecular symmetries from MD trajectories.
MD trajectories consist of smooth consecutive changes in nearly isomorphic molecular graphs. When sampling from these trajectories the combinatorial challenge is to correctly identify the same atoms across the examples such that the learning method can use consistent information for comparing two molecular conformations in its kernel function. While socalled bipartite matching allows to locally assign atoms \({\mathbf{R}} = \left( {{\mathbf{r}}_1, \ldots ,{\mathbf{r}}_N} \right)\) for each pair of molecules in the training set, this strategy alone is not sufficient as it needs to be made globally consistent by multipartite matching in a second step^{50,51,52}.
We start with adjacency matrices as representation for the molecular graph^{9,13,47,53,54}. To solve the pairwise matching problem we therefore seek to find the assignment τ which minimizes the squared Euclidean distance between the adjacency matrices A of two isomorphic graphs G and H with entries \(\left( {\mathbf{A}} \right)_{ij} = \left\ {{\mathbf{r}}_i  {\mathbf{r}}_j} \right\\), where P(τ) is the permutation matrix that realizes the assignment:
Adjacency matrices of isomorphic graphs have identical eigenvalues and eigenvectors, only their assignment differs. Following the approach of Umeyama^{55}, we identify the correspondence of eigenvectors U by projecting both sets U_{G} and U_{H} onto each other to find the best overlap. We use the overlap matrix, after sorting eigenvalues and overcoming sign ambiguity
Then −M is provided as the cost matrix for the Hungarian algorithm^{56}, maximizing the overall overlap which finally returns the approximate assignment \(\tilde \tau\) that minimizes Eq. (1) and thus provides the results of step one of the procedure. As indicated, global inconsistencies may arise, e.g., violations of the transitivity property \(\tau _{jk} \circ \tau _{ij} = \tau _{ik}\) of the assignments, therefore a second step is necessary which is based on the composite matrix \(\tilde{\cal P}\) of all pairwise assignment matrices \({\tilde{\mathbf P}}_{ij} \equiv {\mathbf{P}}(\tilde \tau _{ij})\) within the training set.
We propose to reconstruct a ranklimited \({\cal P}\) via the transitive closure of the minimum spanning tree (MST) that minimizes the bipartite matching cost (see Eq. (1), Fig. 1) over the training set. The MST is constructed from the most confident bipartite assignments and represents the rank \(N\) skeleton of \(\tilde{\cal P}\), defining also \({\cal P}\).
The resulting consistent multipartite matching \({\cal P}\) enables us to construct symmetric kernelbased ML models of the form
by augmenting the training set with the symmetric variations of each molecule (see Supplementary Note 1 for a comparison with alternative symmetryadapted kernel functions). A particular advantage of our solution is that it can fully populate all recovered permutational configurations even if they do not form a symmetric group, severely reducing the computational effort in evaluating the model. Even if we limit the range of j to include all S unique assignments only, the major downside of this approach is that a multiplication of the training set size leads to a drastic increase in the complexity of the cubically scaling kernel ridge regression learning algorithm. We overcome this drawback by exploiting the fact that the set of coefficients α for the symmetrized training set exhibits the same symmetries as the data, hence the linear system can be contracted to its original size, while still defining the full set of coefficients exactly.
For notational convenience we transform all training geometries into a canonical permutation \({\mathbf{x}}_i \equiv {\mathbf{P}}_{i1}{\mathbf{x}}_i\), enabling the use of uniform symmetry transformations \({\mathbf{P}}_j \equiv {\mathbf{P}}_{1j}\) (see Supplementary Note 2). Simplifying Eq. (3) accordingly, gives rise to the symmetric kernel that we originally set off to construct
and yields a model with the exact same number of parameters as the original, nonsymmetric one.
Our symmetric kernel is an extension to regular kernels and can be applied universally, in particular to kernelbased force fields. Here we construct a symmetric variant of the GDML model, sGDML. This symmetrized GDML force field kernel takes the form:
Accordingly, the trained force field estimator collects the contributions of the partial derivatives 3N of all training points M and number of symmetry transformations S to compile the prediction for a new input x. It takes the form
and a corresponding energy predictor is obtained by integrating \({\hat{\mathbf f}}_{\mathbf{F}}\) with respect to the Cartesian geometry. Due to linearity of integration, the expression for the energy predictor is identical up to second derivative operator on the kernel function.
Every (s)GDML model is trained on a set of reference examples that reflects the population of energy states a particular molecule visits during an MD simulation at a certain temperature. For our purposes, the corresponding set of geometries is subsampled from a 200 picosecond DFT MD trajectory at 500 K following the Boltzmann distribution. Subsequently, a globally consistent permutation graph is constructed that jointly assigns all geometries in the training set, providing a small selection of physically feasible transformations that define the training set specific symmetric kernel function. In the interest of computational tractability, we shortcut this sampling process to construct sGDML@CCSD(T) and only recompute energy and force labels at this higher level of theory.
The sGDML model can be trained in closed form, which is both quicker and more accurate than numerical solutions. Model selection is performed through a grid search on a suitable subset of the hyperparameter space. Throughout, crossvalidation with dedicated datasets for training, testing, and validation are used to estimate the generalization performance of the model.
Forces and energies from GDML to sGDML@DFT to sGDML@CCSD(T)
Our goal is to demonstrate that it is possible to construct compact sGDML models that faithfully recover CCSD(T) force fields for flexible molecules with up to 20 atoms, by using only a small set of few hundred molecular conformations. As a first step, we investigate the gain in efficiency and accuracy of the sGDML model vs. the GDML model employing MD trajectories of ten molecules from benzene to azobenzene computed with DFT (see Fig. 2 and Supplementary Table 1). The benefit of a symmetric model is directly linked to the number of symmetries in the system. For toluene, naphthalene, aspirin, malonaldehyde, ethanol, paracetamol, and azobenzene, sGDML improves the force prediction by 31.3–67.4% using the same training sets in all cases (see Table 1). As expected, uracil and salicylic acid have no exploitable symmetries, hence the performance of sGDML is unchanged with respect to GDML. The inclusion of symmetries leads to a stronger improvement in force prediction performance compared to energy predictions. This is most clearly visible for the naphthalene dataset, where the force predictions even improve unilaterally. We attribute this to the difference in complexity of both quantities and the fact that an energy penalty is intentionally omitted in the cost function to avoid a tradeoff.
A minimal force accuracy required for reliable MD simulations is MAE = 1 kcal mol^{−1} Å^{−1}. While the GDML model can achieve this accuracy at around 800 training examples for all molecules except aspirin, sGDML only needs 200 training examples to reach the same quality. Note that energybased ML approaches typically require two to three orders of magnitude more data^{47}.
Given that the novel sGDML model is data efficient and highly accurate, we are now in position to tackle CCSD(T) level of accuracy with modest computational resources. We have trained sGDML models on CCSD(T) forces for benzene, toluene, ethanol, and malonaldehyde. For the larger aspirin molecule, we used CCSD forces (see Supplementary Table 2). The sGDML@CCSD(T) model achieves a high accuracy for energies, reducing the prediction error of sGDML@DFT by a factor of 1.4 (for ethanol) to 3.4 (for toluene). This finding leads to an interesting hypothesis that sophisticated quantummechanical force fields are smoother and, as a convenient side effect, easier to learn. Note that the accuracy of the force prediction in both sGDML@CCSD(T) and sGDML@DFT is comparable, with the benzene molecule as the only exception. We attribute this aspect to slight shifts in the locations of the minima on the PES between DFT and CCSD(T), which means that the data sampling process for CCSD(T) can be further improved. In principle, we can envision a corrected resampling procedure for CCSD(T), using the sGDML@CCSD(T) model as future work.
MD with ab initio accuracy
The predictive power of a force field can only be truly assessed by computing dynamical and thermodynamical observables, which require sufficient sampling of the configuration space, for example by employing MD or Monte Carlo simulations. We remark that global error measures, such as mean average error (MAE) and root mean squared error are typically prone to overestimate the reconstruction quality of the force field, as they average out local topological properties. However, these local properties can become highly relevant when the model is used for an actual analysis of MD trajectories. As a demonstration, we will use the ethanol molecule; this molecule has three minima, gauche± (M_{g±}) and trans (M_{t}) shown in Fig. 3a, where experimentally it has been confirmed that M_{t} is the ground state and M_{g} is a local minimum^{57}. The energy difference between these two minima is only 0.12 kcal mol^{−1} and they are separated by an energy barrier of 1.15 kcal mol^{−1}. Obviously, the widely discussed ML target accuracy of 1 kcal mol^{−1} is not sufficient to describe the dynamics of ethanol and other molecules.
This brings us to another crucial issue for predictive models: the reference data accuracy. Computing the energy difference between M_{t} and M_{g} using DFT(PBETS) we observe that M_{g} is 0.08 kcal mol^{−1} more stable than M_{t}, contradicting the experimental measurements. Repeating the same calculation using CCSD(T)/ccpVTZ we find that M_{t} is more stable than M_{g} by 0.08 kcal mol^{−1}, in excellent agreement with experiment. From this analysis and subsequent MD simulations we conclude that CCSD(T) or sometimes even higher accuracy is necessary for truly predictive insights.
Additionally to requiring highly accurate quantum chemical approximations, the ethanol molecule also belongs to a category of fluxional molecules sensitive to nuclear quantum effects (NQE). This is because internal rotational barriers of the ethanol molecule (M_{g} ↔ M_{t}) are on the order of ~1.2 kcal mol^{−1} (see Fig. 3), which is neither low enough to generate frequent transitions nor high enough to avoid them. In a classical MD at room temperature the thermal fluctuations lead to inadequate sampling of the PES. By correctly including NQE via pathintegral MD (PIMD), the ethanol molecule is able to transition between M_{g} and M_{t} configurations, radically increasing the transition frequency (see Supplementary Figure 1) and generating statistical weights in excellent agreement with experiment. Figure 3b shows the statistical occupations of the different minima for ethanol using classical MD and PIMD for the sGDML@CCSD(T) and sGDML@DFT models in comparison with the experimental results. Overall, our MD results for ethanol highlight the necessity of using a highly accurate force field with an equally accurate treatment of NQE for achieving reliable and quantitative understanding of molecular systems.
Having established the accuracy of statistical occupations of different states of ethanol, we are now in position to discuss for the first time the CCSD(T) vibrational spectrum of ethanol computed using the velocity–velocity autocorrelation function based on centroid PIMD (see Fig. 3c). As a reference, in Fig. 3ctop we compare the vibrational spectra from DFT and CCSD(T) sGDML models in the fingerprint zone, and as expected the sGDML@CCSD(T) model generates higher frequencies but both share similar shapes but slightly different peak intensities. Molecular vibrational spectra at finite temperature include anharmonic effects, hence anharmonicities can be studied by comparing the sGDML@CCSD(T) spectrum with the harmonic approximation. Figure 3cmiddle shows such comparison and demonstrates that lowfrequency and nonsymmetric vibrations are most affected by finitetemperature contributions. The thermal frequency shift can be better seen in Fig. 3cbottom, where the sGDML@CCSD(T) spectrum is compared at two different temperatures. We observe that each normal mode is shifted in a specific manner and not by a simple scaling factor, as typically assumed. The most striking finding from our simulations is the resolution of the apparent mismatch between theory and experiment explaining the origin of the torsional frequency for the hydroxyl group. Experimentally, the low frequency region of ethanol, around ~210 cm^{−1}, is not fully understood, but there are frequency measurements for the hydroxyl rotor ranging in between ~202^{58,59} and ~207^{60} cm^{−1} for gasphase ethanol, while theoretically we found 243.7 cm^{−1} at the sGDML@CCSD(T) level of theory in the harmonic approximation. From the middle and bottom panels in Fig. 3c, we observe that by increasing the temperature the lowest peak shifts to substantially lower frequencies compared to the rest of the spectrum. The origin of such phenomena is the strong anharmonic behavior of the lowest normal mode 1, shown in Fig. 3cmiddle, which mainly corresponds to hydroxyl group rotations. At room temperature the frequency of this mode drops to ~215 cm^{−1}, corresponding to a redshift of 12% and getting closer to the experimental results, demonstrating the importance of dynamical anharmonicities.
Finally, we illustrate the wider applicability of the sGDML model to more complex molecules than ethanol by performing a detailed analysis of MD simulations for malonaldehyde and aspirin. In Fig. 4a, we show the joint probability distributions of the dihedral angles (PDDA) for the malonaldehyde molecule. This molecule has a peculiar PES with two local minima with a O\(\cdots\)H\(\cdots\)O symmetric interaction (structure (1)), and a shallow region where the molecule fluctuates between two symmetric global minima (structure (2)). The dynamical behavior represented in structure (2) is due to the interplay of two molecular states dominated by an intramolecular O\(\cdots\)H interaction and a low crossing barrier of ~0.2 kcal mol^{−1}. An interesting result is the nearly unvisited structure (1) by sGDML@DFT in comparison to sGDML@CCSD(T) model regardless of the great similarities of their PES, which gives an idea of the observable consequences of subtle energy differences in the PES of molecules with several degrees of freedom. In terms of spectroscopic differences, the two approximations generate spectra with very few differences (Fig. 4aright), but being the most prominent the one between the two peaks around 500 cm^{−1}. Such difference can be traced back to the enhanced sampling of the structure (1), and additionally it could be associated to the different nature between the methods in describing the intramolecular O\(\cdots\)H coupling.
For aspirin, the consequences of proper inclusion of the electron correlation are even more significant. Figure 4b shows the PIMD generated PDDA for DFT and CCSD based models. By comparing the two distributions we find that sGDML@CCSD generates localized dynamics in the global energy minimum, whereas the DFT model yields a rather delocalized sampling of the PES. These two contrasting results are explained by the difference in the energetic barriers along the ester dihedral angle. The incorporation of electron correlation in CCSD increases the internal barriers by ~1 kcal mol^{−1}. This prediction was corroborated with explicit CCSD(T) calculations along the dihedralangle coordinate (black dashed line in Fig. 4bPES). Furthermore, the difference in the sampling is also due to the fact that the DFT model generates consistently softer interatomic interactions compared to CCSD, which leads to large and visible differences in the vibrational spectra between DFT and CCSD (Fig. 4bright).
Discussion
The present work enables MD simulations of flexible molecules with up to a few dozen atoms with the accuracy of highlevel ab initio quantum mechanics. Such simulations pave the way to computations of dynamical and thermodynamical properties of molecules with an essentially exact description of the underlying PES. On the one hand, this is a required step towards molecular simulations with spectroscopic accuracy. On the other, our accurate and efficient sGDML model leads to unprecedented insights when interpreting the experimental vibrational spectra and dynamical behavior of molecules. The contrasting demands of accuracy and efficiency are satisfied by the sGDML model through a rigorous incorporation of physical symmetries (spatial, temporal, and local symmetries) into a gradientdomain ML approach. This is a significant improvement over symmetry adaption in traditional force fields and electronicstructure calculations, where usually only (global) point groups are considered. Global symmetries are increasingly less likely to occur with growing molecule size, providing diminishing returns. Local symmetries on the other hand are system size independent and preserved even when the molecule is fragmented for largescale modeling.
In many of the applications of machinelearned force fields the target error is the chemical accuracy or thermochemical accuracy (1 kcal mol^{−1}), but this value was conceived in the sense of thermochemical experimental measurements, such as heats of formation or ionization potentials. Consequently, the accuracy in ML models for predicting the molecular PES should not be tied to this value. Here, we propose a framework for the accuracy in force fields which satisfy the stringent demands of molecular spectroscopists, being typically in the range of wavenumbers (≈ 0.03 kcal mol^{−1}). Reaching this accuracy will be one of the greatest challenges of MLbased force fields. We remark that energy differences between molecular conformers are often on the order of 0.1–0.2 kcal mol^{−1}, hence reaching spectroscopic accuracy in molecular simulations is needed to generate predictive results.
A comparable accuracy is not obtainable with traditional force fields (see Fig. 5). In general, they miss most of the crucial quantum effects due to their rigid, handcrafted analytical form. For example, the absence of a term for electron lone pairs in AMBER leads to uncoupled rotors in ethanol. Furthermore the oversimplified harmonic description of bonded interactions generates an unphysical harmonic sampling at room temperature (see Fig. 5a). In the case of malonaldehyde (Fig. 5b), both distributions misleadingly resemble each other, however they emerge from different types of interactions. For AMBER, the dynamics are purely driven by Coulomb interactions, while the sampling with sGDML@CCSD(T) (structure (2) in Fig. 4a) is mostly guided by electron correlation effects. Lastly, a complete mismatch between the regular force field and sGDML is evident for aspirin (see Fig. 5c), where the interactions dominated by Coulomb forces generate a completely different PES with spurious global and local minima. It is worth mentioning, that the observed shortcomings of the AMBER force field can be addressed for a particular molecule, however only at the cost of losing generality and computational efficiency.
In the context of ML, our work connects to recent studies on the usage of invariance constraints for learning and representations in vision. In the human visual system and also in computer vision algorithms the incorporation of invariances such as translation, scaling, and rotation of objects can in principle permit higher performance at more data efficiency^{61}; learning theoretical bounds can furthermore show that the amount of data required is reduced by a factor: the number of parameters of the invariance transformation^{62}. Interestingly, our study goes empirically beyond this factor, i.e., our gain in data efficiency is often more than two orders of magnitude when combining the invariances (physical symmetries). We speculate that our finding may indicate that the learning problem itself may become less complex, i.e., that the underlying problem structure becomes significantly easier to represent.
There is a number of challenges that remain to be solved to extend the sGDML model in terms of its applicability and scaling to larger molecular systems. Given an extensive set of individually trained sGDML models, an unseen molecule can be represented as a nonlinear combination of those models. This would allow scaling up and transferable prediction for molecules that are similar in size. Advanced sampling strategies could be employed to combine forces from different levels of theory to minimize the need for computationally intensive ab initio calculations. Our focus in this work was on intramolecular forces in small and mediumsized molecules. Looking ahead, it is sensible to integrate the sGDML model with an accurate intermolecular force field to enable predictive simulations of condensed molecular systems (Ref.^{63} presents an intermolecular model which would be particularly suited for coupling with sGDML). Many other avenues for further development exist^{64}, including incorporating additional physical priors, reducing dimensionality of complex PES, computing reaction pathways, and modeling infrared, Raman, and other spectroscopic measurements.
Methods
Reference data generation
The data used for training the DFT models were created running abinitio MD in the NVT ensemble using the NoséHoover thermostat at 500 K during a 200 ps simulation with a resolution of 0.5 fs. We computed forces and energies using allelectrons at the generalized gradient approximation level of theory with the PerdewBurkeErnzerhof (PBE)^{65} exchangecorrelation functional, treating van der Waals interactions with the TkatchenkoScheffler (TS) method^{66}. All calculations were performed with FHIaims^{67}. The final training data was generated by subsampling the full trajectory under preservation of the MaxwellBoltzmann distribution for the energies.
To create the coupled cluster datasets, we reused the same geometries as for the DFT models and recomputed energies and forces using allelectron coupled cluster with single, double, and perturbative triple excitations (CCSD(T)). The Dunning’s correlationconsistent basis set ccpVTZ was used for ethanol, ccpVDZ for toluene and malonaldehyde and CCSD/ccpVDZ for aspirin. All calculations were performed with the Psi4^{68} software suite.
Molecular dynamics
In order to incorporate the crucial effects induced by quantum nuclear delocalization, we used PIMD, which incorporates quantummechanical effects into MD simulations via the Feynman’s path integral formalism. The PIMD simulations were performed with the sGDML model interfaced to the iPI code^{69}. The integration timestep was set to 0.2 fs to ensure energy conservation along the MD using the NVE and NVT ensemble. The total simulation time was 1 ns for ethanol (Fig. 3) to get a converged sampling of the PES using 16 beads in the PIMD.
Bipartite matching cost matrix
For the bipartite matching of a pair of molecular graphs, we solve the optimal assignment problem for the eigenvectors of their adjacency matrices using the Hungarian algorithm^{56}. As input, this algorithm expects a matrix with all pairwise assignment costs \({\mathbf{C}}_{\mathbf{M}} =  {\mathbf{M}}\), which is constructed as the negative overlap matrix from Eq. (2). We add a penalty matrix with entries \(({\mathbf{C}}_{\mathbf{z}})_{ij} = {\mathrm{abs}}(({\mathbf{z}})_i  ({\mathbf{z}})_j)\varepsilon\) that prevents the matching of nonidentical nuclei for sufficiently large ε > 0. The final const matrix is then \({\mathbf{C}} = {\mathbf{C}}_{\mathbf{M}} + {\mathbf{C}}_{\mathbf{z}}\).
Training sGDML
The symmetric kernel formulation approximates the similarities in the kernel matrix between different permutational configurations of the inputs, as they would appear with a fully symmetrized training set. We construct this object as the sum over all relevant atom assignments for each training geometry, such that the kernel matrix retains its original size. This procedure is used to symmetrize the GDML model^{47}, where the symmetric kernel function takes the form
Note, that the rows and columns of the Hessian in the summand are permuted (using \({\mathbf{P}}_p^{\rm \top}\) and P_{q}) such that the corresponding partial derivatives align. When evaluating the model, the free variable x (first argument of the kernel function) is not permuted and the normalization factor is dropped (see Eq. (5). See Supplementary Note 3 for information on how to use the sGDML model, when the input is represented by a descriptor.
Data availability
All datasets used in this work are available at http://quantummachine.org/datasets/. Additional data related to this paper may be requested from the authors.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.
Tuckerman, M. Statistical Mechanics: Theory and Molecular Simulation (Oxford University Press, Oxford, UK, 2010).
 2.
Koch, W. & Holthausen, M. C. A Chemist's Guide to Density Functional Theory (John Wiley & Sons, Hoboken, New Jersey, USA, 2015).
 3.
Partridge, H. & Schwenke, D. W. The determination of an accurate isotope dependent potential energy surface for water from extensive ab initio calculations and experimental data. J. Chem. Phys. 106, 4618–4639 (1997).
 4.
Mizukami, W., Habershon, S. & Tew, D. P. A compact and accurate semiglobal potential energy surface for malonaldehyde from constrained least squares regression. J. Chem. Phys. 141, 144310 (2014).
 5.
Schran, C., Uhl, F., Behler, J. & Marx, D. Highdimensional neural network potentials for solvation: the case of protonated water clusters in helium. J. Chem. Phys. 148, 102310 (2018).
 6.
Behler, J. & Parrinello, M. Generalized neuralnetwork representation of highdimensional potentialenergy surfaces. Phys. Rev. Lett. 98, 146401 (2007).
 7.
Bartók, A. P., Payne, M. C., Kondor, R. & Csányi, G. Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett. 104, 136403 (2010).
 8.
Jose, K. V. J., Artrith, N. & Behler, J. Construction of highdimensional neural network potentials using environmentdependent atom pairs. J. Chem. Phys. 136, 194111 (2012).
 9.
Rupp, M., Tkatchenko, A., Müller, K.R. & von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108, 058301 (2012).
 10.
Montavon, G. et al. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 15, 095003 (2013).
 11.
Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87, 184115 (2013).
 12.
Hansen, K. et al. Assessment and validation of machine learning methods for predicting molecular atomization energies. J. Chem. Theory Comput. 9, 3404–3419 (2013).
 13.
Hansen, K. et al. Machine learning predictions of molecular properties: accurate manybody potentials and nonlocality in chemical space. J. Phys. Chem. Lett. 6, 2326–2331 (2015).
 14.
Rupp, M., Ramakrishnan, R. & von Lilienfeld, O. A. Machine learning for quantum mechanical properties of atoms in molecules. J. Phys. Chem. Lett. 6, 3309–3313 (2015).
 15.
Bartók, A. P. & Csányi, G. Gaussian approximation potentials: a brief tutorial introduction. Int. J. Quantum Chem. 115, 1051–1057 (2015).
 16.
Botu, V. & Ramprasad, R. Learning scheme to predict atomic forces and accelerate materials simulations. Phys. Rev. B 92, 094306 (2015).
 17.
Li, Z., Kermode, J. R. & De Vita, A. Molecular dynamics with onthefly machine learning of quantummechanical forces. Phys. Rev. Lett. 114, 096405 (2015).
 18.
Eickenberg, M., Exarchakis, G., Hirn, M., Mallat, S. & Thiry, L. Solid harmonic wavelet scattering for predictions of molecule properties. J. Chem. Phys. 148, 241732 (2018).
 19.
Behler, J. Perspective: machine learning potentials for atomistic simulations. J. Chem. Phys. 145, 170901 (2016).
 20.
De, S., Bartok, A. P., Csányi, G. & Ceriotti, M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 18, 13754–13769 (2016).
 21.
Brockherde, F. et al. Bypassing the KohnSham equations with machine learning. Nat. Commun. 8, 872 (2017).
 22.
Artrith, N., Urban, A. & Ceder, G. Efficient and accurate machinelearning interpolation of atomic energies in compositions with many species. Phys. Rev. B 96, 014112 (2017).
 23.
Podryabinkin, E. V. & Shapeev, A. V. Active learning of linearly parametrized interatomic potentials. Comput. Mater. Sci. 140, 171–180 (2017).
 24.
Bartók, A. P. et al. Machine learning unifies the modeling of materials and molecules. Sci. Adv. 3, e1701816 (2017).
 25.
Glielmo, A., Sollich, P. & De Vita, A. Accurate interatomic force fields via machine learning with covariant kernels. Phys. Rev. B 95, 214302 (2017).
 26.
Gastegger, M., Behler, J. & Marquetand, P. Machine learning molecular dynamics for the simulation of infrared spectra. Chem. Sci. 8, 6924–6935 (2017).
 27.
Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R. & Tkatchenko, A. Quantumchemical insights from deep tensor neural networks. Nat. Commun. 8, 13890 (2017).
 28.
Yao, K., Herr, J. E. & Parkhill, J. The manybody expansion combined with neural networks. J. Chem. Phys. 146, 014106 (2017).
 29.
Dral, P. O., Owens, A., Yurchenko, S. N. & Thiel, W. Structurebased sampling and selfcorrecting machine learning for accurate calculations of potential energy surfaces and vibrational levels. J. Chem. Phys. 146, 244108 (2017).
 30.
John, S. & Csányi, G. Manybody coarsegrained interactions using gaussian approximation potentials. J. Phys. Chem. B 121, 10934–10949 (2017).
 31.
Huang, B. & von Lilienfeld, O. The “DNA” of chemistry: scalable quantum machine learning with “amons”. Preprint at https://arxiv.org/abs/1707.04146 (2017).
 32.
Faber, F. A. et al. Prediction errors of molecular machine learning models lower than hybrid DFT error. J. Chem. Theory Comput. 13, 5255–5264 (2017).
 33.
Huan, T. D. et al. A universal strategy for the creation of machine learningbased atomistic force fields. NPJ Comput. Mater. 3, 37 (2017).
 34.
Schütt, K. et al. SchNet: a continuousfilter convolutional neural network for modeling quantum interactions. Adv. Neural Inf. Process. Syst. 31, 991–1001 (2017).
 35.
Mardt, A., Pasquali, L., Wu, H. & Noé, F. VAMPnets for deep learning of molecular kinetics. Nat. Commun. 9, 5 (2018).
 36.
Glielmo, A., Zeni, C. & De Vita, A. Efficient nonparametric nbody force fields from machine learning. Phys. Rev. B 97, 184307 (2018).
 37.
Zhang, L., Han, J., Wang, H., Car, R. & Weinan, E. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett. 120, 143001 (2018).
 38.
Lubbers, N., Smith, J. S. & Barros, K. Hierarchical modeling of molecular energies using a deep neural network. J. Chem. Phys. 148, 241715 (2018).
 39.
Tang, Y.H., Zhang, D. & Karniadakis, G. E. An atomistic fingerprint algorithm for learning ab initio molecular force fields. J. Chem. Phys. 148, 034101 (2018).
 40.
Grisafi, A., Wilkins, D. M., Csányi, G. & Ceriotti, M. Symmetryadapted machine learning for tensorial properties of atomistic systems. Phys. Rev. Lett. 120, 036002 (2018).
 41.
Ryczko, K., Mills, K., Luchak, I., Homenick, C. & Tamblyn, I. Convolutional neural networks for atomistic systems. Comput. Mater. Sci. 149, 134–142 (2018).
 42.
Kanamori, K. et al. Exploring a potential energy surface by machine learning for characterizing atomic transport. Phys. Rev. B 97, 125124 (2018).
 43.
Pronobis, W., Tkatchenko, A. & Müller, K.R. Manybody descriptors for predicting molecular properties with machine learning: analysis of pairwise and threebody interactions in molecules. J. Chem. Theory Comput. 14, 2991–3003 (2018).
 44.
Hy, T. S., Trivedi, S., Pan, H., Anderson, B. M. & Kondor, R. Predicting molecular properties with covariant compositional networks. J. Chem. Phys. 148, 241745 (2018).
 45.
Smith, J. S. et al. Outsmarting quantum chemistry through transfer learning. Preprint at https://chemrxiv.org/articles/Outsmarting_Quantum_Chemistry_Through_Transfer_Learning/6744440 (2018).
 46.
Yao, K., Herr, J. E., Toth, D. W., Mckintyre, R. & Parkhill, J. The TensorMol0.1 model chemistry: a neural network augmented with longrange physics. Chem. Sci. 9, 2261–2269 (2018).
 47.
Chmiela, S. et al. Machine learning of accurate energyconserving molecular force fields. Sci. Adv. 3, e1603015 (2017).
 48.
Wilson, E. Molecular Vibrations: The Theory of Infrared and Raman Vibrational Spectra (McGrawHill Interamericana, São Paulo, Brasil, 1955).
 49.
LonguetHiggins, H. The symmetry groups of nonrigid molecules. Mol. Phys. 6, 445–460 (1963).
 50.
Pachauri, D., Kondor, R. & Singh, V. Solving the multiway matching problem by permutation synchronization. Adv. Neural Inf. Process. Syst. 26, 1860–1868 (2013)
 51.
Schiavinato, M., Gasparetto, A. & Torsello, A. Transitive Assignment Kernels for Structural Classification (Springer International Publishing, Cham, Switzerland, 2015).
 52.
Kriege, N. M., Giscard, P.L. & Wilson, R. C. On valid optimal assignment kernels and applications to graph classification. Adv. Neural Inf. Process. Syst. 30, 1623–1631 (2016).
 53.
Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R. & Borgwardt, K. M. Graph kernels. J. Mach. Learn. Res. 11, 1201–1242 (2010).
 54.
Ferré, G., Haut, T. & Barros, K. Learning potential energy landscapes using graph kernels. J. Chem. Phys. 146, 114107 (2017).
 55.
Umeyama, S. An eigendecomposition approach to weighted graph matching problems. IEEE. Trans. Pattern Anal. Mach. Intell. 10, 695–703 (1988).
 56.
Kuhn, H. W. The Hungarian method for the assignment problem. Nav. Res. Logist. 2, 83–97 (1955).
 57.
González, L., Mó, O. & Yáñez, M. Density functional theory study on ethanol dimers and cyclic ethanol trimers. J. Chem. Phys. 111, 3855–3861 (1999).
 58.
Durig, J. & Larsen, R. Torsional vibrations and barriers to internal rotation for ethanol and 2, 2, 2triuoroethanol. J. Mol. Struct. 238, 195–222 (1990).
 59.
Wassermann, T. N. & Suhm, M. A. Ethanol monomers and dimers revisited: a Raman study of conformational preferences and argon nanocoating effects. J. Phys. Chem. A 114, 8223–8233 (2010).
 60.
Durig, J., Bucy, W., Wurrey, C. & Carreira, L. Raman spectra of gases. XVI. Torsional transitions in ethanol and ethanethiol. J. Phys. Chem. A 79, 988–993 (1975).
 61.
Poggio, T. & Anselmi, F. Visual Cortex and Deep Networks: Learning Invariant Representations (MIT Press, Cambridge, MA, 2016).
 62.
Anselmi, F., Rosasco, L. & Poggio, T. On invariance and selectivity in representation learning. Inf. Inference 5, 134–158 (2016).
 63.
Bereau, T., DiStasio, R. A. Jr, Tkatchenko, A. & Von Lilienfeld, O. A. Noncovalent interactions across organic and biological subsets of chemical space: physicsbased potentials parametrized from machine learning. J. Chem. Phys. 148, 241706 (2018).
 64.
De Luna, P., Wei, J., Bengio, Y., AspuruGuzik, A. & Sargent, E. Use machine learning to find energy materials. Nature 552, 23 (2017).
 65.
Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865–3868 (1996).
 66.
Tkatchenko, A. & Scheffler, M. Accurate molecular van der waals interactions from groundstate electron density and freeatom reference data. Phys. Rev. Lett. 102, 073005 (2009).
 67.
Blum, V. et al. Ab initio molecular simulations with numeric atomcentered orbitals. Comput. Phys. Commun. 180, 2175–2196 (2009).
 68.
Parrish, R. M. et al. Psi4 1.1: an opensource electronic structure program emphasizing automation, advanced libraries, and interoperability. J. Chem. Theory Comput. 13, 3185–3197 (2017).
 69.
Ceriotti, M., More, J. & Manolopoulos, D. E. iPI: a python interface for ab initio path integral molecular dynamics simulations. Comput. Phys. Commun. 185, 1019–1026 (2014).
 70.
Case, D. et al. Amber 2018 (The Amber Project, 2018).
Acknowledgements
We thank Michael Gastegger for providing the AMBER force fields. S.C., A.T., and K.R.M. thank the Deutsche Forschungsgemeinschaft (project MU 987/201) for funding this work. A.T. is funded by the European Research Council with ERCCoG grant BeStMo. K.R.M. gratefully acknowledges the BK21 program funded by the Korean National Research Foundation grant (no. 2012005741). Part of this research was performed while the authors were visiting the Institute for Pure and Applied Mathematics, which is supported by the NSF.
Author information
Affiliations
Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany
 Stefan Chmiela
 & KlausRobert Müller
FritzHaberInstitut der MaxPlanckGesellschaft, 14195, Berlin, Germany
 Huziel E. Sauceda
Department of Brain and Cognitive Engineering, Korea University, Anamdong, Seongbukgu, Seoul, 136713, Korea
 KlausRobert Müller
Max Planck Institute for Informatics, Stuhlsatzenhausweg, 66123, Saarbrücken, Germany
 KlausRobert Müller
Physics and Materials Science Research Unit, University of Luxembourg, L1511, Luxembourg, Luxembourg
 Alexandre Tkatchenko
Authors
Search for Stefan Chmiela in:
Search for Huziel E. Sauceda in:
Search for KlausRobert Müller in:
Search for Alexandre Tkatchenko in:
Contributions
S.C. conceived and constructed the sGDML models. S.C., H.E.S., A.T., and K.R.M. developed the theory. H.E.S. and A.T. designed the analyses. H.E.S. performed the DFT and CCSD(T) calculations and MD simulations. S.C. and H.E.S. created the figures, with help from other authors. All authors wrote the paper, discussed the results and commented on the manuscript.
Competing interests
The authors declare no competing interests.
Corresponding authors
Correspondence to KlausRobert Müller or Alexandre Tkatchenko.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Further reading

Unmasking Clever Hans predictors and assessing what machines really learn
Nature Communications (2019)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.