SpookyNet: Learning force fields with electronic degrees of freedom and nonlocal effects

Machine-learned force fields combine the accuracy of ab initio methods with the efficiency of conventional force fields. However, current machine-learned force fields typically ignore electronic degrees of freedom, such as the total charge or spin state, and assume chemical locality, which is problematic when molecules have inconsistent electronic states, or when nonlocal effects play a significant role. This work introduces SpookyNet, a deep neural network for constructing machine-learned force fields with explicit treatment of electronic degrees of freedom and nonlocality, modeled via self-attention in a transformer architecture. Chemically meaningful inductive biases and analytical corrections built into the network architecture allow it to properly model physical limits. SpookyNet improves upon the current state-of-the-art (or achieves similar performance) on popular quantum chemistry data sets. Notably, it is able to generalize across chemical and conformational space and can leverage the learned chemical insights, e.g. by predicting unknown spin states, thus helping to close a further important remaining gap for today’s machine learning models in quantum chemistry.


Completeness of atomic descriptors in SpookyNet
Many ML algorithms for constructing potential energy surfaces make use of some sort of descriptor to represent atoms in their chemical environment. As long as this description is complete, any atom-centered property (including atomic decompositions of extensive properties such as energy) can be predicted from the descriptors. 2 In this context, completeness means that structures which are not convertible into each other (by translations, rotations, or permutations of equivalent atoms) map to different descriptors.
A simple descriptor for the environment of an atom i at position #» r i consists of the set of distances r ij = #» r ij (with #» r ij = #» r j − #» r i ) to neighboring atoms j, and the set of angles between all possible combinations of neighboring atoms j and k. For environments consisting of multiple different species, separate sets of distances (and angles) are necessary for each element (or combination of elements). However, for simplicity, it is assumed here that all atoms are identical. A disadvantage of using angles in the descriptor is that their computation scales O(n 2 ) with the number of neighbors n, because all combinations must be considered. An alternative to encode angular information, which scales O(n), is to replace the set of angles with invariants of the form derived from the angular power spectrum, where Y m l are the spherical harmonics (see Eq. 17 in the main text). In the following, a i,l for l = 0, 1, 2, 3, 4 are called s, p, d, f, and g invariants because of their relation to the symmetries of atomic orbitals. The disadvantage here is that when using a finite number (l = 0, . . . , L) of power spectrum invariants as angular descriptor, some environments with different sets of angles may lead to the same descriptor. For example, square planar and terahedral environments have the same s and p invariants (L = 1), so it is necessary to include at least d invariants (L = 2) in the descriptor to differentiate them (see Fig. 1B).
There is a widespread belief in the literature that sets of distances and angles are sufficient for a complete description of atomic environments. 3,4 However, it was recently demonstrated that this is not the case, and even including the set of dihedrals between triplets of neighboring atoms does not lead to a complete description in general. 1 In this context, it is interesting to investigate the completeness of the atomic descriptors f (see Eq. 3 in the main text) learned by SpookyNet and compare its ability to distinguish different structures to other popular approaches. For this purpose, five pairs of distinct atomic environments (shown in Fig. 1), with geometries that are particularly difficult to separate, are considered. Then, different models are trained to predict scalar labels of 1 (for one of the environments) and −1 (for the other environment) from the descriptors of the central atoms (blue) in each pair. It can be observed that models either learn to predict the labels with virtually zero error (up to numerical precision), i.e. the environments can be distinguished, or a value of 0 is predicted for both central atoms, i.e. their environments are mapped to the same descriptor and a compromise between the contradictory labels has to be found. The results are summarized in Table 1 10,11 models are evaluated using the QML package. 12 All models were trained on multiple randomly rotated versions of the environments shown in Fig. 1. This was done to prevent models picking up on differences due to floating point imprecision, which otherwise may make environments distinguishable even when their descriptors are degenerate (up to numerical noise).
Most models based on hand-crafted descriptors can only distinguish environments when their sets of distances and angles (and in some cases dihedrals) differs. Message-passing neural networks (MPNNs) on the other hand can learn to distinguish all environments shown in Fig. 1, provided that at least T ≥ 2 message-passing steps are used. The amount of information that can be resolved with a single message-passing step (T = 1) is often related to the power spectrum invariants (see Eq. 2) and different MPNNs mainly differ in the maximum order L which they can resolve in a single update (with the exception of DimeNet, 6 which uses angles directly but scales O(n 2 ) with the number of neighbors n). SpookyNet uses an update with a maximum order of L = 2, which is sufficient to differentiate most common chemical environments (as long as they are distinguishable by distances and angles). It would be possible to introduce higher order interactions with the symmetry of f-, or even g-orbitals into the update step (see Eq. 12 in the main text), so that additional environments (e.g. Fig. 1d) become distinguishable with a single update, but this increases the computational cost and is found to give little benefit (in terms of additional accuracy for predictions) in practice. 2) for different angular momenta l = 1, . . . 4 (p, d, f, g) are given for each structure (the s invariant simply counts the number of neighbors and is therefore omitted). Since all distances to neighboring atoms are identical, descriptors need to be able to at least resolve angular information to distinguish the structures (a). However, for some structures, the power spectrum invariants may be degenerate for small values of l (b and d). Some structures even have identical angular distributions, in which case the power spectrum invariants are equal for all l = 0, . . . , ∞ and information about dihedrals is necessary to distinguish the environments (c). Note that some environments cannot even be distinguished when information about dihedrals is included (e). 1  details). Due to the strongly distorted geometries, potential energies in this data set vary by ∼1400 kcal mol −1 and forces by ∼10700 kcal mol −1Å−1 . Further, this way of sampling will lead to many structures with (nearly) degenerate sets of angles (see Fig. 1c) and is thus particularly challenging to learn. For this task, it is to be expected that models relying on incomplete descriptors improve at a slower rate (and eventually cease to improve at all) when in- creasing the number of training data. 1 The performance of SpookyNet on this data set for different training set sizes is summarized in Table 2. With only 10 training points (∼0.00013% of the data), SpookyNet reaches prediction errors that correspond to a relative absolute error of just ∼1% (with respect to the energy range covered in the data set). Chemical accuracy (absolute errors <1 kcal mol −1 ) is reached with as few as 1000 training points. The learning curve (see Fig. 2) shows that the performance of SpookyNet increases steadily when more data is used for training while being about two orders of magnitude more data-efficient than other methods. The increased data efficiency is largely due to a much lower y-axis intercept, which indicates a high target similarity of the learned descriptor. 16   = 100), or 4 (n train ≥ 1000) random splits and the standard deviation between runs is given in brackets. . 3: Local chemical potential for a random carbene chosen from the QMSpin database (the optimized geometries for the singlet/triplet states are shown). The chemical potential for a model without spin embeddings lacks features, whereas a model with spin embeddings learns a rich representation with significant differences between singlet and triplet states.

Conformer benchmark
As an additional test of extrapolation to larger molecules, the different models trained on QM7-X were applied to structures from the conformer benchmark introduced in Ref. 20. All structures that were non-neutral or contained elements other than H, C, N, O, S, or Cl were filtered out, because they are not covered by the QM7-X data set. In total, 5178 structures with an average number of 23 non-hydrogen atoms are considered (the largest structure contains 48 non-hydrogen atoms). All models were evaluated using the metrics introduced in Ref. 20 and compared to reference data computed at the same level of theory as QM7-X (see Ref. 19 for details). Both SpookyNet and PaiNN predict relative energies with sub-kcal accuracy (see Table 5), however, all models systematically overpredict absolute energies for systems larger than those contained in QM7-X (SpookyNet: 0.92 kcal/mol/atom, PaiNN: 1.12 kcal/mol/atom, SchNet: 3.78 kcal/mol/atom). As such, training on QM7-X is not sufficient if absolute energies of large structures are of interest.
In this context, it is also illuminating to compare wall clock times. All DFT reference calculations were performed on 72-core Intel Xeon IceLake-SP processors and took on average 24 min to complete. In contrast, evaluating a single structure with SpookyNet on a 6-Core Intel Core i7 takes 30 ms on average (speedup w.r.t. DFT > 10 4 ). However, it is usually more efficient to evaluate multiple structures in parallel on a GPU. For example, with a batch size of 250 structures, evaluating a single structure on an NVIDIA A100 SXM4 40 GB GPU takes only 0.25 ms on average (speedup w.r.t. DFT > 10 6 ).

Why the name SpookyNet?
In a famous letter to Max Born, Albert Einstein referred to the nonlocal nature of quantum systems as "spooky actions at a distance". SpookyNet also incorporates nonlocality in its architecture, for example by "spreading" (or rather delocalizing) electronic information over atoms and allowing nonlocal interactions between them, hence the name.