Introduction

Quantitative modeling of atomic-scale phenomena is central for scientific insights and technological innovations in many areas of physics, chemistry, and materials science. Solving the equations that govern quantum mechanics (QM), such as Schrödinger’s or Dirac’s equation, allows accurate calculation of the properties of molecules, clusters, bulk crystals, surfaces, and other polyatomic systems. For this, numerical simulations of the electronic structure of matter are used, with tremendous success in explaining observations and quantitative predictions.

However, the high computational cost of these ab initio simulations (Supplementary Note 1) often only allows investigating from tens of thousands of small systems with a few dozen atoms to a few large systems with thousands of atoms, particularly for periodic structures. In contrast, the number of possible molecules and materials grows combinatorially with the number of atoms: 13 or fewer C, N, O, S, Cl atoms can form a billion possible molecules1, and for 5-component alloys, there are more than a billion possible compositions when choosing from 30 elements (Supplementary Note 2). This limits systematic computational study and exploration of molecular and materials spaces. Similar considerations hold for ab initio dynamics simulations, which are typically restricted to systems with a few hundred atoms and sub-nanosecond timescales.

Such situations require many simulations of systems correlated in structure, implying a high degree of redundancy. Machine learning2,3 (ML) exploits this redundancy to interpolate between reference simulations4,5,6,7 (Fig. 1). This ansatz replaces most ab initio simulations by ML predictions, based on a small set of reference simulations. Effectively, it maps the problem of repeatedly solving a QM equation for many related systems onto a regression problem. This approach has been demonstrated in benchmark settings4,8,9 and applications5,10,11, with reported speed-ups between zero to six orders of magnitude12,13,14,15. It is currently regarded as a highly promising avenue towards extending the scope of ab initio methods.

Fig. 1: Sketch illustrating the interpolation of quantum-mechanical simulations by machine learning.
figure 1

The horizontal axis represents chemical or materials space, the vertical axis the predicted property. Instead of conducting many computationally expensive ab initio simulations (solid line), machine learning (dashed line) interpolates between reference simulations (dots).

The most relevant aspect of ML models for interpolation of QM simulations (QM/ML models) after data quality (Supplementary Note 3) is the definition of suitable input features, that is, representations of atomistic systems. Representations define how systems relate to each other for regression and are the subject of this perspective.

Scope and structure

QM/ML models require a space in which interpolation takes place. Such spaces can be defined explicitly, often as vector spaces, or implicitly, for example, via a kernel function in kernel-based machine learning16,17. This work reviews and compares explicit Hilbert-space representations of finite and periodic polyatomic systems for accurate interpolation of QM observables via ML, focusing on representations that satisfy the requirements discussed in section “Requirements” and energy predictions.

This excludes features that do not encode all input information, such as atomic numbers and coordinates, for example, descriptors or fingerprints used in cheminformatics and materials informatics to interpolate between experimental outcomes18, and implicit representations learned by end-to-end deep neural networks19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37 or defined via direct kernels between systems38,39,40,41,42 (Supplementary Notes 4 and 10).

Characteristics and requirements of representations are discussed in the sections “Role and types of representations” and “Requirements” followed by a short description of a unified mathematical framework for representations (“A unified framework”). Specific representations are delineated (benchmarked ones in “Selected representations”, others in “Other representations”), qualitatively compared (“Analysis”), and empirically benchmarked (“Empirical comparison”). We conclude with an outlook on open problems and possible directions for future research in section “Conclusions and outlook”. See Table 1 for a glossary of covered representations and technical terms.

Table 1 Glossary.

Related work

Studies of QM/ML models often compare their performance estimates with those reported in the literature. While such comparisons have value, they entertain considerable uncertainty due to different datasets, learning algorithms, including choice of hyperparameters (HPs, free parameters), sampling, validation procedures, and reported quantities. Accurate, reliable performance estimates require a systematic comparison that controls for the above factors, which we perform in this work.

Several recent studies systematically measured and compared prediction errors of representations (Table 2). We distinguish between studies that automatically (as opposed to manually) optimize numerical HPs of representations, for example, the width of a normal distribution; structural HPs of representations, for example, choice of basis functions; and HPs of the regression method, for example, regularization strength. Supplementary Note 5 discusses the individual studies from Table 2.

Table 2 Related work. See Supplementary Note 5 for details.

Role and types of representations

An N-atom system formally has 3N−6 degrees of freedom. Covering those with M samples per dimension requires M3N−6 reference calculations, which is infeasible except for the smallest systems. How then is it possible to learn high-dimensional energy surfaces?

Part of the answer is that learning the whole energy surface is unnecessary, as configurations high in energy become exponentially unlikely—it is sufficient to learn low-energy regions. Another reason is that the regression space’s formal dimensionality is less important than the data distribution in this space. (Supplementary Note 6) Representations can have thousands of dimensions, but their effective dimensionality43 can be much lower if they are highly correlated. The role of representations is, therefore, to map atomistic systems to spaces amenable to regression. These spaces, together with the data’s distribution, determine the efficiency of learning.

We distinguish between local representations that describe parts of an atomistic system, such as atoms in their environment8,44, and global ones that describe the whole system. For global representations, represented systems are either finite, such as molecules and clusters, or periodic, such as bulk crystals and surfaces (Table 3).

Table 3 Types of representations.

Local representations are directly suitable for local properties, such as forces, nuclear magnetic resonance shifts, or core-level excitations45, which depend only on a finite-size environment of an atom. Extensive global properties (Supplementary Note 7) such as energies can be modeled with local representations via additive approximations, summing over atomic contributions (Supplementary Note 8). Since local representations require only finite support, it does not matter whether the surrounding system is finite or periodic. Global representations are suited for properties of the whole system, such as energy, band gap, or the polarizability tensor. Since periodic systems are infinitely large, global representations usually need to be designed for or adapted to these. Trade-offs between local and global representations are discussed in the section “Analysis”.

Historically, interpolation has been used to reduce the effort of numerical solutions to quantum problems from the beginning. Early works employing ML techniques such as Tikhonov regularization and reproducing kernel Hilbert spaces in the late 1980s and throughout the 1990s were limited to small systems46,47,48,49. Representations for high-dimensional systems appeared a decade later8,9,50, underwent rapid development, and constitute an active area of research today. Table 4 presents an overview.

Table 4 Overview of representations.

Requirements

Figures of merit for QM/ML models include computational efficiency, predictive accuracy, and sample efficiency, that is, the number of reference simulations required to reach a given target accuracy. Imposing physical constraints on representations improves their sample efficiency by removing the need to learn these constraints from the training data. The demands of speed, accuracy, and sample efficiency give rise to specific requirements, some of which depend on the predicted property:

  1. (i)

    Invariance to transformations that preserve the predicted property, including (a) changes in atom indexing (input order, permutations of like atoms), and often (b) translations, (c) rotations, and (d) reflections. Predicting tensorial properties requires (e) covariance (equivariance) with rotations6,25,26,29,51,52,53,54. Dependence of the property on a global frame of reference, for example, due to the presence of a non-isotropic external field, can affect variance requirements.

  2. (ii)

    Uniqueness, that is, variance against all transformations that change the predicted property: Two systems that differ in property should be mapped to different representations. Systems with equal representation that differ in property introduce errors55,56,57: Because the ML model cannot distinguish them, it predicts the same value for both, resulting in at least one erroneous prediction. Uniqueness is necessary and sufficient for reconstruction, up to invariant transformations, of an atomistic system from its representation44,58.

  3. (iii)

    (a) Continuity, and ideally (b) differentiability, with respect to atomic coordinates.

    Discontinuities work against the regularity assumptions in ML models, which try to find the least complex function compatible with the training data. Intuitively, continuous functions require less training data than functions with jumps. Differentiable representations enable differentiable ML models. If available, reference gradients can further constrain the interpolation function ("force matching”), improving sample efficiency59,60,61.

  4. (iv)

    Computational efficiency relative to the reference simulations. For an advantage over simulations alone (without ML), overall computational costs should be reduced by one or more orders of magnitude to justify the effort. The difference between running reference simulations and computing representations usually dominates costs. (Supplementary Note 9) Therefore, the results of computationally sufficiently cheaper simulations, for example, from a lower level of theory, can be used to construct representations62,63 or to predict properties at a higher level of theory (“Δ-learning”)63,64,65.

  5. (v)

    Structure of representations and the resulting data distribution should be suitable for regression. (Supplementary Notes 6 and 10) It is useful if feature vectors always have the same length66,67. Representations often have a Hilbert space structure, featuring an inner product, completeness, projections, and other advantages. Besides the formal space defined by the representation, the structure of the subspace spanned by the data is critical57,68. This requirement is currently less well understood than (i)–(iv) and evaluated mostly empirically (see section “Empirical comparison”).

  6. (vi)

    Generality, in the sense of being able to encode any atomistic system. While current representations handle finite and periodic systems, less work was done on charged systems, excited states, continuous spin systems, isotopes, and systems subjected to external fields.

Simplicity, both conceptually and in terms of implementation, is, in our opinion, a desirable quality of representations, albeit hard to quantify.

The above requirements preclude direct use of Cartesian coordinates, which violate requirement (i), and internal coordinates, which satisfy (i.b)–(i.d) but are still system-specific, violating (v) and possibly (i.a) if not defined uniquely. Descriptors and fingerprints from cheminformatics18 and materials informatics violate (ii) and (iii.a).

Simple representations such as the Coulomb matrix (section “Other representations”) either suffer from coarse-graining, violating (ii), or from discontinuities, violating (iii.a). In practice, representations do not satisfy all requirements exactly (section “Analysis”) but can achieve high predictive accuracy regardless; for example, for some datasets, modeling a fraction of higher-order terms can be sufficiently unique already69. The optimal interaction orders to utilize in a representation also depend on the type and amount of data available42.

A unified framework

Based on recent work6,70,71,72 we describe concepts and notation towards a unified treatment of representations in order to highlight their common foundation. For this, we successively build up Hilbert spaces of atoms, k-atom tuples, local environments, and global structures, using group averaging to ensure physical invariants and tensor products to retain desired information and construct invariant features.

Representing atoms, environments, and systems

Information about a single atom, such as position and proton number, is represented as an abstract ket \(\left|\alpha \right\rangle\) in a Hilbert space \({{{{\mathcal{H}}}}}_{\alpha }\). Relations between k atoms, where their order can matter, are encoded as k-body functions \({g}_{k}:{{{{\mathcal{H}}}}}_{\alpha }^{\times k}\to {{{{\mathcal{H}}}}}_{g}\). (Supplementary Note 11) These functions can be purely geometric, such as distances or angles, but could also be of (al)chemical or mixed nature. Tuples of atoms and associated many-body properties are thus elementary tensors of a space \({{{\mathcal{H}}}}\equiv {{{{\mathcal{H}}}}}_{\alpha }^{\otimes k}\otimes {{{{\mathcal{H}}}}}_{g}\),

$$\left|{{{{\mathcal{A}}}}}_{{\alpha }_{1}...{\alpha }_{k}}\right\rangle \equiv \left|{\alpha }_{1}\right\rangle \otimes ...\otimes \left|{\alpha }_{k}\right\rangle \otimes {g}_{k}(\left|{\alpha }_{1}\right\rangle ,...,\left|{\alpha }_{k}\right\rangle ).$$
(1)

A local environment of an atom \(\left|\alpha \right\rangle\) is represented via the relations to its k−1 neighbors by keeping \(\left|\alpha \right\rangle\) fixed:

$$\left|{{{{\mathcal{A}}}}}_{\alpha }\right\rangle \equiv \mathop{\sum}\limits_{{\alpha }_{1},\ldots ,{\alpha }_{k-1}}\left|{{{{\mathcal{A}}}}}_{\alpha ,{\alpha }_{1},\ldots ,{\alpha }_{k-1}}\right\rangle .$$
(2)

Weighting functions can reduce the influence of atoms far from \(\left|\alpha \right\rangle\); we include these in gk. An atomistic system as a whole is represented by summing over the local environments of all its atoms:

$$\left|{{{\mathcal{A}}}}\right\rangle =\mathop{\sum}\limits_{{\alpha }_{i}}\left|{{{{\mathcal{A}}}}}_{{\alpha }_{i}}\right\rangle =\mathop{\sum}\limits_{{\alpha }_{1},\ldots ,{\alpha }_{k}}\left|{{{{\mathcal{A}}}}}_{{\alpha }_{1},\ldots ,{\alpha }_{k}}\right\rangle .$$
(3)

For periodic systems, this sum diverges, which requires either exploiting periodicity, for example, by working in reciprocal space, or employing strong weighting functions and keeping one index constrained to the unit cell73.

Symmetries, tensor products, and projections

Representations incorporate symmetry constraints (section “Requirements”) by using invariant many-body functions gk, such as distances or angles, or through explicit symmetrization via group averaging70. Explicit symmetrization transforms a tensor \(\left|T\right\rangle\) by integrating over a symmetry group \({{{\mathcal{S}}}}\) with right-invariant Haar measure dS,

$${\left|T\right\rangle }_{{{{\mathcal{S}}}}}\equiv \int _{{{{\mathcal{S}}}}}S\left|T\right\rangle {\rm{d}}S,$$
(4)

where symmetry transformations \(S\in {{{\mathcal{S}}}}\) act separately on each subspace of \({{{\mathcal{H}}}}\) or parts thereof. For example, for rotational invariance, only the atomic positions in \({{{{\mathcal{H}}}}}_{\alpha }\) change. Rotationally invariant features can be derived from tensor contractions74, as any full contraction of contravariant with covariant tensors yields rotationally invariant scalars75.

Sometimes group averaging can integrate out desired information encoded in \(\left|T\right\rangle\). To counter this, one can perform tensor products of \(\left|T\right\rangle\) with itself, effectively replacing \({{{\mathcal{H}}}}\) by \({{{{\mathcal{H}}}}}^{\otimes \nu }\). Together, this results in a generalized transform

$${\left|{T}^{\nu }\right\rangle }_{{{{\mathcal{S}}}}}\equiv \int _{{{{\mathcal{S}}}}}{(S\left|T\right\rangle )}^{\otimes \nu }{\rm{d}}S.$$
(5)

To retain only part of the information in \({{{\mathcal{A}}}}\), one can project onto orthogonal elements \({\{\left|{h}_{l}\right\rangle \}}_{l = 1}^{m}\) of \({{{\mathcal{H}}}}\) via an associated projection operator \({{{\mathcal{P}}}}={\sum }_{l}\left|{h}_{l}\right\rangle \left\langle {h}_{l}\right|\). Inner products and induced distances between representations are then given by

$$\left\langle {{{\mathcal{A}}}}| {{{\mathcal{P}}}}| {{{\mathcal{A}}}}^{\prime} \right\rangle \ \,{{\mbox{and}}}\,\ {d}_{{{{\mathcal{P}}}}}(\left|{{{\mathcal{A}}}}\right\rangle ,\left|{{{\mathcal{A}}}}^{\prime} \right\rangle )=| | {{{\mathcal{P}}}}\left|{{{\mathcal{A}}}}\right\rangle -{{{\mathcal{P}}}}\left|{{{\mathcal{A}}}}^{\prime} \right\rangle | {| }_{{{{\mathcal{H}}}}}.$$
(6)

Selected representations

We discuss three representations that fulfill the requirements in section “Requirements” and for which an implementation not tied to a specific regression algorithm and supporting finite and periodic systems was openly available. These representations are empirically compared in section “Empirical comparison”.

Symmetry functions

Symmetry functions8,66 (SFs) describe k-body relations between a central atom and the atoms in a local environment around it. (Supplementary Notes 11 and 12) They are typically based on distances (radial SFs, k = 2) and angles (angular SFs, k = 3). Each SF encodes a local feature of an atomic environment, for example, the number of H atoms at a given distance from a central C atom.

For each SF and k-tuple of chemical elements, contributions are summed. Sufficient resolution is achieved by varying the HPs of an SF. For continuity (and differentiability), a cut-off function ensures that SFs decay to zero at the cut-off radius. Two examples of SFs from ref. 66 (see Table 4 and Supplementary Note 22 for further references and SFs) are

$$\begin{array}{l}{G}_{i}^{2}=\mathop{\sum}\limits_{j}\exp \left(-\eta {({d}_{ij}-\mu )}^{2}\right){f}_{c}({d}_{ij})\\ {G}_{i}^{4}=\,{2}^{1-\zeta }\mathop{\sum}\limits_{j,k\ne i}{(1+\lambda \cos {\theta }_{ijk})}^{\zeta }\ \cdot \\ \exp \left(-\eta ({d}_{ij}^{2}+{d}_{ik}^{2}+{d}_{jk}^{2})\right){f}_{c}({d}_{ij})\,{f}_{c}({d}_{ik})\,{f}_{c}({d}_{jk})\end{array}$$
(7)

where η, μ, ζ, λ are numerical HPs controlling radial broadening, shift, angular resolution, and angular direction, respectively, dij is a distance, θijk is the angle between atoms i, j, k, and fc is a cut-off function. Figure 2 illustrates the radial SFs in Eq. (7). The choice of which SFs to use is a structural HP. Variants of SFs include partial radial distribution functions76, SFs with improved angular resolution77 and reparametrizations for improved scaling with the number of chemical species78,79,80.

Fig. 2: Symmetry functions.
figure 2

Shown are radial functions \({G}_{i}^{2}(\mu ,\eta )\) (Eq. (7)) for increasing values of μ. The local environment of a central atom is described by summing contributions from neighboring atoms separately by element.

In terms of the unified notation, SFs use invariant functions gk based on distances and angles, multiplied by a cut-off function, to describe local environments \(\left|{{{{\mathcal{A}}}}}_{\alpha }\right\rangle\). Projections \({{{\mathcal{P}}}}\) onto tuples of atomic numbers Z then separate contributions from different combinations of chemical elements. For instance, for \({G}_{i}^{2}\) in Eq. (7), the representation of atom i is

$$\left|{{{{\mathcal{A}}}}}_{i},(\mu ,\eta )\right\rangle =\mathop{\sum}\limits_{j}\left(\left|{Z}_{i}\right\rangle \otimes \left|{Z}_{j}\right\rangle \right){G}^{2}({d}_{ij},\mu ,\eta ).$$
(8)

with \({G}^{2}({d}_{ij},\mu ,\eta )=\exp \left(-\eta {({d}_{ij}-\mu )}^{2}\right){f}_{c}({d}_{ij})\).

Many-body tensor representation

The global many-body tensor representation73 (MBTR) consists of broadened distributions of k-body terms, arranged by element combination. For each k-body function and k-tuple of elements, all corresponding terms (for example, all distances between C and H atoms) are broadened and summed up (Fig. 3). The resulting distributions describe the geometric features of an atomistic system:

$${f}_{k}(x,{z}_{1},\ldots ,{z}_{k})=\mathop{\sum}\limits_{{i}_{1},\ldots ,{i}_{k}}{w}_{k}\ {{{\mathcal{N}}}}(x| {g}_{k},\sigma )\mathop{\prod }\limits_{j=1}^{k}{\delta }_{{z}_{j},{Z}_{{i}_{j}}}\,,$$
(9)

where wk is a weighting function that reduces the influence of tuples with atoms far from each other, and gk is a k-body function; both wk and gk depend on atoms i1, …, ik. \({{{\mathcal{N}}}}(x| \mu ,\sigma )\) denotes a normal distribution with mean μ and variance σ2, evaluated at x. The product of Kronecker δ-functions restricts to the given element combination z1, …, zk.

Fig. 3: Many-body tensor representation.
figure 3

Shown are broadened distances (no weighting) arranged by element combination.

Periodic systems can be treated by using strong weighting functions and constraining one index to the unit cell. In practice, Eq. (9) can be discretized. Structural HPs include the choice of wk and gk; numerical HPs include variance σ of normal distributions. Requiring one atom in each tuple to be the central atom results in a local variant81.

In terms of the unified notation, MBTR uses distribution-valued functions gk, including weighting, with distributions centered on k-body terms such as (inverse) distances or angles. The outer-product structure of \(\left|{{{\mathcal{A}}}}\right\rangle\) corresponds to the product of δ-functions in Eq. (9), which selects for specific k-tuples of chemical elements. For k = 2, for example, the geometry and weighting functions depend on pairwise distances dij:

$$\begin{array}{rlr}\left|{{{\mathcal{A}}}},x\right\rangle \,&=\mathop{\sum}\limits_{i}\left|{{{{\mathcal{A}}}}}_{i},x\right\rangle &\\ \left|{{{{\mathcal{A}}}}}_{i},x\right\rangle &\propto \mathop{\sum}\limits_{j}\left(\left|{Z}_{i}\right\rangle \otimes \left|{Z}_{j}\right\rangle \right){G}_{i}^{2}\left({g}_{2}\left({d}_{ij}\right),x,\frac{1}{2}{\sigma }^{-2}\right)\,.\end{array}$$
(10)

Smooth overlap of atomic positions

Smooth overlap of atomic positions44 (SOAP) representations expand a central atoms’ local neighborhood density, a scalar function of position r, approximated by Gaussian functions located at atom positions, in orthogonal radial and spherical harmonics basis functions (Fig. 4):

$$\rho ({\mathbf{r}})=\mathop{\sum}\limits_{n,l,m}{c}_{nlm}\,{g}_{n}({\mathbf{r}})\,{Y}_{lm}({\mathbf{r}}),$$
(11)

where cnlm are expansion coefficients, gn are radial, and Ylm are (angular) spherical harmonics basis functions. From the coefficients, rotationally invariant quantities can be constructed, such as the power spectrum

$${p}_{{nn}^{\prime}{\ell}}=\mathop{\sum}\limits_{m}{c}_{{n}{\ell}{m}}{c}_{{n}^{\prime}{\ell}{m}}^{* }$$
(12)

which is equivalent to a radial and angular distribution function15, and therefore captures up to three-body interactions. Numerical HPs are the maximal number of radial and angular basis functions, the broadening width, and the cut-off radius.

Fig. 4: Smooth overlap of atomic positions.
figure 4

The local density around a central atom is modeled by atom-centered normal distributions and expanded into radial and spherical harmonics basis functions.

An alternative to the power spectrum is the bispectrum9 (BS), a set of invariants that couples multiple angular momentum and radial channels. The Spectral Neighbor Analysis Potential (SNAP) includes quadratic terms in the BS components82. Extensions of the SOAP framework include recursion relations for faster evaluation83 and alternative radial basis functions gn, such as third- and higher-order polynomials83, Gaussian functions84, and spherical Bessel functions of the first kind 58,85.

In terms of the unified notation, SOAP uses vector-valued gk to compute the basis set coefficients in Eq. (11). Analytic group-averaging (symmetry integration) then results in invariant features such as the power spectrum (ν = 2, Eq. (5)) or bispectrum (ν = 3). The SOAP (ν = 2) representation is therefore

$$\left|{{{{\mathcal{A}}}}}_{i},nn^{\prime} l\right\rangle =\mathop{\sum}\limits_{j}\left(\left|{Z}_{i}\right\rangle \otimes \left|{Z}_{j}\right\rangle \right){p}_{nn^{\prime} l}\,.$$
(13)

Other representations

Many other representations were proposed.

The Coulomb matrix4 (CM) globally describes a system via inverse distances between atoms but does not contain higher-order terms. It is fast to compute, easy to implement, and in the commonly used sorted version (see footnote reference 25 in ref. 4) allows reconstruction of an atomistic system via a least-squares problem. However, its direct use of atomic numbers to encode elements is problematic, and it suffers either from discontinuities in the sorted version or from information loss in the diagonalized version as its eigenspectrum is not unique55,86. A local variant exists87.

The bag-of-bonds88 (BoB) representation uses the same inverse-distance terms as the CM but arranges them by element pair instead of by atom pair. The “BA-representation”89 extends this to higher-order interactions by using bags of dressed atoms, distances, angles, and torsions. The inverse-distance many-body representation90 (IDMBR) employs higher powers of inverse distances and separation by element combinations.

Histograms of distances, angles, and dihedral angles91 (HDAD) are histograms of geometric features organized by element combination. This global representation is similar to MBTR but typically uses fewer bins, without broadening or explicit weighting.

The Faber-Christensen-Huang-von Lilienfeld representation92,93 (FCHL) describes atomic environments with normal distributions over row and column in the periodic table (k = 1), interatomic distances (k = 2), and angles (k = 3), scaled by power laws. In the FCHL18 variant92, the full continuous distributions are used, requiring an integral kernel for regression. Among other optimizations, FCHL1993 discretizes these distributions, similar to the approach taken by SFs, and can be used with standard vector kernels.

Wavelet scattering transforms94,95,96,97,98,99,100 (WST) use a convolutional wavelet frame representation to describe variations of (local) atomic density at different scales and orientations. Integrating non-linear functions of the wavelet coefficients yields invariant features, where second- and higher-order features couple two or more length scales. Variations use different wavelets (Morlet94,95, solid harmonic, or atomic orbital96,97,98,100) and radial basis functions (exponential96, Laguerre polynomials97,100).

Moment-tensor potentials74 (MTP) describe local atomic environments using a spanning set of efficiently computable, rotationally and permutationally invariant polynomials derived from tensor contractions. Related representations include Gaussian moments75 (GM), based on contractions of tensors from (linear combinations of) Gaussian-type atomic orbitals; the N-body iterative contraction of equivariants (NICE) framework71, which uses recursion relations to compute higher-order terms efficiently; atomic cluster expansion53,101,102 (ACE), which employs a basis of isometry- and permutation-invariant polynomials from trigonometric functions and spherical harmonics; and, moment invariants as (local) atomic descriptors (MILAD), which are non-redundant invariants constructed from Zernike polynomials.

Overlap-matrix fingerprints62,103,104 (OMF) and related approaches30,35 employ the sorted eigenvalues (and derived quantities) of overlap matrices based on Gaussian-type orbitals as representation. Eigenvalue crossings can cause derivative discontinuities, requiring post-processing104 to ensure continuity. Using a molecular orbital basis (MOB63,105 and related approaches36) adds the cost of computing the basis, for example, localized molecular orbitals via a Hartree–Fock self-consistent field calculation. Other matrices can be used, such as Fock, Coulomb, and exchange matrices, or even the Hessian, for example, from a computationally cheaper reference method. Density-encoded canonically-aligned fingerprints106 (DECAF) represent the local density in a canonical, invariant coordinate frame found by solving an optimization problem related to kernel principal component analysis.

Tensor properties require covariance (equivariance). Proposed solutions include local coordinates from eigendecompositions45, which exhibit discontinuities when eigenvalues cross, related local coordinate systems106, and internal vectors107 (IV), based on inner products of summed neighbor vectors at different scales, as well as covariant extensions of SOAP6,52 and ACE53.

Analysis

We discuss relationships between specific representations, to which degree they satisfy the requirements in section “Requirements”, trade-offs between local and global representations, and relationships to other models and modeling techniques, including systematic selection and generation of features.

Relationships between representations

Most representations in sections “Selected representations” and “Other representations” are related through the concepts in section “A unified framework”. We distinguish two primary strategies to deal with invariances, the use of invariant k-body functions (BoB, CM, FCHL, HDAD, IDMBR, MBTR, SF) and explicit symmetrization (ACE, BS, GM, MILAD, MOB, MTP, NICE, OMF, SOAP, WST). A similar distinction can be made for kernels40. Some representations share specific connections:

Comparing Eqs. (8) and (10) reveals that for suitable choices of hyperparameters, SFs can be identified with the local terms of distance-based MBTR, as both can be seen as histograms of geometric features, similar to HDAD. This suggests a local MBTR or HDAD variant by restricting summation to atomic environments81, and a global variant of SFs by summing over the whole system.

ACE, BS, GM, MILAD, MTP, NICE, and SOAP share the idea of generating tensors that are then systematically contracted to obtain rotationally invariant features. These tensors should form an orthonormal basis, or at least a spanning set, for atomic environments. Formally, expressing a local neighborhood density in a suitable basis before generating derived features avoids asymptotic scaling with the number of neighboring atoms101, although HPs, and thus runtime, still depend on it. Within a representation, recursive relationships can exist between many-body terms of different orders71,83,102. References 53,101,102 discuss technical details of the relationships between ACE and SFs, BS, SNAP, SOAP, MTP.

Requirements

Some representations, in particular early ones such as the CM, do not fulfill all requirements in section “Requirements”. Most representations fulfill some requirements only in the limit, that is, absent practical constraints such as truncation of infinite sums, short cut-off radii, and restriction to low-order interaction terms. The degree of fulfillment often depends on HPs, such as truncation order, the length of a cut-off radius, or the highest interaction order k used. Effects can be antagonistic; for example, in Eq. (11), both (ii) uniqueness and (iv) computational effort increase with n, l, m44. In addition, not all invariances of a property might be known or require additional effort to model, for example, symmetries51.

Mathematical proof or systematic empirical verification that a representation satisfies a requirement or related property are sometimes provided: The symmetrized invariant moment polynomials of MTPs form a spanning set for all permutationally and rotationally invariant polynomials74; basis sets can also be constructed102. For SOAP, systematic reconstruction experiments demonstrate the dependence of uniqueness on parametrization44.

While (ii) uniqueness guarantees that reconstruction of a system up to invariances is possible in principle, accuracy and complexity of this task vary with representation and parametrization. For example, reconstruction is a simple least-squares problem for the global CM as it comprises the whole distance matrix Dij = ri − rj2, whereas for local representations, (global) reconstruction is more involved.

If a local representation comprises only up to 4-body terms then there are degenerate environments that it cannot distinguish57, but that can differ in property. Combining representations of different environments in a system can break the degeneracy. However, by distorting feature space (v) structure, these degeneracies degrade learning efficiency and limit achievable prediction errors, even if the training set contains no degenerate systems57. It is currently unknown whether degenerate environments exist for representations with terms of order k > 4. The degree to which a representation is unique can be numerically investigated through the eigendecomposition of a sensitivity matrix based on a representation’s derivatives with respect to atom coordinates104.

Global versus local representations

Local representations can be used to model global properties by assuming that these decompose into atomic contributions. In terms of prediction errors, this tends to work well for energies. (Supplementary Note 7) Learning with atomic contributions adds technical complexity to the regression model and is equivalent to pairwise-sum kernels on whole systems, (Supplementary Note 8) with favorable computational scaling for large systems (see Supplementary Notes 9 and 27, and Table 5). Other approaches to creating global kernels from local ones exist108.

Table 5 Computational cost of calculating representations.

Conversely, using global representations for local properties can require modifying the representation to incorporate locality and directionality of the property45,84. A general recipe for constructing local representations from global ones is to require interactions to include the central atom, starting from k = 2 81.

Relationships to other models and techniques

Two modeling aspects directly related to representations are which subset of the features to use and the construction of derived features. Both modulate feature space dimensionality and (v) structure. Adding products of 2-body and 3-body terms as features, for example, can improve performance69, as these features relate to higher-order terms, (Supplementary Note 11) but can also degrade performance if the features are unrelated to the predicted property, or if there is insufficient data to infer the relationship. Feature selection tailors a representation to a dataset by selecting a small subset of features that still predict the target property accurately enough. Optimal choices of features depend on the data’s size and distribution.

In this work, we focus exclusively on representations. In kernel regression, however, kernels can be defined directly between two systems, without an explicit intermediate representation. For example, n-body kernels between atomic environments can be systematically constructed from a non-invariant Gaussian kernel using Haar integration, or using invariant k-body functions (Supplementary Note 11), yielding kernels of varying body-order and degrees of freedom40,42. Similarly, while neural networks can use representations as inputs, their architecture can also be designed to learn implicit representations from the raw data (end-to-end learning). In all cases, the requirements in section “Requirements” apply.

Empirical comparison

We benchmark prediction errors for all representations from section “Selected representations” on three benchmark datasets. Since our focus is exclusively on the representations, we control for other factors, in particular for data distribution, regression method, and HP optimization.

Datasets

The qm9 consensus benchmarking dataset109,110 comprises 133,885 organic molecules composed of H, C, N, O, F with up to 9 non-H atoms. (Supplementary Note 13) Ground state geometries and properties are given at the DFT/B3LYP/6-31G(2df,p) level of theory. We predict U0, the atomization energy at 0 K.

The ba10 dataset110,111 (Supplementary Note 14) contains the ten binary alloys AgCu, AlFe, AlMg, AlNi, AlTi, CoNi, CuFe, CuNi, FeV, and NbNi. For each alloy system, it comprises all structures with up to 8 atoms for face-centered cubic (FCC), body-centered cubic (BCC), and hexagonal close-packed (HCP) crystal types, 15,950 structures in total. Formation energies of unrelaxed structures are given at the DFT/PBE level of theory.

The nmd18 challenge112 dataset113 (Supplementary Note 15) contains 3000 ternary (Alx-Gay-Inz)2O3 oxides, x + y + z = 1, of potential interest as transparent conducting oxides. Formation and band-gap energies of relaxed structures are provided at the DFT/PBE level of theory. The dataset contains both relaxed (nmd18r, used here) and approximate (nmd18u) structures as input. In the challenge, energies of relaxed structures were predicted from approximate structures.

Together, these datasets cover finite and periodic systems, organic and inorganic chemistry, and ground state as well as off-equilibrium structures. See Supplementary Notes 1315 for details.

Benchmarking method

We estimate prediction errors as a function of training set size (learning curves, Supplementary Notes 16 and 17). To ensure that subsets are representative, we control for the distribution of elemental composition, size, and energy. (Supplementary Note 18) This reduces the variance of performance estimates and ensures the validity of the independent-and-identically-distributed data assumption inherent in ML. All predictions are on data never seen during training.

We use kernel ridge regression114 (KRR; predictions are equivalent to those of Gaussian process regression115, GPR) with a Gaussian kernel as an ML model. (Supplementary Note 19) KRR is a widely-used non-parametric non-linear regression method. There are two regression HPs, the length scale of the Gaussian kernel and the amount of regularization. (Supplementary Note 21) In this work, training is exclusively on energies; in particular, derivatives are not used. All HPs, that is, regression HPs, numerical HPs (e.g., a weight in a weighting function), and structural HPs (e.g., which weighting function to use), are optimized with a consistent and fully automatic scheme based on sequential model-based optimization and tree-structured Parzen estimators116,117. (Supplementary Note 20) This setup treats all representations on equal footing. See Supplementary Notes 2124 for details on the optimized HPs.

Learning curves and compute times

Figure 5 presents learning curves for SF, MBTR, SOAP on datasets qm9, ba10, nmd18r (see Supplementary Note 25 for tabulated values). For each dataset, representation, and training set size, we trained a KRR model and evaluated its predictions on a separate hold-out validation set of size 10k (qm9), 1k (ba10), and 0.6k (nmd18r). This procedure was repeated 10 times to estimate the variance of these experiments.

Fig. 5: Learning curves for selected representations on datasets.
figure 5

Datasets a qm9, b ba10, and c nmd18r. Shown is root mean squared error (RMSE) of energy predictions on out-of-sample-data as a function of training set size. Boxes, whiskers, bars, crosses show interquartile range, total range, median, mean, respectively. Lines are fits to theoretical asymptotic RMSE. (Supplementary Note16). See Glossary (Table 1) for abbreviations.

Boxes, whiskers, horizontal bars, and crosses show interquartile ranges, minimum/maximum value, median, and mean, respectively, of the root mean squared error (RMSE) of hold-out-set predictions over repetitions. We show RMSE as it is the loss minimized by least-squares regression such as KRR, and thus a natural choice. For other loss functions, see Supplementary Note 26. From statistical learning theory, RMSE decays as a negative power of training set size (a reason why learning curves are preferably shown on log-log plots)118,119,120. Lines show corresponding fits of mean RMSE, weighted by the standard deviation for each training set size.

Figure 6 reveals dependencies between the time to compute representations for a training set (horizontal axis) and RMSE (vertical axis). When comparing observations in two dimensions, here time t and error e, there is no unique ordering < , and we resort to the usual notion of dominance: Let \({\boldsymbol{x}},{\boldsymbol{x}}^{\prime}\in {{\mathbb{R}}}^{d}\); then \({\boldsymbol{x}}\) dominates \({\boldsymbol{x}}^{\prime}\) if \({x}_{i}\le {x}_{i}^{\prime}\) for all dimensions i and \({x}_{i} < {x}_{i}^{\prime}\) for some i. The set of all non-dominated points is called the Pareto frontier, shown by a line, with numbers indicating training set sizes. Table 5 presents compute times for representations (see Supplementary Note 27 for kernel matrices).

Fig. 6: Compute times of selected representations for datasets.
figure 6

Datasets a qm9, b ba10, and c nmd18r. Shown is root mean squared error (RMSE) of energy predictions on out-of-sample-data as a function of the time needed to compute all representations in a training set. Lines indicate Pareto frontiers; inset numbers show training set sizes. See Glossary (Table 1) for abbreviations.

Findings

Asymptotically, observed prediction errors for all representations on all datasets relate as

$$\begin{array}{llll}{{\mathrm{SF}}{\hbox{-}}2,3}\,\prec \,{{\mathrm{SF}}{\hbox{-}}2}\,,\qquad\qquad\,{{\mathrm{MBTR}}{\hbox{-}}2,3}\,\preccurlyeq \,{{\mathrm{MBTR}}{\hbox{-}}2}\,,\\ {{\mathrm{SOAP}}}\,\prec \,{{\mathrm{SF}}{\hbox{-}}2,3}\,,\qquad\qquad\quad\,\,\,{{\mathrm{SOAP}}}\,\prec \,{{\mathrm{MBTR}}{\hbox{-}}2,3}\,,\\ {{\mathrm{SF}}{\hbox{-}}2,3}\,\preccurlyeq \,{{\mathrm{MBTR}}{\hbox{-}}2,3}\,,\qquad\qquad{{\mathrm{SF}}{\hbox{-}}2}\,\prec \,{{\mathrm{MBTR}}{\hbox{-}}2}\,,\end{array}$$
(14)

where AB (AB) indicates that A has lower (or equal) estimated error than B asymptotically. Except for MBTR-2,3 SF-2 on dataset nmd18r,

$$\,{{\mbox{SOAP}}}\prec {{\mbox{SF-2,3}}}\preccurlyeq {{\mbox{MBTR-2,3}}}\prec {{\mbox{SF-2}}}\prec {{\mbox{MBTR-2}}}\,.$$
(15)

We conclude that, for energy predictions, accuracy improves with modeled interaction order and for local representations over global ones. The magnitude of, and between, these effects varies across datasets.

Dependence of predictive accuracy on interaction order has been observed by others82,84,90,92,121 and might be partially due to a higher resolution of structural features57. The latter would only show for sufficient training data, such as for dataset ba10 in Fig. 5. We do not observe this for dataset qm9, possibly because angular terms might be immediately relevant for characterizing organic molecules’ carbon scaffolds90.

Better performance of local representations might be due to higher resolution and better generalization (both from representing only a small part of the whole structure), and has also been observed by others122,123. The impact of assuming additivity is unclear but likely depends on the structure of the modeled property. (Supplementary Note 7) Our comparison includes only a single global representation (MBTR), warranting further study of the locality aspect. For additional analysis details, see Supplementary Notes 28 and 29.

Computational costs tend to increase with predictive accuracy. Representations should therefore be selected based on a target accuracy, constrained by available computing resources.

Converged prediction errors are in reasonable agreement with the literature (Supplementary Note 30) considering the lack of standardized conditions such as sampling, regression method, HP optimization, and reported performance statistics. In absolute terms, prediction errors of models trained on 10k samples are closer to the differences between DFT codes than the (systematic) differences between the underlying DFT reference and experimental measurements. (Supplementary Note 31).

Conclusions and outlook

We review representations of atomistic systems, such as molecules and crystalline materials, for machine learning of ab initio quantum-mechanical simulations. For this, we distinguish between local and global representations and between using invariant k-body functions and explicit symmetrization to deal with invariances. Despite their apparent diversity, many representations can be formulated in a single mathematical framework based on k-atom terms, symmetrization, and tensor products. Empirically, we observe that when controlling for other factors, including distribution of training and validation data, regression method, and HP optimization, both prediction errors and compute time of SFs, MBTR and SOAP improve with interaction order, and for local representations over global ones.

Our findings suggest the following guidance:

  • If their prediction errors are sufficient for an application, we recommend two-body versions of simple representations such as SF and MBTR as they are fastest to compute.

  • For large systems, local representations should be used.

  • For strong noise or bias on input structures, as in dataset nmd18u, performance differences between representations vanish, (Supplementary Note 29) and computationally cheaper features that do not satisfy the requirements in section “Requirements” (descriptors) suffice.

We conclude by providing related current research directions, grouped by topic.

Directly related to representations:

  • Systematic development of representations via extending the mathematical framework (section “A unified framework”) to include more state-of-the-art representations. This would enable deriving “missing” variants of representations (see Table 3), such as a global SOAP108 and local MBTR81, on a principled basis, as well as understanding and reformulating existing representations in a joint framework, perhaps to the extent of an efficient general implementation124.

  • Representing more systems. Develop or extend representations for atomistic systems currently not representable, or only to a limited extent, such as charged atoms and systems28,53,79,125,126,127,128,129, excited states130,131,132,133,134, spin systems, isotopes, and systems in an applied external field135[,136.

  • Alchemical learning. Further understand and develop alchemical representations92,137,138 that incorporate similarity between chemical species to improve sample efficiency. What are the salient features of chemical elements that need to be considered, also with respect to charges, excitations, spins, and isotopes?

  • Analysis of representations to better understand structure and data distribution in feature spaces and how they relate to physics and chemistry concepts. Possible approaches include quantitative measures of structure and distribution of datasets in these spaces, dimensionality reduction methods, analysis of data-driven representations from deep neural networks, and construction, or proof of non-existence, of non-distinguishable environments for representations employing terms of order higher than four.

  • Explicit complexity control. Different applications require different trade-offs between computational cost and predictive accuracy. This requires determination, and automatic adaptation as an HP, of the capacity (complexity, dimensionality) and computational cost of a representation to a dataset, for example, through selection, combination139, or systematic construction of features42,57.

Related to benchmarking of representations:

  • Extended scope. We empirically compare one global and two local representations on three datasets to predict energies using KRR with a Gaussian kernel. For a more systematic coverage, further representations and datasets, training with forces60,61, and more properties should be included while maintaining control over regression method, data distribution, and HP optimization. Deep neural networks23,126,140,141 could be included via representation learning. Comparison with simple baseline models such as k-nearest neighbors142 would be desirable.

  • Improved optimization of HPs: The stochastic optimizer used in this work required multiple restarts in practice to avoid sub-optimal results, and reached its limits for large HP search spaces. It would be desirable to reduce the influence and computational cost of HP optimization. Possible means include reducing the number of HPs in representations, employing more systematic and thus more robust optimization methods, and providing reliable heuristics for HP default values.

  • Multi-objective optimization. We optimize HPs for predictive accuracy on a single property. In practice, though, parametrizations of similar accuracy but lower computational cost would be preferable, and more than one property can be of interest. HPs should, therefore, be optimized for multiple properties and criteria, including computational cost and predictive uncertainties (see below). How to balance these is part of the problem143.

  • Predictive uncertainties. While prediction errors are frequently analyzed, and reasonable guidelines exist, this is not the case for predictive uncertainties. These are becoming increasingly important as applications of ML mature, for example, for human assessment and decisions, learning on the fly144, and active learning. Beyond global analysis of uncertainty estimates, local characterization (in input or feature space) of prediction errors is relevant143,145.

Related through context:

  • Long-range interactions. ML models appear to be well-suited for short- and medium-ranged interactions, but problematic for long-ranged interactions due to the increasing degrees of freedom of larger systems and larger necessary cut-off radii of atomic environments. Two approaches are to integrate ML models with physical models for long-range interactions125,128,146, and to adapt ML models to learn long-range interactions directly147.

  • Relationships between QM and ML. A deeper understanding of the relationships between QM and kernel-based ML could lead to insights and technical progress in both fields. As both share concepts from linear algebra, such relationships could be formal mathematical ones. For example, QM concepts such as matrix product states can parameterize non-linear kernel models148.