Representations of molecules and materials for interpolation of quantum-mechanical simulations via machine learning

Computational study of molecules and materials from first principles is a cornerstone of physics, chemistry, and materials science, but limited by the cost of accurate and precise simulations. In settings involving many simulations, machine learning can reduce these costs, often by orders of magnitude, by interpolating between reference simulations. This requires representations that describe any molecule or material and support interpolation. We comprehensively review and discuss current representations and relations between them, using a unified mathematical framework based on many-body functions, group averaging, and tensor products. For selected state-of-the-art representations, we compare energy predictions for organic molecules, binary alloys, and Al-Ga-In sesquioxides in numerical experiments controlled for data distribution, regression method, and hyper-parameter optimization.


Introduction
Quantitative modeling of atomic-scale phenomena is central for scientific insights and technological innovations in many areas of physics, chemistry, and materials science. Solving the equations that govern quantum mechanics (QM), such as Schrödinger's or Dirac's equation, allows accurate calculation of the properties of molecules, clusters, bulk crystals, surfaces, and other polyatomic systems. For this, numerical simulations of the electronic structure of matter are used, with tremendous success in explaining observations and quantitative predictions. However, the high computational cost of these ab initio simulations (SI 1) often only allows investigating from tens of thousands of small systems with a few dozen atoms to a few large systems with thousands of atoms, particularly for periodic structures. In contrast, the number of possible molecules and materials grows combinatorially with the number of atoms: 13 or fewer C, N, O, S, Cl atoms can form a billion possible molecules, 1 and for 5-component alloys, there are more than a billion possible compositions when choosing from 30 elements. (SI 2) This limits systematic computational study and exploration of molecular and materials spaces. Similar considerations hold for ab initio dynamics simulations, which are typically restricted to systems with a few hundred atoms and sub-nanosecond timescales.
Such situations require many simulations of systems correlated in structure, implying a high degree of redundancy. Machine learning 2, 3 (ML) exploits this redundancy to interpolate between reference simulations [4][5][6][7] (Figure 1). This ansatz replaces most ab initio simulations by ML predictions, based on a small set of reference simulations. Effectively, it maps the problem of repeatedly solving a QM equation for many related systems onto a regression problem. This approach has been demonstrated in benchmark settings 4,8,9 and applications, 5,10,11 with reported speed-ups between zero to six orders of magnitude. [12][13][14][15] It is currently regarded as a highly promising avenue towards extending the scope of ab initio methods.
The most relevant aspect of ML models for interpolation of QM simulations (QM/ML models) after data quality (SI 3) is the definition of suitable input features, that is, representations of atomistic systems. Representations define how systems relate to each other for regression and are the subject of this review.
Scope and structure QM/ML models require a space in which interpolation takes place. Such spaces can be defined explicitly, often as vector spaces, or implicitly, for example, via a kernel function in kernel-based machine learning. 16,17 This work reviews and compares explicit Hilbert-space representations of finite and periodic polyatomic systems for accurate interpolation of QM observables via ML, focusing on representations that satisfy the requirements in Section 3 and energy predictions.

Related work
Studies of QM/ML models often compare their performance estimates with those reported in the literature. While such comparisons have value, they entertain considerable uncertainty due to different datasets, learning algorithms, including choice of hyperparameters (HPs, free parameters), sampling, validation procedures and reported quantities. Accurate, reliable performance estimates require a systematic comparison that controls for the above factors, which we perform in this work.
Several recent studies systematically measured and compared prediction errors of representations (Table 1). We distinguish between studies that automatically (as opposed to manually) optimize numerical HPs of representations, for example, the width of a normal distribution; structural HPs of representations, for example, choice of basis functions; and HPs of the regression method, for example, regularization strength. SI 5 discusses the individual studies in Table 1

Role and types of representations
An N -atom system formally has 3N −6 degrees of freedom.
Covering those with M samples per dimension requires M 3N −6 reference calculations, which is infeasible except for the smallest systems. How then is it possible to learn high-dimensional energy surfaces?
Part of the answer is that learning the whole energy surface is unnecessary, as configurations high in energy become exponentially unlikely-it is sufficient to learn low-energy regions. Another reason is that the regression space's formal dimensionality is less important than the data distribution in this space. (SI 6) Representations can have thousands of dimensions, but their effective dimensionality 53 can be much lower if they are highly correlated. The role of representations is, therefore, to map atomistic systems to spaces amenable to regression. These spaces, together with the data's distribution, determine the efficiency of learning.
We distinguish between local representations that describe parts of an atomistic system, such as atoms in their environment, 8,54 and global ones that describe the whole system. For global representations, represented systems are either finite, such as molecules and clusters, or periodic, such as bulk crystals and surfaces. (Table 2) Local representations are directly suitable for local properties, such as forces, nuclear magnetic resonance shifts, or core-level excitations, 55 which depend only on a finite-size environment of an atom. Extensive global properties (SI 7) such as energies can be modeled with local representations via additive approximations, summing over atomic contribu- Table 2: Types of representations. We distinguish between local (atoms in their environment) and global (holistic, whole system) representations, as well as between representations for finite (molecules, clusters) and periodic systems (bulk crystals, surfaces). Local representations have finite support, and thus do not need to distinguish between finite and periodic systems. See Glossary for abbreviations.  (SI 8). Since local representations require only finite support, it does not matter whether the surrounding system is finite or periodic. Global representations are suited for properties of the whole system, such as energy, band gap, or the polarizability tensor. Since periodic systems are infinitely large, global representations usually need to be designed for or adapted to these. Trade-offs between local and global representations are discussed in Section 7.
Historically, interpolation has been used to reduce the effort of numerical solutions to quantum problems from the beginning. Early works employing ML techniques such as Tikhonov regularization and reproducing kernel Hilbert spaces in the late 1980s and throughout the 1990s were limited to small systems. [56][57][58][59] Representations for highdimensional systems appeared a decade later, 8,9,60 underwent rapid development, and constitute an active area of research today. Table 3 presents an overview.

Requirements
The figure of merit of ML models for fast, accurate interpolation of ab initio properties is sample efficiency: The number of reference simulations required to reach a given target accuracy. Imposing physical constraints on representations improves their sample efficiency by removing the need to learn these constraints from the training data. The demands of speed, accuracy, and sample efficiency give rise to specific requirements, some of which depend on the predicted property: (i) Invariance to transformations that preserve the predicted property, including (a) changes in atom indexing (input order, permutations of like atoms), and often (b) translations, (c) rotations, and (d) reflections. Predicting tensorial properties requires (e) covariance (equivariance) with rotations. 6,25,26,29,114,[119][120][121] Dependence of the property on a global frame of reference, for example, due to the presence of a non-isotropic external field, can affect variance requirements.
(ii) Uniqueness, that is, variance against all transformations that change the predicted property: Two systems that differ in property should be mapped to different representations.
Systems with equal representation that differ in property introduce errors: 71,122,123 Because the ML model cannot distinguish them, it predicts the same value for both, resulting in at least one erroneous prediction. Uniqueness is necessary and sufficient for reconstruction, up to invariant transformations, of an atomistic system from its representation. 54,80 (iii) (a) Continuity, and ideally (b) differentiability, with respect to atomic coordinates.
Discontinuities work against the regularity assumptions in ML models, which try to find the least complex function compatible with the training data. Intuitively, continuous functions require less training data than functions with jumps. Differentiable representations enable differentiable ML models. If available, reference gradients can further constrain the interpolation function ("force matching"), improving sample efficiency. [124][125][126] (iv) Computational efficiency relative to the reference simulations.
For an advantage over simulations alone (without ML), overall computational costs should be reduced by one or more orders of magnitude to justify the effort. The difference between running reference simulations and computing representations usually dominates costs. (SI 9) Therefore, the results of computationally sufficiently cheaper simulations, for example, from a lower level of theory, can be used to construct representations 82, 108 or to predict properties at a higher level of theory ("∆-learning"). 45,73,108 (v) Structure of representations and the resulting data distribution should be suitable for regression. (SI 6 and 10) It is useful if feature vectors always have the same length. 61,127 Representations often have a Hilbert space structure, featuring an inner product, completeness, projections, and other advantages. Besides the formal space defined by the representation, the structure of the subspace spanned by the data is critical. 52,123 This requirement is currently less well understood than (i)-(iv) and evaluated mostly empirically (see Section 8).
(vi) Generality, in the sense of being able to encode any atomistic system.
While current representations handle finite and periodic systems, less work was done on charged systems, excited states, continuous spin systems, isotopes, and systems subjected to external fields.
Simplicity, both conceptually and in terms of implementation, is, in our opinion, a desirable quality of representations, albeit hard to quantify.
The above requirements preclude direct use of Cartesian coordinates, which violate requirement (i), and internal coordinates, which satisfy (i.b)-(i.d) but are still system-specific, violating (v) and possibly (i.a) if not defined uniquely. Descriptors and fingerprints from cheminformatics 18 and materials informatics violate (ii) and (iii.a).
Simple representations such as the Coulomb matrix (Section 6) either suffer from coarse-graining, violating (ii), or from discontinuities, violating (iii.a). In practice, representations do not satisfy all requirements exactly (Section 7) but can achieve high predictive accuracy regardless; for example, for some datasets, modeling a fraction of higher-order terms can be sufficiently unique already. 128 The optimal interaction orders to utilize in a representation also depend on the type and amount of data available. 41

A unified framework
Based on recent work 6, 116, 129 we describe concepts and notation towards a unified treatment of representations. For this, we successively build up Hilbert spaces of atoms, katom tuples, local environments, and global structures, using group averaging to ensure physical invariants and tensor products to retain desired information and construct invariant features.

Representing atoms, environments, systems
Information about a single atom, such as position and proton number, is represented as an abstract ket |α in a Hilbert space H α . Relations between k atoms, where their order can matter, are encoded as k-body functions g k : H ×k α → H g . (SI 11) These functions can be purely geometric, such as distances or angles, but could also be of (al)chemical or mixed nature. Tuples of atoms and associated many-body properties are thus elementary tensors of a space H ≡ H ⊗k α ⊗ H g , |A α1...α k ≡ |α 1 ⊗ ... ⊗ |α k ⊗ g k (|α 1 , ..., |α k ).
A local environment of an atom |α is represented via the relations to its k − 1 neighbors by keeping |α fixed: Weighting functions can reduce the influence of atoms far from |α ; we include these in g k . An atomistic systems as a whole is represented by summing over the local environments of all its atoms: For periodic systems, this sum diverges, which requires either exploiting periodicity, for example, by working in reciprocal space, or employing strong weighting functions and keeping one index constrained to the unit cell. 101

Symmetries, tensor products, and projections
Representations incorporate symmetry constraints (Section 3) by using invariant many-body functions g k , such as distances or angles, or through explicit symmetrization via group averaging. 129 Explicit symmetrization transforms a tensor |T by integrating over a symmetry group S with right-invariant Haar measure dS, where symmetry transformations S ∈ S act separately on each subspace of H or parts thereof. For example, for rotational invariance, only the atomic positions in H α change.
Rotationally invariant features can be derived from tensor contractions, as any full contraction of contravariant with covariant tensors yields rotationally invariant scalars. 118 Sometimes group averaging can integrate out desired information encoded in |T . To counter this, one can perform tensor products of |T with itself, effectively replacing H by H ⊗ν . Together, this results in a generalized transform To retain only part of the information in A, one can project onto orthogonal elements {|h l } m l=1 of H via an associated projection operator P = l |h l h l |. Inner products and induced distances between representations are then given by

Selected representations
We discuss three representations that fulfill the requirements in Section 3 and for which an implementation not tied to a specific regression algorithm and supporting finite and periodic systems was openly available. These representations are empirically compared in Section 8.

Symmetry functions
Symmetry functions 8,61 (SFs) describe k-body relations between a central atom and the atoms in a local environment around it. (SI 11 and 12) They are typically based on distances (radial SFs, k = 2) and angles (angular SFs, k = 3).
Each SF encodes a local feature of an atomic environment, for example the number of H atoms at a given distance from a central C atom.
For each SF and k-tuple of chemical elements, contributions are summed. Sufficient resolution is achieved by varying the HPs of an SF. For continuity (and differentiability), a cut-off function ensures that SFs decay to zero at the cutoff radius. Two examples of SFs from Reference 61 (see Table 3 and SI 22 for further references and SFs) are where η, µ, ζ, λ are numerical HPs controlling radial broadening, shift, angular resolution, and angular direction, respectively, d ij is a distance, θ ijk is the angle between atoms i, j, k, and f c is a cut-off function. Figure 2 illustrates the radial SFs in Equation 4. The choice of which SFs to use is a structural HP. Variants of SFs include partial radial distribution functions, 130 SFs with improved angular resolution 62 and reparametrizations for improved scaling with the number of chemical species. [63][64][65] In terms of the unified notation, SFs use invariant functions g k based on distances and angles, multiplied by a cutoff function, to describe local environments |A α . Projections P onto tuples of atomic numbers then separate contributions from different combinations of chemical elements.
) for increasing values of µ. The local environment of a central atom is described by summing contributions from neighboring atoms separately by element.

Many-body tensor representation
The global many-body tensor representation 101 (MBTR) consists of broadened distributions of k-body terms, arranged by element combination. For each k-body function and k-tuple of elements, all corresponding terms (for example, all distances between C and H atoms) are broadened and summed up (Figure 3). The resulting distributions describe the geometric features of an atomistic system: where w k is a weighting function that reduces the influence of tuples with atoms far from each other, and g k is a kbody function; both w k and g k depend on atoms i 1 , . . . , i k . N (x|µ) denotes a normal distribution with mean µ, evaluated at x. The product of Kronecker δ-functions restricts to the given element combination z 1 , . . . , z k . Periodic systems can be treated by using strong weighting functions and constraining one index to the unit cell. In practice, Equation 5 can be discretized. Structural HPs include the choice of w k and g k ; numerical HPs include variance of normal distributions. Requiring one atom in each tuple to be the central atom results in a local variant. 131 In terms of the unified notation, MBTR uses distributionvalued functions g k , including weighting, with distributions centered on k-body terms such as (inverse) distances or angles. The outer-product structure of |A corresponds to the product of δ-functions in Equation 5, which selects for specific k-tuples of chemical elements.

Smooth overlap of atomic positions
Smooth overlap of atomic positions 54 (SOAP) representations expand a central atoms' local neighborhood density, approximated by Gaussian functions located at atom positions, in orthogonal radial and spherical harmonics basis functions ( Figure 4): where c n,l,m are expansion coefficients, g n are radial, and Y l,m are (angular) spherical harmonics basis functions. From the coefficients, rotationally invariant quantities can be constructed, such as the power spectrum p n,n , = m c n, ,m c * n , ,m , which is equivalent to a radial and angular distribution function, 15 and therefore captures up to three-body interactions. Numerical HPs are the maximal number of radial and angular basis functions, the broadening width, and the cut-off radius. An alternative to the power spectrum is the bispectrum 9 (BS), a set of invariants that couples multiple angular momentum and radial channels. The Spectral Neighbor Analysis Potential (SNAP) includes quadratic terms in the BS components. 67 Extensions of the SOAP framework include recursion relations for faster evaluation 78 and alternative radial basis functions g n , such as third-and higher-order polynomials, 78 Gaussian functions, 43 and spherical Bessel functions of the first kind. 79,80 In terms of the unified notation, SOAP uses vectorvalued g k to compute the basis set coefficients in Equation 6. Analytic group-averaging (symmetry integration) then results in invariant features such as the power spectrum (ν = 2, Equation 2) or bispectrum (ν = 3).

Other representations
Many other representations were proposed.
The Coulomb matrix 4 (CM) globally describes a system via inverse distances between atoms but does not contain higher-order terms. It is fast to compute, easy to implement, and in the commonly used sorted version (see footnote reference 25 in Reference 4), allows reconstruction of an atomistic system via a least-squares problem. However, its direct use of atomic numbers to encode elements is problematic, and it suffers either from discontinuities in the sorted version or from information loss in the diagonalized version as its eigenspectrum is not unique. 71, 72 A local variant exists. 74 The bag-of-bonds 84 (BoB) representation uses the same inverse distance terms as the CM but arranges them by element pair instead of by atom pair. The "BA-representation" 85 extends this to higher-order interactions by using bags of dressed atoms, distances, angles, and torsions. The inverse-distance many-body representation 106 (IDMBR) employs higher powers of inverse distances and separation by element combinations.
Histograms of distances, angles, and dihedral angles 42 (HDAD) are histograms of geometric features organized by element combination. This global representation is similar to MBTR but typically uses fewer bins, without broadening or explicit weighting. The Faber-Christensen-Huang-von Lilienfeld representation 104, 105 (FCHL) describes atomic environments with normal distributions over row and column in the periodic table (k = 1), interatomic distances (k = 2), and angles (k = 3), scaled by power laws. In the FCHL18 variant, 104 the full continuous distributions are used, requiring an integral kernel for regression. Among other optimizations, FCHL19 105 discretizes these distributions, similar to the approach taken by SFs, and can be used with standard vector kernels.
Wavelet scattering transforms 87-93 (WST) use a convolutional wavelet frame representation to describe variations of (local) atomic density at different scales and orientations. Integrating non-linear functions of the wavelet coefficients yields invariant features, where second-and higher-order features couple two or more length scales. Variations use different wavelets (Morlet, 87, 88 solid harmonic, or atomic orbital 89-91, 93 ) and radial basis functions (exponential, 89 Laguerre polynomials 90, 93 ).
Moment-tensor potentials 95 (MTP) describe local atomic environments using a spanning set of efficiently computable, rotationally and permutationally invariant polynomials derived from tensor contractions. Related representations include Gaussian moments 118 (GM), based on contractions of tensors from (linear combinations of) Gaussian-type atomic orbitals; the N -body iterative contraction of equivariants (NICE) framework, 116 which uses recursion relations to compute higher-order terms efficiently; and atomic cluster expansion 112-114 (ACE), which employs a basis of isometryand permutation-invariant polynomials from trigonometric functions and spherical harmonics.
Overlap-matrix fingerprints 50, 82, 83 (OMF) and related approaches 30, 34 employ the sorted eigenvalues (and derived quantities) of overlap matrices based on Gaussian-type orbitals as representation. Eigenvalue crossings can cause derivative discontinuities, requiring post-processing 50 to ensure continuity. Using a molecular orbital basis (MOB 108, 109 and related approaches 35 ) adds the cost of computing the basis, for example, localized molecular orbitals via a Hartree-Fock self-consistent field calculation. Other matrices can be used, such as Fock, Coulomb, and exchange matrices, or even the Hessian, for example, from a computationally cheaper reference method. Density-encoded canonicallyaligned fingerprints 102 (DECAF) represent the local density in a canonical, invariant coordinate frame found by solving an optimization problem related to kernel principal component analysis.
Tensor properties require covariance (equivariance). Proposed solutions include local coordinates from eigendecompositions, 55 which exhibit discontinuities when eigenvalues cross, related local coordinate systems, 102 and internal vectors 132 (IV), based on inner products of summed neighbor vectors at different scales, as well as covariant extensions of SOAP 6, 120 and ACE. 114

Analysis
We discuss relationships between specific representations, to which degree they satisfy the requirements in Section 3, trade-offs between local and global representations, and relationships to other models and modeling techniques, including systematic selection and generation of features.

Relationships between representations
All representations in Section 5 and most representations in Section 6 are related through the concepts in Section 4. We distinguish two primary strategies to deal with invariances, the use of invariant k-body functions (BoB, CM, FCHL, HDAD, IDMBR, MBTR, SF) and explicit symmetrization (ACE, BS, GM, MOB, MTP, NICE, OMF, SOAP, WST). A similar distinction can be made for kernels. 39 Some representations share specific connections: An evenly-spaced grid of SFs can be seen as a histogram of distances, angles, or higher-order terms, similar to MBTR and HDAD. This suggests a local MBTR or HDAD variant by restricting summation to atomic environments, 131 and a global variant of SFs by summing over the whole system. A difference is that MBTR explicitly broadens k-body terms, whereas SFs implicitly broaden them via the exponential functions in Equation 4. The original formulation of SFs represents each chemically distinct central atom separately, whereas MBTR represents each (unique) tuple of k elements in separate tensor components. Both approaches correspond to using Kronecker δ functions on element types. ACE, BS, GM, MTP, NICE, and SOAP share the idea of generating tensors that are then systematically contracted to obtain rotationally invariant features. These tensors should form an orthonormal basis, or at least a spanning set, for atomic environments. Formally, expressing a local neighborhood density in a suitable basis before generating derived features avoids asymptotic scaling with the number of neighboring atoms, 112 although HPs, and thus runtime, still depend on it. Within a representation, recursive relationships can exist between many-body terms of different orders. 78,113,116 References 112-114 discuss technical details of the relationships between ACE and SFs, BS, SNAP, SOAP, MTP.

Requirements
Some representations, in particular early ones such as the CM, do not fulfill all requirements in Section 3. Most representations fulfill some requirements only in the limit, that is, absent practical constraints such as truncation of infinite sums, short cut-off radii, and restriction to low-order interaction terms. The degree of fulfillment often depends on HPs, such as truncation order, the length of a cut-off radius, or the highest interaction order k used. Effects can be antagonistic; for example, in Equation 6, both (ii) uniqueness and (iv) computational effort increase with n, l, m. 54 In addition, not all invariances of a property might be known or require additional effort to model, for example, symmetries. 119 Mathematical proof or systematic empirical verification that a representation satisfies a requirement or related property are sometimes provided: The symmetrized invariant moment polynomials of MTPs form a spanning set for all permutationally and rotationally invariant polynomials; 95 basis sets can also be constructed. 113 For SOAP, systematic reconstruction experiments demonstrate the dependence of uniqueness on parametrization. 54 While (ii) uniqueness guarantees that reconstruction of a system up to invariances is possible in principle, accuracy and complexity of this task vary with representation and parametrization. For example, reconstruction is a simple least-squares problem for the global CM as it comprises the whole distance matrix D ij = ||r i − r j || 2 , whereas for local representations, (global) reconstruction is more involved.
If a local representation comprises only up to 4-body terms then there are degenerate environments that it cannot distinguish, 123 but that can differ in property. Combining representations of different environments in a system can break the degeneracy. However, by distorting feature space (v) structure, these degeneracies degrade learning efficiency and limit achievable prediction errors, even if the training set contains no degenerate systems. 123 It is currently unknown whether degenerate environments exist for representations with terms of order k > 4. The degree to which a representation is unique can be numerically investigated through the eigendecomposition of a sensitivity matrix based on a representation's derivatives with respect to atom coordinates. 50

Global versus local representations
Local representations can be used to model global properties by assuming that these decompose into atomic contributions. In terms of prediction errors, this tends to work well for energies. (SI 7) Learning with atomic contributions adds technical complexity to the regression model and is equivalent to pairwise-sum kernels on whole systems, (SI 8) with favorable computational scaling for large systems (see SI 9 and 27 and Table 4). Other approaches to creating global kernels from local ones exist. 76 Conversely, using global representations for local properties can require modifying the representation to incorporate locality and directionality of the property. 43, 55 A general recipe for constructing local representations from global ones is to require interactions to include the central atom, starting from k = 2. 131

Relationships to other models and techniques
Two modeling aspects directly related to representations are which subset of the features to use and the construction of derived features. Both modulate feature space dimensionality and (v) structure. Adding products of 2-body and 3-body terms as features, for example, can improve performance, 128 as these features relate to higher-order terms, (SI 11) but can also degrade performance if the features are unrelated to the predicted property, or if there is insufficient data to infer the relationship. Feature selection tailors a representation to a dataset by selecting a small subset of features that still predict the target property accurately enough. Optimal choices of features depend on the data's size and distribution.
In this work, we focus exclusively on representations. In kernel regression, however, kernels can be defined directly between two systems, without an explicit intermediate representation. For example, n-body kernels between atomic environments can be systematically constructed from a non-invariant Gaussian kernel using Haar integration, or using invariant k-body functions (SI 11), yielding kernels of varying body-order and degrees of freedom. 39,41 Similarly, while neural networks can use representations as in-puts, their architecture can also be designed to learn implicit representations from the raw data (end-to-end learning). In all cases, the Requirements in Section 3 apply.

Empirical comparison
We benchmark prediction errors for the representations from Section 5 on three benchmark datasets. Since our focus is exclusively on the representations, we control for other factors, in particular for data distribution, regression method, and HP optimization.

Data
The qm9 consensus benchmarking dataset 133, 134 comprises 133 885 organic molecules composed of H, C, N, O, F with up to 9 non-H atoms. (SI 13) Ground state geometries and properties are given at the DFT/B3LYP/6-31G(2df,p) level of theory. We predict U 0 , the atomization energy at 0 K.
The nmd18 challenge 135 dataset 136 (SI 15) contains 3 000 ternary (Al x -Ga y -In z ) 2 O 3 oxides, x + y + z = 1, of potential interest as transparent conducting oxides. Formation and band-gap energies of relaxed structures are provided at the DFT/PBE level of theory. The dataset contains both relaxed (nmd18r, used here) and approximate (nmd18u) structures as input. In the challenge, energies of relaxed structures were predicted from approximate structures.
Together, these datasets cover finite and periodic systems, organic and inorganic chemistry, and ground state as well as off-equilibrium structures. See SI 13 to 15 for details.

Method
We estimate prediction errors as a function of training set size (learning curves, SI 16 and 17). To ensure that subsets are representative, we control for distribution of elemental composition, size, and energy. (SI 18) This reduces the variance of performance estimates and ensures the validity of the independent-and-identically-distributed data assumption inherent in ML. All predictions are on data never seen during training.
We use kernel ridge regression 137 (KRR; predictions are equivalent to those of Gaussian process regression, 138 GPR) with a Gaussian kernel as an ML model. (SI 19) KRR is a widely-used non-parametric non-linear regression method. There are two regression HPs, the length scale of the Gaussian kernel and the amount of regularization. (SI 21) In this work, training is exclusively on energies; in particular, derivatives are not used. All HPs, that is, regression HPs, numerical HPs (e.g., a weight in a weighting function), and structural HPs (e.g., which weighting function to use), are optimized with a consistent and fully automatic scheme based on sequential model-based optimization and tree-structured Parzen estimators. 139 Figure 5 presents learning curves for SF, MBTR, SOAP on datasets qm9, ba10, nmd18r (see SI 25 for tabulated values). For each dataset, representation, and training set size, we trained a KRR model and evaluated its predictions on a separate hold-out validation set of size 10 k (qm9), 1 k (ba10), and 0.6 k (nmd18r). This procedure was repeated 10 times to estimate the variance of these experiments.

Results
Boxes, whiskers, horizontal bars, and crosses show interquartile ranges, minimum / maximum value, median, and mean, respectively, of the root mean squared error (RMSE) of hold-out-set predictions over repetitions. We show RMSE as it is the loss minimized by least-squares regression such as KRR, and thus a natural choice. For other loss functions, see SI 26. From statistical learning theory, RMSE decays as a negative power of training set size (a reason why learning curves are preferably shown on log-log plots). 141-143 Lines show corresponding fits of mean RMSE, weighted by the standard deviation for each training set size. Figure 6 reveals dependencies between the time to compute representations for a training set (horizontal axis) and RMSE (vertical axis). When comparing observations in two dimensions, here time t and error e, there is no unique ordering <, and we resort to the usual notion of dominance: Let x, x ∈ R d ; then x dominates x if x i ≤ x i for all dimensions i and x i < x i for some i. The set of all non-dominated points is called the Pareto frontier, shown by a line, with numbers indicating training set sizes. Table 4 presents compute times for representations (SI 27 for kernel matrices).

Discussion
Asymptotically, observed prediction errors for all representations on all datasets relate as indicates that A has lower (or equal) estimated error than B asymptotically. Except for MBTR-2,3 SF-2 on dataset nmd18r, We conclude that, for energy predictions, accuracy improves with modeled interaction order and for local representations over global ones. The magnitude of, and between, these effects varies across datasets.
Dependence of predictive accuracy on interaction order has been observed by others 43, 67, 104, 106, 144 and might be partially due to a higher resolution of structural features. 123 Table 4: Computational cost of calculating representations in milliseconds of processor time. Shown are mean ± standard deviation over all training set sizes of a dataset for the time to compute the representation of a single molecule or unit cell. See SI 27 for details.

Time in ms Dataset
Representation qm9 ba10 nmd18 The latter would only show for sufficient training data, such as for dataset ba10 in Figure 5. We do not observe this for dataset qm9, possibly because angular terms might be immediately relevant for characterizing organic molecules' carbon scaffolds. 106 Better performance of local representations might be due to higher resolution and better generalization (both from representing only a small part of the whole structure), and has also been observed by others. 51,145 The impact of assuming additivity is unclear but likely depends on the structure of the modeled property. (SI 7) Our comparison includes only a single global representation (MBTR), warranting further study of the locality aspect. For additional analysis details, see SI 28 and 29.
Computational costs tend to increase with predictive accuracy. Representations should therefore be selected based on a target accuracy, constrained by available computing resources.
Converged prediction errors are in reasonable agreement with the literature (SI 30) considering the lack of standardized conditions such as sampling, regression method, HP optimization, and reported performance statistics. In absolute terms, prediction errors of models trained on 10 k samples are closer to the differences between DFT codes than the (systematic) differences between the underlying DFT reference and experimental measurements. (SI 31)

Conclusions and outlook
We review representations of atomistic systems, such as molecules and crystalline materials, for machine-learning of ab initio quantum-mechanical simulations. For this, we distinguish between local and global representations and between using invariant k-body functions and explicit symmetrization to deal with invariances. Despite their apparent diversity, many representations can be formulated in a single mathematical framework based on k-atom terms, symmetrization, and tensor products. Empirically, we observe that when controlling for other factors, including distribution of training and validation data, regression method, and HP optimization, both prediction errors and compute time of SFs, MBTR and SOAP improve with interaction order, and for local representations over global ones.       Our findings suggest the following guidance: • If their prediction errors are sufficient for an application, we recommend two-body versions of simple representations such as SF and MBTR as they are fastest to compute.
• For large systems, local representations should be used.
• For strong noise or bias on input structures, as in dataset nmd18u, performance differences between representations vanish, (SI 29) and computationally cheaper features that do not satisfy the requirements in Section 3 (descriptors) suffice.
We conclude by providing related current research directions, grouped by topic.
Directly related to representations: • Systematic development of representations via extending the mathematical framework (Section 4) to include more state-of-the-art representations. This would enable deriving "missing" variants of representations (see Table 2), such as a global SOAP 76 and local MBTR, 131 on a principled basis, as well as understanding and reformulating existing representations in a joint framework, perhaps to the extent of an efficient general implementation. 146 • Representing more systems. Develop or extend representations for atomistic systems currently not representable, or only to a limited extent, such as charged atoms and systems, 28 • Alchemical learning. Further understand and develop alchemical representations 104, 158, 159 that incorporate similarity between chemical species to improve sample efficiency. What are the salient features of chemical elements that need to be considered, also with respect to charges, excitations, spins, and isotopes?
• Analysis of representations to better understand structure and data distribution in feature spaces and how they relate to physics and chemistry concepts. Possible approaches include quantitative measures of structure and distribution of datasets in these spaces, dimensionality reduction methods, analysis of data-driven representations from deep neural networks, and construction, or proof of nonexistence, of non-distinguishable environments for representations employing terms of order higher than four.
• Explicit complexity control. Different applications require different trade-offs between computational cost and predictive accuracy. This requires determination, and automatic adaptation as an HP, of the capacity (complexity, dimensionality) and computational cost of a representation to a dataset, for example, through selection, combination, 160 or systematic construction of features. 41,123 Related to benchmarking of representations: • Extended scope. We empirically compare one global and two local representations on three datasets to predict energies using KRR with a Gaussian kernel. For a more systematic coverage, other representations (Section 6) and datasets, training with forces, 125, 126 and more properties should be included while maintaining control over regression method, data distribution, and HP optimization. Deep neural networks 23, 148, 161, 162 could be included via representation learning. Comparison with simple baseline models such as k-nearest neighbors 163 would be desirable.
• Improved optimization of HPs: The stochastic optimizer used in this work required multiple restarts in practice to avoid sub-optimal results, and reached its limits for large HP search spaces. It would be desirable to reduce the influence and computational cost of HP optimization. Possible means include reducing the number of HPs in representations, employing more systematic and thus more robust optimization methods, and providing reliable heuristics for HP default values.
• Multi-objective optimization. We optimize HPs for predictive accuracy on a single property. In practice, though, parametrizations of similar accuracy but lower computational cost would be preferable, and more than one property can be of interest. HPs should, therefore, be optimized for multiple properties and criteria, including computational cost and predictive uncertainties (see below). How to balance these is part of the problem. 164 • Predictive uncertainties. While prediction errors are frequently analyzed, and reasonable guidelines exist, this is not the case for predictive uncertainties. These are becoming increasingly important as applications of ML mature, for example, for human assessment and decisions, learning on the fly, 165 and active learning. Beyond global analysis of uncertainty estimates, local characterization (in input or feature space) of prediction errors is relevant. 164,166 Related through context: • Long-range interactions. ML models appear to be wellsuited for short-and medium-ranged interactions, but problematic for long-ranged interactions due to the increasing degrees of freedom of larger systems and larger necessary cut-off radii of atomic environments. Two approaches are to integrate ML models with physical models for long-range interactions, 147, 149, 167 and to adapt ML models to learn long-range interactions directly. 168 • Relationships between QM and ML. A deeper understanding of the relationships between QM and kernel-based ML could lead to insights and technical progress in both fields. As both share concepts from linear algebra, such relationships could be formal mathematical ones. For example, QM concepts such as matrix product states can parameterize non-linear kernel models. 169 The authors thank Profs.

Competing interests
The authors declare no competing financial or non-financial interests.

Data and code availability
The data that support the findings of this study are publicly available, datasets at https://qmml.org, hyperparameter search spaces, machine-learning models, program code, and results at https://marcel.science/ repbench. A tutorial introduction to the cmlkit Python framework developed for this work is part of the NOMAD Analytics Toolkit. 170 Doubling N thus increases compute time by roughly an order of magnitude, and a few such doublings will exhaust any computational resource. Advances in large-scale computing facilities, such as current exascale computing inititatives, will move this "computational wall" to larger systems, but cannot remove it. In practice, the large prefactor hidden in the asymptotic runtimes is relevant as well. ) 30 5 ≈ 1.5 · 10 9 . 3 | The role of data quality for QM/ML models Data are the basis for data-driven models, and errors in them can only be corrected to a limited extent ("garbage in, garbage out"). Even dealing with simple errors like independent identically distributed noise requires additional data, and more severe errors lead to qualitative problems such as outliers. Conversely, problems in fitting a ML model can be indicative of problems in the data. 4 | Explicit and implicit features Features used for regression can be defined explicitly via representations, or implicitly, for example, via kernels or deep neural networks.

References
In this work, we focus on explicit Hilbert-space representations in conjunction with kernel-based regression with a Gaussian kernel. Technically, the features used for regression are the components of the kernel feature space, that is, the non-linear transformations of the representations' components via the Gaussian kernel. While used implicitly in this sense, the representations are still defined explicitly. This is in contrast to implicitly defined representations, for example, feature spaces of kernels defined directly on "raw inputs" such as atomic coordinates and numbers, without an intermediate explicit Hilbert-space representation, or, the layers of deep neural networks (end-to-end learning). For the latter, the requirements in Section 3 can be imposed via the network architecture, which can be seen as the conceptual analog to explicitly conformant representations or kernels.
5 | Related work Faber et al. 7 compare combinations of representations and regression methods for atomization energies of organic molecules (qm9 dataset, see Section 8).
Only some of the tested features are representations that satisfy the requirements in Section 3; their HPs are not optimized.
Himanen et al. 8 investigate the representations in Section 5, also using kernel regression, to predict ionic charges of molecules from the qm9 dataset, as well as formation energies in a custom dataset of inorganic crystals obtained from the Open Quantum Materials Database. They optimize numerical HPs of representations and regression method, but not structural ones.
Zuo et al. 9 focus on dynamics simulations, and therefore include forces and stresses in training and evaluation. They also evaluate predictions of derived physical quantities, such as elastic constants or equation-of-state curves. Different combinations of representation, regression method, and HP tuning are evaluated on a dataset of elemental solids. Timings are discussed.
Schmitz et al. 10 compare regression methods for potential energy surfaces of 15 small organic molecules, using non-redundant internal coordinates as features. HPs of the representation are not optimized.
Nyshadham et al. 11 compare selected combinations of representations and regression methods on binary alloys (ba10 dataset, see Section 8). HPs of the representations are not optimized.
Stuke et al. 12 evaluate prediction of molecular orbital energies with kernel regression on three datasets: organic molecules (qm9 dataset, see Section 8), amino acids and dipeptides, as well as opto-electronically active molecules. Numerical HPs of representations and regression method are optimised via local grid search.
Onat et al. 13 empirically investigate effective dimensionality and sensitivity (to small perturbations of the underlying system) of representations using both materials and molecular datasets. HPs are not optimized. They do not investigate prediction errors.
Parsaeifard et al. 14 also study the sensitivity of representations to infinitesimal geometric perturbations (uniqueness), as well as correlations between the distances induced by different representations and with physical properties.
Käser et al. 15 empirically evaluate representations with kernel-based regression and a deep neural network for prediction of energies, forces, vibrational modes and infrared spectra of formaldehyde at different levels of theory.
Jäger et al. 16 compare SOAP, MBTR, CM and SFs with KRR for prediction of the adsorption free energy of hydrogen on the surface of nanoclusters. Numerical HPs of KRR, and some numerical HPs of representations are optimized.
Goscinski et al. 17 develop metrics to compare the feature spaces generated by different representations (SFs, SOAP, NICE), exploring the impact of HP choices in datasets of methane and solid carbon.
2 Role and types of representations 6 | Structure and distribution of data Figure S1 illustrates the importance of representation space structure for regression with a toy example. Low-dimensional (here, essentially one-dimensional) data is embedded into a highdimensional (here, two-dimensional) space. The spiral embedding is not suited for linear regression, whereas the linear embedding is.
7 | Extensive and intensive properties A property whose magnitude is additive in the size (extent or mass) of an object is called extensive; a property whose magnitude is independent of the size of an object is called intensive. For example, internal energy is an extensive property, band gap energy an intensive one.
Originating from thermodynamics, 18,19 the application of these terms to microscopic quantities is limited by allowed changes in "size" of a system: For finite systems such as molecules, a property p is extensive if for any two noninteracting systems A and B, p(A + B) = p(A) + p(B), 20 and intensive if p(A) = p(A + A). For periodic systems such as bulk crystals, we take A and B to be supercells of the same unit cell. In this minimal sense, total and atomization energy of atomistic systems are extensive.  Figure S1: Structure of representation determines suitability for regression. Almost one-dimensional data is embedded into a two-dimensional space. The spiral embedding (left) is not suited for linear regression, but the "unrolled" embedding (right) is. However, energies are not additive for general changes in a system, such as changes in atomic position, and addition or removal of atoms. With respect to the requirements in Section 3, ML models for energies should be size-extensive in the (minimal) sense above. For global representations, this can be achieved via normalization in conjunction with the linear kernel, 20 whereas local representations as described in SI 8 automatically satisfy this requirement.
8 | Learning with atomic contributions One ansatz to scale prediction of global properties to large atomistic systems is to predict atomic contributions. This assumes additivity, as the predicted property is a sum of predicted atomic contributions, and locality, as efficient scaling requires representations of atoms in their environment to have local support, often achieved through a finite-radius cut-off function.
Predicting atomic contributions requires a modification of the basic kernel regression scheme, which we derive here building on References 21 and 22: Let A 1 , . . . , A a denote atoms of systems M 1 , . . . , M m and let D ∈ {0, 1} m×a be their incidence matrix, that is D i,j = 1 if A j belongs to M i and 0 otherwise. Letk denote a kernel function on atoms. The prediction for the i-th system is the sum of its predicted atomic contributions,

Minimizing quadratic loss yields
Since this is a quadratic form, it suffices to set its gradient to zero and solve forα: where the last expression is preferable for numerical evaluation. Predictions for m new systems M with a atoms A can be expressed efficiently as where D is the incidence matrix for the predicted systems andL is the a × a kernel matrix between atoms A and A .
This approach is equivalent to kernel regression (SI 19) on whole systems with a kernel k given by the sum of the atom kernelk over all pairs of atoms in two systems, This follows from K = DKD T , L = DLD T and α = D T α (Equation 8): Predictions for whole systems using k are identical to predictions usingk: y = L T α = D L T D T α = D Lα . In particular, atomic weightsα are blocks of system weights α.
Computing full atom kernel matricesK and incidence matrices D can require large amounts of memory. In practice, we compute blocks ofK on the fly and directly sum over its entries. Learning with atomic contributions is extensive (SI 7). Both ref and repr depend on system size, often polynomially, with differences in asymptotic runtime as well as constant factors relevant in practice. Local representations require computing more kernel matrix entries (SI 8) than global representations, which can noticeably influence compute time (Table S6), but enable scaling with system size: Let c denote the (average) number of atoms per system, and d the (average) number of atoms in a local environment. In the following, we assume d to be constant (bounded from above), and representations to have constant size. Total computational effort to compute representations (first term) and kernel matrices (second term) is then given by O (n + m) c k + nm and O (n + m) c d k + nmc 2 for global (left) and local (right) representations, where d k is constant. For small systems c ≈ d, and the additional overhead in computing kernel matrices will dominate runtime for small k. In the limit c → ∞ of large systems, the c k term will dominate for global representations, while local representations enjoy quadratic scaling. This can be observed to some extent in Tables S5 to S7.
The above analysis applies to representations that explicitly compute k-body terms. Representations that compute k-body terms as contractions of basis set coefficients enjoy better scaling in the number of atoms. 23 Recursion relations between k-body terms can improve scaling as well. [24][25][26] 10 | Role of representations The role of the representation is to map atomistic systems into a space amenable to regression (linear interpolation). Strictly speaking, for kernel regression this is the kernel feature space, that is, representation space transformed by the kernel. We limit our discussion to the representation itself-for the linear kernel this is exact as the transformation is the identity, and many non-linear kernels like the Gaussian kernel act on the representation space, relying on its structure and implied metric.

A unified framework
11 | k-body functions Informally, a k-body function maps information about k distinct atoms |α 1 , . . . , |α k , where order can matter, to an output space, here the real numbers, or a distribution on them. Atom information |α typically includes coordinates and proton number, but is not limited to those; for example, it could include neutron number to model isotopes.
Typical k-body functions include atomic number counts (k = 1), distances, sometimes inverted or squared (k = 2), angles or their cosine (k = 3), dihedral or torsional angles, volume-related terms (k = 4). Less common, (al)chemical relationships can be included, for example, based on atoms' period and group in the periodic table. 27 In this work, we do not use k = 4 or higher-order interactions due to the computational cost from combinatorial growth of number of terms when enumerated directly, which becomes a limiting factor for larger systems, such as in the nmd18 dataset.
More formally, Glielmo et al. 28,29 define the order of a kernel of two local atomic environments as the smallest integer k for which differentiating by k different atomic coordinates always yields zero. Conceptually, in our notation, the body-order of a global representation is for all distinct |α i1 , . . . , |α i k+1 . For local representations, one of the atoms α i is fixed, reducing the order of the derivative by one. This definition is not concerned with the extent to which k-body terms utilise k-body information. For instance, in a local representation, products of 2-body terms, say, distances from the central atom to one other atom, depend on two distances (from the central atom to two other atoms), and are therefore formally k = 3, but cannot resolve angular information because the distance between the two other atoms is not known. 28,29 Powers of k-body terms 28,29 and products of k, k -body terms 30 can improve performance, but are less expressive than full k ζ -body and kk -body terms, respectively.

Representations
12 | Local atomic neighbourhoods Local representations are computed for a local neighbourhood of a central atom, usually defined as { |α i | d i ≤ r c }, where |α i denotes atom i, d i is distance of |α i to the central atom, and r c ≥ 0 is a cut-off radius. Usually, a cut-off function that smoothly approaches zero for d i → r c is employed to prevent discontinuities at the threshold.
Both the quippy and DScribe implementations of SOAP include the central atom in the neighbourhood, and thus in the neighbourhood density, 31 in contrast to the original definition. 32 SFs do not take the central atom into account explicitly.
In periodic systems, the unit cell is replicated up to the cut-off radius to ensure that all interactions within the neighbourhood are included. In practice, some implementations may internally use a modified effective cut-off radius. For instance, DScribe ensures that atoms up to the tail of the radial basis function are taken into account.

Empirical comparison
13 | qm9 dataset The qm9 dataset, 33, 34 also known as gdb9-14, contains 133 885 small organic molecules composed of H, C, N, O, F with up to 9 non-H atoms. It is a subset of the "generated database 17" (GDB-17). 5 Molecular ground state geometries and properties, including energetics, are computed at density functional level of theory using the Becke 3-parameter Lee-Yang-Parr (B3LYP) 35 hybrid functional with 6-31G(2df,p) basis set.
We use the version available at qmml.org, which offers a convenient format for parsing, and exclude all structures in the uncharacterized.txt file and those listed in the readme.txt file as "difficult to converge", as those are potentially problematic. Total energies were converted to energies of atomization by subtracting the atomic contributions given in file atomref.txt.
15 | nmd18 dataset The nmd18 dataset 41 is a Kaggle challenge 42 dataset containing 3 000 ternary (Al x -Ga y -In z ) 2 O 3 oxides, x + y + z = 1, of potential interest as transparent conducting oxides. We predict formation and band-gap energies of relaxed structures, using either relaxed (nmd18r) or approximate (nmd18u) structures from Vegard's rule as input. Geometries and energies are computed at the density functional level of theory using the PBE functional as implemented in the all-electron code FHI-aims 43 with tight settings.
The challenge scenario is to predict formation and bandgap energies of relaxed structures from unrelaxed geometries obtained via Vegard's rule. This is equivalent to strong noise or bias in the inputs. Unlike pure benchmarking scenarios, where computationally expensive relaxed geometries are given, the challenge scenario is closer to a virtual screening application in that Vegard's rule geometries are computationally inexpensive to obtain.
The dataset contains all structures from the challenge training and leaderboard data. Unless otherwise noted, we report RMSE, not the root mean square logarithmic error used in the challenge.
16 | Learning curves Plots of empirical prediction error as a function of training set size n are called "learning curves". Asymptotically, we assume the error to decay as a negative power, 44 = a n −b . On a log-log plot, is therefore linear, log = a − b log(n), and the offset a = log a and slope b can be used to characterize predictive performance of models. 45 For QM/ML models the estimated quantities are noise-free (except for numerical noise, which is negligible for converged calculations) and representations are unique. We use base-10 logarithms for learning curves. For asymptotic fits we weight training set sizes by the standard deviation over their respective splits to attenuate for small sample effects, as the above equation is valid only in the limit n → ∞. All learning curves in Figures S2 and 5 show linear behaviour after at most a few hundred samples. See Table S1 for estimated offsets a and slopes b.
17 | Subsets For training and validation, data subsets were sampled as follows: An outer validation set 1 was randomly drawn (10 k molecules for qm9, 1 k structures for ba10, 600 structures for nmd18). From the remaining entries, outer training sets of sizes 100, 250, 650, 1 600, 4 000 and 10 000 for datasets qm9, ba10 and 100, 160, 250, 400, 650, 1 000 and 1 600 for dataset nmd18 were randomly drawn. These sizes were chosen to be equidistant in log-space. Each outer training set was then split into an inner training set and an inner validation set by randomly drawing the latter. We used an 80 / 20 % split, yielding inner validation sets of size 20, 50, 130, 320, 800, 2 000 for datasets qm9, ba10 and 20, 32,50,80,130,200, 320 for nmd18. The whole procedure was repeated 10 times. We excluded structures with few atoms (6 or fewer non-H atoms for qm9, 5 or fewer atoms per unit cell for ba10, 10 atoms per unit cell for nmd18) as there are not enough of these for statistical learning.
18 | Sampling To reduce variance, remove bias and ensure that subsets faithfully represent the distribution of the whole dataset, subsets were drawn using Monte-Carlo sampling such that differences to the parent dataset in selected statistics were below pre-defined fractional thresholds.
For dataset qm9, these were number of N, O and F atoms, number of molecules with 7, 8 and 9 non-H atoms, binned number of atoms (with H), and binned energy. For dataset , where x is S5 the system to predict, x 1 , . . . , x n are the training systems, and k is a symmetric positive definite function (kernel). The regression coefficients α are obtained by minimizing the regularized quadratic loss n i=1 (y i − f (x i )) 2 + λ||f || 2 H , where y are property values of the training data and the regularization strength λ is a HP that controls the smoothness of the predictor. In this work, we use the Gaussian kernel k(x, z) = exp(−||x − z|| 2 2 /2σ 2 ), where the length scale σ is a HP (SI 21). We used the qmmlpack 48, 49 implementation.
20 | HP optimization For model selection we optimized the HPs of representations, kernel and regression method, including structural HPs (for example, which k-body functions to use) and numerical HPs (for example, the Gaussian kernel length scale). Specifically, the RMSE of an inner validation set was minimized using tree-structured Parzen estimators 50,51 in combination with local grid search. The same optimization scheme was used for all representations, using consistent grid spacings and parameter ranges to reduce human bias. Our corresponding cmlkit package 52 provides interfaces to the hyperopt optimization package 50 and to each representation's implementation(s); it is freely available under an open source license.
The space of possible models ("HP search space") is a tree-structured set of choices, for instance, between different k-body functions, or different values of a numerical HP. Tree-structured Parzen estimators treat this search space as a prior distribution over HPs, updated every time a loss is computed to increase prior weight around HP settings with better loss. We use uniform priors throughout, discretizing numerical HPs on logarithmic or linear grids as necessary. Once a HP search space has been defined, model selection is fully automatic.
HPs were optimized for each training set size as follows: For each trial, representation HPs and starting values for regression method HPs (Gaussian kernel length scale and regularization strength) were drawn from the prior. The latter were then refined through a randomized local grid search and the resulting HP values used to update the prior. All optimizations were run for 2 000 steps, and rerun three times, to minimize variance from stochastic optimization. To reduce computational cost, HPs were optimized on only one outer split.

| Kernel regression HPs
We used KRR with a single Gaussian kernel, a frequently used combination in the literature. Note that due to Requirement (iii.a), the Gaussian kernel is better suited than less smooth kernels such as the Laplacian kernel. 53 No post-processing of the kernel was performed. In particular, centering of kernel and labels, which together is equivalent to having an explicit bias term b in the regression, were not performed, as this is not necessary for the Gaussian kernel. 54 Depending on the representation used, labels were normalized for training as needed to either represent values per atom or per entire system. (SI 7) This setup entails two HPs: The width σ of the Gaussian kernel, and the regularization strength λ. Search spaces for these two HPs (Table S2) were held constant across all representations and learning curves. See Reference 55 for HP search spaces and optimized model HPs.
with the cut-off function In Equations 10 to 15, index i is the central atom, j, k run over all atoms in the local environment around i with cutoff radius c, d lm indicates pairwise distance, θ lmn the angle between three atoms; η and κ are broadening parameters, µ a shift, ζ determines angular resolution. λ = ±1 determines whether the angular part of G 4 i and G 5 i peaks at 0°or 180°. We utilize the RuNNer 57-59 software to compute SFs and restrict ourselves to the radial SFs G 2 i and angular SFs G 4 i (RuNNer functions 2 and 3). We use the same SFs for all element combinations to minimize size of HP search space. This can result in constant or low-variance features, which are unproblematic for kernel regression with a Gaussian kernel (SI 19) as they enter only through the norm of feature vector differences (for neural networks, these features could be more problematic). Similarly, we use an empirical parametrization scheme 60 to choose HPs µ and η for G 2 i and HPs η, ζ, and λ for G 4 i . For radial SFs we use two schemes, shifted and centered. For shifted, µ is chosen on a linear grid while η is held fixed. For centered, µ = 0 and η is chosen such that the standard S6 deviation of each SF lies on the same grid points. For i ∈ {0, 1, . . . , n} a point on a one-dimensional grid, ∆ = c−1.5 n−1 , and r i = 1 + ∆i, in the centered scheme, µ i = 0 and η i = 1 2r 2 i , and in the shifted scheme, µ i = r i and η i = (2∆ 2 ) −1 . In this setting, the only HP is the number of grid points n+1, which we allow to vary from 2 to 10 for each scheme.
For angular SFs, we choose λ = ±1 and ζ = 1, 2, 4. The only HP remaining is the broadening η, optimized on a log 2 grid between −20 and 1 with spacing 0.5. The radial SFs and two angular SFs with λ = ±1 and ζ = 1 are always included, but the optimizer can enable or disable any of the remaining k = 3 SFs with λ = ±1 and ζ = 2, 4. Cutoff radii are varied in integer steps, starting from the integer above the smallest distance found in the dataset, up until at most 9Å.
The output of RuNNer is post-processed to be suitable for KRR, placing all SFs for a given type of central atom in separate blocks of an atomic feature vector with # elements × # SFs components. For the Gaussian kernel, this leads to negligible kernel values between representations belonging to different elements. As SFs are local representations, labels were normalized to (extensive) per-system values.
See Reference 55 for HP search spaces and optimized model HPs.
23 | Many-body tensor representation HPs We employed the MBTR implementation in qmmlpack, 49 adding optional normalization by 1 or 2 norm. For k = 2, 3, representations for k = 2 and k = 3 were concatenated. MBTR exhibits several categorical HPs, with subsequent numerical HPs conditional on prior choices.
We used the k-body functions 1/distance, 1/dot (k=2), and angle, cos angle, dot/dotdot (k=3). No one-body terms were used as atomization and formation energies already contain linear contributions of element counts. Histogram ranges were chosen based on the whole dataset, as inter-atomic distance ranges are similar for all subsets. 100 discretization bins were used throughout. Broadening parameters were restricted to at least a single bin and at most a quarter of the range of the corresponding geometry function.
The latter two in each set introduce conditional HPs. For periodic systems, in particular the nmd18 dataset, the ranges of these parameters were manually restricted to avoid excessive computation times (above 30 s for one trial). The convergence threshold was set to 0.001.
We used the full indexing scheme, which generates all permutations of elements (as opposed to noreversals, which does not double-count element combinations, for example, CH and HC). This seems to lead to more consistent behaviour and higher predictive accuracy for supercells, or unit cells of different sizes, and similar accuracy for molecules, at the expense of higher computational cost. We used per-system energies for the qm9 dataset and per-atom energies for datasets ba10 and nmd18.

| Smooth Overlap of Atomic Positions HPs
We used the DScribe implementation of SOAP with Gaussian-type orbitals, 8,61 which we found to provide more accurate predictions at lower computational cost than the quippy 62 implementation. Results are already structured by element types; no post-processing was applied. HPs l max and n max were chosen between 2 and 8. Cut-off radii were chosen as for SFs, and the broadening adapted to the resulting ranges (53 steps from -20 to 6 on a log 2 grid). We report results for the gto basis set, which resulted in lower prediction errors than the polynomial one, and was faster to compute. Labels were normalized to per-system values. See Reference 55 for HP search spaces and optimized model HPs. Tables S3 and S4 present numerical values underlying the learning curves for RMSE (Table S3) and MAE (Table S4). For rRMSE (SI 26), standard deviations of 239.31 kcal mol −1 , 178.86 meV, 104.57 meV, were used for datasets qm9, ba10, nmd18, computed over the whole dataset (differences to standard deviations over validation sets were around 1 % or less in all cases).

| Prediction errors
26 | Error metrics We measure predictive performance by two metrics, an absolute one and a relative one that facilitates comparison across datasets. In addition, we also provide a metric for qualitative comparison with the literature.
Let y i , f i , e i = f i − y i denote i-th observed label, prediction and residual. Root mean squared error (RMSE) and mean absolute error (MAE) are given by The canonical loss for least-squares regression is RMSE (as it is optimized by the regression). We also provide MAE since it is often reported in the literature (Figures S2 and S3). RMSE and MAE are scale-dependent, and thus not suited for comparison across different datasets. We therefore also report the scale-independent relative RMSE (rRMSE), y i is the mean of the observed labels and σ y is their standard deviation. The rRMSE can be seen as RMSE relative to the RMSE of a baseline model RMSE * that always predicts the mean of the labels. While the latter is more naturally computed using training labels and the former using validation labels, as long as as the assumption of independent and identically distributed data holds, the number of samples is more important.
See References 63, 64 and references therein for an extended discussion of error metrics. Tables S5 to S7 present empirical computational costs, measured by processor wall-time, for calculating representations and kernel matrices, respectively. Experiments were run on a single core of an Intel Xeon E5-2698v4 2.2 GHz processor.       Relative MAE of Ef in % Dataset nmd18r. Figure S2: Learning curves for mean absolute error (MAE) of representations in Section 5 on datasets qm9 (top), ba10 (middle), and nmd18r (bottom). Shown is MAE of energy predictions on out-of-sample-data as a function of training set size. Boxes, whiskers, bars, crosses show interquartile range, total range, median, mean, respectively. Lines are fits to theoretical asymptotic error. See Glossary for abbreviations.  Table S5, representations of the 10 k, 1 k, 600 systems (datasets qm9, ba10, nmd18) in the first outer validation set were computed en bloc and the result divided by number of systems; this was repeated three times.

| Compute times
Similarly, for Table S6 kernel matrices between the representations of these systems were computed, also over three repetitions. The results were divided by the number of entries in the respective kernel matrices, yielding average kernel evaluation times. Table S7 presents a summary overview of compute times for representations and kernel matrices.
28 | Analysis details Predictive accuracy as measured by rRMSE is worse for solid-state datasets compared to the molecular qm9 one. This might indicate that periodic systems pose harder learning tasks than molecules.
MBTR performs worse for solid-state datasets than for the qm9 one, in particular for nmd18r. This might be due to increasing difficulty of the learning problem with system size (see discussion in Section 8) and lack of intrinsic scaling with number of atoms, impeding interpolation between unit cells of different size. The high computational cost of MBTR with k = 3 for large periodic systems also renders HP optimization more difficult.
For the qm9 dataset at 1 600 training samples, we observe an increase in RMSE standard deviation compared to neighbouring training set sizes for most methods. Comparing to MAE, which exhibits no such effect, and investigating errors individually, revealed that this is due to outliers, that is, few predictions with high error in some, but not all, outer splits. The problematic structures are ring molecules, and are not present in the outer training split used for HP optimization. This stresses the importance of carefully stratifying benchmark datasets. Figures S4 and S5, and, Tables S3 to S7 present results for energy predictions on the nmd18u dataset, that is, the nmd18 dataset with approximate geometries obtained from Vegard's rule. In contrast to relaxed structures, such geometries can be obtained at almost no cost, and could be used in virtual screening campaigns.

| nmd18u dataset
We observe (i) a strong increase in prediction errors (14-21 % for rRMSE), (ii) collapse of all representations to similar performance, (iii) large differences between MAE and RMSE, indicating significant outliers. From this, we conclude that the map from unrelaxed structures to ground-state energies is harder to learn than the map from relaxed structures to their energies, and, that here the representation is not the limiting factor, and other sources of error dominate.
30 | Comparison with literature-reported errors Due to different conditions, such as sampling, regression and HP optimization methods, comparisons with performance estimates reported in the literature must remain qualitative. Frequently, only MAE is reported, which tends to result in lower absolute values and to de-emphasize outliers. Table S8 presents selected performance estimates from the literature. Overall, errors in this work appear to be compatible with reported ones.

| Comparison with DFT and experimental errors
The error of DFT simulations against experimentally measured observations depends on system and property, as well as choice of density functional and other parameters, such as convergence thresholds and k-point density. For heats of formation and the Becke 3-parameter Lee-Yang-Parr (B3LYP) functional used for the qm9 dataset, (systematic) MAEs relative to experiment of ≈ 2.6 kcal mol −1 have been reported for small organic molecules containing only C, H, N, O. 65 For cohesive energies and the Perdew-Burke-Ernzerhof (PBE) functional used for the ba10 and nmd18 datasets, values of approximately 200 to 300 meV/atom have been reported. [66][67][68] For the PBE functional, reported MAEs in computed energies between different parametrizations of DFT codes and RMSEs between 20 different DFT codes on 71 elements in bulk crystalline form were approximately 2 meV/atom and 1.7 meV/atom, respectively; 69 the latter reduces to 0.6 meV/atom for all-electron codes only. The best models for bulk crystal reported here have RMSEs of 4.6 meV/atom and 3.3 meV/cation on the ba10 and nmd18 datasets. However, the former benchmark values are integrated over a ± 6 % interval around the equilibrium volume, whereas the values reported here are computed at the minima themselves and therefore measure related but distinct quantities. This suggests that prediction errors are at least ≈ 2-6 times larger than DFT-intrinsic variations. Table S5: Computational cost of calculating representations in milliseconds of processor wall-time on a single core. Shown are mean ± standard deviation over three repetitions of the time to compute a single system (molecule or unit cell).