Bayesian force fields from active learning for simulation of inter-dimensional transformation of stanene

We present a way to dramatically accelerate Gaussian process models for interatomic force fields based on many-body kernels by mapping both forces and uncertainties onto functions of low-dimensional features. This allows for automated active learning of models combining near-quantum accuracy, built-in uncertainty, and constant cost of evaluation that is comparable to classical analytical models, capable of simulating millions of atoms. Using this approach, we perform large-scale molecular dynamics simulations of the stability of the stanene monolayer. We discover an unusual phase transformation mechanism of 2D stanene, where ripples lead to nucleation of bilayer defects, densification into a disordered multilayer structure, followed by formation of bulk liquid at high temperature or nucleation and growth of the 3D bcc crystal at low temperature. The presented method opens possibilities for rapid development of fast accurate uncertainty-aware models for simulating long-time large-scale dynamics of complex materials.


INTRODUCTION
Density functional theory (DFT) is one of the most successful methods for simulating condensed matter thanks to a reasonable accuracy for a wide range of systems. Ab initio molecular dynamics (AIMD) offers a way to simulate the atomic motion using forces computed at the DFT level. Unfortunately, computational requirements limit the timescale and size of AIMD simulations to a few hundred atoms for a few hundred picoseconds of time, precluding investigation of phase transitions and heterogeneous reactions. Such large-scale molecular dynamics (MD) simulation must resort to empirically derived analytical interatomic force fields with fixed functional form [1][2][3][4] , trading accuracy and transferability for larger length and timescales. Classical analytical force fields often do not match the accuracy of ab initio results, limiting simulations to describing results qualitatively at best or, at worst, deviating from the correct behavior. In order to broaden the reach of computational simulations, it would be desirable to compute forces with ab initio accuracy at the same cost of classical interatomic force fields.
However, most ML models have no predictive distributions, which provide uncertainty for energy/force predictions. Without uncertainty, the training data are generally selected from ab initio calculations via a manual or random scheme. Determining the reliability of the force fields then becomes difficult, which could result in untrustworthy configurations in the MD simulations. Therefore, uncertainty quantification is a highly desirable capability 29,30 . NN potentials, e.g., ANI 11,31 uses ensemble disagreement as an uncertainty measure. This statistical approach is, however, not guaranteed to yield reliable calibrated uncertainty. In addition, NN approaches usually require tens of thousands of data for training, and are a few orders of magnitude slower than analytical force fields. Bayesian models are promising for uncertainty quantification in atomistic simulations since they have an internal principled uncertainty quantification mechanism, the variance of the posterior prediction distribution, which can be used to keep track of the error on forces during a MD run. For instance, Jinnouchi et al. 32,33 used high-dimensional SOAP descriptors with Bayesian linear regression 19 . Gaussian process (GP) regression 18,19,[34][35][36] is a Bayesian method that has been shown to learn accurate forces with relatively small training data sets. Bartok et al. 21 used GP uncertainty with the GAP/SOAP framework to obtain only qualitative estimates of the force field's accuracy. Recently Vandermause et al. 16 demonstrated a GP-based Bayesian active learning (BAL) scheme in the FLARE framework, utilizing rigorously principled and quantitatively calibrated uncertainty, applying it to a variety of multi-component materials.
In the most common form, however, GP models require using the whole training data set for prediction of both the force and uncertainty, meaning that the computational cost of prediction grows linearly with the size of the training set, and so accuracy increases together with computational cost. In complex multicomponent structures, more than O(10 2 ) training structures or O (10 3 ) local environments are typically required to construct an accurate GP model, which makes predictions slow 37,38 . Because of the linear scaling of GP, the on-the-fly training becomes slower as more data are collected. This precludes the active learning of GP on larger system sizes, which may be needed due to finite size effects, as well as longer timescales needed to explore phase space more thoroughly. To accelerate the on-the-fly training, fast and lossless mappings as approximations of both the force and uncertainty are desired, such that the mappings can replace GP to make predictions during the BAL.
The force mapping has been addressed by Glielmo et al. 34,[34][35][36] and Vandermause et al. 16 . They noted that, for a suitable choice of the n-body kernel function, it is possible to decompose GP force/ energy prediction into a summation of low-dimensional functions of atomic positions. As a result, one can construct a parametric mapping of the n-body kernel function combined with a fixed training set. This approach reaches a constant scaling, which increases the speed of the GP model without accuracy loss. We also present the formalism of the force mapping in "Methods".
Since the uncertainty plays a central role in the on-the-fly training 16 and also suffers from the linear-scaling complexity with respect to the training set size, it is desirable to have a similar mapping of the uncertainty as that of the force. However, this was not attempted in refs. 16,[34][35][36] , since the mathematical form of the predictive variance results in decomposition with twice the dimension of the corresponding mean function, dramatically increasing the computational cost of their evaluation with spline interpolation. Thus, to date there is no method that combines high accuracy, modest training data requirement and fast prediction of both forces and their principled Bayesian uncertainty. In this work, we introduce a dimensionality reduction technique for the uncertainty mapping, such that the interpolation can be done on the same dimensionality as the force mapping, enabling development and application of efficient Bayesian force field (BFF) models for complex materials. The mathematical formalism of the uncertainty mapping is presented in "Methods".
In this article, we present an accurate mapping of both force (mean) and uncertainty (variance) implemented as the mapped Gaussian process (MGP) method. As a result, the MGP method benefits from the capability of quantifying uncertainty, while at the same time retaining cost independent of training size. The original BAL with GP can then be accelerated by the fast evaluation of forces and uncertainties of MGP. The training can also be extended to larger system sizes and longer timescales that are challenging for the full GP. To illustrate the performance for large-scale dynamics simulations, we incorporate the MGP force field with mean-only mapping in the parallel MD simulation code LAMMPS 39 , and apply it to the investigation of phase transformation dynamics. The MGP force field is shown to be efficient for large-scale (million atom) simulations, achieving speeds comparable to classical analytical force fields, several orders of magnitude faster than available NN or full GP approaches.
As a test application, we focus on stanene, a 2D material that has recently gathered attention as a quantum spin Hall insulator 40 . Moreover, the only published force field for stanene is, to the best of our knowledge, the bond-order potential by Cherukara et al. 41 , which is fitted to capture stanene's low-temperature crystalline characteristics but, as common to many empirical interatomic potentials, suffers in accuracy near the melting temperature. In particular, we show that MGP is capable of rapidly learning to describe a wide range of temperatures, below and around the melting temperature, and that we can efficiently monitor the uncertainty on forces at each time step of the MD, and use this capability to iteratively increase the accuracy of the force field by hierarchical active training. Using parallel simulations of largescale structures, we characterize the unusual phase transition, where the 2D monolayer transforms to bcc bulk Sn at the temperature of 200 K, and melts to ultimately form a 3D liquid phase at the temperature of 500 K.

RESULTS
Accelerated Bayesian active learning with MGP In a MD simulation, it is likely that the system will evolve to atomic configurations unseen before, and are far from those in the training set. In this situation, the uncertainties of the predictions will grow to large values, which may be considered unsatisfactory. Therefore, it is desirable to obtain an accurate estimate of the forces for such configurations using a relatively expensive first principles calculation, and add this new information to the training set of the GP regression model. This procedure is referred to as BAL, where training examples are added on the fly as more information is obtained about the problem. Bayesian ML models such as GP are particularly wellsuited to such uncertainty-based active learning approaches, as they provide a well-defined probability distribution over model outputs, which can be used to extract rigorous estimates of model uncertainty for any test point 42,43 .
Here we adopt BAL to achieve automatic training of models for atomic forces, expanding on our earlier workflow 16 . This way the accuracy of the GP model increases with time, particularly in the configuration regions that are less explored and likely to be inaccurate.
As mentioned above, GP regression cost scales linearly with the training set size, so the prediction of forces and uncertainties becomes more expensive as the BAL algorithm keeps adding data to the training set, hindering simulations of complex materials systems. The MGP approach is an essentially exact mapping of both mean and variance of the predicted forces from a full GP model for a fixed training set to a low-dimensional spline model. This approach has the ability to accelerate prediction of not only the mean, but also GP's internal uncertainty without loss of accuracy. MGP is incorporated into the BAL workflow as depicted in Fig. 1. In this scheme, the mapping is constructed every time the GP model is updated. Atomic forces and their uncertainties for MD are produced by MGP, reducing the computational cost of the full original GP prediction by orders of magnitude. We stress that not only is the MGP BFF faster than the original GP, but also its speedup is more pronounced as the training data set size N T increases. This is especially critical in complex systems with multiple atomic and molecular species and phases, where more training configurations are needed.
In addition, fast evaluation of forces and their variances enables training the force field model on larger system sizes. Therefore, we can train the BFF using a hierarchical scheme, i.e., for different system sizes, we can run several iterations of active learning. First, a small system size is used to train the BFF using the BAL workflow. Then, we perform BAL on a larger system size, where DFT calculations are more expensive but are needed less frequently, in order to capture potentially new unexplored configurations such as defects that are automatically identified as high uncertainty by the model of the previous step. We note that DFT calculations scale as the 3rd power of the number of electrons in the system, so it is efficient to perform most of the training at smaller sizes, and to only require re-training on fewer large-scale structures. This way the hierarchical training scheme helps to overcome finite size effects and explore phenomena that cannot be captured with small system sizes, such as phase transformations.  Case study: 2D to 3D transformation of stanene To illustrate our method, we apply the mapped on-the-fly workflow (Fig. 1) to study the phase transformation of stanene, a two-dimensional slightly buckled honeycomb lattice of tin. This structure has received much attention recently due to its unique topological electronic properties. Stanene can be synthesized by molecular beam epitaxy 44 and has been shown to have topologically protected states in the band gap 45 . Moreover, trilayer stanene is a superconductor 46 . As a demonstration of our methodology, here we want to focus on the less-studied aspects of thermodynamic stability and phase transition mechanisms of free-standing stanene. We note that a substrate is currently always used in the growth of stanene monolayers, and the interaction between the monolayer and the substrate will be investigated in a future work. By studying the intrinsic thermodynamics properties of the free-standing stanene, we are able to examine its stability and decomposition timescales as a function of temperature, which are relevant for device engineering.
Despite the intense interest, capability is still lacking for accurate modeling of thermodynamic properties and phase stability of stanene, and most other 2D materials, for which~nm scale supercells are required. Such simulations are beyond the ability of DFT, and to the best of our knowledge, the only published classical force field for stanene is a bond-order potential (BOP) 41 . Our focus is on developing an accurate potential to model the 2D melting process, which is a notoriously difficult problem to tackle with existing classical force fields. In fact, describing the melting mechanism of a 2D material is a highly nontrivial problem, since it can be characterized by specific behaviors such as defect formation and nucleation. For example, the melting of graphene is thought to occur in a temperature range between 4000 and 6000 K [47][48][49] , and seems to be characterized by the formation of defects, while above 6000 K, the structure melts directly into a totally disordered configuration. Since bulk tin melts at a much lower temperature (~500 K) than graphite (~4300 K), we expect that stanene will transform at a relatively low temperature. However, little is known about its melting mechanism and transition temperature. The MGP method is uniquely suited to describe the transformation processes, as it is scalable to large systems and capable of accurately describing any configuration that is sufficiently close to the training data.
To train the force field for stanene, we start by training the MGP BFF using the BAL loop on a small system size, and gradually increase the size of the simulation, iteratively improving upon the force field developed for the smaller size.
The DFT settings and the hyperparameters of GP are shown in Supplementary 50 .
The first iteration of training is started with a 4 × 4 × 1 supercell of stanene (32 atoms in total) and a 20-ps long MD simulation at a temperature of 50 K. Subsequently, we increase the temperature of the system at 300, 500, and 800 K by velocity rescaling, and let the system evolve for 20, 30, and 30 ps, respectively. This way we efficiently augment the training set with relevant configurations. During the simulation, DFT was called whenever the uncertainty on any single force component exceeded the current optimized noise parameter of the GP, after which the N = 1 atomic environment with the highest force uncertainty was added to the training set. Figure 2b shows the mean square displacement of atoms during the on-the-fly BAL training. As one would expect, the DFT is called multiple times at the beginning of the MD simulation, since the model needs to build an initial training data set to make force predictions. In addition, the DFT is called whenever temperature is increased, as the trajectory explores new configurations that have not been visited before. Clearly, the variance increases rapidly when temperature is increased.
In this small simulation, the stanene monolayer decomposes at 800 K, where Sn atoms start moving around the simulation cell. This allows us to augment the training set with such disordered configurations. However, the small size of the simulation cell does not yet allow us to draw conclusions about the melting process of stanene: for this, we need a larger simulation cell.
We continue with our hierarchical active learning scheme to refine the force field, but retaining the training data from the previous smaller MD run. We now increase the number of atoms in the simulation, which allows us to explore additional atomic configurations and improve the quality of the model. Since the system becomes larger and more computationally expensive, instead of updating the model at each learning step, we update it at the end of each MD simulation. This "batching" approach is one of many possible training schemes and is chosen for reasons of efficiency. The procedure is as below: (1) Run MD simulation on a system of a given size with MGP; (2) Predict mapped uncertainties for the frames from the MD trajectory; (3) Select frames of highest uncertainties in different structures (crystal, transition state, and disordered phase) for additional DFT calculations; (4) Add atomic environments from those frames based on uncertainty to the training set of GP; (5) Construct a new MGP BFF using the updated GP; (6) Repeat the process for same or larger simulation using the new BFF.
The above steps form one iteration, and can be run several times to refine the force field. In our force field training for stanene, we performed one iteration of 32 atoms as described before, and two iterations all on the system size of 200 atoms at 500 K, a relatively high temperature to explore diverse structures in the configuration phase space. We denote the MGP BFF from the on-the-fly training of 32 atoms as MGP-1, the ones from the a b  Fig. 3a. First, there is a drastic increase of uncertainty within each trajectory at a certain time. This increase corresponds to the transition from crystal lattice to a disordered phase, so the uncertainty acts as an automatic indicator of the structural change. Next, the uncertainties of MGP-1, 2, 3 decrease iteration by iteration, indicating our model becomes more confident as it collects more data. The mean absolute error of force predictions of MGP BFF against DFT is investigated, as presented in Fig. 3b, showing that MGP BFF becomes also more accurate with each subsequent training iterations. The accuracy of force prediction from MGP force field is also much higher than BOP 41 for different phases of stanene, as shown in Supplementary Table 4.
It is worth noting that in the transition process of the simulation by MGP-1, there appear a bilayer stacked defect and a specific compact triangular lattice around the defect. It is a visibly new ordered structure different from the hexagonal stanene, and is thus missing in the initial training set, as shown in Fig. 3c. The uncertainty is low in honeycomb lattice, as expected, and becomes high around the bilayer defect, exhibiting also a strong correlation with the coordination number. Thus, the mapped uncertainty is an accurate automatic indicator of the novelty of the new configurations. We note that after iterative model refinement by the hierarchical training scheme, the triangular lattice no longer appears in the simulation. This example illustrates how the hierarchical active learning scheme helps avoid qualitatively incorrect results that arise due to the a priori unknown relevant configurations. In contrast, a single-pass training of a force field based on manually chosen configurations is very likely to result in nonphysical predictions.
To investigate the full phase transition mehanism of stanene, we consider low temperature (below 200 K) and high temperature (above 500 K) values, corresponding to the crystal and liquid phase for bulk tin. In studying the graphite-diamond transition, Khaliullin et al. 51 identified that nucleation and growth of the disordered phase play key roles in the phase transition process. We also find that in stanene, the phase transition begins with the appearance of defects and the consequent nucleation and growth of the disordered phase that eventually transforms the structure.
From DFT ab initio calculations, the ground state energy of honeycomb monolayer is~0.38 eV/atom higher than the bulk, so the 2D monolayer structure is expected to be meta-stable. The full transformation described below is therefore a sequence of 2D melting, followed by a kinetically controlled transition to the bulk phase of lower free energy. We set up the simulation of stanene monolayer of size 15.6 × 18.1 nm (3000 atoms), and run MD in the NPT ensemble for 3 ns at the temperature of 200 K. We discover that the ordered honeycomb monolayer loses long-range order and undergoes densification, and finally rearranges into the bcc bulk structure spontaneously, following three transformation steps. At first, the atoms vibrate around the honeycomb lattice sites, and the monolayer remains in its ordered 2D phase (Fig. 4a  -0.3 ns). The main deviation from the ideal lattice is the out-ofplane rippling due to membrane fluctuations. The fluctuating ripples are similar to what was reported in simulations of graphene [47][48][49] . Next, double-layer stacking defects appear due to buckling and densification. These defect regions grow and quickly expand to the whole system, indicative of a first order phase transition in 2D. The atomic configuration appears to be a "disordered" phase ( Fig. 4a-0.8 ns). In this stage, the structure becomes denser and increasingly three-dimensional driven by energetic gains from increasing Sn atom coordination, with the unit cell dimensions shrinking. Finally, patches of bcc lattice nucleate during the atomic rearrangement (Fig. 4a-1.6 ns), and those patches grow until the whole system forms the crystal bcc slab and atoms start vibrating around the lattice sites ( Fig. 4a-2 ns).
The stable phase of bulk tin below 286.35 K is expected to be the α-phase with the diamond structure, which is different from what we observe here. The reason for the formation of the bcc phase is kinetic rather than thermodynamic, caused by a lower free energy barrier separating the 3D disordered structure from the bcc phase than the α-phase, making the former phase more kinetically accessible. One well-known example is the graphite-diamond transition, in which the meta-stable hexagonal diamond (HD) is obtained from hexagonal graphite (HG) in most laboratory synthesis, instead of the more thermodynamically stable cubic diamond (CD) [52][53][54][55][56] . Computational study 51 reveals that the higher in-layer distortions result in higher energy barrier for HG-CD than HG-HD transformation, under most of the experimental conditions. We observe that the 2D stacking orderdisorder and the 3D crystallization transformation steps proceed via nucleation and growth, where the defect nuclei grow and turn the whole system into the intermediate disordered dense phase, while in the second step the bcc nuclei appear from the disordered configuration and grow, turning the system into the bcc crystal phase. Our simulation result implies the meta-stability of the free-standing stanene monolayer. The evidences provided by the MD simulation are consistent with the experimental fact that all the existing growths of the stanene monolayer are achieved on substrates 3,44-46,57-59 . The role that the substrate plays in stabilizing the two-dimensional stanene monolayer is thus of great interests and realistic significance, and will be investigated in our future work.
To investigate the melting behavior at higher temperature, we focus on a 600 ps long MD run that starts from a pristine sample of stanene with size 39.4 × 22.7 nm (10,000 atoms) at~500 K. A few frames of this simulations are shown in Fig. 4b. The melting process can also be identified as three stages. The first stage is the development of large out-of-plane ripples in the free-standing 2D lattice. In the second stage, at around 100 ps, double-layer defects in the monolayer form and agglomerate, causing densification. After 200 ps, densification due to large patches of disordered structures lead to the formation of holes, and a large void area grows. We observe breaking of the two-dimensionality of the crystal and formation of stacked configurations where atoms climb over other atoms of 2D stanene. In the third stage, the disordered stacked configurations shrink into spherical liquid droplets connected by a neck (400 ps). Subsequently the droplets merge to form a large sphere, minimizing the surface energy (600 ps). In Supplementary Fig. 3, we include more discussions and animations of the transition processes.

Performance
In this section, we discuss the computational cost of the MGP model in application to large-scale MD simulations, and its advantages and limitations. Timing results are shown in Fig. 5a, comparing different models including DFT, full GP (without mapping; implemented in Python), MGP (with mapped force and uncertainty; Python), MGP-F (MGP with force prediction only; Python), MGP (force only; implemented in LAMMPS), and BOP (LAMMPS force field). For DFT, we tested prediction time for system sizes of 32 and 200 atoms, where computational cost grows as the cube of the system size. The other algorithms' cost scales linearly with system size, since they rely on local environments with finite cutoff radius, we measure the same per atom computational cost for both system sizes.
It is important to highlight the acceleration of the MGP relative to the original full GP. We present the timings of GP and MGP built from different training set sizes (100 and 400 training data) in Fig.  5, where the GP scales linearly. MGP-F (without uncertainty) and MGP (with uncertainty, fixed rank) are independent of the training set size. In the stanene system, MGP BFF is two orders of magnitude faster than GP with O(10 2 ) training data points when both force and variance are predicted, and the speedup becomes more significant as the GP collects more training data, as reported in Supplementary Fig. 4. We also note that although the prediction of MGP gets rid of the linear scaling with training set size, the construction of the mappings does scale linearly with training set size, since it requires GP to make prediction on each grid point.
The cost of the MGP force field implemented in LAMMPS (force only) is comparable to that of empirical interatomic BOP force field, since both BOP and MGP models consider interatomic distances within the local environment of a central atom and thus have similar computational cost at the same cutoff radius. For accurate simulations, however, the MGP uses a larger cutoff radius (7.2 Å) than what is used by the BOP (6 Å), and thus a simulation with our MGP force field for stanene is slower by a factor of 4-7 approximately, see our test of 1 million atoms in Table 1. The computational cost of the MGP force field in LAMMPS as a function of the system size is shown in Fig. 5b. The prefactor of the linear dependence depends on the number of pairs and triplets within a local environment. The scaling factor for the 2D honeycomb lattice structure is lower than for the dense 3D liquid, because each atom in the 3D liquid has more neighbors than the honeycomb lattice within the same cutoff radius.
The hierarchical active learning scheme adopted here is an efficient path to systematically improve the force field. However, it must be noted that the training is still limited by the system sizes that can be tackled with a DFT calculation. In fact, DFT calculations with thousands of atoms are unaffordable. To circumvent this problem, regions of high uncertainties may be cropped out of the large simulation cell, in order to perform DFT calculation on a smaller subset of atoms with a relevant structure. In this direction, we note that uncertainty is a local predicted property in MGP BFF, indicating how each atom contributes to the total uncertainty. In Fig. 3c, we show that the largest uncertainty in an MD frame of 2048 atoms comes from regions close to defects, since these configurations are missing in the training set for that particular model. Therefore, one may use the uncertainty to automatically select a section of the crystal close to the high-uncertainty region, and use a DFT simulation to add a training data point to the MGP. Automatically constructing such structures remains an open challenge in complex materials. However, access to the local uncertainty for each force prediction already allows for assessment of reliability of simulations, and is central to the paradigm of BFFs.
Finally, we mention that the descriptive power of our current BFF is limited due to the lack of full many-body interactions in GP kernel, since the n-body kernel only uses low-dimensional functions of crystal coordinates. In this work, we have implemented the MGP for 2-and 3-body interactions, which, for the mapping problem, translate to interpolation problems in three dimensions. The approach may be extended to higher interaction orders at the expense of computational efficiency, but it becomes expensive in terms of both computation and storage requirements for splines on uniform grids to reach a satisfactory accuracy. Many-body descriptors can be taken into account to increase the descriptive power. To map it in the same way as the methodology we introduced, it is necessary to have a descriptor such that it can be decomposed into low-dimensional functions. For example, the SOAP approach of refs. 32,33 cannot be mapped using the methodology followed in this work, since it relies on a highdimensional descriptor. In particular, the interpolation procedure may be too computationally expensive to be a viable solution.

DISCUSSION
In conclusion, we present an extended method for mapping the GP regression model with many-body kernels, where we accurately parameterize both the force and its variance obtained from the original Bayesian model as functions of atomic configurations. The mapped forces and their variance are then utilized in an active learning workflow to obtain a machinelearned force field that includes its own uncertainty, the first realization of a fast BFF. The resulting model has near-quantum accuracy and computational cost comparable to empirical analytical interatomic potentials. Large-scale simulations are used as a demonstration to investigate the microscopic details of interdimensional transformation behavior of 2D stanene into 3D tin. We discover that the transformation proceeds by nucleation of bilayer defects, densification due to continued disordered multilayer stacking, and finally conversion into either crystalline or molten bulk tin, depending on temperature. This application shows that we can actively monitor the uncertainties of force predictions during MD simulations and iteratively improve the model as the material undergoes reorganization, exploring diverse structures in the configuration phase space. The ability to reach simulation sizes of over 1 million atoms is promising for application of ML force fields to study phase diagrams and transformation dynamics of complex systems, while automated active learning of force fields opens a way to accelerate widerange material design and discovery.

Background: Gaussian process force field
In this section, we summarize the GP model introduced in ref. 16 . From here on, we use bold letters to denote matrices/vectors and regular letters for scalars/numbers.
An atomic configuration is defined by atomic coordinates R α , a vector of dimension equal to the number of atoms N atoms , where α labels the three Cartesian directions α = x, y, z. While in general the crystal may contain atoms of different species, we here consider only a single species case for the sake of simplicity. The results can be extended to the multi-component case, which is discussed in Supplementary Method 1. The objective of the GP regression model is to predict the forces F α for a given atomic configuration R α . To this aim, we associate the forces of an atom i, F i, α , to the local environment around this atom ρ i . F i, α can then be computed by comparing its environment with a training data set of atomic environments with known forces, which is quantified by distance d between the target environment and the training set.
The atomic environment ρ i of the atom i is defined as all the atoms that are within a sphere of a cutoff radius r cut centered at the atom i. We proceed by partitioning this atomic environment ρ i into a set of n-atom subsets, denoted by ρ To each atomic pair p we associate an interatomic distance r p between the two atoms. The distance metric between two pairs p and q is then defined as ðr p À r q Þ 2 . And the distance metric between two atomic environments ρ ð2Þ i and ρ ð2Þ j is defined as: This distance d (2) will be used to determine the two-body contribution to the atomic forces, for the purposes of constructing the kernel function.
This construction can be extended to higher-order interatomic interactions. In this work, we also include three-body interactions, corresponding to n = 3. Similar to the two-body case, the set ρ ð3Þ i of triplets p between the central atom i and two other atoms i 0 and i″ in ρ i is defined as: This time, each triplet p is characterized by a vector r p = {r p,1 , r p,2 , r p,3 }, which represents the three interatomic distances of this atomic triplet p. Including permutation, the distance metric between two triplets p and q can be defined as P u;v ðr p;u À r q;v Þ 2 . And the 3-body distance metric between two atomic environments ρ Having now defined a distance between atomic environments, we can build a GP regression model for atomic forces. We define the equivariant force-force kernel function to compare two atomic environments as: where α and β label Cartesian coordinates, and ∂ ∂Riα indicates the partial derivative of the kernel with respect to the position of atom i. The n-body energy kernel function k ðnÞ ðρ ðnÞ i ; ρ ðnÞ j Þ is first expressed in terms of the partitions of n atoms of the two atomic environments ρ i and ρ j , and is defined as: wherek ðnÞ ðp; qÞ is a Gaussian kernel function, built using the definition of distance between n-atoms subsets as described before. The σ ðnÞ sig and l (n) are hyperparameters independent of p and q, that set the signal variance related to the maximum uncertainty and the length scale, respectively. The smooth cutoff function φ ðnÞ cut dampens the kernel to zero as interatomic distances approach r cut . In this work, we chose a quadratic form for the cutoff function φ ð2Þ cut ðp; qÞ ¼ ðr q À r cut Þ 2 ðr p À r cut Þ 2 ; The GP regression model uses training data as a reference set of N T atomic environments. In particular, predictions of force values and variances require a 3N T × 3N T covariance matrix where σ n is a noise parameter (indices iα and jβ are grouped together). We define an auxiliary 3N T dimensional vector η with components In order to make a prediction of forces and their errors on a target structure ρ i , we compute the Gaussian kernel vector k iα for atom i and a direction α (α = x, y, z), where the components of the vector are kernel distances ðk iα Þ jβ :¼ k αβ ðρ i ; ρ j Þ between the prediction iα and the training set points jβ, j = 1, ..., N T , β = x, y, z. Finally, the predictions of the mean value of the Cartesian component α of the force F iα acting on the atom at the center of the atomic environment ρ i and its variance V iα are given by ref. 60 : We notice that this formulation has two major computational bottlenecks, that we intend to address in this work. First, the vector k iα needs to be constructed for every force prediction; for each atom, the evaluation of k iα requires the evaluation of the kernel between the atomic environment and the entire training data set, i.e., N T calls to the Gaussian kernel. We discuss an approach to speed up the evaluation of k iα . Second, every variance calculation requires a matrix-vector multiplication that is computationally expensive, and thus will be further discussed in "Methods".

Mapped force field
In GP regression, a kernel function is used to build weights that depend on the distance from training data: training data that are similar to the test data contribute with a larger weight, and vice versa, dissimilar training data contribute less. To make a prediction on a data point (force on an atom in its local environment), the GP regression model requires the evaluation of the kernel between such point and all the N T training set data points. The GP mean prediction is a weighted average of the training labels, using the distances as measured by the kernel function as weights. In Eq. (11), we notice that the predictive mean requires the evaluation of a 3N Tdimensional vector k iα , which quantitatively weighs the contribution to the test atomic environment ρ i coming from the training atomic environments ρ j .
In the following discussion, we consider a specific subset size n, capturing n-body atomic structure information, i.e., ðk Starting with the idea introduced in ref. 34 , the kernel definition (Eq. (6)) allows us to decompose the force prediction into contributions from natom subsets, and to decompose the variance into contributions from pairs of n-atom subsets.
To see this, we first note that the kernel vector of atomic environments ρ i can be decomposed into the kernel of all the n-atom subsets in ρ i , and the kernel of an n-atom subset p can be further decomposed into the contributions from all the n − 1 neighbor atoms (except for the center atom i). We denote r s p :¼ R s À R i as a bond in p, where s labels the remaining n − 1 neighbor atoms, and introduce the unit vectorr s p :¼ r s p =r s p denoting the direction of the bond. Then the force kernel has decomposition  (15) are vectors in R 3NT space. In the first equality of Eq. (13), we used the decomposition of Eq. (6), and in the second equality we first convert the derivative from atomic coordinates to the relative distances of the n-atom subset, then take advantage of the fact that the kernel only depends on relative distances between atoms, i.e., r s p ¼k r s p k. Therefore, forces may be written as where f ðnÞ s ðpÞ :¼ μ ðnÞ s > ðpÞη (17) and η is defined in Eq. (10). As noted in ref. 34 , we can achieve a better scaling by a computationally efficient parametric method, i.e., an approximate way for constructing such functions f. In fact, a parametric function can quickly yield the value without evaluating kernel distances from all the GP training points, making the cost independent of the size of the training set, thus remove a computational bottleneck of the full GP implementation.
In order to achieve an effective parametrization of these weight functions, we use a cubic spline interpolation on a uniform grid of points, which benefits from having an analytic form and a continuous second derivative. By increasing the number of points in the uniform grid, the MGP BFF prediction can be made arbitrarily close to the original functions.
The construction of the mapped force field includes the following steps: Step 1: Define a uniform grid of points fx k 2 ½a; b ð n 2 Þ g in the space of distances given by the vectors r p . The lower bound a ≥ 0 is chosen to be smaller than the minimal interatomic distance, and the upper bound b can be set to the cutoff radius of atomic neighborhoods, above which the prediction vanishes (using the zero prior of GP), and n 2 ¼ nðnÀ1Þ 2 is the number of interatomic distances in the n-atom subset.
Step 2: A GP model with a fixed training set yields the values ff ðnÞ s ðx k Þg for each grid point x k using Eq. (17).
Step 3: Interpolate with a cubic spline functionf ðnÞ s through the points fðx ðnÞ k ; f ðnÞ s ðx k ÞÞg. In prediction, contributions from all n-bodies are added up, so the force of local environment ρ is predicted as: with the covariance matrix Σ defined in Eq. (9). We note that the domain of the variance function v has dimensionality twice that of f, so that when the interaction order n > 3, the number of grid points becomes prohibitively large to efficiently map v. Even for a 3-body kernel, when v is a 6-D function, we find that the evaluation of the variance, both in terms of computational time and memory footprint, limits its usage to simple benchmarks.
An efficient evaluation of the variance is critical for adopting BAL. We now discuss a key result of this work and introduce an accurate yet efficient approximate mapping field for the variance. In particular, we focus on simplifying the vector μ ðnÞ s ðpÞ, which, from our tests, is the most computationally expensive term for predicting variance. In particular, since μ ðnÞ s ðpÞ evaluates the kernel function between the test point p and all the training points, it scales linearly with the training set size. By implementing a mapping of the variance weight field, the cost of variance prediction depends only on the rank of our dimension reduction approach, and is independent of the training data set size.
To this aim, we first define the vector ψ ðnÞ s ðpÞ :¼ L À1 μ ðnÞ s ðpÞ 2 R 3NT , where L is the Cholesky decomposition of Σ, i.e., Σ = LL ⊤ , so that Eq. (19) can be written as: The domain of vector function ψ ðnÞ s ðpÞ is half the dimensionality of that of v ðn;n 0 Þ s;t ðp; qÞ. As a first idea, since each component of ψ (n) has domain of lower dimensionality, one may think of building 3N T spline functions to interpolate the 3N T components separately, which means the number of spline functions needed grows with the training set size.
To achieve a more efficient mapping of ψ ðnÞ s , we take advantage of the principal component analysis (PCA), a common dimensionality reduction method. The construction of the mapped variance field includes the following steps: Step 1: We start by evaluating the function values on the uniform grid fx k 2 ½a; b In this form, the variance would still require a multiplication between two matrices of size m × 3N T . However, we note that the computational cost of total variance can be reduced V α ðρÞ ¼ k α;α ðρ; ρÞ À Nominally, the calculation of the second term in the variance still scales with the data set size N T , since the size of V ðnÞ depends on N T . However, we find the cost of the vector-matrix multiplication negligible compared to evaluations of the self-kernel function (the 1st term in Eq. (22)). Therefore, we have formulated a way to significantly speed up the calculation of the variance, which only weakly depends on the training set size. In addition, by tuning the PCA rank m, we can optimize the balance between accuracy and efficiency, increasing the flexibility of this model.
The convergence of the accuracy of mapped forces and uncertainty with different grid numbers G and ranks m is discussed in Supplementary Fig. 2.

DATA AVAILABILITY
The data and related code in this paper are published in Materials Cloud Archive 61 with https://doi.org/10.24435/materialscloud:qg-99 62 .