On-the-fly active learning of interpretable Bayesian force fields for atomistic rare events

Machine learned force fields typically require manual construction of training sets consisting of thousands of first principles calculations, which can result in low training efficiency and unpredictable errors when applied to structures not represented in the training set of the model. This severely limits the practical application of these models in systems with dynamics governed by important rare events, such as chemical reactions and diffusion. We present an adaptive Bayesian inference method for automating the training of interpretable, low-dimensional, and multi-element interatomic force fields using structures drawn on the fly from molecular dynamics simulations. Within an active learning framework, the internal uncertainty of a Gaussian process regression model is used to decide whether to accept the model prediction or to perform a first principles calculation to augment the training set of the model. The method is applied to a range of single- and multi-element systems and shown to achieve a favorable balance of accuracy and computational efficiency, while requiring a minimal amount of ab initio training data. We provide a fully open-source implementation of our method, as well as a procedure to map trained models to computationally efficient tabulated force fields.


INTRODUCTION
Recent machine learned (ML) force fields have been shown to achieve high accuracy for a number of molecular and solid-state systems [1][2][3][4][5][6][7][8][9][10][11] . These methods provide a promising path toward long, large-scale molecular dynamics (MD) simulations driven by force predictions that approach the accuracy of quantum mechanical methods like density functional theory (DFT). However, most currently available ML force fields return point estimates of energies, forces, and stresses rather than predictive distributions that reflect model uncertainty, making the incorporation of accurate uncertainty estimates into ML force fields an outstanding challenge [12][13][14][15][16][17][18] . Without model uncertainty, a laborious fitting procedure is required, which usually involves manually or randomly selecting thousands of reference structures from a database of first principles calculations. In production MD runs, a lack of principled means to compute predictive uncertainties makes it difficult to determine when the force field is trustworthy, leading to unreliable results and lack of guidance on how to update the model in the presence of new data.
Here, we show that active learning based on Gaussian process (GP) regression can accelerate and automate the training of highquality force fields by making use of accurate internal estimates of model error. By combining DFT with low-dimensional GP regression models during molecular dynamics simulations, accurate force fields for a range of single-and multi-element systems are obtained with~100 DFT calculations. Moreover, we demonstrate that the model can be flexibly and automatically updated when the system deviates from previous training data. Such a reduction in the computational cost of training and updating force fields promises to extend ML modeling to a wider class of materials than has been possible to date. The method is shown to successfully model rapid crystal melts and rare diffusive events, and so we call our method FLARE: Fast Learning of Atomistic Rare Events, and make the open-source software freely available online (https://github.com/mir-group/flare).
The key contribution of this work that makes on-the-fly learning possible is the development of a fully interpretable lowdimensional and nonparametric force field that provides trustworthy estimates of model uncertainty. Typical ML force fields involve regression over a high-dimensional descriptor space chosen either on physical grounds 19,20 or learned directly from ab initio data 6,10 . These approaches involve highly flexible models with many physically non-interpretable parameters, complicating the task of inferring a posterior distribution over model parameters. We instead bypass the need for a high-dimensional descriptor by imposing a physical prior that constrains the model to n-body interactions, with high accuracy observed in practice with 2-and 3-body models. Because the low-dimensional descriptor space of our models can be sampled with a small amount of training data, our method avoids sparsification, a procedure that is used in Gaussian approximation potentials to make inference tractable with many-body descriptors like SOAP [20][21][22] , but that requires approximate treatment of GP uncertainty estimates 23,24 . The learning task is simplified as a result, making it possible to automatically tune the model's hyperparameters in a data-driven fashion and derive trustworthy estimates of model uncertainty. This opens the door to a practical uncertainty-driven method for selecting training points "on the fly" 25 , allowing an accurate force field to be trained with a minimal number of relatively expensive first principles calculations.
The resulting GP-based force fields are interpretable in three important respects. First, the underlying energy model of the GP is a physically motivated sum over n-body contributions, such that each cluster of n − 1 neighbors in an atom's environment makes a direct contribution to the force on that atom. This establishes a connection to previous physically motivated force fields, most notably the Stillinger-Weber force field 26 , which also sums over 2-and 3-body contributions but is limited to a specific analytic form. Our models, by contrast, learn nonparametric 2-and 3-body functions directly from ab initio data, allowing the models to generalize well to complex multi-element systems, as we show in the Results section below. Second, the model does not require a descriptor of the entire local environment of an atom, instead relying on a kernel that directly compares interatomic distances of small clusters of atoms. As a result, the only free parameters in the model are a small set of hyperparameters of the GP kernel function, each of which has a direct interpretation and can be rigorously optimized by maximizing the log marginal likelihood of the training data. Neural network and Gaussian approximation potentials, on the other hand, rely on complex high-dimensional descriptors of an atom's environment, making it less apparent how the force acting on an atom is related to the configuration of its neighbors. Finally, and most importantly for active learning, the uncertainty estimates of our GP models break down into two contributions: the epistemic uncertainty σ iα , which is assigned to each atom i and force component α and is determined by distance from the training set, and the noise uncertainty σ n , which characterizes fundamental variability in the training data that cannot be captured by the model. The latter source of error arises from several simplifying approximations that improve computational efficiency, including the exclusion of interactions outside the cutoff radius of the model, the decomposition of global energies into local contributions, and the restriction to 2-and 3body interactions 4,22 . By optimizing the noise uncertainty σ n of the GP, the combined magnitude of these errors can be learned directly from the data (see Methods). The interpretable uncertainties derived from the GP model provide a principled basis for automated training, in which a local environment is added to the training set of the model when the epistemic uncertainty σ iα on a force component exceeds a chosen multiple of the noise uncertainty σ n .
Other GP and active learning based methods for force field training have been proposed in the literature, and we discuss them briefly here to put our method in context. Bartók et al. pioneered the use of GP-based force fields in the Gaussian approximation potential (GAP) framework 21,22 , with subsequent applications combining 2-and 3-body descriptors with the manybody SOAP kernel to achieve high accuracy for a range of extended systems 4,7,20 . Recent GAP studies have reported uncertainty estimates on local energy predictions 8 and introduced self-guided protocols for learning force fields based on random structure searching rather than uncertainty-driven active learning 7,27 . Rupp et al. 28 and more recently Uteva et al. 29 used GP regression to model potential energy surfaces of small molecular systems with active learning, and Smith et al. recently proposed a query-by-committee procedure for actively learning neural network force fields for small molecules 30 . On-the-fly force field training for extended systems was first proposed by Li, Kermode, and De Vita 25 , but the method relied on performing DFT calculations to evaluate model error due to a lack of correlation between the internal error of their GP model and true model error 31 . Podryabinkin and Shapeev developed an on-the-fly method for their linear moment tensor potentials 32 using the Doptimality criterion, which provides an internal informationtheoretic measure of distance from the training set 13 , with subsequent applications to molecules, alloys, and crystal structure prediction 18,33,34 . The D-optimality criterion is usually restricted to linear models and does not provide direct error estimates on model predictions. More recently, Jinnouchi et al. combined a multi-element variant of the SOAP kernel with Bayesian linear regression to obtain direct Bayesian error estimates on individual force components, which was used to perform on-the-fly training of force fields to study melting points and perovskite phase transitions 35,36 . This approach relies on a decomposition of the atomic density of each atom into many-body descriptors based on spherical Bessel functions and spherical harmonics, with the number of descriptors growing quadratically with the number of elements in the system 37 . The machine learned force fields presented here possess four important features that have not been simultaneously achieved before: they are nonparametric, fully Bayesian, explicitly multi-element, and can be mapped to highly efficient tabulated force fields, making our automated method for training these models widely applicable to a range of complex materials.

RESULTS
FLARE: an on-the-fly learning method The goal of FLARE is to automate the training of accurate and computationally efficient force fields that can be used for largescale molecular dynamics simulations of multi-element systems. The low-dimensional GP kernel that we use throughout this work, sketched in Fig. 1a, is calculated by comparing interatomic distances of clusters of two and three atoms, similar to the singleelement kernel presented by Glielmo, Zeni, and De Vita 38 but here generalized to arbitrarily many chemical species. If the two clusters are not of the same type, as determined by the chemical species of the atoms in the cluster, the kernel is assigned a value of zero, allowing the GP to differentiate between chemical species while remaining low-dimensional (see Methods). Restricting the model to a sum over two-and three-dimensional contributions reduces the cost of training the model, allowing the descriptor space to be systematically sampled with a relatively small number of DFT calculations, and also reduces the cost of production MD runs with the final trained model, since the GP can be mapped onto efficient cubic spline models that allow the 2-and 3-body contributions to the force on an atom to be directly evaluated 38 . We have implemented this mapping as a pair style in the molecular dynamics software LAMMPS, allowing us to study multielement systems containing more than ten thousand atoms over nanosecond timescales (Fig. 5 below).
The low dimensionality of our models also makes it practically feasible to rigorously optimize the hyperparameters of the kernel function, which leads to trustworthy estimates of model uncertainty. The reliability of these uncertainties is the key feature of our approach that enables FLARE, an adaptive method for training force fields on the fly during molecular dynamics. As sketched in Fig. 1b, the algorithm takes an arbitrary structure as input and begins with a call to DFT, which is used to train an initial GP model on the forces acting on an arbitrarily chosen subset of atoms in the structure. The GP then proposes an MD step by predicting the forces on all atoms, at which point a decision is made about whether to accept the predictions of the GP or to perform a DFT calculation. The decision is based on the epistemic uncertainty σ iα of each GP force component prediction (defined in Eq. (5) of Methods), which estimates the error of the prediction due to dissimilarity between the atom's environment and the local environments stored in the training set of the GP. In particular, if any σ iα exceeds a chosen multiple of the current noise uncertainty σ n of the model, a call to DFT is made and the training set is augmented with the forces acting on the N added highest uncertainty local environments, the precise number of which can be tuned to increase training efficiency. All hyperparameters, including the noise uncertainty σ n , are optimized whenever a local environment and its force components are added to the training set, allowing the error threshold to adapt to novel environments encountered during the simulation (see Methods).

Characterization of model uncertainty
To justify an on-the-fly learning algorithm, we first characterize the noise and epistemic uncertainties of GP models constructed with the 2-and 3-body kernels described above, and compare them against test errors on out-of-sample structures. Importantly, the optimized noise uncertainty σ n and epistemic uncertainties σ iα are found to provide a sensitive probe of true model error, with the noise uncertainty capturing the baseline error level of model predictions on local environments that are well represented in the training set, and the epistemic uncertainties capturing error due to deviation from the training data. In Fig. 2a-c, we test the relationship between GP uncertainties and true error by performing a set of plane-wave DFT calculations on a 32-atom supercell of FCC aluminum with the atoms randomly perturbed from their equilibrium sites. In Fig. 2a, we examine the noise uncertainty σ n as a function of the cutoff radius of the model, which determines the degree of locality of the trained force field. 2-and 2+3-body GP models were trained on forces acting on atoms in a single structure and then tested on an independently generated structure, with the atomic coordinates in both cases randomly of the predictive posterior distribution of the GP (dotted). f Learning curves of GP models trained on 5-element high entropy alloy structures, with training environments selected randomly (red) and with active learning (blue). The RMSE on force components of an independent test structure is plotted along with the distribution of uncertainties, shown as a band between the minimum and maximum uncertainties on force components in the structure.
perturbed by up to 5% of the lattice parameter, a lat = 4.046 Å. For the 2-body models, the cutoff radius was swept from 3.5 to 8 Å in increments of 0.5 Å, and for the 2+3-body models, the 2-body cutoff was held fixed at 6 Å and the 3-body cutoff was swept from 3 to 4.5 Å. The optimized noise uncertainty σ n plotted in Fig. 2a closely tracks the root mean squared error (RMSE) on the test structure for the range of examined cutoff values. The observed correlation provides a principled way to select the cutoff radius of the GP, showing that the expected error of a model with a given cutoff can be directly estimated from the optimized noise uncertainty σ n when the GP model has been trained on sufficient data.
When the GP model is trained on insufficient data, the epistemic uncertainties σ iα rise above the noise uncertainty σ n , indicating that the model requires additional training data to make accurate force estimates. The utility of the epistemic uncertainty is illustrated in Fig. 2b, which examines GP uncertainties as a function of the amount of data in the training set. Using the same training and test structures as Fig. 2a, a 2+3-body GP model with a 6-Å 2-body cutoff and 4-Å 3-body cutoff was constructed by adding local environments one by one to the training set and evaluating the RMSE and GP uncertainty after each update. The average GP uncertainty ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi σ 2 n þ σ 2 iα q closely tracks the RMSE, where σ iα is the mean epistemic uncertainty over all force components in the test structure.
We also demonstrate in Fig. 2c that the epistemic uncertainty provides an accurate indicator of model error when the model is forced to extrapolate on local environments that are significantly different from local environments in the training set. To systematically investigate distance from the training set, a 2+3-body GP model was trained on a single aluminum structure with atomic coordinates perturbed by δ = 5% of the lattice parameter and tested on structures generated with values of δ ranging from 1 to 50%, with δ = 50% giving rise to a highly distorted structure with a mean absolute force component of 28.6 eV/Å and a maximum absolute force component of 200.5 eV/Å (compared to a mean of 0.50 eV/Å and maximum of 1.48 eV/Å for the training structure). As shown in Fig. 2c, the mean epistemic uncertainty σ iα increases with δ and exceeds the optimized noise uncertainty of σ n = 11.53 meV/Å for δ > 5%, demonstrating the ability of the GP to detect when it is predicting on structures that are outside the training set. This capability is crucial for on-the-fly learning, as the model must be able to flag when additional training data is needed in order to accurately estimate forces. We furthermore observe that the error is substantially underestimated for large values of δ due to an upper bound on the epistemic uncertainty imposed by the signal variance hyperparameters of the kernel function, with the bound nearly saturated for δ > 20% (see Methods for the definition of this bound). This emphasizes the importance of re-optimizing the hyperparameters when additional data is introduced to the training set, allowing the model to adapt to novel structures.
In Fig. 2d, e we demonstrate that GP uncertainties on individual force components can also provide valuable information about the expected errors on structures not represented in the training set. Figure 2d shows individual GP uncertainties ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi σ 2 iα þ σ 2 n p on the predicted force components of a relaxed vacancy structure when the GP was trained on bulk local environments only. Each atom is colored according to the maximum uncertainty of the three predicted force components acting on the atom, with atoms closer to the defect tending to have higher uncertainties. This test was repeated for ten randomly perturbed vacancy structures, with the true error plotted in Fig. 2e against the GP uncertainty ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi σ 2 iα þ σ 2 n p of each force component, showing that higher uncertainties coincide with a wider spread in the true error.
We finally demonstrate in Fig. 2f that the GP uncertainties are trustworthy for more complex multi-element systems. In this test, two GP models were trained on the five-element high entropy alloy (HEA) DFT forces of Zhang et al. 10 , with training environments selected randomly for the first GP model and with active learning for the second. Specifically, thirty-nine HEA structures were drawn from the "rand 1" portion of this dataset, and for each structure, twenty training environments were selected either at random or by identifying the highest uncertainty environments in the structure. After each update to the training set, both the GP uncertainties and true model error on an independent HEA structure were evaluated (with the test structure taken from the "rand2" portion of the dataset and having a different random allocation of elements). The distribution of total uncertainties ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi σ 2 iα þ σ 2 n p on force components in the test structure is shown for both models in Fig. 2f by plotting a band between the minimum and maximum uncertainties, which encloses the true RMSE. Actively selecting environments based on model uncertainty has the effect of shifting the learning curve downward, with the actively trained GP reaching a RMSE of 0.445 eV/Å on the test structure. The GP model obtained with active learning was subsequently mapped to a tabulated force field in order to rapidly evaluate forces on the entire "rand2" test set of Zhang et al. 10 , which consisted of 149 HEA structures with elements placed at random lattice sites. The RMSE averaged over all test structures was found to be 0.466 eV/Å for the tabulated GP model, comparable to the RMSE of 0.410 eV/Å reported for the deep neural network model of Zhang et al. 10 and outperforming the Deep Potential model of Zhang et al. 9 , which achieved a RMSE of 0.576 eV/Å on the same test set. We note that both neural network models were trained on 400 HEA structures 10 , which exceeds the number of structures the GP was trained on by more than an order of magnitude.
Aluminum crystal melt As a first demonstration of on-the-fly learning driven by GP uncertainties, we consider a 32-atom bulk aluminum system initialized in the FCC phase at low temperature, with N added ¼ 1 local environment added to the training set whenever the epistemic uncertainty on a force component exceeds the current noise uncertainty, σ thresh = σ n . As shown in Fig. 3a, DFT is called often at the beginning of the simulation as the GP model learns a force field suitable for FCC aluminum. After about 30 time steps, the model needs far fewer new training points, requiring fewer than 50 DFT calls in the first 5 ps of the simulation. To test the model's ability to adapt to changing conditions, the crystal is melted at time t = 5 ps by rescaling the velocities of the atoms to give the system an instantaneous temperature of 10 4 K, well above the experimental melting point of aluminum (933 K) due to the strong finite size effects of the 2 × 2 × 2 supercell. The subsequent temperature in the remaining 5 ps of the simulation stabilizes around 5000 K with a radial distribution function consistent with the liquid phase (Fig. 3c). As shown in Fig. 3b, which plots the cumulative number of DFT calls made during the training run, the GP model makes frequent calls to DFT immediately after the crystal melts, as the local environments in the liquid phase of aluminum are significantly different from the previous solid-state training environments. The noise uncertainty σ n of the model, shown in red in Fig. 3b, sharply increases as the system enters the liquid phase, reflecting the fact that it is more difficult to model, involving more diverse local environments and significantly larger force fluctuations. Because the error threshold σ thresh is set equal to the optimized noise uncertainty σ n , the threshold in the liquid phase is higher, and as a result the GP model requires a roughly similar number of DFT calls to learn the solid and liquid phases. Fewer than 100 calls are needed in total during the 10 ps of dynamics, with the majority of DFT calls made at the beginning of the simulation and immediately after melting.
The obtained force field is validated by testing the model on two independent 10-ps ab initio molecular dynamics (AIMD) simulations of the solid and liquid phases of aluminum. 100 structures were sampled from the AIMD trajectories with 0.1-ps spacing between structures. Force predictions on all test structures were obtained with a tabulated version of the GP force field of Fig. 3a and compared against the corresponding DFT values, with the RMSE in eV/Å plotted in Fig. 3d. For reference, the models are compared against recent EAM and AGNI ML force fields, which were also trained on plane-wave DFT calculations with GGA exchange-correlation functionals and PAW pseudopotentials 12,39 , though we note that they were not trained on exactly the same DFT calculations as our models. Also included for comparison is the performance of a 2-body FLARE model trained on the same local environments as the 2+3-body model. Each force field was tested on the same structures, with the FLARE force field reaching the lowest force errors for both trajectories. This is due in part to the fact that FLARE optimizes the force field for the specific simulation of interest, only augmenting the training set when necessary. This bypasses the need to anticipate all possible phases which a system might explore when creating the force field. To assess computational efficiency, 1000 MD steps were performed with the LAMMPS implementations of these four force fields on a single CPU core for a system of 1372 bulk Al atoms, with the cost of each force field plotted in Fig. 3e in s/atoms/timestep. The cost of the current LAMMPS implementation of the tabulated 2-body FLARE force field is found to be 5.6 × 10 −6 s/atom/timestep, which is the same order of magnitude as the EAM cost of 2.2 × 10 −6 s/atom/timestep. The 2+3-body model is about an order of magnitude slower at 4.9 × 10 −5 s/atom/ timestep, but still faster than AGNI, which directly predicts forces with a small neural network. This makes FLARE considerably less expensive than many-body models like GAP, with the cost of the recent GAP silicon model reported as 0.1 s/ atom/timestep 8 .

Bulk vacancy and surface adatom diffusion
We next demonstrate that FLARE can be used to train force fields that dramatically accelerate simulations of rare-event dynamics over timescales spanning hundreds of picoseconds by applying the method to aluminum bulk vacancy diffusion and surface adatom diffusion. For bulk vacancy training, a 1-ns simulation was initialized by removing one atom from an equilibrium 32-atom FCC structure and setting the instantaneous initial temperature to 1500 K, giving a mean temperature of 734 K across the simulation. The GP model was constructed with a 2-body kernel with cutoff r ð2Þ cut ¼ 5:4 Å, resulting in a final optimized noise uncertainty of σ n = 70.2 meV/Å. Discarding the 3-body contribution was found to significantly accelerate the simulation while still achieving low force errors due to the simplicity of the single-defect bulk crystalline phase, opening up nanosecond timescales during training. As shown in Fig. 4a, most DFT calls are made early on in the simulation, and after the first~400ps, no additional DFT calls are required. The model predicts vacancy hops every few hundred picoseconds, which appear as sharp jumps in the mean squared displacement plotted in Fig. 4a. To check the accuracy of the underlying energy model of the GP, DFT energies were computed along the high symmetry transition path sketched in the inset of Fig. 4b, with a nearest neighbor migrating into the vacancy while all other atoms in the simulation cell were kept frozen at their fcc lattice sites. GP forces and energies along the transition path were evaluated to give an estimate of the energy barrier, showing close agreement with the ab initio DFT values (Fig. 4c), with the DFT forces lying within one standard deviation of the GP force predictions (Fig. 4b). The entire FLARE training run, including DFT calculations, GP hyperparameter optimization, force evaluations and MD updates, were performed on a 32-core machine in 68.8 h of wall time. Individual DFT calls required over a minute of wall time on average, making FLARE over 300 times faster than an Fig. 3 Active learning of a multi-phase aluminum force field. a Instantaneous temperature during a 10-ps on-the-fly MD trajectory generated with the FLARE learning algorithm. The simulation begins in the FCC phase at low temperature and is melted at t = 5 ps. When the epistemic uncertainty σ iα on a force component rises above the current noise uncertainty σ n of the model, DFT is called (black dots). b The number of DFT calls (solid) and optimized noise uncertainty (dotted) throughout the simulation. A sharp increase is observed when the crystal is melted, illustrating the model's ability to actively learn the liquid phase. c During the first 5 ps of the simulation, the radial distribution function (RDF) is consistent with that of an fcc crystal (solid line), while in the final half of the simulation, the system exhibits an RDF characteristic of the liquid phase (dashed). d RMSE on AIMD forces of a tabulated version of the resulting force field compared with EAM, AGNI, and a tabulated 2-body FLARE force field. e Computational cost of LAMMPS implementations of these force fields on a single CPU core in s/atom/timestep.

equivalent AIMD run (see Supplementary Table 2 for a breakdown of GP prediction costs).
To test the accuracy of FLARE on a subtler transition with a significantly lower energy barrier, we consider aluminum adatom diffusion on a four-layer (111) aluminum slab, with a representative structure shown in the inset of Fig. 4d. As revealed in previous ab initio studies, an isolated Al adatom on the (111) Al surface exhibits a small but surprising preference for the hcp site 40,41 , making this system an interesting and challenging test for a machine learned force field. For this system, 3-body contributions were found to considerably increase the accuracy of the force field, with a 7-Å 2-body cutoff and 4.5 Å 3-body cutoff giving an optimized noise uncertainty of σ n = 44.2 meV/Å after the final DFT call at t = 62.2 ps (Fig. 4d). To validate the energetics of the force field, a 7-image nudged elastic band (NEB) calculation characterizing the transition from the hcp to fcc adatom sites was performed using the Atomic Simulation Environment 42 with the GP energy predictions shown in blue in Fig. 4f. The DFT energies of each image of the NEB calculation are shown in black, showing agreement to within ≈20 meV for each image and confirming the GP's prediction of a slight energetic preference for the hcp site in equilibrium, which is not reproduced by the EAM model of Sheng et al. 39 (red line in Fig. 4f). An independent DFT NEB calculation was performed for the same transition, showing good agreement with the DFT energies of the FLARE NEB images.

Fast-ion diffusion in AgI
As a third and more challenging example of diffusion, we apply FLARE to the fast-ion conductor silver iodide (AgI), which exhibits a structural phase transition at 420 K from the low-temperature γ∕β-phase to a cubic "superionic" α-phase, with silver ions in the α-phase observed to have a liquid-like diffusivity 43 . A 2+3-body FLARE model was trained in a 15 ps on-the-fly simulation of 48 AgI atoms in the α-phase, with the temperature increased at 5 and 10 ps (Fig. 5a). The uncertainty threshold was set to twice the noise uncertainty, σ thresh = 2σ n , making the model slightly less sensitive to changing temperature and contributing to the 1-ps delay observed between the temperature increase at 5 ps and the next call to DFT at t = 6.121 ps. Thirty-nine calls to DFT were made in total, with the N added ¼ 10 highest uncertainty local environments added to the training set after each DFT calculation.
After training, the model was mapped to a tabulated cubic spline model in LAMMPS, which was used to perform 1-ns simulations at zero pressure and fixed temperature, with each simulation requiring about three hours of wall time on 32 cpu cores (≈3.2 × 10 −5 cpu ⋅ s/atom/timestep). Ten MD simulations were performed in total with temperatures ranging from 200 to 650 K in intervals of 50 K. In each simulation, the system was initialized in a pristine 14 × 14 × 14 α-phase supercell (10,976 atoms total), with the silver ions placed at the energetically preferred tetragonal interstices of the bcc iodine sublattice. The diffusion coefficients of the Ag ions are plotted in Fig. 5b, showing a sharp increase between 400 K and 450 K, in good agreement with the experimental fast-ion transition temperature of 420 K. The diffusion coefficients are compared with an AIMD study of the α-phase of AgI 44 , which used a similar exchange-correlation functional, showing excellent agreement at 450 K and above. Both FLARE and AIMD show good agreement with experimentally observed α-phase Ag diffusion coefficients 45 , with a slight vertical offset but comparable activation energies of 0.107, 0.114, and 0.093 eV for FLARE, AIMD, and experiment, respectively. Below the transition temperature, the FLARE force field correctly predicts a phase transition to a non-diffusive and non-cubic hcp phase with a nearest neighbor I-I coordination of 12, consistent with the γ and β phases of AgI 46 . This accounts for the discrepancy between the FLARE and AIMD diffusion coefficients in the low-temperature regime, as the latter simulations were conducted in the α-phase with a fixed cubic cell. Example structures from the 400 and 450 K FLARE MD simulations are illustrated in Fig. 5c, with the lowtemperature structure giving a c∕a ratio of 1.46 and the hightemperature structure having a lattice parameter of a lat = 5.30 Å, in fair agreement with the corresponding experimental values of c∕a = 1.63 and a lat = 5.07 Å near these temperatures (for the βand α-phases, respectively) 47 .
General applicability Finally, we demonstrate in Fig. 6 that FLARE can be widely applied to diverse systems, including covalently bonded insulators and semiconductors, as well as oxides, alloys, and two-dimensional materials. FLARE training runs were performed for five representative systems-carbon, silicon, aluminum oxide, nickel titanium, and two-dimensional boron nitride-with the instantaneous temperature of each system rescaled at t = 5 ps to illustrate the model's ability to detect and adapt to novel local environments (see the left half of Table 1 for training details). To accelerate training of the nickel titanium model, which required expensive DFT calculations, the error threshold was set to twice the noise uncertainty, σ thresh = 2σ n , significantly reducing the total number of DFT calls needed to~20 (as shown in Fig. 6d and Table 1). Adding multiple local environments to the training set after each DFT call also had the effect of reducing the total number of DFT calls needed, as apparent in the aluminum oxide training run, for which N added ¼ 30 local environments were added after every DFT call and only 16 DFT calls were needed in total to train the model. Each training run was performed on a 32-core machine and took between 11.3 and 64.4 h of wall time (for silicon and carbon, respectively). We emphasize that the training procedure for each material is fully automatic, with the training set and hyperparameters updated on the fly without any human guidance.
To validate the models, independent NVE molecular dynamics trajectories of duration 10 ps were generated with each GP force field, with DFT calculations performed for ten MD frames spaced equally across the simulation and compared against the corresponding GP predictions. We find low root mean squared errors (RMSE) of around 0.1 eV/Å for four of the five systems, and for carbon we find a RMSE of 0.42 eV/Å due to the much higher temperature of the carbon validation run. The RMSE over all force component predictions in the ten representative frames is reported in Table 1. In order to illustrate the range of force magnitudes present in the simulation, we also report the 95th percentile of the absolute force components in these frames, with the ratio of the two reported in the final column of Table 1. The resulting ratios lie between 3 and 10%, similar to the ratios reported in a recent study of amorphous carbon with a Gaussian approximation potential 4 .

DISCUSSION
In summary, we have presented a method for automatically training low-dimensional Gaussian process models that provide accurate force estimates and reliable internal estimates of model uncertainty. The model's uncertainties are shown to correlate well with true out-of-sample error, providing an interpretable, principled basis for active learning of a force field model during molecular dynamics. The nonparametric 2-and 3-body FLARE models described here require fewer training environments than high-dimensional machine learning approaches, and are therefore well-suited to settings where large databases of ab initio calculations are too expensive to compute. Our models have a simple, accurate, and physically interpretable underlying energy model, which we have shown can be used to map the GP to a faster regression model approaching the speed of a classical force field. This provides a path toward force fields tailored to individual applications that give good agreement with DFT at several orders of magnitude lower computational cost, which we expect to considerably expand the range of materials that can be accurately studied with atomistic simulation. Particularly promising is the application of the FLARE framework to dynamical systems dominated by rare diffusion or reaction events that are very difficult to treat with existing ab initio, classical force field, or machine learning methods.
Extending this active learning method to complex systems like polymers and proteins is an important open challenge. The Bayesian force fields presented here may serve as a useful guide for selecting small, uncertain fragments from these systems that can then be evaluated with DFT to refine the force field, similar to other recent approaches that train on small portions of larger structures 48,49 . This may provide a path toward accurate machine learned force fields for chemical and biological systems that are currently outside the reach of DFT and other quantum mechanical methods.

Gaussian process force fields
As observed by Glielmo et al. 38,50,51 , the task of fitting a force field can be dramatically simplified by assuming that only small clusters of atoms in the local environment of an atom i contribute to its local energy E i . We define the n-body local environment ρ ðnÞ i of atom i to be the set of atoms within a cutoff distance r ðnÞ cut from atom i, and a cluster of n atoms to be the atom i and n − 1 of the atoms in ρ ðnÞ i . The energy ε si;i 1 ;:::;i nÀ1 ðd i;i1;:::;inÀ1 Þ of each cluster of n atoms is assumed to depend on the species of the atoms in the cluster, s i;i1;:::;inÀ1 ¼ ðs i ; s i1 ; ::; s inÀ1 Þ, and on a corresponding vector of interatomic distances between the atoms, d i;i1;:::;inÀ1 . For example, for clusters of two atoms, this vector consists of a single scalar, d i;i1 ¼ ðr i;i1 Þ; where r i;i1 is the distance between the central atom i and atom i 1 , and for clusters of three atoms, d i;i1;i2 ¼ ðr i;i1 ; r i;i2 ; r i1;i2 Þ. The local energy assigned to atom i may then be written as where the outer sum ranges over each n-body contribution to the energy up to a chosen maximum order N and the inner sum ranges over all clusters of n atoms inside the n-body environment ρ ðnÞ i . The regression task is to learn the functions ε si;i 1 ;:::;i nÀ1 ðd i;i1;:::;inÀ1 Þ, which for small n have much lower dimensionality than the full potential energy surface.
To learn the cluster contributions ε si;i 1 ;:::;i nÀ1 , we use ab initio force data to construct Gaussian process (GP) models, an established Bayesian approach to describing probability distributions over unknown functions 23 . In GP regression, the covariance between two outputs of the unknown function is related to the degree of similarity of the inputs as quantified by a kernel function. For our GP force fields, the covariance between n-body energy contributions (ε si;i 1 ;:::;i nÀ1 in Eq. (1)) is equated to a kernel function k n that directly compares the interatomic distance vectors while preserving rotational invariance. The local energy kernel between two local environments ρ i , ρ j is expressed as a sum over kernels between clusters of atoms, ðnÞ i j nÀ1 > >j 1 2ρ ðnÞ j X Pn δ si;i 1 ; ; i nÀ1 ;Pnsj;j 1 ; ; j nÀ1 k n ðd i;i1; ; inÀ1; P n d j;j1; ; jnÀ1 Þ: Importantly, this kernel function explicitly distinguishes between distinct species, with the delta function δ evaluating to 1 if the species vectors s i;i1;:::in of the clusters under comparison are equal and 0 otherwise. The Fig. 6 On-the-fly force field learning applied to a range of singleand multi-element systems. In each training run, the instantaneous temperature (blue) was increased at time t = 5.0 ps, triggering DFT calls and updates to the GP model (black dots) caused by model detection of novel local environments. Example structures from each simulation are shown in the insets.
innermost sum of Eq. (2) is over all permutations P n of indices of the species and distance vectors of the second cluster, guaranteeing invariance of the model under permutation of atoms of the same species. The resulting force kernel describing the covariance between force components is obtained by differentiating the local energy kernel with respect to the Cartesian coordinates r ! iα ; r ! jβ of the central atoms of ρ 1 and ρ 2 , giving an exactly rotationally covariant and energy conserving model of interatomic forces 5,38,50 . For completeness, we provide in Supplementary  Table 4 the formulas involved in computing the 3-body derivative kernel described by Eq. (3), along with its derivatives with respect to the hyperparameters of the kernel, which are used to calculate the gradient of the log marginal likelihood during hyperparameter optimization.
In this work, we choose N = 3, restricting the sum to 2-and 3-body contributions, as we have found the resulting GP models to be sufficiently expressive to describe with high accuracy a range of single-and multielement systems while remaining computationally efficient. This is consistent with the findings of Glielmo et al. 38 , which compared the performance of 2-, 3-, and many-body kernels and found that many-body models required substantially more training data while only modestly improving performance for several crystals, nanoclusters, and amorphous systems. Further investigation of model accuracy as a function of the maximum order N of the kernel for different types of materials is an interesting area for future study, as it may provide a systematic data-driven approach to characterizing many-body interactions in complex materials.
For the pair and triplet kernels k 2 and k 3 , we choose the squared exponential kernel multiplied by a smooth quadratic cutoff function f cut that ensures the model is continuous as atoms enter and exit the cutoff sphere, k 2 ðr i;i1 ; r j;j 1 Þ ¼ σ 2 s;2 exp À ðri;i 1 Àrj;j 1 Þ 2 2' 2 2 f cut ðr i;i1 ; r j;j 1 Þ; where σ s, (2,3) is the signal variance related to the maximum uncertainty of points far from the training set, ℓ (2,3) is the length scale of the 2-and 3body contributions, and ||. || denotes the vector 2-norm.
The force component f iα on each atom i and the square of the epistemic uncertainty σ 2 iα assigned to that force component are computed using the standard GP relations 23 , where k iα is the vector of force kernels between ρ i and the local environments in the training set, i.e. k iα;jβ ¼ k α;β ðρ i ; ρ j Þ, K is the covariance matrix K mα,nβ = k α,β (ρ m , ρ n ) of the training points, y is the vector of forces acting on the atoms in the training set, and σ n is a hyperparameter that characterizes observation noise. The total uncertainty on the force component, corresponding to the variance of the predictive posterior distribution of the predicted value, is obtained by adding σ 2 n , the square of the noise uncertainty 23 . Notice that the square of the epistemic uncertainty is bounded above by k α,α (ρ i , ρ i ), which for our kernel function is determined by the signal variances σ 2 s;2 and σ 2 s;3 . In all models in this work, the hyperparameters θ = {σ 2 , σ 3 , ℓ 2 , ℓ 3 , σ n } are optimized with SciPy's implementation of the BFGS algorithm 52 by maximizing the log marginal likelihood of the training data ρ = {ρ 1 , ρ 2 , . . ., ρ n }, which takes the form 23 log pðyjρ; θÞ ¼ À 1 2 y T ðK þ σ 2 n IÞ À1 y À 1 2 log jK þ σ 2 n Ij À n 2 log 2π: To efficiently maximize this quantity with BFGS, the gradient with respect to all hyperparameters is calculated with the analytic expression 23 , ∂ ∂θ i log pðyjρ; θÞ ¼ 1 2 tr ðα α T À K y À1 Þ ∂K y ∂θ j ; where α ¼ K À1 y y and K y ¼ K þ σ 2 n I. The formulas for the kernel derivatives with respect to the hyperparameters that appear in this expression, ∂Ky ∂θj , can be exactly calculated, and we list them in Supplementary Table 4 for the case of the 3-body kernel. The BFGS algorithm is terminated once the log marginal likelihood gradient falls below a threshold value ϵ = 10 −4 . Note that computation of the log marginal likelihood and its gradient involves inverting the covariance matrix K and is efficient if the model is trained on fewer than~1000 points. This data-driven approach to selecting model hyperparameters stands in contrast to other GP force fields, in which hyperparameters are chosen heuristically 4 . Mapping to tabulated spline models As shown by Glielmo et al. 38 for single-element systems, GP models built on n-body kernels can be mapped to efficient cubic spline models, eliminating the expensive loop over training points involved in the calculation of the kernel vector k iα in Eq. (5). We have extended this mapping procedure to our multi-element kernels by constructing cubic spline interpolants for each n-body force contribution À d d r ! i ε si;i 1 ;:::;i nÀ1 ðd i;i1;:::;inÀ1 Þ. The 2-and 3-body contributions require 1-and 3-dimensional cubic splines, respectively. The resulting spline model can be made arbitrarily accurate relative to the original GP model by increasing the number of control points of the spline. In Supplementary Table 3, we report the grid of control points used for each mapped force field in this work.

Computational details
All DFT calculations were performed using Quantum Espresso 6.2.1, with pseudopotentials, k-point meshes, plane-wave energy cutoffs, and charge density energy cutoffs for all calculations reported in Supplementary Table 1.
The on-the-fly learning algorithm is implemented with the FLARE package (https://github.com/mir-group/flare), which couples our Python-based MD and GP code with Quantum ESPRESSO 53 . Kernel and distance calculations are accelerated with the open-source just-in-time compiler Numba to enable training simulations spanning hundreds of picoseconds 54 . All onthe-fly molecular dynamics trajectories were performed in the NVE ensemble using the Verlet algorithm. LAMMPS simulations of AgI were performed in the NPT ensemble at zero pressure. Atomistic visualizations were created using Atomeye 55 . Training: N atoms is the number of atoms in the training simulation, r 2 and r 3 are the 2-and 3-body cutoffs of the GP models, σ thresh is the uncertainty threshold that determines when DFT is called, N added is the number of local environments added each time DFT is called, and t wall is the total wall time of the training simulation. Validation: T is the mean temperature during the validation simulation, RMSE is the root mean squared error on ten snapshots from the validation run, and P 95 is the 95 th percentile of force components in these 10 snapshots. The ratio between the RMSE and P 95 is reported in the final column.