Hyperactive learning for data-driven interatomic potentials

Data-driven interatomic potentials have emerged as a powerful tool for approximating ab initio potential energy surfaces. The most time-consuming step in creating these interatomic potentials is typically the generation of a suitable training database. To aid this process hyperactive learning (HAL), an accelerated active learning scheme, is presented as a method for rapid automated training database assembly. HAL adds a biasing term to a physically motivated sampler (e.g. molecular dynamics) driving atomic structures towards uncertainty in turn generating unseen or valuable training configurations. The proposed HAL framework is used to develop atomic cluster expansion (ACE) interatomic potentials for the AlSi10 alloy and polyethylene glycol (PEG) polymer starting from roughly a dozen initial configurations. The HAL generated ACE potentials are shown to be able to determine macroscopic properties, such as melting temperature and density, with close to experimental accuracy.


I. INTRODUCTION
Over the last decade there has been rapid progress in the development of data-driven interatomic potentials, see the review papers [1][2][3][4][5][6].Many systems are often too complex to be modelled by an empirical description yet inaccessible to electronic structure methods due to prohibitive computational cost.Richly parametrised data-driven interatomic potentials bridge this gap and are able to successfully describe the underlying chemistry and physics by approximating the potential energy surface (PES) with quantum mechanical accuracy [7][8][9].This approximation is done by regressing a high-dimensional model to training data collected from electronic structure calculations.
Over the years many approaches have been explored using a range of different model archi- * casv2@cam.ac.uk tectures.These include Artificial Neural Networks (ANN) based on atom centered symmetry functions [10] and have been used in models such as ANI [11,12] and DeepMD [13].Another widely used approach is Gaussian Process Regression (GPR) implemented in models such as SOAP/GAP [14,15], FCHL [16] and sGDML [17].Linear approximations of the PES have also been introduced initially by using permutation invariant polynomials (PIPs) [18] and the more recent atomic PIPs variant [19,20].Other linear models include spectral neighbour analysis potentials [21] based on the bispectrum [22], moment tensor potentials [23] and the atomic cluster expansion (ACE) [24][25][26].More recently, message passing neural network (MPNN) architectures have been introduced [27][28][29][30][31][32][33][34] the most recent of which have been able to outperform any of the previously mentioned models regarding accuracy on benchmarks such as MD17 [35] and ISO17 [36].Central to all of these models is that they are fitted to a training database comprised of configura-tions R labelled with observations comprising total energy E R , forces F R and perhaps virial stresses V R , obtained from electronic structure simulations.By performing a regression on the training data model predictions E of the total energy, and estimates of the respective forces F i = −∇ i E can be determined.Here, the ∇ i operator denotes the gradient with respect to the position of atom i.
Building suitable training databases remains a challenge and the most time consuming task in developing general data-driven interatomic potentials [37][38][39].Databases such as MD17 and ISO17 are typically created by performing Molecular Dynamics (MD) simulations on the structures of interest and selecting decorrelated configurations along the trajectory.This approach samples the potential energy surface according to its Boltzmann distribution.Once the training database contains sufficient number of configurations, a high dimensional model may be regressed in order to accurately interpolate its potential energy surface.The interpolation accuracy can be improved by further sampling, albeit with diminishing returns.However, it is by no means clear that the Boltzmann distribution is the optimal measure, or even a "good" measure, from which to draw samples for an ML training database.Indeed, it likely results in severe undersampling of configurations corresponding to defects and transition states, particularly for material systems with high barriers, which nevertheless have a profound effect on material properties and are often the subject of intense study.
A lack of training data in a sub-region can lead to deep unphysical energy minima in trained models, sometimes called "holes", which are well known to cause catastrophic problems for MD simulations: the trajectory can get trapped in these unphysical minima or even become unstable numerically for normal step sizes.A natural strategy to prevent such problems is active learning (AL): the simulation is augmented with a stopping criterion aimed at detecting when the model encounters a configuration for which the prediction is unreliable.Intuitively, one can think of such configurations as being "far" from the training set.When this situation occurs, a ground-truth evaluation is triggered, the training database extended, and the model refitted to the enlarged database.In the context of data-driven interatomic potentials, this approach was successfully employed by the linear moment tensor potentials [40,41] and the Gaussian process (GP) based methods FLARE [42,43] and GAP [44] which both use site energy uncertainty arising from the GP to formulate a stopping criterion in order to detect unreliable predictions during simulations.
The key contribution of this work is the introduction of the hyperactive learning framework.Rather than relying on normal MD to sample the potential energy and wait until an unreliable prediction appears (which may take a very long time once the model is decent), we continually bias the MD simulation towards regions of high uncertainty.By balancing the physical MD driving force with such a bias we accelerate the discovery of unreliably predicted configurations but retain the overall focus on low energy regions that are important for modelling.This exploration-exploitation trade-off originates from Bayesian Optimisation (BO), a technique used to efficiently optimise a computationally expensive "black box" function [45].BO has been shown to yield state-of-the-art results for optimisation problems while simultaneously minimising incurred computational costs by requiring fewer evaluations [46].In atomistic systems BO has been applied in global structure search [47][48][49][50] where the PES is optimised to find stable structures.Other previous work balancing exploration and exploitation in datadriven interatomic potentials is also closely related, where configurations were generated by balancing high uncertainty and high-likelihood (or rather low-energy) [51].Here the PES was explored by perturbing geometries while monitoring uncertainty rather than explicitly running MD.Note that upon the completion of this work, we discovered a closely related work that also uses uncertainty-biased MD [52].The two studies were performed independently, and appeared on preprint servers near-simultaneously.
In BO an acquisition function balances explo-ration and exploitation, controlled by a biasing parameter.
The biasing strength, represented by biasing parameter τ , controls the exploration of unseen parts of the PES and needs to be carefully tuned in order for the HAL-MD trajectory to remain energetically sensible.An on-the-fly autotuning of τ is presented in the Methods section.
The addition of a biasing potential, accelerating the exploration of relevant configurations, has a long history in the study of rare events and free energy computations, using adaptive biasing strategies such as meta-dynamics [53,54], umbrella sampling [55,56], and similar methods (e.g., [57,58]).While the biasing force in these methods is implicitly specified by the choice of a collective variable, the direction of the biasing force in HAL is the result of the choice of the uncertainty measure σ.
We make the general HAL concept concrete in the context of the ACE "machine learning potential" framework [24,25], however, the methods we propose are immediate applicable to linear models and to Gaussian process type models, and are in principle also extendable to any other ML potential that comes with an uncertainty measure, including deep neural network models.In the context of linear ACE models, described in detail in the methods section, the site energy is defined as a linear combination of basis functions, and total energy, The prediction of the uncertainty σ can, for example, be obtained through the use of an ensemble.Different methods of setting up such ensembles for linear, GP or NN frameworks can be used, such as dropout [59], or bootstrapping [60].In this work, we leverage the linearity of the ACE model and adopt a Bayesian view of the regression problem so that we are able to use unbiased uncertainty estimation.The drawback analytical estimates of uncertainty is that often they are expensive to compute, which would preclude their evaluation at every MD time step, as needed by HAL.We circumvent this problem by setting up a committee based estimator for the unbiased Bayesian uncertainty measure, which yields an efficient algorithm with negligible overhead on top of ordinary MD.Assuming an isotropic Gaussian prior on the model parameters and Gaussian independent and identically distributed (i.i.d) noise on observations, yields an explicit posterior distribution π(c) of the parameters from which one can deduce the variance σ 2 E of the posteriorpredictive distribution of total energies, where the covariance matrix Σ is defined as Here, α, λ are hyperparameters whose treatment is detailed in the methods section, and Ψ is the corresponding design matrix of the linear regression problem and depends on the observations to which the ACE model is fitted.The evaluation of the uncertainty or variance σ 2 E in equation ( 3) is computationally expensive for a large basis B; scaling as O(N 2 basis ).To improve computational efficiency, σ 2 E can be approximated by using an ensemble {c k } K k=1 obtained by sampling from the posterior π(c) (see Methods for further details), resulting in where Ē = c•B with c being the posterior mean of the posterior distribution whose closed form is provided in (22) of the methods section.This is computationally efficient to evaluate, requiring a single basis evaluation B followed by K dot-products with the ensemble parameters.Throughout the remainder of this article we will fix the choice of uncertainty measure in the definition of the HAL energy to be the standard deviation of the posterior-predictive distribution of energy as outlined above, i.e., σ = σ E , which we approximate as σ = σE .From both a theoretical and modelling perspective, it would be of interest to consider other measures of uncertainty as biasing terms.Further discussion of this aspect is provided in the methods section.
Having introduced HAL-MD it remains to specify a stopping criterion that can be used to terminate the dynamics and extract new training configurations.To that end we introduce a relative force uncertainty, f i , which is attractive from a modelling perspective, as for instance liquid and phonon properties require vastly different absolute force accuracy but similar relative force accuracy, typically on the order of 3-10%.Given the model committee we introduced to define σ we define where Fi is the mean force prediction.Further, ε is a regularising constant to prevent divergence of the fraction, and to be specified by the user, often set to around 0.2 eV/ A. During HAL simulations, f i provides a computationally efficient means to detect emerging local (force) uncertainties and trigger new ab initio calculations once it exceeds a predefined tolerance, The specification of f tol is both training data and model specific, and often requires careful tuning to achieve good performance.Too low f tol keeps triggering unnecessary ab initio calculations, whereas too high leads to generation of unphysical high energy configurations.To avoid manual tuning and aid generality, we normalise f i onto [0, 1] through the application of the softmax function s(f i ), resulting in the new stopping criterion where we use the default tolerance s tol = 0.5.The paper is structured as follows.Following an initial discussion of the performance of the relative force error measure f i , its ability to predict true error is investigated and its performance benchmarked by assembling a reduced diamond structure silicon database.Next, the HAL framework is used to build training databases for an alloy (AlSi10) and polymer (polyethylene glycol or PEG) from scratch and the ability of the resulting ACE models are able to accurately predict the AlSi10 melting temperature and PEG density are shown.

A. Filtering an existing training set
Before illustrating the HAL algorithm itself, we first demonstrate the ability of the relative force error estimate f i in Eq. ( 6) to detect true relative force errors.To that end, we will use estimator to significantly reduce a large training set while maintaining accurate model properties relative to the DFT reference.The database we use for this demonstration was originally developed for a Si GAP model [38] a wide range of structures ranging from bulk crystals in various phases, amorphous, liquid and vacancy configurations.The filtering process builds a reduced database by starting from a single configuration and selecting configurations containing the maximum f i from the remaining test configurations.Iterating this process accelerates the learning rate and rapidly converges model properties with respect to the DFT reference.The models we train are linear ACE models basis functions up to correlation order ν=3, polynomial degree 20, outer cutoff set to 5.5 A and inner cutoff set to the closest interatomic distance in the training database.An auxiliary pair potential basis was used using polynomial degree 3 outer cutoff 7.0 A and no inner cutoff.The weights for the energy w E , forces w F and virials w V , which are described in detail in the Methods section, were set to 5.0/1.0/1.0.The size of the committees used to determine f i was K = 32.

Si diamond: error correlation and convergence
Prior to training database reduction the ability of the relative force error estimate f i to predict relative force error is investigated.Fig. 1a compares the maximum relative force error in a configuration against the maximum of f i for two different training databases, containing 4 and 10 silicon diamond configurations respectively.The test configurations are the remaining configurations contained in the 489 silicon diamond configurations as part of the entire silicon database (totalling 16708 local environments).The regularising constant ε was set to the mean force magnitude as predicted by the mean parameterisation.Both figures show good correlation between maximum relative force error and max f i , therefore making it a suitable criterion to be monitored during (H)AL strate-gies.
By leveraging the correlation of f i with true relative force error the existing silicon diamond database can be reduced by iteratively selecting configurations containing the largest relative force uncertainty as part of a greedy algorithms strategy.To demonstrate this a single configuration from the 489 silicon diamond configurations the silicon database was fitted.Next, f i was determined over the remaining configurations and the configuration containing the largest max f i added to the training database.This process was repeated train and test error of this filtering procedure for silicon diamond is shown in Fig. 1b.It is benchmarked against performing random selection whereby, starting from the same initial configuration, test configurations were chosen at random from the pool of remaining .The result indicates that f i accurately detects configurations with large errors and manages to accelerate the learning rate significantly relative to random selection.Good generalisation between training and test errors is achieved by using around 5% of the total environment contained in the original silicon diamond database.

Si diamond: property convergence
The significant acceleration of the learning rate shown in Fig. 1b shows that generalisation between train and test error is rapidly achieved, in turn suggesting that property convergence is accelerated too.This is investigated These properties elastic constants, energy volume curves, phonon spectrum and thermal properties for bulk silicon diamond.
Fig. 2 demonstrates that property convergence for the energy volume curves, phonon spectrum and thermal properties are rapidly achieved by fitting to a fraction of the original database.to 5% of the original database reaches sufficient accuracy to describe all properties with good accuracy with respect to the DFT reference.This is again confirmed by elastic constants as predicted by the respective models as shown in  of the phonon spectrum in Fig. 2 is particularly noteworthy as relative errors on the order of a few percent on small forces ∼ 0.01 eV/ A are .

B. AlSi10
This section outlines the general HAL protocol for building training databases for alloys and demonstrates how an AlSi10 linear ACE model is built from scratch in an automated fashion.By using the relative force error estimate f i previously discussed as a stopping criterion to trigger ab initio evaluations it will be shown how an ACE model is created for AlSi10 using HAL.The HAL generated ACE model will be able to accurately model the liquid-solid phase transition and predict its melting temperature with excellent accuracy.The ACE models used in this section contained basis functions ν = 2 and polynomial degree 13 as well as an outer cutoff 5.5 A. The ACE inner cutoff was set to 1.5 A during the HAL stage of collecting data and moved towards the closest interatomic distance once all training data had been generated.An auxiliary pair potential V 2 added to aid stability also added to the basis including functions up to polynomial degree 13 and an outer cutoff of 6.0 A. The weights for the energy w E , forces w F and virials w V were set to .
The HAL procedure of building ACE models for alloys starts by creating set of a random crystal structures manually, from which a random alloy and liquid alloy training database are built in an iterative fashion.Once sufficient data for both phases has been collected, the HAL solid and liquid databases are afterwards FIG. 4: Coefficient magnitude |c i | for the 723 basis functions grouped per correlation order and element interaction for various ARD tolerances α .Large coefficients are assigned to pair interactions, partly captured by the auxiliary pair potential V 2 , as most of the binding energy is contained in these interactions.
lay density mixing scheme and finite basis correction.
An adaptive biasing parameter τ r =0.05 was chosen (for explicit definition see Methods section) and the temperature set to T solid =800K in order to build the random solid alloy database starting from the 10 initial structures previously described.Besides running biased dynamics, HAL performed cell volume adjusting (adding Gaussian noise to cell vectors) and element swapping Monte Carlo (MC) steps during the simulation on the HAL potential energy surface E HAL .These steps were accepted or rejected according to the Metropolis-Hastings algorithm [63].
During HAL dynamics the softmax normalised relative force estimate s(f ) is evaluated and a ground-truth evaluation triggered once a predefined tolerance of s tol =0.5 is met.A total of 42 HAL configurations were sampled as the HAL dynamics at this stage was stable for 5000 steps The pressure P , temperature T and f i are shown in Fig. 3 for four iterations with the first three being included in the training database, e.g.below or equal to iteration 42.The strong oscillations in the pressure P are due to the volume and element swapping MC steps being accepted.
HAL trajectories were simulated using a Langevin thermostat targeting a temperature regime of T liquid =3000K, and and a proportional control barostat targeting pressure level of 0.1 GPa.No volume or swap MC steps were performed.
After generating generating 46 liquid alloy configurations using HAL the dynamics stable for 5000 steps and shown in Table II.The test set was assembled by continuing HAL iterations for both the random alloy and liquid.Increasing α lowers the relevance criterion for the linear ACE basis functions in turn decreasing sparsity.A clear trade-off between sparsity and training error can be seen in Table II which also includes model evaluation performance and fitting times.Increasing α not only decreases training error FIG.5: NS determined heat capacity for ARD fitted linear AlSi10 ACE models (left) and schematic phase diagram for AlSi10 [61] (right).Excellent agreement with the melting temperature is demonstrated for fits with large α .but also test error up to α = 300k for which the test error increases, a sign of overfitting.Due to the relatively small training database size remains low, around a minute or less using 8 threads on Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz.Performance testing was done using LAMMPs and the PACE evaluator [64] on the exact same processor and show that all models are within 100 µs/atom/core per MD step.
Further analysis of the ARD fitted models was done by examining the absolute value of the coefficients |c i |.Basis functions below the predefined threshold are pruned away as can be seen in Fig. 4. Large coefficients are given to the pair interactions described by the auxiliary basis V 2 and two-body components of the ACE basis for all models, which is intuitive as most binding energy is stored in these pair interactions.Increasing α results in more (less relevant) basis functions being included with relatively smaller coefficients.For α = 300k many of these low relevance coefficients of around 10 −4 are included in the fit indicating a degree of overfitting -as confirmed by the test set error increase in Table II.
Next the melting temperature for each of the previously ARD fitted AlSi10 ACE models is determined.This was done using Nested Sampling (NS) which approximates the partition function of an atomic system by exploring the potential energy surface over decreasing energy (or enthalpy) levels, in turn determining the cumulative density of states [65].From the partition function any thermodynamic quantity can be derived such as the heat capacity.The heat capacity exhibits a signature peak for first-order phase transitions, which includes the liquidsolid transition occurring at the melting temperature.Extensive previous work has shown that NS is an accurate and reliable method for determining the melting temperature [66,67].As it explores the entirety of configurational space including gas, liquid and solid phases NS also serves as an excellent test for model robustness.This robustness is partly achieved by the addition of the auxiliary pair potential previously described as V 2 which is added in order to ensure close range repulsion between atoms.
Starting from the gas phase (initial cell volume of 500 A 3 /atom) the walkers explored the potential energy surface by iteratively cloning and decorrelating walkers at decreasing enthalpy levels, passing through the liquid phase and ending up at the ground structure.The decorrelation was done by running MD for 6 timesteps using a 0.1 fs timestep and a total of 1024 MC steps by changing the cell volume, shearing/stretching the cell and swapping atoms (ratio 6:6:6:6) were performed.The NS cell pressure was set to 0.1 GPa and the minimum aspect ratio of the cell set to 0.85.By summing the enthalpies from the NS simulation the constant pressure partition function is determined from which the heat capacity can be derived through postprocessing.Three independent runs for each of the ARD models fitted to the AlSi10 HAL database were performed.shown in Fig. 5.All models predicted the expected FCC ground structure, but a difference in predicted melting temperature for varying α can be seen.Only the α = 300k and α = 80k models accurately the melting temperature of 867K as predicted by Thermo-Calc with the TCAL4 database [68].Table II C. Polyethylene glycol (PEG) This section presents the application of HAL to build databases for polymers.Polyethylene glycol (PEG) has the formula H OCH 2 CH 2 n OH, where n is the number of monomer units [69].From a modelling perspective these polymers are challenging to simulate in vacuum as they form configurations ranging from tightly coiled up to fully stretched out structures.Due to the OH group at the end the polymer can also exhibit hydrogen bonding, which further complicates its description.These hydrogen bonds typically correspond to low energy configurations and are frequently formed and broken during long MD simulations.This section first presents a benchmark of HAL against AL followed by a demonstration HAL finding configurations exhibiting large errors.Finally, the potential fitted to small polymer units in vacuum is used to predict the density of a long PEG(n=200) polymer in bulk with excellent accuracy relative to experiment.All DFT reference calculations in this section are carried out with the ORCA code [70] using the ωB97X DFT exchange correlation functional [71] and 6-31G(d) basis set.

PEG(n=2): HAL vs AL
In order to test whether HAL accelerates training database assembly relative to standard AL, a benchmark test was performed.An initial database containing 20 PEG(n=2) polymers was created by running 500 K NVT molecular dynamics simulation using the general purpose ANI-2x forcefield [11] sampling every 7 ps.This database was fitted using an ACE basis containing basis functions up to correlation order ν=3 and polynomial degree 10 with an outer cutoff 4.5 A and inner cutoff 0.5 A. The auxiliary pair potential basis up to polynomial degree 10 and outer cutoff 5.5 A and did not have an inner cutoff.The weights for the energy w E , forces w F were set to 15.0 and 1.0 and remain constant throughout this seciton on PEG.AL (non-biasing, or τ =0.0) and HAL simulations with varying biasing strengths τ r were performed using a timestep of 0.5 fs at 500K.Configurations were evaluated using ORCA DFT once s tol =0.5 was reached.
The linear ACE models generated during the AL/HAL simulations were saved and subsequently used in a regular MD stability test and ran for 1 million MD steps at 500 K using a 1 fs timestep for 100 separate runs.A MD simulation was deemed stable if the CC and CO bonds along the chain where within 1.0-2.0A and the CH and OH bonds within 0.8-2.0A during the simulation.The minimum number of FIG.6: HAL vs AL benchmark comparing MD stability for 1 million MD steps over 100 seeds.Turning on biasing (non-zero τ r ) creates ACE models achieving stable 100 million MD timestep faster than standard AL by up to an order of magnitude.
stable MD timesteps out of the 100 different simulations is shown in Fig. 6 and demonstrates that up to τ r =0.20 a total of 80 (H)AL iterations are required in order to achieve a minimum MD stability of 1 million steps.The large biasing strength of τ r =0.25 results in unstable MD dynamics as too strong biasing causes the generation of exceedingly high energy configuration far away from the desired potential energy surface to be included in the training database.Fitting to these configurations leads to a poorly performing model as many unphysical configurations enter the training database resulting.
The HAL run using a biasing strength of τ r =0.20, achieves minimum 1 million step MD stability after an order of magnitude fewer exploratory MD timesteps compared to standard AL.

PEG(n=4): rare events
Using PEG(n=4) polymers this section will investigate the ability of HAL to generate and detect configurations with large errors.First a training database was built using the gen-eral purpose ANI-2x forcefield [11] at 500K and 800K using a timestep of 1 fs.Configurations were sampled every 7000 timesteps (7 ps), and used to assemble 500K and 800K databases.The 500K database was divided into 750 train configurations and 250 test configurations.The 800K training and test databases both contained 250 configurations.The linear ACE model was extended to include basis functions up to 12 for both the ACE and pair potential, while keeping the cutoffs and correlation order the same (ν=3) too compared to the previous section on PEG(n=2).
Using the 500K MD sampled training database HAL was started using τ r =0.10 and a timestep of 0.5 fs.The stopping criterion s tol set to 0.5.A total of 200 HAL configurations were generated and formed a HAL database used for both a train and test set.Using the previously described basis three models were created fitted to: 500K, 500K+800K and a 500K+HAL.Energy scatter plots for these three models are shown in Fig. 7 demonstrating that the errors on the HAL-found configurations are large for both the 500K and 500K+800K fits, despite the fact that the these HAL-found configurations are also low in energy!Only by including the HAL configurations in the training database can the errors on these configurations be reduced as shown in Table .III. Inspection of the HAL generated structures exposes a shared characteristic: most of them contain (double) hydrogen bonding across the polymer an example of which is shown in Fig. 7.Such hydrogen-bond formation is a rare event in this system, because only the two ends of the molecule are capable of hydrogen bonding.It is difficult to find these configurations using regular MD (even when using elevated temperatures), whereas HAL finds them easily.

PEG(n=200): bulk density
As a final investigation the density of a PEG(n=200) polymer containing 1400 atoms is determined using an ACE model fitted to a HAL generated PEG training database containing polymer sizes ranging from n=2 to n=32 monomer units.This database contained configurations from the previous PEG sections and extended using configurations sized n = 8, n = 16 and n = 32.The training database included standard ANI MD sampled configurations at 500K including 1000 PEG(n=4) configurations (from the previous section), as well as 50 PEG(n=2), 100 PEG(n=8), 100 PEG(n=16) and 18 PEG(n=32) configurations.Starting from this data HAL was used to generate an extra 64 PEG(n=16) and 91 PEG(n=32) HAL configurations until dynamics was deemed stable.The linear ACE basis used for the regression task was identical to the ACE in the previous section on PEG(n=4), and any force components with greater than 20 eV/ A were excluded from the fit.
Using the ACE model a PEG(n=200) polymer was simulated in LAMMPS [73] with the PACE evaluator pair style with periodic boundary conditions.Since the training database only contained small polymers segments in vacuum this periodic simulation demonstrates a large degree of extrapolation to configurations far away from the training database.Furthermore, the DFT code used to evaluate the training energies and forces does not support periodic boundary conditions making DFT simulation of the 1400 atom PEG(n=200) simulation box not just computationally infeasible, but practically impossible in this case.
The resulting linear ACE model was timed at 220 µs/atom/core per MD step.LAMMPs NPT simulations were performed at 1 bar using a 1 fs timestep at 300K, 400K, 500K and 600K.The recorded density as a function of simulation time is plotted in Fig. 8. Using the last 500 ps from the 300K simulation the density was determined to be 1.238 g/cm 3 .This value is around 3% higher than the experimental value of 1.2 g/cm 3 [72].

A. Hyperactive Learning (HAL)
The HAL potential energy E HAL as defined in Eq. ( 1) biases MD simulations during the exploration step in AL towards uncertainty by shifting the potential energy surface and assigning lower energies to configurations with high uncertainty.We have shown that when σ is the standard deviation of the posterior-predictive uncertainty of energy, it can be computationally cheaply approximated by a Monte-Carlo estimate σ; see (5).Likewise, the derivative of σ can be computed as where and  the posterior distribution as specified in (22).The sum over K is over the ensemble or committee of models, which in this work was chosen to be linear (ACE) models.Other architectures such as neural networks ensembles may be considered in future work.This quantity in essence is a computationally cheap method of determining the gradient towards (total) energy uncertainty and may be interpreted as a conservative biasing force, HAL dynamics adds this biasing force to MD in order to accelerate the generation of configurations with high uncertainty, which sets HAL apart from AL. Setting τ =0 recovers standard MD dynamics, and in this sense, HAL generalizes AL.Interestingly, previous work employed a biasing force using a neural network interatomic potential [74] but biased away from uncertainty in order to stabilise the MD dynamics.
The biasing strength τ can either be set as a constant or adapted during the HAL simulation.Controlling the biasing strength is important as too strong biasing can quickly lead to unphysical configurations, whereas low biasing generates valuable configurations at a slow rate.The adaptive biasing works by first setting τ r and performing a burn-in period to record the magnitudes (or, norms) of F σ and F .Typically, the burn-in period is set to the latest 100 timesteps δt.The biasing strength τ is then determined by satisfying The new parameter τ r is generally set between FIG. 8: HAL protocol for building linear ACE PEG model accurately determining PEG(n=200) density within experimental accuracy of 1.2 g/cm 3 at 297K (shaded area) [72].Training database only included small polymers ranging from n=2 to n=32 in isolation.
0.05 and 0.25.It can be understood as the approximate relative average strength of the biasing force in comparison to the average force of the fitted model.Using this adaptive biasing term aids usability and tunes the biasing strength to ensure that HAL gently drives MD towards high uncertainty.The value may loosely be interpreted as the relative magnitude of the biasing force compared to the true gradient of the potential energy surface.Larger τ r increases the biasing strength and rate at which configurations with high uncertainty are generated.In order to sample configurations at desired pressures and temperatures a proportional control barostat was added as well as a Langevin thermostat.

B. Atomic Cluster Expansion (ACE)
The ACE model decomposes the total energy E of a configuration R as a sum of parameterised atomic energies, The atomic energies E i are then parameterised by a linear model, E i = c • B i , where B i denotes the ACE basis.The present work employs a particularly simple variant, which we review briefly: Given relative atomic positions r ji = r j − r i and associated chemical elements one evaluates a one-particle basis followed by a pooling operation resulting in features that are denoted the atomic basis in the context of the ACE model.Taking a ν order (tensor) product results in many-body correlation functions incorporating (ν+1) body-order interactions, The A-basis is a complete basis of permutationinvariant functions but does not incorporate rotation or reflection symmetry.An isometry invariant basis B is constructed by averaging over rotations and reflections.Representation theory of the orthogonal group O(3) shows that this can be expressed as a sparse linear operation and results in where C contains generalised Clebsch-Gordan coefficients; we refer to [24,25] for further details.
A major benefit of the linear ACE model is that the computational cost of evaluating a site energy E i scales only linearly with the number of neighbouring atoms, as well as with the body order ν + 1.

C. (Bayesian) Linear Regression
The parameters of linear ACE models are fitted by solving a linear regression problem.The associated squared loss function L(c) to be minimised over configurations R in training set R with corresponding (DFT) observations for energy where w E and w F are weights specifying the relative importance of the DFT observations.When fitting materials a third term is added w V |V (c; R) − V R | 2 referring to the virial stress components of the configuration R.This minimisation problem can be recast in the generic form arg min where is the design matrix and collects the observations to which the parameters are fitted.
Here, we also added a Tychonov regularisation with regularisation parameter η > 0 which is commonly determined through a model selection criterion such as cross-validation.This linear regression model can be cast in a Bayesian framework by specifying a prior distribution p(c) over the regression parameters, and an (additive) probabilistic noise model gives rise to the likelihood function By restricting ourselves to a Gaussian error model, and assuming the prior to be Gaussian as well, i.e., p(c) = N (c|0, Σ 0 ), it is ensured that the posterior distribution, π(c) = p(c|R, y, λ), is Gaussian with closed form expressions for both the distribution mean c and variance Σ, In the context of this work, having closed form expressions for both these quantities is desirable as it (i) allows for conceptual easy and fast generation of independent samples {c k } K k=1 from the posterior distribution, and (ii) allows for a parametrisation of the fitted model with the exact mean, c, of the posterior distribution.
In what follows we briefly describe two Bayesian regression techniques, Bayesian Ridge Regression (BRR), which we use to produce Bayesian fits during the HAL data generation phase, and the Automatic Relevance Determination (ARD), which we typically use to obtain a final model fit after the data generation is complete.

D. Bayesian Ridge Regression (BRR)
In Bayesian Ridge Regression the covariance of the prior is assumed to be isotropic, i.e., p(c|α) = N (c|0, α −1 I), (23) for some hyper-parameter α > 0, the precision of the prior distribution.Under this choice of prior, the logarithm of the posterior distribution takes the form where C is some constant.Thus, maximising the (log-)posterior for this choice of prior, is equivalent to solving the regularised least square problem Eq. 24 with ridge penalty η = α/λ.This shows that the prior naturally gives rise to a regularised solution, keeping coefficient parameters small.The determination of the hyper-parameters α and λ in BRR is achieved by optimising the marginal log likelihood also known as evidence maximisation [75].One first defines the evidence function as p(y|α, λ) = p(y|c, λ)p(c|α)dc (25) which marginalises out the coefficients c and describes the likelihood of observing the data given the hyperparameters α and λ.Using the previously defined definitions the evidence function can be expressed as which can be maximised with respect to α and λ in order maximise the marginal likelihood and obtain the statistically most probably likely solution given the basis and data.

E. Automatic Relevance Determination (ARD)
Automatic Relevance Determination (ARD) modifies BRR by relaxing the isotropy of the prior and assigning a hyperparameter α i to independently regularise each coefficient c i .The corresponding prior is given by p(c|α) = N (c|0,A −1 ) A = diag(α 1 , ..., α N basis ).(28) This prior determines the relevance of each parameter c i , or basis function, which effectively results in a feature selection.Basis functions are ranked based on their relevance and are pruned if determined irrelevant, in turn producing a sparse solution.In practice, sparse models obtained through ARD often yield better generalisation than BRR.Using ARD requires the specification of a threshold parameter α setting the minimum relevance of basis functions included in the fit.Adjusting this parameter controls the balance between accuracy and sparsity of the model.

F. Posterior Predictive Distribution
A key property of the Bayesian approach is that it provides a way to quantify uncertainty in terms of the posterior-predictive distribution, which accounts both for parameter uncertainty as given by the posterior distribution as well as uncertainty due to observation error.

FIG. 1 :
FIG. 1: a) Maximum relative force error estimate max f i versus error correlation plots for silicon diamond containing 4 and 10 training configurations.b) Learning rate comparison between filtering and random selection for silicon diamond.

FIG. 3 :
FIG. 3: HAL dynamics for several iterations for the AlSi10 random alloy showing softmax normalised relative force error estimate s(f ), temperature and pressure.DFT calculations are triggered if the tolerance in red is reached.Pressure fluctuations are due to swap/volume MC steps on HAL potential energy surface E HAL .
N basis is the dimensionality of c. Completing the square in the exponent and taking the log gives rise to the marginal log likelihood ln p(y|α, λ) Table.I.The convergence FIG.2: Property convergence for the energy volume (top), thermal properties (middle) and phonon spectrum (bottom) for filtering silicon diamond ACE models.ACE 3% envs 98.2 188.1 53.3 79.7 ACE 4% envs 84.2 159.8 46.4 75.7 ACE 5% envs 82.5 148.7 49.3 73.7 DFT 82.6 147.2 50.3 73.1

TABLE I :
Convergence of the elastic moduli (GPa) of the filtered ACE models relative to the CASTEP DFT reference.

TABLE II :
Train/test error splits for HAL generated AlSi10 database for varying ARD tolerance α .Larger ARD tolerance α includes more basis functions but increases performance and fitting time.

TABLE III :
Train and test errors for energies (E) in meV and forces (F) in meV/ A for the 500K, 500K+800K and 500K+HAL databases using ACE.† is train error.
* is test error.