Introduction

The most direct computational strategy to monitor the atomic-level time-evolution of chemical and physical processes is by the molecular dynamics (MD) method. Starting with an initial atomic configuration and atomic velocities, MD simulations require the atomic forces as input to propagate the atoms and their velocities to the next time-step (at which point, the atomic forces are re-evaluated); the cycle continues, thus allowing for an iterative time-evolution of the system. The atomic forces at each time-step may be obtained either using quantum mechanics-based methods, such as density functional theory (DFT), or parameterized classical force fields.

A variety of DFT-based methods have already contributed to the rational design of application-specific materials.1,2,3 Nevertheless, DFT-based MD simulations are not practical and routine at the present time, especially to track chemical processes with long time scales (1 nanosecond) and large length scales (10 nanometers). The repetitive and expensive DFT force computations during MD and the necessary small MD time steps (of the order of femtoseconds), lead to the primary bottlenecks of DFT-based MD simulations. Parameterized classical force fields (which are 6–10 orders of magnitude faster than DFT) may be used to access truly long time scales and large length scales. But, these approaches are not satisfactory either, as such force fields lack accuracy and versatility, i.e., they are not transferable to situations that were not originally used in the parameterization.

Owing to this scenario, monitoring the atomistic details underlying major classes of physical and chemical processes with high fidelity is still beyond the reaches of modern computational methods. Here, we attempt to take a step in remedying this situation via a universal data-driven atomic force prediction recipe that is as fast as classical force fields but as accurate and versatile as quantum mechanics-based methods.4,5,6,7,8,9 Simply put, our hypothesis is that the force experienced by an atom is purely a function of the arrangement of the other atoms around it-a notion inspired, and originally suggested, by Feynman within the context of quantum mechanics, which has lead to the celebrated Hellmann–Feynman theorem.10,11 If we are able to numerically represent this atomic arrangement, and if we have sufficient “atomic arrangement vs. force” examples, we should be able to train a machine to learn the arrangement-force relationship, and make future atomic force predictions based purely on atomic arrangement information.

Machine learning (ML) methods using neural networks, Gaussian processes, and other algorithms have been successful in the development of ML potentials for MD simulations.12,13,14,15,16,17,18,19,20,21,22,23,24 The approach of the present contribution, namely, learning to predict atomic forces directly (from which the total potential energy of the entire system can be determined through appropriate integration), as previously introduced in ref. 25, is far more powerful. This is because the atomic force is a local quantity that can be formally defined for atoms, whereas the potential energy can be formally defined only for the entire system or unit cell. Other force fields that directly predict the potential energy define this quantity as a sum of atomic energies, for which a formal basis does not exist within quantum mechanical treatments.

Figure 1 portrays the essential steps involved in the creation of a ML force field, which are: (1) Generation of reference data, e.g., using DFT; (2) Fingerprinting the atomic environments, so that at the end of this step, the data has been reduced to a set of numerical atomic fingerprints and the corresponding atomic forces; (3) Creation of compact “training sets” from this data through clustering and sampling; and (4) Learning from these datasets, such that at the end of this step, a robust mapping is established connecting fingerprints and forces. A rudimentary version of this procedure has been recently demonstrated, leading to a class of force fields we refer to as AGNI force fields. For the case of Al, several static and dynamic applications, including stress-strain behavior, melting, proper description of dislocation core regions, point defect diffusion in bulk, adatom organization (island formation) at surfaces, etc., were also explored with the AGNI force field.4,5,7,8

Fig. 1
figure 1

Strategy for the creation of machine learning force fields

In the present contribution, we establish the true power and universality of this force field by applying it to a set of elemental solids, displaying a diversity of chemical bonding, including Al, Cu, Ti, W, Si, and C, encompassing metals and insulators occurring in a variety of different crystal structures. We also show that predictions of atomic forces and total potential energies (obtained through appropriate integration of the forces along an MD trajectory) reach unprecedented chemical accuracy levels for all cases considered, provided fundamental improvements to each of the four steps of Fig. 1 are made. For instance, the reference data should have sufficiently low intrinsic errors, and the fingerprinting scheme should have sufficient “resolving power” to represent the atomic configurations with high fidelity, carrying as much information about the atomic configuration as possible. Moreover, an important innovation of the present development is the clustering of the parent dataset so that different atomic environments are separated into different clusters, and each of them is sampled separately so that compact, non-redundant, and diverse “training sets” can be collected. The ML force field thus created can be improved progressively by including new configurations (i.e., new clusters), as required.

The proposed class of enhanced AGNI force fields can potentially (1) reach an arbitrary level of accuracy approaching that of the reference data, (2) be about 6–8 orders of magnitude faster than DFT calculations, (3) be systematically improved in quality, transferability and versatility by progressively adding new training set configurations, and (4) be generalized to any element (or combination of elements) of the periodic table for which reference (one-time) quantum mechanics-based calculations can be done. Below, we describe the key steps of this strategy and the results in detail.

Results

Strategy

All four steps of the strategy, shown in Fig. 1, are crucial for the fidelity of enhanced AGNI force fields. In what follows, the critical aspects of these steps are discussed while the technical details can be found in section Methods.

Reference data preparation

The reference data used for creating AGNI force fields must be prepared accurately and consistently, ensuring sufficiently low intrinsic errors.26,27,28,29,30,31,32,33 Except for the aforementioned computational cost, first-principles-based methods for computing atomic forces are considered to meet these requirements. Our reference data was prepared using a two-step procedure. First, atomic configurations and forces were obtained from several DFT-based MD simulations performed for bulk Al, Cu, Ti, W, Si, and C at a sufficient level of accuracy. Then, by rotating the collected atomic configurations and forces, new data was accumulated, making the reference data truly large in volume, and providing access to a lot more force components than in the original dataset.

Fingerprinting

A fingerprint is needed to numerically represent atomic configurations, especially for force field development.5,6,7,8,13,15,19,20 Typically, a fingerprint must be invariant with respect to translations and rotations of the whole system, and to permutations of like atoms. To be useful for a ML force field, it must also be directionally resolved and continuous, i.e., proportionately change with small changes of the atomic arrangement. Our proposal is a d-dimensional vector V i,α , representing the atomic environment of atom i viewed along the Cartesian α direction. For elemental materials, the k th component (k {1, d}) of this vector is defined as5,6,13

$${V_{i,\alpha ;k}} = \mathop {\sum}\limits_{j\not = i} \frac{{r_{ij}^\alpha }}{{{r_{ij}}}}\frac{1}{{\sqrt {2\pi } w}}{\rm{exp}}\left[ { - \frac{1}{2}{{\left( {\frac{{{r_{ij}} - {a_k}}}{w}} \right)}^2}} \right]{f_{\rm{c}}}\left( {{r_{ij}}} \right).$$
(1)

Here, r i and r j are the positions of atoms i and j, \({r_{ij}} = \left| {{{\bf{r}}_j} - {{\bf{r}}_i}} \right|\), \(r_{ij}^\alpha\) is the projection of r j  − r i onto the α direction, and the sum runs over the neighbor list {j} of atom i. In our strategy, V i,α , schematically illustrated in Fig. 2, is used to learn (and to predict) the Cartesian α component of the atomic force exerted on atom i. The definition (1) can readily be generalized to multi-element materials.

Fig. 2
figure 2

The k-dimensional fingerprint vector, V i,α , of reference atom i, (determined by the positions r ij of other atoms j with respect to atom i), to be mapped to its α Cartesian force component. As shown in a, r ij makes a θ angle with the α direction. In b, the first term of Eq. (1), i.e., \(r_{ij}^\alpha {\rm{/}}{r_{ij}} = {\rm{cos}}\,{\boldsymbol{\theta}}\), is represented. The distance \({r_{ij}} \equiv \left| {{{\bf{r}}_{ij}}} \right|\) is projected onto a series of Gaussian functions, yielding G(a k , w; r ij ), as depicted in c. The damping factor f c (r ij ) is represented in d and a schematic illustration of V i,α is given in e

By construction, V i,α satisfies the translation and permutation invariants. Although this fingerprint is directionally specific, the reference dataset is symmetrized by rotations, as described in the previous section of “Reference data preparation”, thus the force field, i.e., the ML model established in the training set, can handle any orientation of the simulated system. Moreover, three factors included in each term of (1) are designed to handle those directly related to the ML force fields. First, \(r_{ij}^\alpha {\rm{/}}{r_{ij}}\) characterizes the contribution of atom j to the α component of the force on atom i. Next, the Gaussian factor \(G\left( {{a_k},w;r} \right) \equiv \frac{1}{{\sqrt {2\pi } w}}{\rm{exp}}\left[ {{{\left( {r - {a_k}} \right)}^2}{\rm{/}}\left( {2{w^2}} \right)} \right]\) captures, in a continuous manner, the possibility of atom j falling within the kth spherical shell centered at atom i. Finally, \({f_{\rm{c}}}\left( {{r_{ij}}} \right) \equiv \frac{1}{2}\left[ {{\rm{cos}}\left( {\frac{{\pi {r_{ij}}}}{{{R_c}}}} \right) + 1} \right]\) gradually diminishes contributions from distant atoms, and setting them to zero when r ij  > R c, a cutoff radius. The last two terms, namely G(a k , w; r) and f c(r ij ), have been used for radial symmetry function, a similar fingerprint introduced to construct a neural-network potential.13

Starting from the same intuitive idea, a related fingerprint was recently defined,5,6 leading to an early version of our AGNI ML force fields.5,6,7,8,9 Instead of G(a k , w; r), \(G\left( {0,{\eta _k};r} \right) \equiv {\rm{exp}}\left( {{r^2}{\rm{/}}\eta _k^2} \right)\) was used in the previo′us work to handle atoms j residing within the kth sphere centered at atom i. Consequently, the earlier fingerprint, denoted by V \(_{i,a}^\prime\), was defined as5,6

$$V_{i,\alpha ;k}^\prime = \mathop {\sum}\limits_{j\not = i} \frac{{r_{ij}^\alpha }}{{{r_{ij}}}}{\rm{exp}}\left( {\frac{{r_{ij}^2}}{{\eta _k^2}}} \right){f_{\rm{c}}}\left( {{r_{ij}}} \right).$$
(2)

Two principal parameters of V i,α (the new fingerprint defined in (1)) are the dimensionality d and the width w of the Gaussian factors (although the G(a k , w; r)s are not required to have the same w, we use this simplification). The particular distribution of \(\left\{ {{a_k}} \right\}_{k = 1}^d\) is not important, especially for large d, given that the shells occupied by G(a k ,w;r)s cover the sphere of radius R c from atom i. Similarly, d is the main parameter of V \(_{i,\alpha }^\prime\) while the manner by which \(\{ {\eta _k}\} _{k = 1}^d\) is distributed is of minor importance when d is large. The optimal parameters of V i,α and V \(_{i,\alpha }^\prime\) can be determined using the well-defined procedure described below. The fundamental difference between the two fingerprint schemes is explained by using a principal component analysis discussed in detail in the Supporting Information.

Sampling and clustering

An AGNI force field is a ML model established on a training dataset selected from the reference data. The training data should be compact and non-redundant, allowing the development and applications of the force fields to be time-efficient. However, the main assumption in training data selection is that the diversity of the reference data could be properly captured. In the context of generic ML, this work is non-trivial, especially when the original dataset is big and diverse.34,35 Within the most common approach, the training data is selected randomly from the reference data.5,6,7,8,9,27,28,29 Such training data is likely dominated by the configurations from the highly-populated domains while other domains are under-represented. This scenario is visualized in Fig. 3a, where the training data randomly selected from the reference data contains essentially no configuration with large-amplitude forces.

Fig. 3
figure 3

An illustration of three methods for selecting a training set, including random a, force amplitude sampling b, and fingerprint space clustering c. In this Figure, the reference dataset is projected onto a 2-dimensional manifold spanned by the first two principal components (Component 1 and Component 2), identified by a principal component analysis. Different colors are used in a, b for representing force magnitude while in c, they are used to label different data clusters

We propose several training data selection methods for better capturing the diversity of the reference data we prepared. In the force-binning method, we arrange the reference data into a number of force amplitude intervals and select training data from all the intervals, properly capturing the force amplitude profile of the reference data. In the clustering approach, the reference data is split into a given number of clusters in fingerprint space, from each of which a part of the training data is selected. Figure 3b, c visualize the improved diversity of the training set prepared by these two methods.

In fact, the clustering technique can potentially lead to a far more powerful concept in creating ML force fields. The current idea used for creating baseline AGNI ML force fields5,6,7,8,9 and predicting material properties27,28,29,30,31 is to use the predictive model (mapping) created on a single training set for the whole domain occupied by the reference data. The error of such a baseline ML force field grows unavoidably when the reference data becomes highly diverse, regardless of how big the training set size N t is. Our approach, as sketched in Fig. 1, is that for each data cluster (domain), a separate training set is selected and then, a fingerprint-force mapping is established. By assembling these mappings, we obtain a ML force field, which is domain-specific in the sense that the atomic force of an atomic configuration is evaluated using the mapping created for the closest domain. Because any domain is significantly less diverse than the reference data, the domain-specific force fields can readily surpass the aforementioned intrinsic limit of the baseline ML force fields. A caveat though is that care must be taken to ensure that discontinuities are not introduced in the predicted atomic forces, say, during the course of an MD simulation, if the atomic environment shifts from one cluster to another. This aspect has not been explored in the present work. In the discussion below, all of the approaches will be critically examined.

Machine learning

Given a set of training data, a learning algorithm is needed to establish the fingerprint-force mapping. Here, we use kernel ridge regression (KRR),36,37 a powerful method that has widely been used in materials informatics.5,6,7,8,9,27,28,29,38 KRR predicts the atomic force F i corresponding to the configuration i as

$${F_i} = \mathop {\sum}\limits_{j = 1}^{{N_{\rm{t}}}} {\alpha _j}\,{\rm{exp}}\left[ { - \frac{1}{2}{{\left( {\frac{{{d_{ij}}}}{\sigma }} \right)}^2}} \right].$$
(3)

Here, the sum runs over N t configurations, indexed by j, of the training set while d ij is the “distance” between configurations i and j, chosen to be the Euclidean norm, in fingerprint space. The “length” scale in this space is specified by σ. During the training phase, the regression weights α j and σ are evaluated by optimizing a regularized objective function via 5-fold cross-validation.

Critical assessment of force fields

Fingerprint optimization

We consider five classes of ML models categorized in terms of fingerprint type and sampling method. These include using (I) V \(_{i,\alpha }^\prime\) with random sampling, (II) V i,α with random sampling, (III) V i,α with fingerprint space clustering, (IV) V i,α with force binning sampling, and (V) V i,α with the domain-specific learning approach. In the last class, each force field includes five fingerprint-force mappings established on five data clusters. When a new atomic configuration is encountered, the cluster whose centroid is closest to this point in fingerprint space is identified, and the ML model established for this cluster is used to evaluate the force desired. Since (I) and (II) differ only by the fingerprint choice, we use these two classes to critically compare fingerprints V i,α and V \(_{i,\alpha }^\prime\), and also to determine the optimal choices of the fingerprint parameters.

V i,α and V \(_{i,\alpha }^\prime\) were optimized by determining the primary parameters that minimize the error of the ML force field predictions. Although d can be arbitrarily large, we examined d = 4, 8, 16, 32, or 48. For V i,α , w was chosen to be 0.05, 0.10, 0.20, 0.30, 0.40, 0.50, or 0.60 Å while a k were uniformly distributed between a half of the nearest-neighbor distance of the material considered to R c , chosen to be 8 Å following refs. 5,6,7,8. In the case of V \(_{i,\alpha }^\prime\), η k were uniformly distributed on the log scale between a half of the nearest-neighbor distance to η max, ranging up to R c . A training set of N t  = 1,000 atomic configurations was prepared using the random sampling method, leaving the complement as the test set. Baseline ML force fields were then created, utilizing V i,α and V \(_{i,\alpha }^\prime\). The root-mean-square error δ RMS of the force predicted on the test set was used to identify the optimal parameters. Details of this procedure can be found in Fig. S1 and related discussions.

We now summarize our findings for these baseline ML force fields. First, the general trend is that the higher the dimensionality d, the better the fingerprint, i.e., d = 48 is optimal for both V i,α and V \(_{i,\alpha }^\prime\). Second, the optimized V i,α is superior to the optimized V \(_{i,\alpha }^\prime\) in terms of δ RMS. Finally, for a given d, the optimal w is of the order of a k+1 − a k , the spacing between the centers of two adjacent Gaussian functions. Such a value allows for some overlap between these radial basis functions, maximizing the sensitivity of V i,α with respect to changes of the atomic environment (for more information, see Fig. S2 and related discussions for a principal component analysis of V i,α and V \(_{i,\alpha }^\prime\)).

Figure 4 shows the learning curves, constructed from δ RMS, for the baseline ML force fields (i.e., those of classes (I) and (II), which are based on random data sampling) created for Al, Cu, Ti, W, Si, and C using the optimized V i,α and V \(_{i,\alpha }^\prime\). For both fingerprints, δ RMS scales inversely with N t initially39 but reaches a limit at N t 500. By using V i,α instead of V \(_{i,\alpha }^\prime\), δ RMS drops from 0.16 eV/Å to 0.09 eV/Å in case of C. Significant improvement was also obtained for W, Ti, and Cu. For Al and Si, whose δ RMS obtained with V \(_{i,\alpha }^\prime\) is consistent with that reported in refs. 6,9, the reduction of δ RMS is smaller but remains noticeable.

Fig. 4
figure 4

Learning curves constructed from δ RMS, the root-mean-square error of the AGNI force field created for Al, Cu, Ti, W, Si, and C using the random sampling method and either optimized V i,α (with d = 48 and w = 0.1 Å) or optimized V \(_{i,\alpha }^\prime\) (with d = 48). Each data point (and its error bar) is obtained from 10 force fields created on 10 independently selected training sets

Force field performance

In this section, we critically examine the five classes of AGNI ML force fields considered via their prediction error. Because the forces to be evaluated may span over several orders of magnitude, δ RMS may not be sufficient for describing its performance. We propose a few other error metrics. They are (1) the absolute error δ ABS, (2) the standard deviation of the error distribution (assumed to be normal) δ STD and (3) the average of the top 1% of the force errors δ MAX. While δ ABS should be examined on the whole range of force amplitude (see Fig. 5), δ STD and δ MAX capture the regimes of small and large errors, respectively.

Fig. 5
figure 5

Atomistic forces evaluated by the domain-specific force fields developed for Al, Cu, Ti, W, Si, and C, shown against the reference DFT forces. The absolute errors of the force predictions are shown in insets as a function of reference force magnitudes. These force fields are created with the training set size N t = 1000

We first focus on the ML force fields created by four recipes that employ different fingerprints and sampling methods, i.e., (I), (II), (III), and (IV), whose δ RMS, δ STD, and δ MAX are summarized in Table 1. The evolution of the error measures from (I) to (II) demonstrates that V i,α is clearly better than V \(_{i,\alpha }^\prime\), especially for W and C. The errors associated with force binning and clustering methods are comparable, being significantly smaller than those of the force fields created using the random sampling method, especially for Ti. As shown in Fig. 5, the absolute error δ ABS of these ML force fields falls within 0.02 − 0.09 eV/Å in the whole range of force magnitude. This respectable level of error is equivalent or better than those of other contemporary ML force fields, e.g., 0.1 eV/Å for bulk Si,25 0.2 eV/Å for Si n clusters,19 and 0.04 eV/Å for W.18

Table 1 Force evaluation errors of the force fields developed, obtained for a training set size N t = 1000

Next, we discuss the applicability of the AGNI force fields within MD simulations, especially pertaining to our ability to obtain total potential energies by appropriate integration of the atomic forces along a MD trajectory, and the stability of such simulations with respect to total energy conservation. In fact, the force field of class (I) created for Al (employing V \(_{i,\alpha }^\prime\)) has been successfully used for a variety of MD simulations.5,7,8 Herein, we tested the ML force fields created using (IV), combining V i,α and the force binning sampling method [we note that (III) and (IV) offer essentially similar performance]. For this purpose, we carried our several microcanonical (NVE) and canonical (NVT) MD simulations, employing the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS).40 The simulations were performed at 200 and 400 K over 500 ps, using Δt = 0.5 fs for the time step. Along the MD trajectories, the total potential energy E pot at a time (t) is computed from that of the previous step (t − Δt) as7

$${E_{{\rm{pot}}}}(t) = {E_{{\rm{pot}}}}\left( {t - \Delta t} \right) - \Delta t\mathop {\sum}\limits_{i,\alpha } F_i^\alpha v_i^\alpha ,$$
(4)

where the sum runs over the atom index i and the Cartersian index α (with E pot(t = 0) = 0). The kinetic energy E kin, the potential energy E pot, and the total energy E tot of the NVE and NVT simulations performed on a 256 atom supercell of Cu are shown in Fig. 6a, b, respectively, while those for other materials are given in the Supporting Information. For the NVE simulations, E tot is clearly conserved, presumably due to the intrinsic error cancellation of the ML force fields used, evidenced by the symmetric distribution of the prediction errors frequently reported.5,7 The NVT simulations suggest that E pot constructed from Eq. (4) works well with the thermostats implemented in LAMMPS. The agreement between E pot computed using Eq. (4) and those computed using DFT for several snapshots taken at 100, 200, 300, 400, and 500 ps, is shown in Fig. 6c, strongly supporting the accuracy of our force fields (with respect to both atomic force and total potential energy predictions) and their applicability in MD simulations.

Fig. 6
figure 6

Kinetic energy E kin, potential energy E pot, and total energy E tot obtained for a two NVE and b two NVT MD simulations of bulk Cu, performed at 200 and 400 K. Our simulations employ the force field created using V i,α and force binning sampling method. In c, the potential energies computed during the MD simulations using Eq. (4) and DFT for the snapshots taken at 100, 200, 300, 400, and 500 ps from these simulations are shown

Table 1 and Fig. 5 also show that the domain-specific force fields created by (V) are superior to (I), (II), (III), and (IV) in terms of the considered error measurements, approaching chemical accuracy for all the materials considered. Learning curves for the domain-specific force field recipe (V) (analogous to those shown in Fig. 4 for the (I) and (II) recipes) are provided in the Supporting Information. The applicability of the domain-specific force fields within MD simulations has not been tested extensively at this point. As alluded to earlier, as configurations evolve during the course of an MD simulation, atomic environments may shift from one domain to another, and it is imperative that the force prediction during such shifts do not go through discontinuities. These aspects have to be addressed appropriately before recipe (V) can be reliably used within MD simulations. Nevertheless, as the reference dataset grows in size, and the scaling of training and prediction times become intensive, domain-specific learning approaches may become a necessary alternative in the future, and may be a powerful pathway for the creation of adaptive and high-fidelity force fields.

Discussion

ML has emerged as a powerful approach for developing a new generation of highly accurate force fields.5,6,7,8,9,12,13,14,15,16,23 Accuracy and transferability are not the only advantages of ML force fields. Their most appealing aspect is that they can be created and improved in a highly automatic manner with minimal human intervention within one unified framework. This contribution discusses one such force field, using which atomic forces can be efficiently and accurately computed given just the configuration of atoms. The total potential energy during the course of an MD simulation may then be computed through appropriate integration of the atomic forces along the trajectory. The true versatility of the present single unified strategy is rigorously demonstrated in this work for a variety of elemental solids, displaying a diversity of chemical bonding types, namely Al, Cu, W, Ti, Si, and C. For all of them, the ML force fields reach chemical accuracy while being several orders of magnitude faster than corresponding quantum mechanics-based computations. The accuracy and applicability of the ML force fields can be systematically improved without adversely affecting their efficiency. This maybe accomplished by (1) preparing the reference data at a higher level of theory for computing the force more accurately and/or (2) including new training datasets selected from new domains that have not been considered yet. We hope that after being fully developed, such ML force fields may provide a pathway for high-fidelity MD simulations to enter the regimes of time scales ( nanoseconds) and/or length scales (10 nanometers) presently difficult to reach using purely quantum mechanical simulations.

Methods

Reference data preparation with DFT

Our reference data was prepared by first-principles computations performed using the DFT,41,42 a plane-wave basis set, and the projector augmented wave (PAW) method43 as implemented in the Vienna Ab initio simulation package 44,45,46,47,48 The Perdew, Burke, and Ernzerhof functional49 was used for the exchange-correlation energies.

The elemental (bulk) materials considered are face-centered cubic (fcc) aluminum Al, fcc copper Cu, hexagonal close packed Ti, body-centered cubic, tungsten (W), diamond silicon (Si), and diamond carbon (C). For each of them, we constructed a suitable supercell of \(n_{{\rm{at}}}^{{\rm{sc}}}\) atoms, and then performed DFT-based MD simulations at two temperatures, T = 300 K and T = 600 K (for the purpose of data accumulation, the temperature is a parameter for spanning diverse environments). Atomic configurations and forces were then extracted from the MD trajectories obtained. We note that to create a reliable force field for a specific application, e.g., for systems with defects, vacancies, and grain boundaries, the reference data should be more comprehensive, including the atomic environments expected to be encountered during the MD simulations.7,8

The convergence of force calculations by DFT is generally slower than that of energy calculations. To ensure the convergence of our data, the kinetic energy cutoff E c was taken to be 130% of the default value while the projection operators (involved in the calculations of the non-local part of the PAW pseudopotentials) were evaluated in the reciprocal space. The convergence of the forces with respect to the k-point mesh, generated without considering possible symmetries of the crystals, was carefully tested. We note that the accuracy level adopted for the reference data generation is significantly higher than used for typical DFT-based MD simulations (see, for example, refs. 50,51). All the parameters used for the data generation step are given in Table 2.

Table 2 Parameters used to generate the dataset for force field generation

Details of sampling and clustering

The key idea underlying the force binning method is that the training set should reflect the force-amplitude profile of the reference dataset while ensuring sparsely populated domains have their own representatives. Therefore, this procedure involves splitting the reference dataset into a given number of equal-size bins, covering the entire range of force amplitude. Then, 30% of the desired training set was equally selected from the prepared bins. The contribution of each bin to the remaining 70% of the training set was proportional to the population of this bin. Similarly, the clustering method makes sure that any domain in fingerprint space that is occupied by the reference dataset has their own representatives in the training set. In this work, we used the k-means method to cluster the reference dataset into five clusters in fingerprint space based on the distance used in KRR. Then, 20% of the desired training set was selected from each data cluster. The number of clusters was specified beforehand.

Data availability

All the data used for this work, i.e., the MD trajectories of the six materials, is freely available in http://khazana.uconn.edu with record identification number ranging from 3,278 to 3,283.