Discrepancies and Error Evaluation Metrics for Machine Learning Interatomic Potentials

Machine learning interatomic potentials (MLPs) are a promising technique for atomic modeling. While high accuracy and small errors are widely reported for MLPs, an open concern is whether MLPs can accurately reproduce atomistic dynamics and related physical properties in their applications in molecular dynamics (MD) simulations. In this study, we examine the current state-of-the-art MLPs and uncover a number of discrepancies related to atom dynamics, defects, and rare events (REs), in their MD simulations compared to ab initio methods. Our findings reveal that low averaged errors by current MLP testing are insufficient, leading us to develop novel quantitative metrics that better indicate the accurate prediction of related properties by MLPs in MD simulations. The MLPs optimized by the RE-based evaluation metrics are demonstrated to have improved prediction in multiple properties. The identified errors, the developed evaluation metrics, and the proposed process of developing such metrics are general to MLPs, thus providing valuable guidance for future testing, development, and improvements of accurate, robust, and reliable MLPs for atomistic modeling.


Introduction.
Atomistic modeling, which simulates physical phenomena based on the interactions of atoms, is a crucial research technique in a wide range of disciplines including physics, chemistry, biology, and materials science. Density functional theory (DFT) calculation have been the standard technique for evaluating atom interactions among a diverse range of configurations and chemistries, but their applications are limited to small system sizes of a few hundred atoms (up to a few nm) due to high computation costs. [1][2][3] By contrast, classical interatomic potentials, also known as force fields, have significantly lower computation costs and thus can be employed for atomistic simulations with much larger length-scale (nm -m) and longer time-scale (ns -s), but they lack the transferability to different atomistic configurations that are not considered in the potential fitting. 1,4,5 As an emerging technique to bridge the gaps among different computational (SOAP) descriptors, 11,12 Neural Network Potential (NNP), 13,14 Spectral Neighbor Analysis Potential (SNAP), 15 Moment Tensor Potential (MTP), 16 and Deep Potential (DeePMD) models, 5 and many MLIP variances derived from different modifications and combinations of ML models and descriptors. MLIPs are trained using DFT-calculated energies and forces from a diverse range of atomistic configurations, typically encompassing bulk and defected structures, equilibrium and non-equilibrium structures, and solid and liquid phases. Current state-of-the-art MLIPs are claimed to achieve accuracies similar to ab 3 initio calculations, 1,5,11,13,[15][16][17][18]  reported to achieve small, average errors of energies and atomic forces as low as 1 meV atom -1 and 0.05 eV Å -1 , 1,5,8,11,19,20 respectively, in conventional ML testing. These low averaged errors reported have created the impression that MLIPs are as accurate as DFT calculations. However, these MLIPs with small average errors may not always accurately reproduce the physical phenomena in atomistic simulations, as shown in the following examples. 10,21,22 An MLIP of Al by Botu et al. was reported a low MAE force error of 0.03 eV Å -1 , but its MD simulations predicted the activation energy of Al vacancy diffusion with an error of 0.1 eV compared to the DFT value of 0.59 eV, even though vacancy structures and vacancy diffusion were included in the training dataset. 23 In Vandermause et al. 20 , an Al MLIP with a low RMSE force error of 0.05 eV Å -1 for solid phase and 0.12 eV Å -1 for liquid phase also exhibited discrepancies with DFT in in surface adatom migration, which were considered during the on-the-fly training. Zuo et al. 1  the PES beyond their equilibrium sites, but the direct testing of MLIPs on these atomisticlevel details in MD simulations still shows errors and failures. 24 As reported by Fu et al. 24 , the MD simulations based on MLIPs observe errors, such as radial density functions, and even the failure of the MD simulations after a certain duration. These results suggest there are errors in the MLIPs causing these errors and failures from actual MD simulations.
It is crucial to examine the accuracy of MLIPs in simulating the atomic dynamics and reproducing physical properties, understand potential discrepancies, and develop appropriate testing metrics.
Typical approaches to improve the MLIPs include adjusting the fraction or weights of certain structures in the training dataset, modifying the cost/loss functions, and tuning hyperparameters. 25 The average errors in energies and forces or a few easily computable properties, such as elastic constants, energy vs. volume curves for different crystal structures, and formation energies of point defects, 1,26 are often used to optimize and select the MLIP models. However, to test and quantify those errors that can only be directly observed in actual MD simulations, such as the errors in diffusional properties, one would need to conduct numerous tests in MD simulations for an extended duration 5 before selecting the final MLIP models. 24 This approach requires a large computational cost of running MD simulations to select MLIPs, which may be impractical for optimizing MLIPs with many combinations of training variables and hyperparameters. Therefore, appropriate testing metrics should be developed to thoroughly gauge the ability of an MLIP in reproducing atomic dynamics and physical properties in a range of typically encountered physical situations, and such quantitative metrics are crucial for the further improvement of MLIPs.
In this study, by comparing atomic dynamics from MLIP-MD simulations and ab initio MD (AIMD) simulations, we reveal that state-of-the-art MLIPs, even with carefully selected training datasets and small average errors evaluated by conventional testing, may not fully reproduce atomic dynamics or related properties (Section 2.1). The tested MLIPs show discrepancies in diffusions or rare events (Section 2.2), defect configurations (Section 2.3), and atomic vibrations (Section 2.4). We then develop the error evaluation metrics based on the atomic forces of RE atoms (Section 2.5) and demonstrate them for indicating the performance of MLIPs on atomistic dynamics in MD simulations (Section 2.6). The MLIPs trained with enhanced RE data and selected by newly developed metrics show improved predictions of atom dynamics and diffusional properties. In the end, we summarize our process of developing the evaluation metrics for the observed simulationbased discrepancies. The identified discrepancies, our evaluation metrics, and their development process are general to all MLIPs and can serve as guidance for future development and improvements of accurate, robust, and reliable MLIPs for atomistic modeling.

Discrepancies of MLIPs on atomic dynamics between MLIP and DFT occur even when low average errors are reported
We conduct our study on a number of MLIPs to show the observed phenomena are general among MLIPs. In order to perform a consistent comparison, we directly retrieve the MLIP (GAP, GAPPRX, NNP, SNAP, and MTP) models of Si from previous studies 1, 26 , besides the DeePMD model 27  . In order to quantify the errors on predicted energies and atomic forces of these MLIPs, we construct interstitial-RE and vacancy-RE testing sets, RE-I Testing and RE-V Testing , respectively, each consisting of 100 snapshots of atomic configurations with a single migrating vacancy or interstitial, respectively, from AIMD simulations at 1230 K ( Fig. 1a and b, Methods), with the true values of energies and atomic forces evaluated by DFT calculations with a k-point mesh of 4×4×4 (DFT K4). The MLIPs accurately predict the energies and atomic forces in the training dataset and the testing dataset in consistency with the original studies, 1,26 showing low root-mean-square errors (RMSEs) below 10 meV atom -1 for energies and 0.3 eV Å -1 for forces (with the median force magnitude of 1.67 eV Å -1 ) for most MLIPs on the vacancy-RE testing set RE-V Testing (Fig. 1e,  We also test the MLIPs on the structures not included in the training data (Fig. 1c), specifically the interstitial-RE testing dataset RE-I Testing , which comprises the snapshots from the AIMD simulations of Si supercell with an interstitial. Most MLIPs (except for GAPPRX) do not include the configurations with an interstitial in the training dataset. While the MLIPs prediction of atomistic energies show a bias with an average offset of 10 -13 meV atom -1 lower than DFT values, the overall RMSEs are below 15 meV atom -1 and 0.3 eV Å -1 (with the median force magnitude of 1.69 eV Å -1 ) for these interstitial structures for most MLIPs (

Rare events are sources of discrepancies
Here the MD simulations using these MLIPs are performed to simulate atom dynamics and related physical properties in a Si supercell with a single vacancy or a single interstitial, to identify potential discrepancies between MLIP-MD and AIMD simulations. The Si diffusivities are evaluated at a range of temperatures 730 -1600 K from the mean-squared-displacements of Si over time (Methods) ( Fig. 2a and b). Given because AIMD simulations based on K4 is takes prohibitively long to obtain enough number of atom hops. In addition, these lower accuracy settings are commonly utilized for AIMD simulations in previous studies. 4,29,30 The errors with lower-accuracy DFT calculations also can be used as error ranges for comparison.
For vacancy diffusion, which is covered in the training data, most MLIPs predict diffusivities within a reasonable error range of the diffusivities predicted by K2 AIMD simulations. Most MLIPs perform better than DFT-K1. DeePMD gives higher diffusivities than AIMD simulations, but an agreement on the fitted activation energy (0.17 eV compared with 0.2 eV given by AIMD K2). Some MLIPs show discrepancies and deviations among each other in the diffusivities at lower temperatures, leading to discrepancies in the fitted activation energies and extrapolated room temperature diffusivities. Nonetheless, the comparison of the MLIPs and AIMD simulations should take into consideration the stochastic nature of estimating diffusivities from MD simulations and the limited number of data points, which would lead to large uncertainty of the fitted activation energy and extrapolated diffusivity. 4 As shown in the previous study 4 Table 2). The diffusivities predicted by GAP and MTP agree reasonably with AIMD simulations over the temperature range 1000 -1600 K, and the fitted activation energies of interstitial diffusion, Ea I , show minor differences. GAPPRX, which is the only MLIP considered interstitial in the training, also shows discrepancies in interstitial diffusivities with a low Ea I of 0.12 eV.
SNAP shows a Ea I of 0.74 eV, much higher than AIMD K2 (0.30 eV). For NNP, the crystalline Si structure melted during the MD simulations at the temperature of 730 -1600 K (Supplementary Note 4). According to the tests, many MLIPs show some discrepancies in predicting diffusivity, activation energy, or both, and thus the ability to fully reproduce the atom dynamics of Si interstitial diffusion is limited. The discrepancies in diffusional properties indicate that having low average errors in energies and forces are insufficient error evaluation metrics to judge whether the atomic dynamics of diffusions in MD simulations are accurate.

Configurations with similar energies are discrepancy sources
In order to reveal the discrepancies of MLIPs in predicting interstitial diffusions, we analyze the snapshots from MLIP-MD simulations of interstitials in comparison with AIMD simulations. DFT studies [31][32][33] reported three types of Si interstitials, such as the groundstate split-<110> (Fig. 3a), tetrahedral (Fig. 3b), and hexagonal ( Fig. 3c) interstitials, with the formation energies Ef of 3.56, 3.72, and 3.59 eV, respectively, from DFT (K4) calculations (Fig. 3). Consistent with the trend of the formation energies, the split-<110> interstitial has higher occurrence frequency in the AIMD simulations than tetrahedral or hexagonal (Methods) ( Fig. 3d and e). 31 give lower Ef for the tetrahedral interstitial than the ground-state split-<110> interstitial

Vibration near defects is a discrepancy source
The good performance of MLIPs on Si vacancies is expected given the dynamical snapshots of Si vacancies in a wide range of conditions are well covered in the training dataset. However, the phonon dispersion of the vacancy structure predicted by the MLIPs In summary, we have identified discrepancies based on observations of atomic dynamics in MD simulations, such as diffusions, configurations of defects, and atomic vibrations. In order to train MLIPs that can more accurately reproduce these dynamical phenomena, we need to quantify these discrepancies by developing corresponding error evaluation metrics, which can be further used to train and select the MLIPs with the highest metric scores.

Quantifying the force errors on RE migrating atoms
Using the discrepancies on atom diffusions as an example, we here develop corresponding error evaluation metrics, and improve the performances of MLIPs on diffusional properties. The process is as follows. We first develop a number of metrics for quantifying the aforementioned sources of discrepancies (Section 2.5). The evaluation metrics are then statistically verified to effectively indicate diffusional properties derived from atom dynamics in MD simulations (Section 2.6). This process can be generalized to improve the training and testing of MLIPs, as summarized in Section 2.7.
Given the aforementioned discrepancies in diffusional properties, we identify the sources of errors on migrating atoms, which are the atoms of interstitial or vacancy in the middle of the hopping from the current equilibrium site to the neighboring site (Methods).
These atom hops are known as rare events (REs) in MD simulations. To quantify the errors of MLIP predictions for RE atoms, we compare the predicted atomic forces with 14 DFT results on these RE atoms. We evaluate the error of atomic force in the magnitude and the direction of as Eq. (2) where is the benchmark true force values calculated by DFT K4, and is the force predicted by the MLIPs (or DFT K1, or K2). For the RE atoms in the RE-I Testing and RE-V Testing datasets, large errors were found in the force magnitude F (Fig. 5a) and force direction ( Fig. 5f) (Fig. 5b).
Among these atoms with F > 0.5 eV Å -1 , 10 -35% interstitials and 3 -15% vacancies also exhibit significant force direction errors of > 30º (Fig. 5g), which would lead to major errors in predicting atom dynamics in MD simulations. Therefore, these large errors in MLIP-predicted forces on RE atoms (Fig. 5d, e, i, and j) and their nearby atoms (Supplementary Table 7 These quantitative metrics can be effectively used in training, validation, and testing to improve MLIPs as demonstrated in the next section.

Process of developing error evaluation metrics
Developing error evaluation metrics that are indicative of the predictions of atomic dynamics is essential for the development of MLIPs. In the conventional training process, optimizing MLIPs on properties other than the errors of predicted energies and forces can be either done by 1) adding additional terms and weights into the loss functions as training targets (Fig. 7a-i), 1,25 or 2) adding additional metrics or material properties, such as elastic tensors in Ref. 1 , when deciding the optimal hyperparameters for MLIPs (Fig. 7a-ii).
However, this conventional approach would be computationally expensive for evaluating atomic dynamics (Fig. 6a-

Trade-offs in MLIP performances: Pareto fronts of MLIPs
Additionally, we compare the performance and force accuracies of MLIPs based on a variety of ML algorithms and atomic descriptors and observe the trade-off the accuracies on different properties (Fig. 6c). MTP gives accurate predictions for atomic forces and activation energies but shows discrepancies in the predicted atomic vibrations near vacancy (Fig. 3f, Fig. 6a, and c).
GAP has good predictions on diffusivities in MD simulations but shows discrepancies on phonon spectra (Fig. 1f, g, and h). DeePMD reproduces activation energies of diffusion but shows errors for diffusivities, atomic forces, and interstitial configurations (Supplementary Table 2, Supplementary Table 3, Fig. 1, Fig. 2, Fig. 4, and Supplementary Figure 12). Therefore, the performance of MLIPs should not be judged by a single property or a few properties. Even if an MLIP may give good performance for a number of properties, good performance for other properties should not be assumed.
Furthermore, overcoming the trade-off of MLIPs' performance on different properties is required to further improve MLIPs for a wide variety of physical simulations. While a careful selection and balance for different types of defect data are known to be essential, it is also important to have a systematic process and quantitative metrics to train MLIPs with balanced accuracies in a range of structures and properties.

Discussion.
Our study presents a systematic testing on the current state-of-the-art MLIPs,  (Fig. 6a) and testing of MLIPs (Fig. 3, 6c). Hoover thermostat was adopted. To obtain diffusional properties, AIMD simulations were performed at different temperatures 730, 840, 1000, 1230, 1500, and 1600 K following the same scheme in Ref. 4,45 . Missing diffusivities at certain temperatures in Fig. 2a   The values of the distance to the nearest static site, rs, the distances to its nearest vacancy, rv, and the distances to its nearest RE atom, r, were quantified correspondingly from these static sites. All distributions in Fig. 3c  For the selection of optimal MLIP models, which were used to study the force performances of MLIPs in Fig. 6c Using these evaluation scores, we first selected those MLIPs with the lowest RMSE or the highest NAC in any one of the eight criteria, giving a total of 46 MLIPs as interstitial-enhanced MLIPs. The same optimization process using eight criterion was also applied on selecting additional MLIPs trained by the original dataset in Ref. 1

Data availability statement
The structural (POSCAR files), energies, and forces data to support the finding of this study, including original training dataset from Ref. 1 , the enhanced validation set EVS , the interstitial-enhanced training set, the interstitial-RE testing set RE-I Testing , the vacancyenhanced training set, the vacancy-RE testing set RE-V Testing are available from: https://github.com/mogroupumd/Silicon_MLIP_datasets

Code availability statement
The computation codes and programs to support the finding of this study is available from the corresponding author on reasonable request. All DFT calculations are