Pretraining of attention-based deep learning potential model for molecular simulation

Machine learning assisted modeling of the inter-atomic potential energy surface (PES) is rev-olutionizing the field of molecular simulation. With the accumulation of high-quality electronic structure data, a model that can be pretrained on all available data and finetuned on downstream tasks with a small additional effort would bring the field to a new stage. Here we propose DPA-1, a Deep Potential model with a novel attention mechanism, which is highly effective for representing the conformation and chemical spaces of atomic systems and learning the PES. We tested DPA-1 on a number of systems and observed superior performance compared with existing benchmarks. When pretrained on large-scale datasets containing 56 elements, DPA-1 can be successfully applied to various downstream tasks with a great improvement of sample efficiency. Surprisingly, for different elements, the learned type embedding parameters form a spiral in the latent space and have a nat-ural correspondence with their positions on the periodic table, showing interesting interpretability of the pretrained DPA-1 model.


I. INTRODUCTION
Reliably representing the inter-atomic potential energy surface (PES) is core to the study of properties of molecules and materials in computational physics, chemistry, materials science, biology, etc.While electronic structure methods typically give accurate and transferable PES, they are prohibitively expensive for scaling to systems of more than thousands of atoms.On the other hand, empirical force fields are much more efficient, but are inherently limited by their accuracy in many applications.By properly integrating machine learning (ML) methodologies and physical requirements like extensiveness and symmetries, various methods have emerged to address the accuracy v.s.efficiency dilemma in the realm of PES modeling [1][2][3][4][5][6][7][8][9][10][11].Arguably, a new paradigm is forming: electronic structure methods are no longer used to generate the driving forces during molecular dynamics simulations but are used to generate data for training their alternatives, ML-based PES models.
Despite remarkable achievements of ML-based PES models [12][13][14], challenges still remain.For a domain expert who would like to apply such methodologies in their applications, a natural first question is on the efforts needed for obtaining a reliable PES model: Are there ready-to-use PES models?If not, what would be the amount of training data and time cost required?Can we take advantage of the ever-increasing publicly-available training data?
To address these issues, there have been several efforts.On one hand, general-purpose models for various systems, such as silicon [15], phosphorus [16], water [17], metals and alloys [18][19][20][21][22], etc., have been developed and are directly applicable to relevant studies.However, the range of applicability of such models is typically limited to small conformation or chemical space.For example, for alloys, the majority of general-purpose ML models are developed for systems with at most two element types.On the other hand, several efficient data generation protocols have been developed [23][24][25][26], of which a representative is DP-GEN [25,26], a concurrent learning procedure that iteratively explores the configuration space using models trained with existing data, and then labels only those configurations with high uncertainty level.Even with these protocols, the computational effort needed for complicated systems is still prohibitive.For example, to train a fairly general-purpose model for the AlMgCu alloy system, 100k density functional theory (DFT) [27,28] calculations were ultimately performed, resulting in the cost of ten million CPU core hours [18].
With the accumulation of high-quality electronic structure data covering almost all the elements on the periodic table, it is becoming possible to systematically develop pretraining schemes, which have been widely adopted in areas like computer vision (CV) [29,30] and natural language processing (NLP) [31,32].In these schemes, one first trains a unified model on large-scale datasets and then finetunes it for downstream tasks, expecting that a good representation can be learned in the first stage, and the amount of supervised data needed for the second stage will be significantly reduced.Recently, the pretraining-finetuning idea is applied to organic molecules systems for the energy and force predictions [33,34], and to tackle tasks beyond representing the PES [35][36][37].Unfortunately, most ML-based PES models are premature for such schemes at scale in materials ap-plications.Taking the widely used two versions of Deep Potential models [6,7] as examples, the ML parameters are element-type-dependent, making it highly inefficient when the training data containing many elements.
Constant efforts have been devoted to adapt the architecture of the ML-based PES models for large datasets.Among them, one class of models named equivariant graph neural networks (GNN) [38] that is built upon convolutions over atomic graphs of node and edge equivariant representations has shown promise of training on large datasets.SchNet [5], PaiNN [39], GemNet-OC [40], DimeNet++ [41], PFP [42], SCN [43], SpinConv [44] and Equiformer/EquiformerV2 [45,46] are trained on the OC20/OC2M [47] dataset containing about 133M/2M data frames of 56 elements.These models are benchmarked by the accuracy of energy, force and stable structure predictions.Very recently, it has been shown that introducing the attention architecture [45] in a GNN model improves the performance on the OC20/OC2M dataset [46].Chen and Ong [48] proposed M3GNet, which was able to train on a subset of the Materials Project [49] that contains 187,687 configurations encompassing 89 elements and labeled at the generalized gradient approximation (GGA) [50] or GGA+U level.Takamota et.al. [42] introduced the PFP model, which was trained on a dataset composed of molecular and crystal configurations including approximately 9 × 10 6 frames of 45 elements.Choudhary et.al. [51] developed the ALIGNN model, and they were able to train the model on a subset of the JARVIS-DFT dataset [52] that is composed of 307,113 data frames of 89 elements.The M3GNet, PFP and ALIGNN models are proposed as "universal" potential models, however, their accuracies are not on-par with PES models trained for a specific materials applications.
The equivariant GNN models are potential candidates for pretraining, several issues worth special attention before applying them in down-stream real-world applications.First, the GNN approaches are not well-suited for massively parallel molecular dynamics simulations [53].The update of each GNN layer requires communications between spatially decomposed sub-regions of the system.In each evaluation of the energy and forces, in total several to a dozen of such updates are required, which may lead to a substantial communication overhead in massively parallel high performance super-computers.Second, some models, such as PaiNN, GemNet-OC, SCN, Equiformer/EquiformerV2, directly predict forces using rotationally equivariant networks [39,40,45,54] instead of energy gradients with respect to atomic coordinates.Therefore, the predicted force is not conservative, which serves as a basic assumption in guaranteeing the accuracy of molecular simulations [55].Last but not least, some models, such as GemNet-OC, SpinConv, M3GNet, ALIGNN are not smooth, i.e. a sudden energy jump may happen as the positions of atoms infinitesimally varies.This leads to a non-conserved energy in the Hamiltonian dynamics simulations, which is used in computing the dy-namical properties like diffusion constant and viscosity.
By far, how much may the downstream materials applications benefit from the ML models trained on the largescale datasets are still not clear.To answer the question, in this article, we propose DPA-1, a Deep Potential model with a novel attention mechanism.Designed with a local descriptor, this model is exceptionally well-suited for parallel simulations on large-scale systems containing millions of atoms [56].Notably, DPA-1 predicts conservative forces, ensures smoothness, and demonstrates outstanding efficacy in learning inter-atomic interactions.Moreover, once pretrained, DPA-1 can significantly decrease the supplementary efforts needed for subsequent downstream tasks.We tested DPA-1 on various systems and observed superior performance compared with existing benchmarks.Then we took AlMgCu alloy systems [18] as an example, showing that after pretraining with single-element and binary samples, DPA-1 can save around 90% ternary samples compared with the DeepPot-SE model [7].Finally, we pretrained DPA-1 using the OC20 dataset, which consists of 56 elements, and successfully applied it to various downstream tasks.We checked the interpretability of the pretrained model by looking into the learned embedding parameters for different element types, finding that the 56 elements are arranged on a spiral in the latent space, which has a natural correspondence with their physical properties on the periodic table.Above all, we believe that DPA-1 and the pretraining scheme will bring the field of molecular simulation to a new stage.

II. METHOD
Consider a system of N atoms, the elemental types are A = {α 1 , α 2 , ..., α i , ..., α N }, and the atomic coordinates are R = {r 1 , r 2 , ..., r i , ..., r N }, with r i being the three Cartesian coordinates of atom i.The PES of the system is denoted by E, a function of elemental types and coordinates, i.e.E = E(A, R).For each atom i, consider its neighbors {j|j ∈ N rc (i)}, where N rc (i) denotes the set of atom indices j such that r ji < r c , with r ji being the Euclidean distance between atoms i and j.E is represented as the summation of atomic energies {e 1 , e 2 , ..., e i , ..., e N }, where the atomic energy e i only depends on the information of N rc (i).We define N i = |N rc (i)|, the cardinality of the set N rc (i).We use A i to denote element types in N rc (i), and R i ∈ R Ni×3 their corresponding coordinates relative to i.The atomic energy e i is thus a function of A i and R i .The atomic force on atom i, F i , is defined as the negative gradient of the total energy with respect to i's coordinate: We refer to Ref. [7] for a detailed discussion on several requirements on PES modeling.In particular, the PES has to be invariant under translation, rotation, and permutation of the indices of atoms with the same element types.
The details of the model architecture are introduced below.We refer to Fig. 1 for the overall pipeline to predict the atomic energy e i : from the embedded neighboring environment, through the self-attention scheme, to the symmetry-preserving descriptors, and finally to the fitting network.

A. Local embedding matrix with type information
We obtain the local embedding matrix with the following three steps.First, R i is mapped to the generalized coordinates Ri ∈ R Ni×4 .In this mapping, each row of R i , {x ji , y ji , z ji }, is transformed into a row of Ri : where {x ji , y ji , z ji } denotes the Cartesian coordinates of , and s (r ji ) : R → R is a continuous and differentiable scalar weighting function applied to each component, defined as: Here r cs is a smooth cutoff parameter that allows the components in Ri to smoothly go to zero at the boundary of the local region defined by r c .Second, we add the atomic type embedding as supplemental information.For atom i, the type embedding map T i is defined as: where α i is the atomic type of atom i and ϕ T is a onehot-like embedding network mapping from α i to a lengthfixed vector.Then, given both Ri and type embeddings {T i } ∪ {T j |j ∈ N rc (i)}, we define the local embedding matrix where G is a neural network mapping from scalar weight s (r ji ) and type embeddings of both center and neighbor atoms, through multiple hidden layers, to M 1 outputs.
Here we simply feed the concatenated inputs into G at once, as shown in Fig. 1(b).

B. Attention method for building up trainable descriptors
The attention mechanism has achieved great success and played an increasingly important role in CV [57] and NLP [58].It has become an excellent tool for modeling the importance or relevance of visual regions or text tokens, thus is potentially appropriate to reweight the interactions among neighbor atoms according to both distance and angular information.
In DPA-1, we follow the standard self-attention mechanism and obtain the queries where Q l , K l , V l represent three linear transformations which output the queries and keys of dimension d k and values of dimension d v , and l is the index of attention layer.Here we take G i,0 = G i .Then we adopt the scaled dot-product attention method [59] to mix the neighbor features after calculating the attention weights: where In the original attention method, one typically has , with √ d k being the normalization temperature.This is slightly modified to better incorporate the angular information: where Ri = R i ∥R i ∥2 ∈ R Ni×3 denotes normalized relative coordinates and ⊙ means element-wise multiplication.Intuitively, in the neighborhood of center atom i, neighbor atom k may be highly correlated with j when both the relative distance attention (Q i,l ) j (K i,l ) T k and flat Atom types Element-wise addition Element-wise multiplication ⨀ Linear operation

Softmax
Feed forward network

FFN
Relative coordinates normalized product of relative coordinates rji(r ki ) T rjir ki have high scores.
Then we add layer normalization in a residual way to finally obtain the self-attentioned local embedding matrix Ĝi in one such attention layer: We also tried other attention related tricks such as pre-layer normalization, multi-head attention, etc., which brought little improvement.In practice, as shown in Fig. 1(c), we repeated this procedure by l(l ≥ 2) times for a more complete representation.If not stated otherwise, we use l = 2 in the following sections of the work.
Next, we define the encoded feature matrix D i ∈ R M1×M2 of atom i: where Ġi stands for a sub-matrix of Ĝi , which takes the first M 2 (< M 1 ) columns of Ĝi .Here the feature matrix D i , i.e. the descriptor, preserves all the invariance mentioned above, of which the proof can be found in Ref. [7].
We then pass the reshaped D i , concatenated with the type embedding parameters of the center atom, through the multi-layer fitting network: The total energy of the system is then given as the summation of e i , and the atomic force F i can be further computed via Eq.1.

C. Model (pre-)training and finetuning
For model training or pretraining, we adopted the Adam stochastic gradient descent method [60] on all the trainable parameters w inside the model to minimize the loss: (12) Here B represents a minibatch, |B| is the batch size, t denotes the index of the training sample.E w , F w denote the model outputs and E, F are the corresponding DFT results.We also adopted a scheduler to tune the prefactors p ϵ and p f during the training process to make a better balance between energy and force labels.Virial errors, which are omitted here, can be added to the loss for training if available.
To finetune the pretrained model with a new dataset, we first change the energy bias in the last layer of the pretrained model with the new statistical results of the new dataset, and then we fix part of the parameters in the pretrained model and train the remaining.For the following experiments, we obtained the best performance when only the type embedding parameters are fixed.

III. EXPERIMENTS
We conducted a number of experiments to evaluate the performance of DPA-1.First, to test the model's ability to transfer among different compositions, we trained it from scratch against various systems and tested it under several challenging schemes.Then, we used an AlMgCu dataset to test its ability to transfer to ternary systems upon pretraining with single-element and binary data.Finally, we pretrained DPA-1 using the OC2M subset in OC20 dataset [47] and applied it to various downstream tasks.To illustrate the effectiveness of the typeembedding and attention schemes, we compared against DeepPot-SE model [7] in all the experiments.In the following, we shall introduce first the datasets we used, and then the experiments we conducted.

A. Datasets
AlMgCu alloy systems [18].This dataset is generated using DP-GEN [26], a concurrent learning scheme.After exploring 2.73 billion alloy configurations (derived from ∼2000 bulk and surface systems), only a small portion (∼100k configurations ) of them are labeled then compose the compact dataset.The exploration runs in the whole concentration space, i.e., Al x Cu y Mg z with 0 ≤ x, y, z ≤ 1, x + y + z = 1, and x, y, z take discrete values permitted by the finite-size simulation boxes.We can divide the systems into single, binary and ternary subsets, in name of the number of non-zero x, y, z.The configuration space covers a temperature range around 50.0 K to 2579.8K and a pressure range around 1 bar to 50000 bar.
Solid-state electrolyte (SSE) systems [61].These systems contain Li 10 XP 2 S 12 -type SSE materials, where X represents single or combination of Ge/Si/Sn, and can be divided into three main parts: init, mix and single.The init part comes from standard DP-GEN scheme starting from 590 structures that are generated via slightly perturbing DFT-relaxed crystal structures Li 10 Ge(PS 6 ) 2 , Li 10 SiP 2 S 12 and Li 10 SnP 2 S 12 from Materials Project [49].The exploration covers both ordered structures relaxed by DFT (i.e.structures down-loaded from the Materials Project database, in which the position of Ge/Si/Sn/P atoms are fixed) and disordered structures whose 4d sites are randomly occupied by Ge/Si/Sn/P.Based on the init part, the mix part contains further exploration in binary and ternary mixture of Ge/Si/Sn, while the single part covers only single X in Ge/Si/Sn with other changes in lattice and ratio of Li.
HEA systems.The High Entropy Alloy HEA dataset includes bulk TaNbWMoVAl alloy systems of various configurations and compositions.We employ DP-GEN to explore the composition space, starting from Ta 3 Nb 3 W 3 Mo 3 V 3 Al 1 , a 16-atom unit cell containing the former 5 elements as main components and Al as an additive.The dataset is divided into two subsets: interior and exterior .The interior (higher entropy) subset includes composition variations near the starting point.It covers six-component, quinary, quaternary and ternary alloys.The exterior (lower entropy) subset includes systems that are close to the corners and edges of the composition space.It includes systems where one or two elements dominate, binary alloys and simple substance systems.For both subsets, the temperature range is around 50.0 K to 388.1 K and the pressure range is around 1 bar to 50000 bar.
OC20 [47].OC20 consists of single adsorbates (small molecules) physically binding to the surfaces of catalysts covering periodic bulk materials with 56 elements.Both the chemical diversity and system size are much more complex than other benchmark datasets, such as MD17 [62], ANI-1x [24] or QM9 [63].OC2M is a subset including 2 million data points (energies and forces) randomly sampled from OC20, which is still challenging for model training and decent for pretraining.Johannes et al. recently provided several baselines on OC2M, taking months to converge [40].

B. Accuracy on various datasets, trained from scratch
The majority of existing models usually focus on the ability to transfer among different configurations, in which case training and validation subsets consist of similar compositions (e.g.randomly sampled from the same dataset).However, to perform pretraining, the upstream and downstream datasets may differ violently.Thus, it's vital for models under pretraining scheme to transfer among different compositions or even among different datasets, which has, as far as we know, rarely been discussed before.In this work, we mainly focus on a more general but challenging scheme to comprehensively test the generalization ability of the model.
We first designed several challenging tasks to test the model's ability to transfer among different compositions.For AlMgCu, SSE, and HEA systems, we divided them into subsets with different compositions for training and validation (See Sec.III A for details).The results of  DPA-1 and DeepPot-SE are shown in Table I.With the training loss nearly the same (omitted in the table), the DPA-1 drastically outperforms DeepPot-SE in the validation accuracy.For example, for AlMgCu systems, when trained only on single-and binary-element samples, the validation RMSE of DPA-1 on ternary samples can outperform DeepPot-SE by one order of magnitude (6.99 versus 65.1 meV/atom).This suggests that the DPA-1 model might have learned the latent interactions of ternary pairs Al-Mg-Cu from binary pairs Al-Mg, Al-Cu, Mg-Cu, and single-element interactions, possibly thanks to the type-embedding scheme and attention mechanism.We conducted an ablation study in Appendix.A on HEA systems to demonstrate the influence of each structural component.
To test the performance of DPA-1 in terms of predicting more physical quantities, we performed geometry relaxations on all AlMgCu ternary alloys available from the Materials Project to evaluate their accuracy in predicting formation energy and equilibrium volume (see details in Appendix B).We also used it to calculate the elastic moduli of AlMgCu systems, which requires accurately capturing the second-order information (see details in Appendix C).Additionally, we carried out molecular dynamics simulations on LiGePS systems to assess the diffusion coefficients in relation to temperature, comparing the results to ab initio molecular dynamics (AIMD) simulations and experimental studies (see details in Appendix D).In all tests, satisfactory agreement with the DFT and/or experimental references are obtained.
As a supplement, we also trained DPA-1 model on several simple systems to compare with other ML-based PES.Since these tasks are much easier than the above ones and out of our main focus, we place the results in Appendix Section F. Note that there may be relatively little room for improvement on these simple datasets,

C. Sample efficiency of pretrained models
As shown in Fig. 2, we use the learning curves to illustrate in terms of the amount of additional training data saved for downstream tasks thanks to model pretraining.
In all the experiments, the learning curves were generated by an active learning procedure, in which a pool of data labeled by energy and force is prepared and three steps are repeated iteratively: using samples in the training pool to train the model; testing the model using the remaining samples; selecting 50 samples with the largest prediction errors on per-atom energies and adding them to the training pool.We use the term sample efficiency to denote the amount of training samples required by a model to achieve a given accuracy level for a certain task.
We started with a relatively simple task to compare DeepPot-SE and DPA-1.In this task, both the two models were pretrained using single-element and binary subsets of the AlMgCu systems, and the learning curves were obtained using the AlMgCu ternary subset.As shown in Fig. 2(a), DPA-1 exhibits a much better sample efficiency than DeepPot-SE, which should be expected.
Next, we used the OC2M dataset, which contains 56 elements, to pretrain DPA-1 and evaluated its performance on the HEA systems and the AlCu systems (Figs.2(b) and (c), respectively).As shown in Fig. 3(c), the training cost of DeepPot-SE scales quadratically with the number of elements, making its pretraining computationally infeasible, while the number of elements has no effects on the training cost of DPA-1.It is observed that the sample efficiency of DPA-1 pretrained on OC2M is generally better than DPA-1 from scratch, while DeepPot-SE from scratch is the worst.Moreover, compared with AlCu systems, the improvement of pretraining is much more significant for HEA systems, possibly due to the fact that the number of elements of HEA is much larger than AlCu, and the local chemical environment is much more complicated.
The equivariant GNN models usually needs thousands of GPU hours to be trained to a descent accuracy [40].By contrast, the DPA-1 modle only takes less than 200 GPU hours for training.The converged energy and force MAEs on the OC2M validation set are 0.681 eV and 0.076 eV/ Å, respectively.This accuracy is comparable with the best energy conserving GNN model DimeNet++, which achieves MAEs of 0.805 eV and 0.066 eV/ Å, reported in Ref. [40].A better performance of energy MAE 0.286 eV and force MAE 0.026 eV/ Å is achieved by GemNet-OC at the cost of non-conservative forces and loss of smoothness [40].

D. Interpretability of type embedding learned from pretraining
To see whether DPA-1 can learn physically meaningful information from pretraining, we investigated the 3dimensional principal component analysis (PCA) visualization of the learned type embeddings in the OC2Mpretrained model.Interestingly, as shown in Fig. 3(a), the arrangement of the elements generally follows the shape of a downward spiral.Elements belonging to the same period are lined up in the direction of the spiral; while elements belonging to the same family are listed in the direction orthogonal to the spiral.Even though some transition metal elements are almost bounded together, this rule still roughly holds.It is observed that C, N and O are outliers, possibly because in OC2M, C, N and O are mostly in organic molecules, which serve as adsorbates and have chemical environments that are very different from other elements.
In addition, we performed interpolation experiments for the type embedding of Li, an element unseen in OC2M.As shown in Fig. 3(b), we let T Li = λ (N a) * T N a + (1 − λ (N a)) * T H , since Li lies between H and Na in the same family.When tested on the SSE system, only the bias in the atomic energy is changed, since the setup of the electronic method used to label the SSE system is different from that for OC2M, which typically causes an energy shift.It is found that the RMSE of energy and force shows a sudden drop when λ (N a) = 0.7, which meets the chemical intuition and further confirms the interpretability of the pretrained DPA-1 model.Moreover, we conducted analogous interpolation experiments for Nb and Mo on the HEA systems, and reached similar conclusions as the Li interpolation (see detailed report in Appendix.E).

IV. SUMMARY
In this paper, we developed DPA-1, an attention-based Deep Potential model that allows for large-scale pretraining on atomistic datasets.We tested DPA-1 from different aspects, showing its excellent performance in terms of its accuracy on various datasets when trained from scratch, as well as its sample efficiency when pretrained with existing data.Further investigations on the type embedding parameters suggests the interpretability of DPA-1 pretrained on OC2M.
In the future, it will be of interest to extend the training dataset to cover the full periodic table, and, in particular, see a more converged "spiral" in the latent space; the embedding information of local chemical environments may be useful to characterize different conformations.Multi-task and unsupervised training schemes would worth exploring; and, for downstream tasks, just like what has happened in the fields of CV and NLP, schemes like model compression, distillation, and transfer, etc., are desperately needed.We leave these possibilities and more applications to future works.

VIII. COMPETING INTERESTS
The authors declare no competing financial or nonfinancial interests.--------'------------'--------  As shown in Table A.2, we trained DPA-1 from scratch on simple bulk systems [6] and compared with the embedded atom neural network potential (EANN) [10] and DeepPot-SE.This small dataset contains two types of systems.The first type includes general systems, such as relatively easy Cu, Ge, Si, Al 2 O 3 with one single solid phase, and more challenging systems like C 5 H 5 N (pyridine) and TiO 2 with two and three phases, respectively.The second type of systems contains a grand-canonical-like system of supported Pt clusters on a MoS 2 slab and a CoCrFeMnNi high entropy alloy (HEA) system.

FIG. 1 .
FIG. 1. Schematic illustration of DPA-1.(a) Flowchart from A i and R i to the atomic energy ei.(b) Structure of the Embedding net, which maps s(rji) and Ti, through multiple residual layers, to G i .(c) Self-attention mechanism on G i through a standard scale-dot procedure gated by the angular information Ri ( Ri ) T .(d) Fitting net structures, similar to Embedding net, from the descriptors D i and Ti to final atomic energy ei.
FIG. 2.Learning curves of both energy and force with DeepPot-SE and DPA-1, under different setups and on different systems.(a) Learning curves on the AlMgCu ternary subset, with DeepPot-SE and DPA-1 models pretrained on single-element and binary subsets; (b-c) Learning curves on HEA (b) and AlCu (c), with DeepPot-SE (from scratch) and DPA-1 (both from scratch and pretrained on OC2M).Red line represents the full-data-training baseline with DPA-1.
FIG. 3.(a) 3-dimensional PCA visualization of the learned type embeddings of DPA-1 pretrained on OC2M.These 56 elements are roughly arranged on a spiral in the latent space.Elements in the fourth period are connected with the red line and elements belonging to the same family are grouped by the blue dot lines.Colors on the names of the elements represent the height in z-axis.We use dashed circle to denote the hypothetical position of Li, which is not contained in OC2M.See text for discussions.(b) RMSE of energy and force for SSE systems given by DPA-1 pretrained on OC2M, as functions of linear interpolation coefficient λ (N a).Since Li is not contained in OC2M, we let TLi = λ (N a) * TNa + (1 − λ (N a)) * TH be the interpolated type embedding of Li.The OC2M-pretrained model with this interpolation and modified energy bias is directly tested on SSE systems without further training.(c) Training efficiency of DPA-1 and DeepPot-SE (considering type information of both two sides) with the growing number of element types in training systems.The maximum number of neighboring atoms to be considered is set to 120 in all the experiments.
VII. AUTHOR CONTRIBUTIONS D.Z, L.Z, H.W. and F.Z.D. conceived the idea of this work.D.Z., H.B. and H.W. designed the model structure.D.Z. implemented the model.D.Z., H.B. and W.J performed the experiments on different systems.All authors contributed to the discussions and edited the manuscript.
Appendix F: Results on simple datasets FIG. A.4.The RMSE of energy and force on the HEA system using embedded element type interpolation for (a)(b) singleelement replacements and (c)(d) two-element replacements.Nb wo , Mo wo and NbMo wo denote models trained on purposefully altered OC2M datasets that excluded frames containing Nb, Mo, and either Nb or Mo, respectively.

TABLE I .
Validation RMSE of DPA-1 and DeepPot-SE on energy (meV/atom) and atomic forces (meV/ Å) with different settings of the training/validation sets (See Sec.III A for details).The number of attention layers l in DPA-1 is set to 2 in the AlMgCu and SSE systems, and to 3 in the HEA systems.Bold numbers correspond to lower values.

TABLE A .
1. Ablation study of the DPA-1 model architecture.The models are validated by the energy (meV/atom) and atomic forces (meV/ Å) RMSEs calculated with different settings of the training/validation sets.Bold numbers correspond to lowest values.