Accurate and efficient molecular dynamics based on machine learning and non von Neumann architecture

Mo, Pinghui; Li, Chang; Zhao, Dan; Zhang, Yujia; Shi, Mengchao; Li, Junhua; Liu, Jie

doi:10.1038/s41524-022-00773-z

Download PDF

Article
Open access
Published: 09 May 2022

Accurate and efficient molecular dynamics based on machine learning and non von Neumann architecture

Pinghui Mo¹,
Chang Li¹,
Dan Zhao¹,
Yujia Zhang¹,
Mengchao Shi¹,
Junhua Li¹ &
…
Jie Liu ORCID: orcid.org/0000-0001-7350-9850^1,2

npj Computational Materials volume 8, Article number: 107 (2022) Cite this article

12k Accesses
11 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Force field-based classical molecular dynamics (CMD) is efficient but its potential energy surface (PES) prediction error can be very large. Density functional theory (DFT)-based ab-initio molecular dynamics (AIMD) is accurate but computational cost limits its applications to small systems. Here, we propose a molecular dynamics (MD) methodology which can simultaneously achieve both AIMD-level high accuracy and CMD-level high efficiency. The high accuracy is achieved by exploiting deep neural network (DNN)’s arbitrarily-high precision to fit PES. The high efficiency is achieved by deploying multiplication-less DNN on a carefully-optimized special-purpose non von Neumann (NvN) computer to mitigate the performance-limiting data shuttling (i.e., ‘memory wall bottleneck’). By testing on different molecules and bulk systems, we show that the proposed MD methodology is generally-applicable to various MD tasks. The proposed MD methodology has been deployed on an in-house computing server based on reconfigurable field programmable gate array (FPGA), which is freely available at http://nvnmd.picp.vip.

Overcoming the barrier of orbital-free density functional theory for molecular systems using deep learning

Article 11 March 2024

Beyond MD17: the reactive xxMD dataset

Article Open access 20 February 2024

Learning local equivariant representations for large-scale atomistic dynamics

Article Open access 03 February 2023

Introduction

As a cornerstone of atomistic-scale analysis, molecular dynamics (MD) is widely used in many fields, such as physics^1,2, chemistry^3,4, biology⁵, materials^6,7, nanotechnology^8,9, drug design^10,11, earth science^12,13, semiconductor integrated circuit^14,15, and so on. Despite its importance, it is well-known that MD simulations suffer from a long-standing dilemma between accuracy and efficiency^{16,17,18,19,20,21}. On one hand, ab-initio MD (AIMD), which is based on the first-principles density functional theory (DFT) evaluation of potential energy surface (PES), is accurate but not efficient enough to simulate large systems^16,17,18. On the other hand, classical MD (CMD), which is based on artificially-crafted force fields (FF) approximation of PES, is efficient but not accurate enough in some applications^{19,20,21,22,23,24}.

In recent years, this dilemma is mitigated, to some extent, by the machine-learning (ML) MD (MLMD)^{25,26,27,28,29,30,31}. By evaluating PES using ML models, the efficiency of MLMD is significantly superior than that of AIMD, while keeping the AIMD-level high accuracy. Unfortunately, though several orders of magnitude faster than AIMD, the state-of-the-art MLMD is still about two orders of magnitude slower than CMD^27,31,32. Until now, it is still an outstanding problem to develop an MD simulator that can simultaneously achieve AIMD-level high accuracy and CMD-level high efficiency.

It is worth noting that, MD simulations are predominantly deployed on general-purpose von-Neumann (vN) computers, where the data processing hardware (e.g., central processing unit (CPU) and graphics processing unit (GPU)) and the data storage hardware (e.g., dynamic random-access memory (DRAM)) are separate hardware components. It is well known that the vN computers suffer from severe vN bottleneck (vNB)—the majority (e.g., over 90%) of computing time and energy must be spent to repeatedly shuttle data back-and-forth between the data processing hardware and the data storage hardware^33,34,35. Consequently, only a very small fraction of calculation time and energy consumption is used to perform the useful arithmetic and logic operations, leading to the overall low efficiency of vN computers^33,34,35.

The severity of vNB depends on the characteristics of calculation—the more repeated data shuttling, the more performance-limiting vNB becomes. Given the typical MD duration (e.g., t_MD ≈ 10⁻⁹–10⁻³ s) and timestep (e.g., ∆t ≈ 10⁻¹⁵ s), the atomistic data (e.g., positions, velocities, forces, and atomic neighbor data, etc.) must be shuttled repeatedly by a large number (e.g., n_MD ≈ t_MD/∆t = 10⁶–10¹²) of times. Furthermore, in each MD timestep, a huge number of additional data shuttling is required to accomplish each PES evaluation³⁶. Hindered by such nested-loop heavy-duty data shuttling, both the time efficiency and the energy efficiency of MD calculations are extremely low on general-purpose vN computers²³.

However, since the invention of the first general-purpose electronic computer in the 1940s, general-purpose vN architecture has been the dominating paradigm of the mainstream computers like laptops, desktops, and supercomputers for over 7 decades^37,38,39. Researchers nowadays widely use vN computers to run MD, largely because they have no other choice. Though some special-purpose MD computers have been developed^22,23,40,41, they are all based on CMD and FF, whose accuracy is questionable in many important applications^{42,43,44,45,46}. Therefore, considering the scientific and technological significance of MD^{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47}, it deserves serious efforts to develop a special-purpose MD computer beyond the vN paradigm, to enable efficient and accurate MD calculations in various fields.

In order to approach this goal, in this paper, we propose a paradigm shift from the established vN architecture to a non vN (NvN) architecture. By leveraging the technologies in MLMD algorithms^26,27,30, artificial intelligence^48,49, and NvN architecture^50,51, the proposed special-purpose MD computer can simultaneously achieve both the AIMD-level high accuracy and the CMD-level high efficiency. This is achieved by deploying a deeply-revised MLMD algorithm, i.e., DeePMD^{26,27,28,29,30,31}, (to ensure high accuracy) on a carefully-optimized NvN hardware (to ensure high efficiency). In the Section Results, the calculation accuracy and calculation efficiency are quantitatively analyzed. In the Section Discussion, a discussion is briefly made. In the Section Methods, the overall system design of the proposed special-purpose MD computer is introduced and the implementation details of the NvN architecture are presented.

Results

The performance of the proposed special-purpose non von Neumann molecular dynamics (NVNMD) computer (see Methods section for more design and implementation details) is quantitatively analyzed in this section. First, the analysis procedure is introduced (Section Analysis procedure). Then, the calculation accuracy (Section Calculation accuracy), the calculation time efficiency (Section Time efficiency), and the calculation energy efficiency (Section Energy efficiency) are analyzed quantitatively.

Analysis procedure

Any user can follow two consecutive steps to run MD on the proposed NVNMD computer, which has been released online⁵²: (i) to train a machine learning (ML) model that can decently reproduce the PES^{25,26,27,28,29,30,31}; and (ii) to deploy the trained ML model on the proposed NVNMD computer, then run MD there to obtain the atomistic trajectories.

ML training (i.e., step (i)) is performed on traditional vN architecture computers (e.g., CPU/GPU) by using the training codes we open-sourced online⁵³, which are programmed purposefully based on TensorFlow⁵⁴ to help users train ML models that are compatible with the unique NvN computer proposed here. To accomplish step (i), the training samples should be prepared first. This can be done by using either the active learning tools^26,29,30, or the brute-force (i.e., less efficient) DFT-based AIMD sampling^55,56. Then, these training samples are used as inputs of our training codes⁵³, which output the ML models. Our training procedure is comparatively speaking more complicated than that of the established MLMD^26,27—it consists of not only the continuous neural network (CNN) training of the established MLMD, but also an additional step of quantized neural network (QNN) training (Section Quantized neural network) which uses CNN results as inputs. Typically, the CNN training uses a large number of training steps (e.g., 1 × 10⁶) with a high learning rate (e.g., 2 × 10⁻²); and the subsequent QNN training uses a small number of training steps (e.g., 1 × 10⁴) and a low learning rate (e.g., 2 × 10⁻⁷), as it only needs to minimize the small error induced by quantization from CNN to QNN.

ML inference (i.e., step (ii)) is performed on the proposed NvN architecture computer, after uploading the QNN ML model to our online NVNMD system⁵². In the online NVNMD system, all MD settings and parameters (e.g., timestep, microcanonical/canonical/isothermal–isobaric ensemble, thermostats, etc.) are controlled by using the same input file interface of the LAMMPS package⁵⁷, except that the force field is replaced by using the uploaded QNN ML model.

Six systems are used to run testing MD calculations, including three molecule systems (i.e., benzene, naphthalene, and aspirin) and three bulk systems (i.e., Sb, GeTe, and Li₁₀GeP₂S₁₂). The training data of molecule systems are from MD17 dataset^58,59,60; and those of bulk systems (i.e., Sb, GeTe, and Li₁₀GeP₂S₁₂) come from Ref. ⁶¹, Ref. ⁵⁶, and Ref. ⁶², respectively. The result of test accuracy, speed and energy efficiency are shown in Section Calculation accuracy, Section Time efficiency, and Section Energy efficiency, respectively.

Calculation accuracy

The high accuracy of the proposed NVNMD can be seen obviously in Table 1. The root mean square errors (RMSE) of PES fitting of benzene, naphthalene, aspirin, Sb, GeTe, and Li₁₀GeP₂S₁₂ systems are 0.19, 0.39, 0.32, 0.14, 0.09, and 0.14 kcal mol⁻¹, respectively. These are close to the established MLMD values in literature (Table 1), and well below the chemical accuracy threshold (1.0 kcal mol⁻¹)^63,64, indicating decent accuracy of the proposed NVNMD. As a direct comparison, in Table 1, we also collected some of the energy prediction errors of the MLMD and CMD from the existing literature, after removing the obvious outliers (e.g., |∆E_i | =2.7 kcal mol⁻¹ for methylamine, and | ∆E_i | =1.4 kcal mol⁻¹ for aqueous LiF pair^65,66,67). It is obvious that, while CMD suffers from large PES prediction error, the proposed NVNMD has decent PES prediction accuracy, which is inherited from the highly accurate MLMD.

Table 1 Calculation accuracy comparison between the proposed NVNMD and the established MLMD/CMD.

Full size table

As shown in Table 1, the μ_e of MLMD is about 10⁻²–10⁻¹ kcal mol⁻¹ (i.e., single-digit meV atom⁻¹) different from μ_e of NVNMD. It’s worth noting that, we copied data of different systems from literature into Table 1, since it is distracting and time-consuming to reproduce MD results of so many systems using various MD tools by ourselves. So, Table 1 only roughly shows that the MLMD and NVNMD have the similar accuracy (μ_e ≈ 10⁻¹ kcal mol⁻¹) and both of them are much more accurate than CMD (μ_e ≈ 10¹ kcal mol⁻¹).

Though it is clear from Table 1 that NVNMD is much more accurate than CMD, it is difficult to tell the subtle accuracy difference between NVNMD and MLMD based on different datasets. Therefore, more rigorous and refined analysis is needed. Instead of directly fetching MLMD RMSE data from literature (as we did in Table 1), hereafter we start the entire procedure (including training, inference, and testing) all the way over by using the identical set of training/testing data on an identical set of systems (Table 2).

Table 2 The root mean square errors (RMSE) of system energies (meV atom⁻¹) of the three bulk systems.

Full size table

Since the proposed NVNMD is revised from DeePMD^{26,27,28,29,30,31} (details available in the Methods section), we use DeePMD as reference and the starting point (third row of Table 2). Then, the vN-based revisions and the NvN-based revisions are made consecutively to obtain the results of CNN (second row of Table 2) and QNN (first row of Table 2), respectively. QNN results are the final results of NVNMD in Table 1. Here, the CNN includes all quantization-free (thus more appropriate for vN) revisions, e.g., the revision of symmetry-preserving feature calculation (Eq. (11)) and the revision of nonlinear activation function calculation (Section Nonlinear activation function). The QNN further includes the quantization-based (thus more appropriate for NvN) revisions, e.g., the continuous neural network is replaced by using quantized neural network (Section Quantized neural network), the multiplication operations of floating-point numbers are replaced by using shift operations of quantized numbers (Section Multiplication-less neural network), the continuous evaluation is replaced by using discretized look-up table searching (Eq. (6)) and so on.

When applied to all systems under test, the NVNMD shows a single-digit meV atom⁻¹ accuracy difference, compared to the MLMD (last row of Table 2). It’s worth noting that this is about 3 orders of magnitude lower than the typical interatomic bonding energy (at the order of 10⁰ eV atom⁻¹), and 1 order of magnitude smaller than the chemical accuracy threshold, 1.0 kcal mol⁻¹ (i.e., about 43.4 meV atom⁻¹)^63,64. Furthermore, this high accuracy is kept while calculating elementary, binary, and quaternary systems. The test systems we chose in Table 2 include complicated crystalline-amorphous phase transitions (Sb and GeTe^56,61) and atomic diffusion through quaternary system (Li-Ge-P-S⁶²), which involve repeated chemical bond rapture and re-forming with very sophisticated PES. Therefore, these test calculations can prove that NVNMD would be accurate enough to handle many complicated MD applications.

To further test the accuracy of NVNMD, we also computed atomic forces as shown in Table 3, because atomic forces are vital to reliably obtain the MD trajectories during integration of the Newton equation. Since the molecular systems in MLMD literature in Table 3 are all evaluated based on the same dataset^58,59,60, we evaluate NVNMD based on this dataset too. When compared against the ab-initio results, the atomic force mean absolute error (MAE) of NVNMD and that of MLMD show a very small difference (i.e., at the order of 10¹ meV Å⁻¹), as shown in the last row of Table 3. In all test cases, this difference is even below the default atomic force threshold (e.g., 40.0 meV Å⁻¹ in SIESTA⁶⁸; 25.7 meV Å⁻¹ in Quantum Espresso⁶⁹; and 23.1 meV Å⁻¹ in CP2K⁷⁰) to determine atomic force convergence during atomic lattice relaxation or supercell geometry optimization in the mainstream ab-initio density functional theory tools.

Table 3 The mean absolute error (MAE) of atomic forces (meV Å⁻¹) of three bulk systems and three molecular systems.

Full size table

To visually illustrate the high accuracy, the energy and force predicted by the proposed NVNMD are plotted against those predicted by the established DFT-based AIMD, as shown in Fig. 1. The high accuracy of energy and forces laid a solid foundation for reliable calculation of physical properties. As shown in Table 4, the bond length, bond angle, and vibration frequencies of the water molecule calculated by using the proposed NVNMD are very close (<1% different), compared to the those obtained by MLMD. As shown in Fig. 2, the radial distribution function, angle distribution function, and coordination number of amorphous GeTe calculated by NVNMD are also very close to those by MLMD.

**Fig. 1: The comparison of predicted energy and force.**

Table 4 The test error ||∆E||₂ and ||∆F||₂, bond length, bond angle, and vibration frequencies of a single water molecule.

Full size table

**Fig. 2: The structure properties of amorphous GeTe.**

To further test accuracy in GeTe system, the canonical (NVT) ensemble MD is performed as shown in Fig. 3a. The crystalline GeTe system of 512 atoms is melted from crystalline to liquid, by increasing temperature from 300 K to 1800 K. Then it is quenched from liquid to amorphous from 1800 K to 300 K. Finally, the system is recrystallized from amorphous back into crystalline by annealing it at 600 K. The entire melt-quench-anneal phase transition processes as measured in experiments^71,72 can be successfully reproduced, as shown in Fig. 3a, indicating decent accuracy of the proposed NVNMD.

**Fig. 3: The molecular dynamics simulation results of GeTe and Li₁₀GeP₂S₁₂ using the proposed NVNMD.**

For the MD test of Li₁₀GeP₂S₁₂, the system of 900 atoms is initialized and equilibrated at 500 K using NVT ensemble for 10 ps, then simulated at microcanonical (NVE) ensemble for 100 ps. The trajectory is used to calculate diffusion coefficients of system (shown as Fig. 3b). The mean square displacement (MSD) is calculated from trajectory using following expression

$${{{\mathrm{MSD}}}} \approx \left( {\frac{1}{{6t}}} \right)\left\langle {\left| {r_i\left( t \right){{{\mathrm{ - }}}}r_i\left( 0 \right)} \right|^2} \right\rangle ,$$

(1)

The diffusion coefficient is computed by extracting the slope of MSD in Fig. 3b. We get the diffusion coefficient 2.03 × 10⁻¹⁰ m² s⁻¹ which is close to the value of Ref. ⁶² (i.e., 2.00 × 10⁻¹⁰ m² s⁻¹). Furthermore, according to the NVNMD results (Fig. 3b), the MSD in x and y directions are significantly smaller than that in z direction. This is in line with the anisotropic diffusion properties of Li₁₀GeP₂S₁₂ well known in literature^73,74,75.

Time efficiency

Using the six test systems, the calculation time efficiency of the proposed NVNMD is shown in Table 5. It is obvious that the time efficiency of the proposed NVNMD is around two orders of magnitude better than that of the MLMD. This means that NVNMD runs at a high speed like CMD, despite calculation complexity of MLMD is much higher than that of CMD. As schematically shown in Fig. 4, the proposed NVNMD simultaneously achieves both the CMD-level high efficiency and the MLMD/AIMD-level high accuracy. It’s worth noting that the same high efficiency is kept, no matter elementary (e.g, Sb), binary (e.g., GeTe), and quaternary (e.g., Li–Ge–P–S) systems are simulated.

Table 5 Calculation time efficiency comparison between the proposed NVNMD and the established MLMD/CMD.

Full size table

**Fig. 4: Calculation time and error of CMD, MLMD, and the proposed NVNMD.**

Energy efficiency

The energy efficiency η is calculated via the formula η = T × P, where T represents the calculation time efficiency (Section Time efficiency); and P denotes the power consumption, which is measured by using a local power tester (PUUCAI P26A-10PN). The total power of the proposed NVNMD system is measured to be only about 108 W. Therefore, η of the proposed NVNMD system is around 10⁻⁵ J step⁻¹ atom⁻¹ (Table 6).

Table 6 Calculation energy efficiency comparison between the proposed NVNMD and the established MLMD/CMD.

Full size table

By using T and P of MLMD from Ref. ^26,31, it can be estimated that η of the established vN-based MLMD is around 10⁻³–10⁻² J step⁻¹ atom⁻¹. Here, P of MLMD is calculated by using the number of CPU/GPU used in Ref. ^26,31; and we use 30 W per CPU and 250 W per GPU for estimation^76,77,78. For instance, based on Summit supercomputer, MLMD uses 27.3 thousand CPU cores and 27.3 thousand GPUs³¹, so P ≈ 27.3 × 10³× (250 + 30) ≈ 7.6 MW (around 50–60% of Summit supercomputer’s total power consumption 13 MW) is used to achieve T ≈ 2.7 × 10⁻¹⁰ s step⁻¹ atom⁻¹.

As shown in Table 6, the calculation energy efficiency of the proposed NVNMD is around 2–3 orders of magnitude better than that of the established vN-based MLMD, with similar calculation accuracy (Table 5 and Fig. 4). Such high energy efficiency is achieved, because in NVNMD there is no repeated data shuttling, which consumes most of the energy in its vN-based counterpart^33,34,35. Consequently, the calculation energy efficiency of the proposed NVNMD is comparable to that of CMD, but accuracy of the proposed NVNMD is much superior than that of CMD (Table 1).

Discussion

As an early-stage pilot version, we implemented the NVNMD on an FPGA (details available in the Methods section). It is well known that FPGA has merits of low-cost and field-programmability (i.e., short turnaround time for design revisions and iterations), and the disadvantages of limited hardware resources and low clock frequency. In contrast, the application-specific integrated circuit (ASIC) has merits of more abundant hardware resources and much higher clock frequency, and the disadvantages of high fabrication cost and long development cycle. So, FPGA is typically used as a debugging and testing tool (research phase), before taping out the ASIC (mass-production phase).

It’s worth noting that the proposed NVNMD, which simultaneously achieved high calculation accuracy (Table 1), high calculation time efficiency (Table 5), and high calculation energy efficiency (Table 6), is based on a low-end device (Xilinx xcvu9p) in the Xilinx Virtex UltraScale+ FPGA product family (Section Hardware implementation)^79,80. This has three significant technological implications on the future ASIC development scenarios of the proposed paradigm (NvN-based MD).

Firstly, the NVNMD we use here is based on a low clock frequency (i.e., 250 MHz), which is about one order of magnitude lower than ordinary ASIC like the commodity-level vN-based GPU/CPU whose clock frequency can reach several GHz⁷⁶. This implies that the time efficiency of NVNMD could be enhanced by another order of magnitude (i.e., C₁ ≈ 10¹), in a straightforward fashion by boosting the clock frequency, if we move from the research phase (FPGA) to the production phase (ASIC).

Secondly, the NVNMD we use here is implemented using a rather limited amount of hardware resources (i.e., about 10⁶ logic cells in FPGA as shown in Table 7). It is well known that a single ASIC chip could integrate around 10⁹–10¹⁰ transistor devices (e.g., 1.6 × 10¹⁰ and 2.1 × 10¹⁰ transistor devices in one Apple M1 5 nm chip and one NVIDIA Tesla V100 12 nm chip, respectively^76,81). Even though we use about 10¹ transistor devices to realize 1 logic cell, the ASIC could be 2 to 3 orders of magnitude more resource abundant than the FPGA we are using. By leveraging the decent parallelization scaling property of MLMD (inherited by the proposed NVNMD)⁸², we anticipate at least two orders of magnitude enhancement of time efficiency (i.e., C₂≈ 10²), by purely increasing the intra-ASIC parallelization.

Table 7 FPGA resource consumption of the proposed system.

Full size table

Thirdly, the logic and arithmetic circuit can be deployed much more freely in ASIC than in FPGA, since FPGA uses more resources to ensure flexibility and programmability. Given much less constraints, the same set of functionalities can be implemented with much fewer transistor devices in ASIC than in FPGA. Therefore, by using ASIC to replace FPGA, the time efficiency can be increased by roughly ten times when the logic and arithmetic circuits are simplified (i.e., C₃≈ 10¹)⁸³.

To summarize, the FPGA-based results we presented in this paper is a very early development stage of the proposed NVNMD paradigm. Based on the architecture design we verified using FPGA in this paper, we are working on developing ASIC-based NVNMD computer, which could be around 4 orders of magnitude (i.e., C = C₁× C₂× C₃≈ 10⁴) more efficient than the results we showed in this paper. In another word, by moving from FPGA to ASIC, the time efficiency of NVNMD could be enhanced from 10⁻⁷ s step⁻¹ atom⁻¹ (Table 5) to around 10⁻¹¹ s step⁻¹ atom⁻¹. This means that NVNMD based on a single ASIC chip (with cm²-level size and 10¹-10² Watt-level power) could be faster than the MLMD based on the whole Summit supercomputer (around 10⁻¹⁰ s step⁻¹ atom⁻¹, with one entire building size and 10⁶–10⁷ Watt-level power)^31,82. Of course, during the implementation of the ASIC-based NVNMD, we need to consider factors other than speed, e.g., the generality to run all kinds of MD and the flexibility to control/dump MD simulation results, etc. All these considerations may compromise the speed somehow, which deserves future research attention. Finally, there are lots of machine learning methods other than DeePMD (e.g., SchNet⁸⁴, DimeNet⁸⁵, sGDML⁸⁶, PaiNN⁸⁷, SpookyNet⁸⁸, GemNet⁸⁹, NewtonNet⁹⁰, UNiTE⁹¹, NequIP⁹², and so on). The NvN acceleration of these methods deserves future research attention, too.

Compared to the other special-purpose MD computers already existing in literature (e.g., Anton)^22,23,40,41, the NVNMD proposed here is different in several aspects. Anton focuses on accelerating biology-related MD simulations and, thus, mainly implements biology-oriented classical force fields. While these classical force fields offer valuable insights to simulate biological MD problems (e.g., protein folding), they suffer serious accuracy problems in many applications in other fields, because it only has the CMD-level accuracy^67,93,94,95. The accuracy of the proposed NVNMD, however, is at the AIMD/MLMD-level. Furthermore, Anton is implemented on the ASIC using advanced semiconductor technology nodes (e.g., 7 nm node), which offers much higher speed than the FPGA used in this pilot version of NVNMD.

Methods

In this Section, the overall system design of the proposed special-purpose MD computer is introduced. Section Heterogeneous parallelization describes the heterogeneous parallelization between the proposed NVNMD computer’s two major units—the master processing unit (MPU) and the slave processing unit (SPU). Section Pipeline and high-speed transmission interface discusses the high-speed transmission interface (HTI) between MPU and SPU. Section Master processing unit and Section Slave processing unit introduce the functionalities of MPU and SPU, respectively.

As the most important part of the proposed NVNMD computer, the SPU bears the predominant majority (e.g., over 99%) of the total computational load. To maximize calculation efficiency of SPU, we propose a paradigm shift from the established general-purpose vN architecture (e.g., CPU and GPU)^{26,27,31,56,58,62,65,66,96,97} to a special-purpose NvN architecture. The proposed NvN architecture efficiently computes the energy (Section Energy calculation) and atomic force and virial (Section Force and virial calculation) by leveraging the processing-in-memory (PIM) technology (Section Processing in memory), based on the algorithms of DeepPot-SE MLMD^26,30,98 after three crucial modifications. These three modifications are indispensable to realize high calculation efficiency using very limited amount of hardware resources. Firstly, the traditional continuous neural network (CNN) widely used in MLMD is replaced by using the quantized neural network (QNN) (Section Quantized neural network). Secondly, the resource-consuming multiplication-based neural network is replaced by using a resource-economical multiplication-less neural network deliberately-designed here for the NVNMD (Section Multiplication-less neural network). Thirdly, the widely-used trigonometric function-based nonlinear activation functions are replaced by using the lightweight nonlinear activation functions specially-crafted for the NVNMD (Section Nonlinear activation function). With the help of these three significant modifications, the NvN-based SPU is implemented in a field programmable gate array (FPGA), to quantitatively test the overall performance of the proposed MD computer (Section Hardware implementation).

Heterogeneous parallelization

The MD simulation consists of a certain number of timesteps in a loop. In each timestep, there are two parts of calculations – (i) the evaluation of PES, E = E({R_i}), and atomic forces, F_i = −∇_iE({R_i}); and (ii) all other calculations, including numerical integration of the Newton equation to update {R_i} and {v_i}. Here, E is the system energy; R_i, v_i, and F_i are the Cartesian coordinate, velocity, and force of the atom i (i = 1, 2, …, N), respectively; and N is the total number of atoms in the simulation system⁹⁹.

In large-scale AIMD/MLMD simulations (i.e., N > 10²), the overwhelming majority (e.g., over 99%) of computing time is spent to evaluate PES⁸². So, we focus on accelerating the calculation of part (i) by using the SPU based on the proposed special-purpose NvN architecture. In contrast, the calculation of part (ii) is much less computationally demanding, so the calculation of part (ii) is based on the traditional general-purpose vN architecture in MPU. This vN/NvN heterogeneous architecture is designed to leverage the flexibility of vN architecture. As a consequence, the proposed MD computer can efficiently run all kinds of MD simulations, e.g., canonical ensemble MD, microcanonical ensemble MD, isothermal–isobaric ensemble MD, enhanced sampling, and so on^{100,101,102,103,104}.

As illustrated in Fig. 5, the calculation of each MD timestep consists of seven consecutive steps (i.e., S1, S2, S3, S4, S5, S6, and S7). In the S1, all atoms j in the vicinity of the atom i are chosen as the neighbor atoms, whose indices are stored in a neighbor list N_c(i) = {j, |R_j − R_i| < R_c} where R_c is a predefined cutoff. In the S2, the neighbor list {N_c(i)}, together with all atoms’ chemical species {Z_i} and atomic coordinates {R_i}, is encoded into a compact data format using 16, 2, and 64 bits, respectively. Then, the compact data {N_c(i)}, {Z_i}, and {R_i} are transmitted from the MPU to the SPU. In the S3, after receiving these compact data from MPU, the global atomic information is transferred into the i^th atom’s local atomic information which includes Z_i, {Z_j, j ∈ N_c(i)} and {R_ji} = {R_i−R_j, j ∈ N_c(i)}. In the S4, the i^th atom’s energy component E_i, the atomic force components {F_ji} = {∂E_i/∂R_ji, j ∈ N_c(i)} and virial components {V_ji} = {R_ji^T× F_ji, j ∈ N_c(i)} are evaluated by feeding the local atomic information into the PES computation module. In the S5, the system energy E, atomic forces {F_i}, and virial V are computed by summing the contributions of each atom. The relationship can be represented as $E = \mathop {\sum }\nolimits_{i{{{\mathrm{ = 1}}}}}^N E_i$,

$$F_i = - \nabla _iE\left( {\left\{ {R_i} \right\}} \right) = - \mathop {\sum }\limits_{j \in N_c(i)} F_{{{{ji}}}} + \mathop {\sum }\limits_{j \in N_c(i)} F_{{{{ij}}}}$$

(2)

and

$$V{{{\mathrm{ = }}}}\mathop {\sum }\limits_i R_i^{\mathrm{T}} \times F_i = - \mathop {\sum }\limits_{{{{i}}} \ne {{{j}}}} V_{{{{ji}}}}$$

(3)

**Fig. 5: Schematic figure of heterogeneous parallelization in the NVNMD system.**

Here, E, {F_i}, and V are encoded with 64 bits, 32 bits, and 64 bits, respectively, and written into a random-access memory (RAM) ready to be read by MPU. In the S6, MPU reads and decodes compact data from the RAM. In the S7, numerical integration of the Newton’s equation, MD thermostat, and material properties are computed.

While S1, S2, S6 and S7 are executed in the MPU, S3, S4 and S5 are executed in the SPU. During integration of these seven steps into a whole functional MD computer, the high calculation efficiency is ensured by two key design ideas. Firstly, the MPU and SPU are linked by the high-speed transmission interface (HTI), and the time spent in MPU calculation and MPU-SPU communication is minimized by the parallel pipeline computation (Section “Pipeline and high-speed transmission interface”). Secondly, the PES evaluation, which is the most time-consuming part of MD, is significantly accelerated by the processing in memory (PIM) calculations in SPU based on the proposed NvN architecture, under the coordination of MPU (Sections Master processing unit and Slave processing unit). The MPU is implemented by running a revised LAMMPS package⁵⁷ on a multicore CPU (Section Master processing unit); and the SPU is implemented using an FPGA (Section Slave processing unit).

Pipeline and high-speed transmission interface

To mitigate, as much as possible, the efficiency bottleneck caused by MPU-SPU communication, four high-speed technologies are used together here. Firstly, the MPU-SPU HTI is designed as a full-duplex channel to enable simultaneous data sending and data receiving, by using two separate memory hardware units. For instance, in the FPGA implementation (Section Hardware implementation), one on-chip block random access memory (BRAM)¹⁰⁵ of SPU is used to read data, and another on-chip BRAM of SPU is used to write data. Secondly, high-speed peripheral component interconnect express (PCIe) technology is used to send/receive data between MPU and SPU. For instance, here we use 16-lane PCIe 3.0, whose maximum bandwidth is 7.88 Gbit s⁻¹ per lane, so the total bandwidth is as high as 15.75 GByte s⁻¹ (Section Hardware implementation)^106,107. Thirdly, the direct memory access (DMA) technology is used to transfer data between MPU and SPU. Using DMA, MPU-SPU data communication can be achieved without MPU control, so that the data transfer latency is minimized and the burden of MPU is alleviated. Fourthly, the seven consecutive steps (Fig. 5) are organized in a carefully-designed pipeline (Fig. 6).

**Fig. 6: Schematic figure of calculation time based on pipeline design when one processing core a or multiple processing cores b are used in MPU.**

While the first three high-speed technologies (i.e., full-duplex, PCIe, DMA) minimize the SPU’s idle time (i.e., time other than t_SPU in Fig. 6a), the fourth one (i.e., pipeline) tries to vanish the SPU’s idle time. As shown in Fig. 6b, by using a small number (typically less than ten) of CPU cores in the MPU, the NvN-based SPU is always busy performing heavy-duty calculations, which is beneficial to maximize the overall efficiency.

The proposed NVNMD is based on a pipeline, in which the MPU and the SPU work in a complementary manner (Figs. 5 and 6). Thus, to maximize the overall efficiency, it is desirable to keep the SPU always busy. In another word, t_SPU in Fig. 6 should be large enough (i.e., N should be large enough) to minimize T_i in Table 5. This trend can be seen in Fig. 7 – the calculation efficiency drops if N is small. Since our focus is the MD simulations of large systems (e.g., N > 10⁴ atoms), this should not be a concern.

**Fig. 7: The calculation time efficiency depending on the number of atoms.**

Master processing unit

MPU performs S1, S2, S6, and S7, as illustrated in Fig. 5. While using a MPU (e.g., CPU with N_P cores) to process the system of N atoms, the whole system is spatially decomposed into N_P domains with equal volume, and each core processes one domain for parallel acceleration⁵⁷. To account for the interaction between atoms located within different domains, neighbor atoms of the domain Ω are copied from the neighbor domains to form a shell domain (referred to as Ωⁿ hereafter)⁵⁷. The indices of atoms inside Ωⁿ are stored in a list I_Ωⁿ = {j | R_j^Ω < R_c and j ∉ I_Ω}, where R_j^Ω denotes the minimum distance between atom j and Ω; I_Ω is the list of indices of atoms inside Ω. For notational convenience, the list of indices of all atoms inside Ω and Ωⁿ is denoted as I_Ω^a = (I_Ω, I_Ωⁿ), which is a combined list of I_Ω and I_Ωⁿ.

SPU can only store and process limited amount of data at one time due to hardware resource restriction, so domain Ω is further divided into N_SD = ⌈μ × N_Ω /N_SPU⌉ sub-domains (denoted as ω hereafter) with equal volume, where μ is set as 2 to account for the spatial fluctuation of atom density; N_Ω is the number of atoms within Ω; N_SPU is set as 4096 to strike a balance between communication efficiency and resource utilization; and ⌈x⌉ is the ceiling function which rounds x to upper integer. While processing one sub-domain ω, a shell of ω (referred to as ωⁿ hereafter) is additionally created to account for the interaction between atoms located within Ωⁿ and sub-domains other than ω. The indices of atoms inside ωⁿ are stored in a list I_ωⁿ = {j | R_j^ω < R_c and j ∉ I_ω}, where R_j^ω denotes the minimum distance between atom j and ω; I_ω is the list of indices of atoms inside ω.

Based on the abovementioned two-level decomposition (i.e., ‘A’ and ‘B’ in Fig. 6b), the enable signal is utilized to ensure that the MPU cores call the SPU in a proper order. After running S2, the p^th core (p = 1, 2, …, N_P) doesn’t call SPU until it receives an enable signal from its previous core (i.e., the N_P^th core when p = 1 and the (p−1)^th core otherwise). After obtaining the results from SPU, the p^th core sends an enable signal to its next core (i.e., the 1^st core when p = N_P and the (p+1)^th core otherwise). It’s worth noting that the 1^st core doesn’t require an enable signal to process its 1^st sub-domain. The steps (i.e., S1, S2, S6, and S7) are discussed in detail below. In the S1, each core builds the neighbor list {N_c(i), i ∈ I_Ω} of atoms located within its domain Ω.

In the S2, MPU processes the sub-domain ω’s data, including neighbor list {N_c(i), i ∈ I_ω}, chemical species {Z_i, i ∈ I_ω^a}, and coordinates {R_i, i ∈ I_ω^a}, where I_ω^a = (I_ω, I_ωⁿ) is a list obtained by combining I_ω and I_ωⁿ. First, the {N_c(i)} is recoded as local neighbor list. For example, if one element of {N_c(i)} is 5, and the value 5 is located at the 1^st position of I_ω^a, this element of {N_c(i)} will be encoded as 1. In this step, the encoded {N_c(i)} is compressed from 32 bits to 16 bits. Second, {Z_i} is compressed from 32 bits to 2 bits through encoding it as the order of chemical species. Third, {R_i} is encoded by multiplying 2⁴⁸ and rounding into 64-bit integer from 64-bit floating number. The encoded data {N_c(i)}, {Z_i}, and {R_i} are stored in the buffer until they are transmitted to SPU by HTI.

In the S6, the cores of MPU decode the data fetched from SPU. Take the sub-domain ω inside domain Ω as an example, the data consists of energy E_ω, atomic forces {F_i, i ∈ I_ω^a}, and virial V_ω. E_ω is decoded from 64-bit integer to 64-bit floating point number by multiplying a factor of 2⁻¹³, and then summed up to obtain the energy E_Ω of domain Ω. {F_i} is decoded from 32-bit integer to 64-bit floating point number by multiplying a factor 2^-25, and then summed up into the corresponding atomic forces of domain Ω (i.e., {F_i, i ∈ I_Ω^a}) according to the index in I_ω^a. V_ω is decoded from 64-bit integer to 64-bit floating number by multiplying a factor 2⁻²⁵, and then added up to the virial V_Ω of Ω.

In the S7, the cores of MPU perform numerical integration, thermostat, and so on. After the core requests SPU to evaluate PES of its domain Ω, the energy E_Ω, atomic forces {F_i, i ∈ I_Ω^a}, and virial V_Ω are obtained, but they are incomplete. Therefore, the cores exchange the forces {F_i, i ∈ I_Ωⁿ} of atoms located within the shell I_Ωⁿ to obtain the complete atomic forces {F_i, i = 1, 2, …, N}. In addition, E_Ω and V_Ω are also exchanged to obtain the complete energy E and virial V of the whole system. Afterward, the atomic forces are used for numerical integration and other procedures in parallel. After S7 is finished, one timestep of MD is accomplished. The abovementioned steps repeat until all timesteps in the MD trajectory are accomplished.

Slave processing unit

As shown in Fig. 5, the SPU runs S3, S4, and S5 in each MD timestep. Among the three categories of MD (i.e., CMD, AIMD, and MLMD), we choose MLMD to implement the proposed special-purpose MD computer, because the FF-based PES evaluation in CMD is too inaccurate and the DFT-based PES evaluation in AIMD is too sophisticated. We modify the Deep Potential-Smooth Edition (DeepPot-SE)³⁰ and deploy it in the S4 of the SPU. The local atomic information Z_i, {Z_j}, {R_ji} are used to compute the many-body descriptor D_i which preserves the translation invariance, rotation invariance, and permutation invariance. Then, D_i is used to calculate the i^th atom’s energy E_i. Finally, atomic force components {F_ji} are obtained by computing the negative derivative of E_i. These steps are discussed in more details below.

In order to preserve the translation invariance, the global coordinates R_j= (x_j, y_j, z_j) are transformed into the relative coordinates R_ji = R_j − R_i = (x_ji, y_ji, z_ji). To describe the smooth cutoff, a new coordinate u_ji is constructed through multiplying R_ji by a cutoff function s_ji, which describes the contribution decay by the increase of R_ji until R_c. The new coordinate is expressed as

$${{{\boldsymbol{u}}}}_{ji} = \left( {s_{{{{{ji}}}}},s_{{{{{ji}}}}}\frac{{x_{{{{{ji}}}}}}}{{R_{{{{{ji}}}}}}},s_{{{{{ji}}}}}\frac{{y_{{{{{ji}}}}}}}{{R_{{{{{ji}}}}}}}{{{{,}}}}s_{{{{{ji}}}}}\frac{{z_{{{{{ji}}}}}}}{{R_{{{{{ji}}}}}}}} \right)$$

(4)

Here, the cutoff function is defined as

$$s_{{{{{ji}}}}} = f_c(R_{{{{{ji}}}}}) = \left\{ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{l}} {\frac{1}{{R_{{{{{ji}}}}}}},} \hfill \\ {\frac{1}{{R_{{{{{ji}}}}}}}\left\{ {\frac{1}{2}{{{\mathrm{cos}}}}\left[ {\pi \frac{{\left( {R_{{{{{ji}}}}} - R_{{{{\mathrm{cs}}}}}} \right)}}{{\left( {R_c - R_{{{{\mathrm{cs}}}}}} \right)}}} \right] + \frac{1}{2}} \right\},} \hfill \\ {{{{\mathrm{0,}}}}} \hfill \end{array}} & {\begin{array}{*{20}{c}} {0 \le R_{{{{{ji}}}}} \,<\, R_{{{{\mathrm{cs}}}}}} \\ {R_{{{{\mathrm{cs}}}}} \le R_{{{{{ji}}}}} \,<\, R_c} \\ {{{{\mathrm{otherwise}}}}} \end{array}} \end{array}} \right.$$

(5)

where R_cs is a predefined cutoff parameter^26,30. Next, multilayer perceptron (MLP) neural network $G_{Z_j}$¹⁰⁸, which is called Feature NN (FeaNN) hereafter, is constructed. FeaNN has one input node and M output nodes, which is written as

$$g_{{{{{ji}}}}} = G_{Z_j}(s_{{{{{ji}}}}})$$

(6)

The output of l^th layer of MLP is

$$x^{l{{{\mathrm{ + 1}}}}} = \xi ^l(x^l \times w^l + b^l)$$

(7)

where x^l, w^l, b^l, and ξ^l are input, weight, bias, and nonlinear activation function of the l^th layer, respectively. The weights of FeaNN depend on chemical species Z_j. Therefore, the output g_ji can distinguish the contribution of neighbors with different chemical species.

In order to preserve the permutation invariance, matrix U_i of size M × 4 is written as

$$U_i = \mathop {\sum }\limits_{j \in N_c(i)} U_{{{{{ji}}}}}$$

(8)

where

$$U_{{{{{ji}}}}} = g_{{{{{ji}}}}}^{\mathrm{T}} \times u_{{{{{ji}}}}}$$

(9)

is a M × 4 matrix^26,30.

In order to preserve the rotational invariance,

$$D_i^\prime = U_i \times U_i^{\mathrm{T}}$$

(10)

is defined^26,30. The subset of $D_i^\prime$ is extracted as a new M × M₂ (1 ≤ M₂ ≤ M) matrix D_i for reducing unnecessary computational cost.

$$D_i\left[ {l,k} \right] = D_i^\prime \left[ {l,\left( {k + l} \right)\% M} \right];l = {{{\mathrm{0,1, \ldots ,}}}}\,M - {{{\mathrm{1;}}}}\,k = {{{\mathrm{0,1, \ldots ,}}}}\,M_2 - 1$$

(11)

where l and k are the matrix indexes of row and column, respectively; % is modulo operation.

The total energy is written as $E = \mathop {\sum }\nolimits_{i{{{\mathrm{ = 1}}}}}^N E_i$. The energy E_i of i^th atom is only determined by its chemical species Z_i and the symmetry-preserving feature D_i^26,27. MLP neural network is used to fit the relation between input D_i and output E_i (referred to as FitNN hereafter)

$$E_i = E_{Z_i}\left( {D_i} \right)$$

(12)

Then, the force and virial can be calculated by using Eq. (2) and Eq. (3), respectively.

Energy calculation

The evaluation of PES (S4 in Fig. 5) is realized by using six calculation modules (i.e., M1, M2, M3, M4, M5, and M6) in SPU, and the atomic energy E_i is predicted during forward propagation, as shown in Fig. 8a. In the M1, {R_ji²} is computed from the relative coordinate {R_ji}. In the M2, {R_ji²} is used to calculate the cutoff function {s_ji} and the outputs {g_ji} of FeaNN according to Eq. (5) and Eq. (6). The weights and biases of FeaNN are switched according to {Z_j}. In the M3, the new coordinate {u_ji} is obtained by using {s_ji} and {R_ji} (Eq. (4)). In the M4, {u_ji} is multiplied by {g_ji} to get {U_ji}, and then {U_ji} is summed together to get U_i (Eq. (9) and Eq. (8)). In the M5, the many-body descriptor D_i is extracted from the subset of the symmetric matrix $D_i^\prime$ which is the matrix product of U_i and U_i^T (Eq. (10) and Eq. (11)). In the M6, the FitNN is implemented to evaluate E_i from D_i, whose weights and biases are switched according to Z_i (Eq. (12)).

**Fig. 8: The implementation of PES calculation.**

In order to simplify the computation complexity of M2, an interpolation method is used to map from {R_ji²} to {h_ji}, where h_ji is a vector of R_ji² and Z_j (e.g., s_ji and g_ji). At the beginning, N_T mapping tables with N_M rows are built to store the data a_k and b_k of R_ji² and Z_j in their k^th row (k = 1, 2, …, N_M), where N_T represents the number of different chemical species; a_k and b_k stand for the value and derivative value of h_ji, respectively, when R_ji² = r_k (here r_k= (k−1)·R_c² /N_M). Then, when R_ji² (r_k ≤ R_ji² < r_k+1) and Z_j is entered, one of the mapping tables is enabled according to Z_j and its k^th row data (i.e., a_k and b_k) is fetched. Finally, h_ji is computed via the formula h_ji = (R_ji² − r_k) · b_k + a_k. The interpolation method employs a few mapping tables and multiplication to replace the complex computation of trigonometric function (Eq. (5)) and FeaNN (Eq. (6)), and it is utilized to compute {s_ji} and {g_ji} from {R_ji²} in the M2. The N_M is set at 1024 to strike a compromise between accuracy loss and resource usage.

The digital signal processor (DSP)¹⁰⁹ resources are used to implement multiplication in M1, M2, M3, M4, and M5. The on-chip memory UltraRAM (URAM)¹⁰⁵ is used to implement the mapping table in M2. Look-Up Table (LUT)¹¹⁰ implements FitNN’s matrix multiplication in the M6, and FitNN’s weights and biases are stored in on-chip Look-Up Table RAM (LUTRAM)¹¹⁰ to avoid frequent fetching from off-chip memory. There is no need to temporarily store the intermediate results in the off-chip memory since the output of the former module is the input of the latter.

Force and virial calculation

The atomic force is defined as the negative gradient of energy, so the force component {F_ji} is calculated in the backward propagation of the models (i.e., M1, M2, M3, M4, M5, and M6) in SPU (Fig. 8b). In order to perform both FP and BP calculation in these modules, each module consists of three sub-modules: FP, BP, and first-in-first-out (FIFO) (Fig. 8c). FP and BP are used to execute the calculation of forward propagation and backward propagation, respectively. FIFO is employed to transmit the required intermediate results calculated by FP to BP. As illustrated in Fig. 8c, in the FP, the input of k^th module is χ^k, and the output χ^k+1 is calculated according to the expression corresponding to the module in the forward propagation; in the BP, the input is ∂E_i/∂χ^k+1, and the output ∂E_i/∂χ^k is calculated by using chain rule as ∂E_i/∂χ^k = ∂E_i/∂χ^k+1 × ∂χ^k+1/∂χ^k; in the FIFO, the intermediate results χ_m^k is transmitted in order to calculated ∂χ^k+1/∂χ^k.

Using the module structure shown in Fig. 8c, the gradient of each module’s input is computed in the backward propagation (Fig. 8b). More specially, in the M6, FIFO transmits Z_i and the input of each layer’s activation function, and BP computes ∂E_i/∂D_i. In the M5, FIFO transmits U_i, and BP computes ∂E_i/∂U_i. In the M4, FIFO transmits {u_ji} and {g_ji}, and BP computes {∂E_i/∂u_ji} and {∂E_i/∂g_ji}. In the M3, FIFO transmits {s_ji} and {R_ji}, and BP computes {∂E_i/∂s_ji} and {∂E_i/∂u_ji × ∂u_ji/∂R_ji}. In the M2, FIFO transmits {R_ji²} and {Z_j}, and BP computes {∂E_i/∂R_ji²}. It’s worth noting that {∂s_ji/∂R_ji²} and {∂g_ji/∂R_ji²} are obtained by using the interpolation method proposed in Section Energy calculation. In the M1, FIFO transmits {R_ji}, and BP computes {F_ji} = {(∂E_i/∂R_ji² × ∂R_ji²/∂R_ji) + (∂E_i/∂u_ji × ∂u_ji/∂R_ji)} and {V_ji}.

In the BP, the matrix multiplication in the FitNN is implemented by LUT, and other multiplication is realized by DSP resources; on-chip LUTRAM is used to hold the FitNN’s parameters; on-chip URAM is used to implement the mapping tables in the M2. The parameters and the mapping tables only need to be initialized once at the start of NVNMD, and they don’t need to be fetched from off-chip memory on a regular basis. In the FIFO, on-chip URAM is also used to implement the function which transmits the data from FP to BP to avoid communication with off-chip memory. The entire procedure is designed to run in pipeline mode for optimal performance.

Processing in memory

If the energy, force, and virial (Sections Energy calculation and Force and virial calculation) are calculated on traditional vN computers, the efficiency is very low. For instance, to calculate the l^th layer of MLP (i.e., Eq. (7)), it needs 11 steps, as shown in Fig. 9. These calculations are not as efficient as they could be, because the data storage unit needs to be accessed 8 times (i.e., steps 1, 2, 4, 5, 6, 8, 9, and 11 in Fig. 9a). Due to the limited size of on-chip memory (e.g., cache) of the vN processing unit (e.g., CPU/GPU), the system has to frequently access the off-chip data storage unit (e.g., main memory), which is typically two orders of magnitude slower than the processing unit (i.e., well known as the vNB).

**Fig. 9: Schematic comparison of calculation step between vN and NvN architecture.**

To overcome the vNB, the proposed NvN-based SPU leverages the processing-in-memory (PIM) technology to avoid heavy-duty data shuttling^{111,112,113,114,115}. Specifically, the logic devices and memory cells are integrated together, vanishing the data fetching latency of its vN counterparts. In the proposed NvN-based SPU, w^l and b^l are stored in the local on-chip memory and x^l+1 is directly used as the input of the (l+1)^th layer of MLP without accessing off-chip memory (Fig. 9), such that the repeated data shuttling from/to the off-chip memory (i.e., vNB) can be avoided. It’s worth noting that the parameters such as w^l and b^l represent the high-dimensional PES, which are material-dependent. Thus, to compute a long MD trajectory of a particular material, w^l and b^l are only loaded once from off-chip memory and then kept unchanged in on-chip memory during all timesteps of the MD trajectory. The logic and arithmetic operations (e.g., multipliers, adders, and activation functions) are implemented using reconfigurable circuit, to link on-chip memory cells (e.g., w^l and b^l). Using PIM, the calculation is pipelined without interruption of data shuttling latency, such that the calculation time is consumed purely for useful logic and arithmetic operations and, thus, the efficiency is maximized.

Quantized neural network

To implement NvN PIM (Fig. 9b), it is very hardware resource-consuming if variables (e.g., x^l, w^l, b^l, p^l, etc.) were represented using floating-point numbers¹¹⁶. So, despite continuous neural network (CNN) based on floating-point numbers is adopted in nearly all existing MLMD, we use the quantized neural network (QNN), which has been proposed to replace CNN in hardware devices with limited power supply and computational resources^117,118,119. In the QNN, the weights and activations are quantized to save power consumption and computation resources. For example, we use quantization

$$\chi ^q = \sigma _\gamma (\chi ) = \left\lfloor {\chi \times 2^\gamma } \right\rfloor$$

(13)

for floating-point number χ, where χ^q is quantized value with the precision 2^-γ; ⌊x⌋ is floor function which gives the greatest integer less than or equal to x. The quantization parameter γ is determined by the trade-off between accuracy and resources. We found that, by setting γ = 13, there is negligibly small accuracy loss after replacing CNN using QNN in the proposed NVNMD.

Multiplication-less neural network

To implement NvN PIM (Fig. 9b), it is also very hardware resource-consuming if the multiplication operations were realized in the arithmetic circuit directly^48,49. So, despite multiplication-based neural network is adopted in nearly all existing MLMD, we propose a multiplication-less neural network, which is specially designed for the proposed NVNMD, in order to reduce the hardware circuit complexity and power consumption. Specifically, to evaluate the (3l-1)^th step in Fig. 9b, the multiplication operation ‘×’ is replaced by using the bitwise shift operation ‘>>’, to evaluate

$$p^q= \left\lfloor {(w^q \times x^q)/2^\gamma } \right\rfloor =\left( {\mathop {\sum }\limits_{k= 1}^K s_k \cdot x^q \cdot 2^{n_k}} \right) > > \gamma = \left( {\mathop {\sum }\limits_{k = 1}^K \left( {s_k \cdot x^{{{\mathrm{q}}}}} \right){< <}n_k} \right) > >\gamma$$

(14)

where x^q is quantized input of the layer; γ is the quantization parameter (Eq. (13));

$$w^q{{{\mathrm{ = }}}}\zeta _K(\sigma _\gamma (w)) = \mathop {\sum }\limits_{k{{{\mathrm{ = 1}}}}}^K s_k \cdot 2^{n_k}$$

(15)

is the quantized weights of QNN; s_k =−1, 0, or 1 is the sign; n_k is a natural number;

$$\zeta_K(x) = \left\{ {\begin{array}{*{20}{l}} \zeta_{K{{{ - 1}}}}(x - \zeta(x)) + {\zeta}(x), \hfill & K{\,>\, 1} \\ {\zeta}{(x),}\hfill & {K{{{ = 1}}}} \end{array}} \right.$$

(16)

is the quantization function;

$$\zeta (x) = s \cdot 2^n = \left\{ {\begin{array}{*{20}{l}} {1 \cdot 2^{\left\lfloor {{{{\mathrm{log}}}}_2(x{{{\mathrm{/1}}}}{{{\mathrm{.5}}}})} \right\rfloor },} \hfill & {x \,>\, 0} \hfill \\ {{{{\mathrm{0}}}} \cdot 2^0}, \hfill & {x\, \,= 0} \hfill \\ { - {{{\mathrm{1}}}} \cdot 2^{\left\lfloor {{{{\mathrm{log}}}}_2{{{\mathrm{(}}}} - x{{{\mathrm{/1}}}}{{{\mathrm{.5)}}}}} \right\rfloor },} \hfill & {x \,<\, 0} \hfill \end{array}} \right.$$

(17)

is used to quantize value to exponent of 2; and ⌈x⌉ is ceiling function which rounds x to upper integer.

Obviously, in the above multiplication-less design, the multiplication operation is replaced by bitwise shift and summation operations, which are much more resource-economical and energy-saving in digital circuit. Our test shows that if K is too small (say, K = 1 or 2), there is serious accuracy loss; if K ≥ 3, the accuracy is decent to fit high-dimensional PES. So, we use K = 3 hereafter.

Nonlinear activation function

To implement NvN PIM (Fig. 9b), it is also very hardware resource-consuming if the trigonometric function-based nonlinear activation functions (e.g., tanh(x)) are implemented directly¹²⁰. So, despite that these trigonometric function-based nonlinear activation functions are widely-used in existing MLMD, we design a nonlinear activation function (Fig. 10)

$$\begin{array}{*{20}{c}} {\phi _{{\upalpha ,\upbeta ,\upgamma }}\left( x \right){{{ \,\,=\,\, }}}\frac{{x_\gamma }}{\alpha } - \frac{{x_\gamma \cdot \left| {x_\gamma } \right|}}{\beta }{{{,}}}\,x_\gamma {{{\mathrm{ = }}}}\left\{ {\begin{array}{*{20}{l}} {\gamma ,} \hfill & {x \ge \gamma } \hfill \\ {x,} \hfill & { - \gamma \,<\, x \,<\, \gamma } \hfill \\ { - \gamma } \hfill {,} & {x \le - \gamma } \hfill \end{array}} \right.} \\ {\phi \left( x \right){{{ \,=\, }}}\phi _{{{{1,4,2}}}}\left( x \right){{{ \,+\, }}}\phi _{{{{32,256,4}}}}\left( x \right)} \end{array}$$

(18)

without trigonometric functions. In order to implement in NvN-based SPU, we redesign an activation function with continuous value and first derivative, and make it easier to use in training and prediction with fewer calculations. Because its parameters are exponents of 2, the shift operation can be used instead of the relevant multiplication and division. The most complex operation is just multiplication, not exponentiation and division in this activation function. It is easy to implement ϕ(x) in training and testing processes on vN-based and NvN-based computer. The curve of tanh(x) and ϕ(x) are compared in Fig. 10, where ϕ(x) is normalized to the range [−1, 1] by dividing 1.0625 (max value of ϕ(x)). Obviously, at the numerical value and first derivative, the tanh and ϕ(x) are similar.

**Fig. 10: The comparison between two activation functions.**

Hardware implementation

To implement the heterogeneous vN/NvN (Fig. 6), we use vN-based CPU (Intel i7-10700K, 3.80 GHz, 8 cores) and NvN-based FPGA (Xilinx xcvu9p) together. The MPU in Fig. 6 is implemented by using CPU; and the SPU is implemented by using FPGA. For the neural network model deployed in SPU (Section Slave processing unit), the maximum number of neighbor atoms is set to 128; The number of FeaNN output nodes is M = 20; The D_i dimension is set as 20 × 10; FitNN contains three hidden layers, each having 20 nodes. The time division multiplexing (TDM) technology is adopted to reduce the number of resources^121,122. By optimizing the design, the number of resources is reduced, the timing is improved, and the clock frequency of 250 MHz is achieved. The number of resources consumed by the whole design is shown in Table 7.

Data availability

To reproduce the results in this paper, training and inference calculations are needed. The training codes and data are open-sourced at https://github.com/LiuGroupHNU/nvnmd, for generating the NVNMD-oriented inter-atomic potential models. The inference functionalities (i.e., NVNMD calculations) can be freely accessed at http://nvnmd.picp.vip.

References

Bapst, V. et al. Unveiling the predictive power of static structure in glassy systems. Nat. Phys. 16, 448–454 (2020).
Article CAS Google Scholar
Schott, S. et al. Polaron spin dynamics in high-mobility polymeric semiconductors. Nat. Phys. 15, 814–822 (2019).
Article CAS Google Scholar
Galib, M. & Limmer, D. T. Reactive uptake of N₂O₅ by atmospheric aerosol is dominated by interfacial processes. Science 371, 921–925 (2021).
Article CAS Google Scholar
Widmer, D. R. & Schwartz, B. J. Solvents can control solute molecular identity. Nat. Chem. 10, 910–916 (2018).
Article CAS Google Scholar
Karplus, M. & Petsko, G. A. Molecular dynamics simulations in biology. Nature 347, 631–639 (1990).
Article CAS Google Scholar
Chen, S. et al. Simultaneously enhancing the ultimate strength and ductility of high-entropy alloys via short-range ordering. Nat. Commun. 12, 4953 (2021).
Article CAS Google Scholar
Ding, W. et al. Prediction of intrinsic two-dimensional ferroelectrics in In₂Se₃ and other III2-VI3 van der Waals materials. Nat. Commun. 8, 14956 (2017).
Article CAS Google Scholar
Wang, Y. et al. Dynamic deformability of individual PbSe nanocrystals during superlattice phase transitions. Sci. Adv. 5, eaaw5623 (2019).
Article CAS Google Scholar
Lehtinen, O., Kurasch, S., Krasheninnikov, A. V. & Kaiser, U. Atomic scale study of the life cycle of a dislocation in graphene from birth to annihilation. Nat. Commun. 4, 2098 (2013).
Article CAS Google Scholar
Lu, S. et al. Activation pathway of a G protein-coupled receptor uncovers conformational intermediates as targets for allosteric drug design. Nat. Commun. 12, 4721 (2021).
Article CAS Google Scholar
Zhao, Y. et al. Augmenting drug–carrier compatibility improves tumour nanotherapy efficacy. Nat. Commun. 7, 11221 (2016).
Article CAS Google Scholar
Laio, A., Bernard, S., Chiarotti, G. L., Scandolo, S. & Tosatti, E. Physics of iron at Earth’s core conditions. Science 287, 1027–1030 (2000).
Article CAS Google Scholar
Steinle-Neumann, G., Stixrude, L., Cohen, R. E. & Gülseren, O. Elasticity of iron at the temperature of the Earth’s inner core. Nature 413, 57–60 (2001).
Article CAS Google Scholar
Hughes, M. A. et al. n-type chalcogenides by ion implantation. Nat. Commun. 5, 5346 (2014).
Article CAS Google Scholar
Wang, X.-P. et al. Time-dependent density-functional theory molecular-dynamics study on amorphization of Sc-Sb-Te alloy under optical excitation. npj Comput. Mater. 6, 31 (2020).
Article CAS Google Scholar
Kohn, W. & Sham, L. J. Self-consistent equations including exchange and correlation effects. Phys. Rev. 140, A1133–A1138 (1965).
Article Google Scholar
Car, R. & Parrinello, M. Unified approach for molecular dynamics and density-functional theory. Phys. Rev. Lett. 55, 2471–2474 (1985).
Article CAS Google Scholar
Alavi, S. Ab initio molecular dynamics basic theory and advanced methods. By Dominik Marx and Jürg Hutter. Angew. Chem. Int. Ed. 48, 9404–9405 (2009).
Article CAS Google Scholar
Jorgensen, W. L., Maxwell, D. S. & Tirado-Rives, J. Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J. Am. Chem. Soc. 118, 11225–11236 (1996).
Article CAS Google Scholar
Wang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A. & Case, D. A. Development and testing of a general Amber force field. J. Comput. Chem. 25, 1157–1174 (2004).
Article CAS Google Scholar
Vanommeslaeghe, K. et al. CHARMM general force field: a force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields. J. Comput. Chem. 31, 671–690 (2010).
CAS Google Scholar
Shaw, D. E. et al. Anton, a special-purpose machine for molecular dynamics simulation. Commun. ACM 51, 91–97 (2008).
Article Google Scholar
Shaw, D. E. et al. Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer. in SC14: International Conference for High Performance Computing, Networking, Storage and Analysis 2015-January, 41–53 (IEEE, 2014).
Shaw, D. E. et al. Anton 3: twenty microseconds of molecular dynamics simulation before lunch. in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 1–11 (ACM, 2021). https://doi.org/10.1145/3458817.3487397.
Behler, J. & Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett. 98, 146401 (2007).
Article CAS Google Scholar
Wang, H., Zhang, L., Han, J. & E, W. DeePMD-kit: a deep learning package for many-body potential energy representation and molecular dynamics. Comput. Phys. Commun. 228, 178–184 (2018).
Article CAS Google Scholar
Zhang, L., Han, J., Wang, H., Car, R. & E, W. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett. 120, 143001 (2018).
Article CAS Google Scholar
Zhang, L., Lin, D.-Y., Wang, H., Car, R. & E, W. Active learning of uniformly accurate interatomic potentials for materials simulation. Phys. Rev. Mater. 3, 023804 (2019).
Article CAS Google Scholar
Zhang, Y. et al. DP-GEN: a concurrent learning platform for the generation of reliable deep learning based potential energy models. Comput. Phys. Commun. 253, 107206 (2020).
Article CAS Google Scholar
Zhang, L. et al. End-to-end Symmetry Preserving Inter-atomic Potential Energy Model for Finite and Extended Systems. Adv. Neural Inf. Process. Syst. 2018-December, 4436–4446 (2018).
Jia, W. et al. Pushing the Limit of Molecular Dynamics with Ab Initio Accuracy to 100 Million Atoms with Machine Learning. in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis 1–14 (IEEE, 2020). https://doi.org/10.1109/SC41405.2020.00009.
LAMMPS Benchmarks. Available at: https://www.lammps.org/bench.html.
Wulf, W. A. & McKee, S. A. Hitting the memory wall. ACM SIGARCH Comput. Archit. N. 23, 20–24 (1995).
Article Google Scholar
Horowitz, M. 1.1 Computing’s energy problem (and what we can do about it). in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) 57, 10–14 (IEEE, 2014).
Ielmini, D. & Wong, H. S. P. In-memory computing with resistive switching devices. Nat. Electron. 1, 333–343 (2018).
Article Google Scholar
Stegailov, V., Smirnov, G. & Vecher, V. VASP hits the memory wall: processors efficiency comparison. Concurr. Comput. Pract. Exp. 31, e5136 (2019).
Article Google Scholar
John von Neumann. First Draft of a Report on the EDVAC. (1945).
Electronic Numerical Integrator and Computer (ENIAC). Available at: https://en.wikipedia.org/wiki/ENIAC.
Beyond von Neumann. Nat. Nanotechnol. 15, 507–507 (2020).
Article CAS Google Scholar
Taiji, M. et al. Protein Explorer: A Petaflops Special-Purpose Computer System for Molecular Dynamics Simulations. in Proceedings of the 2003 ACM/IEEE conference on Supercomputing - SC ’03 15 (ACM Press, 2003). https://doi.org/10.1145/1048935.1050166.
Harvey, M. J., Giupponi, G. & De Fabritiis, G. ACEMD: Accelerating biomolecular dynamics in the microsecond time scale. J. Chem. Theory Comput. 5, 1632–1639 (2009).
Article CAS Google Scholar
Deringer, V. L. & Csányi, G. Machine learning based interatomic potential for amorphous carbon. Phys. Rev. B 95, 094203 (2017).
Article Google Scholar
Rowe, P., Csányi, G., Alfè, D. & Michaelides, A. Development of a machine learning potential for graphene. Phys. Rev. B 97, 054303 (2018).
Article Google Scholar
Zeng, J., Cao, L., Xu, M., Zhu, T. & Zhang, J. Z. H. Complex reaction processes in combustion unraveled by neural network-based molecular dynamics simulation. Nat. Commun. 11, 5713 (2020).
Article CAS Google Scholar
Li, R., Lee, E. & Luo, T. A unified deep neural network potential capable of predicting thermal conductivity of silicon in different phases. Mater. Today Phys. 12, 100181 (2020).
Article Google Scholar
Rowe, P., Deringer, V. L., Gasparotto, P., Csányi, G. & Michaelides, A. An accurate and transferable machine learning potential for carbon. J. Chem. Phys. 153, 034702 (2020).
Article CAS Google Scholar
Bettini, J. et al. Experimental realization of suspended atomic chains composed of different atomic species. Nat. Nanotechnol. 1, 182–185 (2006).
Article CAS Google Scholar
Wu, B. et al. Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions. in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 9127–9135 (IEEE, 2018). https://doi.org/10.1109/CVPR.2018.00951
Chen, H. et al. AdderNet: Do We Really Need Multiplications in Deep Learning? in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 1465–1474 (IEEE, 2020). https://doi.org/10.1109/CVPR42600.2020.00154
Ahn, J., Yoo, S., Mutlu, O. & Choi, K. PIM-enabled instructions. in Proceedings of the 42nd Annual International Symposium on Computer Architecture 43, 336–348 (ACM, 2015).
Mutlu, O., Ghose, S., Gómez-Luna, J. & Ausavarungnirun, R. Processing data where it makes sense: Enabling in-memory computation. Microprocess. Microsyst. 67, 28–41 (2019).
Article Google Scholar
Liu, J. & Mo, P. The server website of NVNMD. (2021). Available at: http://nvnmd.picp.vip/.
Liu, J. & Mo, P. The training and testing code for NVNMD. (2021). Available at: https://github.com/LiuGroupHNU/nvnmd.
Abadi, M. et al. TensorFlow: A system for large-scale machine learning. Proc. 12th USENIX Symp. Oper. Syst. Des. Implementation, OSDI 2016 265–283 (2016). https://doi.org/10.5555/3026877.3026899
Sosso, G. C., Miceli, G., Caravati, S., Behler, J. & Bernasconi, M. Neural network interatomic potential for the phase change material GeTe. Phys. Rev. B 85, 174103 (2012).
Article CAS Google Scholar
Shi, M., Mo, P. & Liu, J. Deep Neural Network for Accurate and Efficient Atomistic Modeling of Phase Change Memory. IEEE Electron Device Lett. 41, 365–368 (2020).
Article CAS Google Scholar
Plimpton, S. Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics 117, (1993).
Chmiela, S. et al. Machine learning of accurate energy-conserving molecular force fields. Sci. Adv. 3, e1603015 (2017).
Article CAS Google Scholar
Chmiela, S., Sauceda, H. E., Müller, K.-R. & Tkatchenko, A. Towards exact molecular dynamics simulations with machine-learned force fields. Nat. Commun. 9, 3887 (2018).
Article CAS Google Scholar
Christensen, A. S. & von Lilienfeld, O. A. On the role of gradients for machine learning of molecular energies and forces. Mach. Learn. Sci. Technol. 1, 045018 (2020).
Article Google Scholar
Shi, M., Li, J., Tao, M., Zhang, X. & Liu, J. Artificial intelligence model for efficient simulation of monatomic phase change material antimony. Mater. Sci. Semicond. Process. 136, 106146 (2021).
Article CAS Google Scholar
Huang, J. et al. Deep potential generation scheme and simulation protocol for the Li₁₀GeP₂S₁₂-type superionic conductors. J. Chem. Phys. 154, 094703 (2021).
Article CAS Google Scholar
Bogojeski, M., Vogt-Maranto, L., Tuckerman, M. E., Müller, K.-R. & Burke, K. Quantum chemical accuracy from density functional approximations via machine learning. Nat. Commun. 11, 5223 (2020).
Article CAS Google Scholar
Narayanan, B., Redfern, P. C., Assary, R. S. & Curtiss, L. A. Accurate quantum chemical energies for 133000 organic molecules. Chem. Sci. 10, 7449–7455 (2019).
Article CAS Google Scholar
Morawietz, T. & Artrith, N. Machine learning-accelerated quantum mechanics-based atomistic simulations for industrial applications. J. Comput. -Aided Mol. Des. 35, 557–586 (2021).
Article CAS Google Scholar
Zhang, P., Shen, L. & Yang, W. Solvation Free Energy Calculations with Quantum Mechanics/Molecular Mechanics and Machine Learning Models. J. Phys. Chem. B 123, 901–908 (2019).
Article CAS Google Scholar
Lu, C. et al. OPLS4: Improving force field accuracy on challenging regimes of chemical space. J. Chem. Theory Comput. 17, 4291–4300 (2021).
Article CAS Google Scholar
Soler, J. M. et al. The SIESTA method for ab initio order-N materials simulation. J. Phys. Condens. Matter 14, 2745–2779 (2002).
Article CAS Google Scholar
Giannozzi, P. et al. QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials. J. Phys. Condens. Matter 21, 395502 (2009).
Article Google Scholar
VandeVondele, J. et al. Quickstep: Fast and accurate density functional calculations using a mixed Gaussian and plane waves approach. Comput. Phys. Commun. 167, 103–128 (2005).
Article CAS Google Scholar
Ahn, S. Phase Change Memory. (Springer International Publishing, 2018). https://doi.org/10.1007/978-3-319-69053-7.
Kolobov, A. V., Krbal, M., Fons, P., Tominaga, J. & Uruga, T. Distortion-triggered loss of long-range order in solids with bonding energy hierarchy. Nat. Chem. 3, 311–316 (2011).
Article CAS Google Scholar
Mo, Y., Ong, S. P. & Ceder, G. First principles study of the Li 10GeP 2S 12 lithium super ionic conductor material. Chem. Mater. 24, 15–17 (2012).
Article CAS Google Scholar
Marcolongo, A., Binninger, T., Zipoli, F. & Laino, T. Simulating Diffusion Properties of Solid‐State Electrolytes via a Neural Network Potential: Performance and Training Scheme. ChemSystemsChem 2, e1900031 (2020).
Article CAS Google Scholar
Kamaya, N. et al. A lithium superionic conductor. Nat. Mater. 10, 682–686 (2011).
Article CAS Google Scholar
NVIDIA Corporation. Nvidia Tesla V100 GPU Volta Architecture. White Paper 53 (2017). Available at: https://images.nvidia.cn/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
Summit. Available at: https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/.
NVIDIA. NVIDIA V100. Available at: https://www.nvidia.com/en-us/data-center/v100/.
Xilinx. UltraScale Architecture and Product Data Sheet: Overview. Xilinx.com 1–46 (2020). Available at: https://www.xilinx.com/support/documentation/data_sheets/ds890-ultrascale-overview.pdf.
Xilinx. UltraScale+ FPGAs Product Tables and Product Selection Guide. Xilinx.com 1–11 (2021). Available at: https://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf.
Ic, S. P., Dube, B., Elisabeth, S. & Scansen, D. Apple M1 System-on-Chip. systemplus.fr 1–36 (2020). Available at: https://www.systemplus.fr/wp-content/uploads/2020/12/SP20608-Apple-M1-System-on-Chip-Sample.pdf.
Lu, D. et al. 86 PFLOPS Deep Potential Molecular Dynamics simulation of 100 million atoms with ab initio accuracy. Comput. Phys. Commun. 259, 107624 (2021).
Article CAS Google Scholar
Samir, N. et al. ASIC and FPGA Comparative Study for IoT lightweight hardware security algorithms. J. Circuits, Syst. Comput. 28, (2019).
Schütt, K. T. et al. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. Adv. Neural Inf. Process. Syst. 2017-Decem, 992–1002 (2017).
Klicpera, J., Groß, J. & Günnemann, S. Directional Message Passing for Molecular Graphs. Preprint at http://arxiv.org/abs/2003.03123 (2020).
Chmiela, S., Sauceda, H. E., Poltavsky, I., Müller, K. R. & Tkatchenko, A. sGDML: Constructing accurate and data efficient molecular force fields using machine learning. Comput. Phys. Commun. 240, 38–45 (2019).
Article CAS Google Scholar
Schütt, K., Unke, O. & Gastegger, M. Equivariant message passing for the prediction of tensorial properties and molecular spectra. in Proceedings of the 38th International Conference on Machine Learning (Vol. 139 eds. Meila, M. & Zhang, T.) 9377–9388 (PMLR, 2021).
Unke, O. T. et al. SpookyNet: Learning force fields with electronic degrees of freedom and nonlocal effects. Nat. Commun. 12, 7273 (2021).
Article CAS Google Scholar
Klicpera, J., Becker, F. & Günnemann, S. GemNet: Universal Directional Graph Neural Networks for Molecules. Preprint at http://arxiv.org/abs/2106.08903 (2021).
Haghighatlari, M. et al. NewtonNet: A Newtonian message passing network for deep learning of interatomic potentials and forces. Preprint at http://arxiv.org/abs/2108.02913 (2021).
Qiao, Z. et al. UNiTE: Unitary N-body Tensor Equivariant Network with Applications to Quantum Chemistry. Preprint at http://arxiv.org/abs/2105.14655 (2021).
Batzner, S. et al. E(3)-Equivariant Graph Neural Networks for Data-Efficient and Accurate Interatomic Potentials. Preprint at http://arxiv.org/abs/2101.03164 (2021).
Kanal, I. Y., Keith, J. A. & Hutchison, G. R. A sobering assessment of small‐molecule force field methods for low energy conformer predictions. Int. J. Quantum Chem. 118, e25512 (2018).
Article CAS Google Scholar
Zgarbová, M., Otyepka, M., Šponer, J., Hobza, P. & Jurečka, P. Large-scale compensation of errors in pairwise-additive empirical force fields: Comparison of AMBER intermolecular terms with rigorous DFT-SAPT calculations. Phys. Chem. Chem. Phys. 12, 10476–10493 (2010).
Article CAS Google Scholar
Demir, H. et al. DFT-based force field development for noble gas adsorption in metal organic frameworks. J. Mater. Chem. A 3, 23539–23548 (2015).
Article CAS Google Scholar
Shen, L. & Yang, W. Molecular Dynamics Simulations with Quantum Mechanics/Molecular Mechanics and Adaptive Neural Networks. J. Chem. Theory Comput. 14, 1442–1455 (2018).
Article CAS Google Scholar
Jinnouchi, R., Karsai, F. & Kresse, G. Making free-energy calculations routine: combining first principles with machine learning. Phys. Rev. B 101, 060201 (2020).
Article CAS Google Scholar
Han, J., Zhang, L., Car, R. & E, W. Deep potential: a general representation of a many-body potential energy surface. Commun. Comput. Phys. 23, 629–639 (2018).
Article Google Scholar
Allen, M. P. & Tildesley, D. J. Computer Simulation of Liquids. 1, (Oxford University Press, 2017).
Parrinello, M. & Rahman, A. Polymorphic transitions in single crystals: a new molecular dynamics method. J. Appl. Phys. 52, 7182–7190 (1981).
Article CAS Google Scholar
Martyna, G. J., Tobias, D. J. & Klein, M. L. Constant pressure molecular dynamics algorithms. J. Chem. Phys. 101, 4177–4189 (1994).
Article CAS Google Scholar
Dullweber, A., Leimkuhler, B. & McLachlan, R. Symplectic splitting methods for rigid body molecular dynamics. J. Chem. Phys. 107, 5840–5851 (1997).
Article CAS Google Scholar
Shinoda, W., Shiga, M. & Mikami, M. Rapid estimation of elastic constants by molecular dynamics simulation under constant stress. Phys. Rev. B 69, 134103 (2004).
Article CAS Google Scholar
Tuckerman, M. E., Alejandre, J., López-Rendón, R., Jochim, A. L. & Martyna, G. J. A Liouville-operator derived measure-preserving integrator for molecular dynamics simulations in the isothermal-isobaric ensemble. J. Phys. A. Math. Gen. 39, 5629–5651 (2006).
Article CAS Google Scholar
Xilinx. UltraScale Architecture: Memory Resources User Guide (UG573). 573, 1–136 (2018).
Goldhammer, A. & Ayer, J. Jr. Understanding performance of PCI express systems. Xilinx WP350 350, 1–18 (2008).
Google Scholar
Xilinx, P. C. I. Express for ultrascale architecture-based devices integrated block for PCIe in the ultrascale. Architecture 464, 1–15 (2015).
Google Scholar
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989).
Article Google Scholar
Xilinx. UltraScale Architecture: DSP Slice User Guide (UG579). Xilinx.com (2020). Available at: https://www.xilinx.com/support/documentation/user_guides/ug579-ultrascale-dsp.pdf.
Xilinx. UltraScale Architecture Configurable Logic Block User Guide (UG574). Xilinx.com (2017). Available at: https://www.xilinx.com/support/documentation/user_guides/ug574-ultrascale-clb.pdf.
Chi, P. et al. PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. Proceedings of the 2016 43rd Int. Symp. Comput. Archit. ISCA 2016 27–39 (2016). https://doi.org/10.1109/ISCA.2016.13
Ghose, S., Boroumand, A., Kim, J. S., Gomez-Luna, J. & Mutlu, O. Processing-in-memory: a workload-driven perspective. IBM J. Res. Dev. 63, 3 (2019).
Article Google Scholar
Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory devices and applications for in-memory computing. Nat. Nanotechnol. 15, 529–544 (2020).
Article CAS Google Scholar
Lu, Z., Arafin, M. T. & Qu, G. RIME: A Scalable and Energy-Efficient Processing-In-Memory Architecture for Floating-Point Operations. Proc. Asia South Pacific Des. Autom. Conf. ASP-DAC 120–125 (2021). https://doi.org/10.1145/3394885.3431524
Bavikadi, S., Sutradhar, P. R., Khasawneh, K. N., Ganguly, A. & Dinakarrao, S. M. P. A review of in-memory computing architectures for machine learning applications. Proc. ACM Gt. Lakes Symp. VLSI, GLSVLSI 89–94 (2020). https://doi.org/10.1145/3386263.3407649
Are, W., Point, F. & Layout, S. IEEE Standard 754 Floating Point Numbers. 1–7 (2011).
Gupta, S., Agrawal, A., Gopalakrishnan, K. & Narayanan, P. Deep learning with limited numerical precision. 32nd Int. Conf. Mach. Learn. ICML 2015 3, 1737–1746 (2015).
Google Scholar
Han, S., Mao, H. & Dally, W. J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. Int. Conf. Learn. Represent. 1–14 (2016).
Alemdar, H., Leroy, V., Prost-Boucle, A. & Petrot, F. Ternary neural networks for resource-efficient AI applications. Proc. Int. Jt. Conf. Neural Networks 2017-May, 2547–2554 (2017).
Marra, S., Iachino, M. A. & Morabito, F. C. High speed, programmable implementation of a tanh-like activation function and its derivative for digital neural networks. IEEE Int. Conf. Neural Networks - Conf. Proc. 506–511 (2007). https://doi.org/10.1109/IJCNN.2007.4371008
Zheng, D., Zhang, X., Pui, C. W. & Young, E. F. Y. Multi-FPGA Co-optimization: Hybrid Routing and Competitive-based Time Division Multiplexing Assignment. Proc. Asia South Pacific Des. Autom. Conf. ASP-DAC 176–182 (2021). https://doi.org/10.1145/3394885.3431565
Zou, P. et al. Time-Division Multiplexing Based System-Level FPGA Routing for Logic Verification. in 2020 57th ACM/IEEE Design Automation Conference (DAC) 2020-July, 1–6 (IEEE, 2020).
Lee, K., Yoo, D., Jeong, W. & Han, S. SIMPLE-NN: An efficient package for training and executing neural-network interatomic potentials. Comput. Phys. Commun. 242, 95–103 (2019).
Article CAS Google Scholar
Lu, D. et al. DP Train, then DP Compress: Model Compression in Deep Potential Molecular Dynamics. Preprint at http://arxiv.org/abs/2107.02103 (2021).
Sedova, A., Eblen, J. D., Budiardja, R., Tharrington, A. & Smith, J. C. High-performance molecular dynamics simulation for biological and materials sciences: Challenges of performance portability. Proc. P3HPC 2018 Int. Work. Performance, Portability Product. HPC, Held conjunction with SC 2018 Int. Conf. High Perform. Comput. Networking, Storage Anal. 1–13 (2019). https://doi.org/10.1109/P3HPC.2018.00004

Download references

Acknowledgements

We thank Han Wang, Linfeng Zhang, Denghui Lu, Wanrun Jiang, Jun Cheng, Yongbin Zhuang, and Jianxing Huang for their precious time to try and test NVNMD, and for their helpful suggestions to improve NVNMD. We thank experts from the DeePMD community for their helpful discussions and technical support. This work is supported by the National Natural Science Foundation of China (#61804049); the Fundamental Research Funds for the Central Universities of P.R. China; Huxiang High Level Talent Gathering Project (#2019RS1023); the Key Research and Development Project of Hunan Province, P.R. China (#2019GK2071); the Technology Innovation and Entrepreneurship Funds of Hunan Province, P.R. China (#2019GK5029); the Fund for Distinguished Young Scholars of Changsha (#kq1905012).

Author information

Authors and Affiliations

College of Electrical and Information Engineering, Hunan University, Changsha, Hunan, PR China
Pinghui Mo, Chang Li, Dan Zhao, Yujia Zhang, Mengchao Shi, Junhua Li & Jie Liu
Department of Electrical and Computer Engineering, University of Washington, Seattle, WA, USA
Jie Liu

Authors

Pinghui Mo
View author publications
You can also search for this author in PubMed Google Scholar
Chang Li
View author publications
You can also search for this author in PubMed Google Scholar
Dan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yujia Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mengchao Shi
View author publications
You can also search for this author in PubMed Google Scholar
Junhua Li
View author publications
You can also search for this author in PubMed Google Scholar
Jie Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Pinghui Mo, Chang Li, Dan Zhao, Yujia Zhang, and Jie Liu implemented and tested the NVNMD system; Mengchao Shi and Junhua Li generated the DFT data for training and testing the NVNMD system; Jie Liu proposed the idea and led the research; Pinghui Mo and Jie Liu composed the manuscript.

Corresponding author

Correspondence to Jie Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mo, P., Li, C., Zhao, D. et al. Accurate and efficient molecular dynamics based on machine learning and non von Neumann architecture. npj Comput Mater 8, 107 (2022). https://doi.org/10.1038/s41524-022-00773-z

Download citation

Received: 09 December 2021
Accepted: 29 March 2022
Published: 09 May 2022
DOI: https://doi.org/10.1038/s41524-022-00773-z

Subjects

Abstract

Similar content being viewed by others

Overcoming the barrier of orbital-free density functional theory for molecular systems using deep learning

Beyond MD17: the reactive xxMD dataset

Learning local equivariant representations for large-scale atomistic dynamics

Introduction

Results

Analysis procedure

Calculation accuracy

Time efficiency

Energy efficiency

Discussion

Methods

Heterogeneous parallelization

Pipeline and high-speed transmission interface

Master processing unit

Slave processing unit

Energy calculation

Force and virial calculation

Processing in memory

Quantized neural network

Multiplication-less neural network

Nonlinear activation function

Hardware implementation

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links