Introduction

Molecular modeling plays a crucial role in modern scientific and engineering fields, aiding in the understanding of chemical reactions, facilitating new drug development, and driving scientific and technological advancements1,2,3,4. One commonly used method in molecular modeling is density functional theory (DFT). DFT enables accurate calculations of energy, forces, and other chemical properties of molecules5,6. However, due to the large computational requirements, DFT calculations often demand significant computational resources and time, particularly for large molecular systems or high-precision calculations. Machine learning (ML) offers an alternative solution by learning from reference data with ab initio accuracy and high computational efficiency7,8. Gradient-domain machine learning (GDML)9 constructs accurate molecular force fields using conservation of energy and limited samples from ab initio molecular dynamics trajectories, enabling cost-effective simulations while maintaining accuracy. Symmetric GDML (sGDML)10 further improves force field construction by incorporating physical symmetries, achieving CCSD(T)-level accuracy for flexible molecules. An exact iterative approach (Global sGDML)11 extends sGDML to global force fields for molecules with several hundred atoms, maintaining correlations of atomic degree and accurately describing complex molecules and materials. In recent years, deep learning (DL) has demonstrated its powerful ability to learn from raw data without any hand-crafted features in many fields and thus attracted more and more attention. However, the inherent drawback of deep learning, which requires large amounts of data, has become a bottleneck for its application to more scenarios12. To alleviate the dependency on data for DL potentials, recent works have incorporated the inductive bias of symmetry into neural network design, known as geometric deep learning (GDL). Symmetry describes the conservation of physical laws, i.e., the unchanged physical properties with any transformations such as translations or rotations. It allows GDL to be extended to limited data scenarios without any data augmentation.

Equivariant graph neural network (EGNN) is one of the representative approaches in GDL, which has extensive capability to model molecular geometry12,13,14,15,16,17,18,19,20,21. A popular kind of EGNN conducts equivariance from directional information and involves geometric features to predict molecular properties. GemNet20 extends the invariant DimeNet/DimeNet++16,17 with dihedral information. They explicitly extract geometric information in the Euclidean space with first-order geometric tensor, i.e., setting lmax = 1. PaiNN18 and equivariant transformer19 further adopt vector embedding and scalarize the angular representation implicitly via the inner product of the vector embedding itself. They reduce the complexity of explicit geometry extraction by taking the angular information into consideration. Another mainstream approach to achieving equivariance is through group representation theory, which can achieve higher accuracy but comes with large computational costs. NequIP, Allegro, and MACE12,22,23 achieve state-of-the-art performance on several molecular dynamics simulation datasets leveraging high-order geometric tensors. On the one hand, algorithms based on group representation theory have strong mathematical foundations and are able to fully utilize geometric information using high-order geometric tensors. On the other hand, these algorithms often require computationally expensive operations such as the Clebsch–Gordan product (CG-product)24, making them possibly suitable for periodic systems with elaborate model design but impractical for large molecular systems such as chemical and biological molecules without periodic boundary conditions.

In this study, we propose ViSNet (short for “Vector-Scalar interactive graph neural Network"), which alleviates the dilemma between computational costs and sufficient utilization of geometric information. By incorporating an elaborate runtime geometry calculation (RGC) strategy, ViSNet implicitly extracts various geometric features, i.e., angles, dihedral torsion angles, and improper angles in accordance with the force field of classical MD with linear time complexity, thus significantly accelerating model training and inference while reducing the memory consumption. To extend the vector representation, we introduce spherical harmonics and simplify the computationally expensive Clebsch–Gordan product with the inner product. Furthermore, we present a well-designed vector–scalar interactive equivariant message passing (ViS-MP) mechanism, which fully utilizes the geometric features by interacting vector hidden representations with scalar ones. When comprehensively evaluated on some benchmark datasets, ViSNet outperforms all state-of-the-art algorithms on all molecules in MD17, revised MD17 and MD22 datasets and shows superior performance on QM9, Molecule3D dataset indicating the powerful capability of molecular geometric representation. ViSNet also has won the PCQM4Mv2 track in the OGB-LCS@NeurIPS2022 competition (https://ogb.stanford.edu/neurips2022/results/). We then performed molecular dynamics simulations for each molecule on MD17 driven by ViSNet trained only with limited data (950 samples). The highly consistent interatomic distance distributions and the explored potential energy surfaces between ViSNet and quantum simulation illustrate that ViSNet is genuinely data-efficient and can perform simulations with high fidelity. To further explore the usefulness of ViSNet to real-world applications, we used an in-house dataset that consists of about 10,000 different conformations of the 166-atom mini-protein Chignolin derived from replica exchange molecular dynamics and calculated at the DFT level. When evaluated on the dataset, ViSNet also achieved significantly better performance than empirical force fields, and the simulations performed by ViSNet exhibited very close force calculation to DFT. In addition, ViSNet exhibits reasonable interpretability to map geometric representation to molecular structures. The contributions of ViSNet can be summarized as follows:

  • Proposing an RGC module that utilizes high-order geometric tensors to implicitly extract various geometric features, including angles, dihedral torsion angles, and improper angles, with linear time complexity.

  • Introducing ViS-MP mechanism to enable efficient interaction between vector hidden representations and scalar ones and fully exploit the geometric information.

  • Achieving state-of-the-art performance in six benchmarks for predicting energy, forces, HOMO-LUMO gap, and other quantum properties of molecules.

  • Performing molecular dynamics simulations driven by ViSNet on both small molecules and 166-atom Chignolin with high fidelity.

  • Demonstrating reasonable model interpretability between geometric features and molecular structures.

Results

Overview of ViSNet

ViSNet is a versatile EGNN that predicts potential energy, atomic forces as well as various quantum chemical properties by taking atomic coordinates and numbers as inputs. As shown in Fig. 1a, the model is composed of an embedding block and multiple stacked ViSNet blocks, followed by an output block. The atomic number and coordinates are fed into the embedding block followed by ViSNet blocks to extract and encode geometric representations. The geometric representations are then used to predict molecular properties through the output block. It is worth noting that ViSNet is an energy-conserving potential, i.e., the predicted atomic forces are derived from the negative gradients of the potential energy with respect to the coordinates9,10.

Fig. 1: The overall architecture of ViSNet.
figure 1

a Model sketch of ViSNet. ViSNet embeds the 3D structures of molecules and extracts the geometric information through a series of ViSNet blocks and outputs the molecule properties such as energy, forces, and HOMO–LUMO gap through an output block. b Flowchart of one ViSNet Block. One ViSNet block consists of two modules: (i) Scalar2Vec, responsible for attaching scalar embeddings to vectors.; (ii) Vec2Scalar, renovates scalar embeddings built on RGC strategy. The inputs of Scalar2Vec are the node embedding hi, edge embedding fij, direction unit \({\overrightarrow{v}}_{i}\) and the relative positions between two atoms. The edge-fusion graph attention module (serves as \({\phi }_{{\rm {m}}}^{{\rm {s}}}\)) takes as input hi and the output of the dense layer following fij, and outputs scalar messages. Before aggregation, each scalar message is transformed through a dense layer, and then fused with the unit of the relative position \({\overrightarrow{u}}_{ij}\) and its own direction unit \({\overrightarrow{v}}_{j}\). We further compute the vector messages and aggregate them all among the neighborhood. Through a gated residual connection, the final residual \({{\Delta }}{\overrightarrow{v}}_{i}\) is produced. In Vec2Scalar module, by Hadamard production of aggregated scalar messages and the output of RGC-Angle calculation and adding a gated residual connection, the final Δhi is figured out. Likewise, combining the projected fij and the output of RGC-Dihedral calculation, the final Δfij is determined.

The success of classical force fields shows that geometric features such as interatomic distances, angles, dihedral torsion angles, and improper angles in Fig. 2 are essential to determine the total potential energy of molecules. The explicit extraction of invariant geometric representations in previous studies often suffers from a large amount of time or memory consumption during model training and inference. Given an atom, the calculation of angular information scales \({{{{{{{\mathcal{O}}}}}}}}({{{{{{{{\mathcal{N}}}}}}}}}^{2})\) with the number of neighboring atoms, while the computational complexity is even \({{{{{{{\mathcal{O}}}}}}}}({{{{{{{{\mathcal{N}}}}}}}}}^{3})\) for dihedrals20. To alleviate this problem, inspired by Sch¨utt et al.18, we propose runtime geometry calculation (RGC), which uses an equivariant vector representation (termed as direction unit) for each node to preserve its geometric information. RGC directly calculates the geometric information from the direction unit which only sums the vectors from the target node to its neighbors once. Therefore, the computational complexity can be reduced to \({{{{{{{\mathcal{O}}}}}}}}({{{{{{{\mathcal{N}}}}}}}})\). Notably, beyond employing angular information that has been used in PaiNN18 and ET19, ViSNet further considers the dihedral torsion and improper angle calculation with higher geometric tensors.

Fig. 2: Illustration of runtime geometry calculation (RGC) module and its relevance to the potential of bonded terms in classical molecular dynamics.
figure 2

The bonded terms consist of bond length, bond angle, dihedral torsion, and improper angle. The RGC module depicts all bonded terms of classical MD as model operations in linear time complexity. Yellow arrow \({\overrightarrow{v}}_{i}\) denotes the direction unit in Eq. (1).

Considering the sub-structure of a toy molecule with four atoms shown in Fig. 2, the angular information of the target node i could be conducted from the vector \({\overrightarrow{r}}_{ij}\) as follows:

$${\overrightarrow{u}}_{ij}=\frac{{\overrightarrow{r}}_{ij}}{\left|{\overrightarrow{r}}_{ij}\right|},\quad {\overrightarrow{v}}_{i}=\mathop{\sum }\limits_{j=1}^{{N}_{i}}{\overrightarrow{u}}_{ij}$$
(1)
$${\left|{\overrightarrow{v}}_{i}\right|}^{2}=\mathop{\sum }\limits_{j=1}^{{N}_{i}}\mathop{\sum }\limits_{k=1}^{{N}_{i}}\left\langle {\overrightarrow{u}}_{ij},{\overrightarrow{u}}_{ik}\right\rangle=\mathop{\sum }\limits_{j=1}^{{N}_{i}}\mathop{\sum }\limits_{k=1}^{{N}_{i}}\cos {\theta }_{jik}$$
(2)

where \({\overrightarrow{r}}_{ij}\) is the vector from node i to its neighboring node j, \({\overrightarrow{u}}_{ij}\) is the unit vector of \({\overrightarrow{r}}_{ij}\). Here, we define the direction unit \({\overrightarrow{v}}_{i}\) as the sum of all unit vectors from node i to its all neighboring nodes j, where node i is the intersection of all unit vectors. As shown in Eq. (2), we calculate the inner product of the direction unit \({\overrightarrow{v}}_{i}\) which represents the sum of the inner products of unit vectors from node i to all its neighboring nodes. Combining with Eq. (1), the inner product of direction \({\overrightarrow{v}}_{i}\) finally stands for the sum of cosine values of all angles formed by node i and any two of its neighboring nodes.

Similar to runtime angle calculation, we also calculate the vector rejection25 of the direction unit \({\overrightarrow{v}}_{i}\) of node i and \({\overrightarrow{v}}_{j}\) of node j on the vector \({\overrightarrow{u}}_{ij}\) and \({\overrightarrow{u}}_{ji}\), respectively.

$${\overrightarrow{w}}_{ij}={{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ij}}\left({\overrightarrow{v}}_{i}\right)= {\overrightarrow{v}}_{i}-\left\langle {\overrightarrow{v}}_{i},{\overrightarrow{u}}_{ij}\right\rangle {\overrightarrow{u}}_{ij} \\= \mathop{\sum }\limits_{m=1}^{{N}_{i}}{{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ij}}\left({\overrightarrow{u}}_{im}\right)\\ {\overrightarrow{w}}_{ji}={{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ji}}\left({\overrightarrow{v}}_{j}\right)= {\overrightarrow{v}}_{j}-\left\langle {\overrightarrow{v}}_{j},{\overrightarrow{u}}_{ji}\right\rangle {\overrightarrow{u}}_{ji} \\= \mathop{\sum }\limits_{n=1}^{{N}_{j}}{{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ji}}\left({\overrightarrow{u}}_{jn}\right)$$
(3)

where \({{{{{{{{\rm{Rej}}}}}}}}}_{\overrightarrow{b}}(\overrightarrow{a})\) represents the vector component of \(\overrightarrow{a}\) perpendicular to \(\overrightarrow{b}\), termed as the vector rejection. \({\overrightarrow{u}}_{ij}\) and \({\overrightarrow{v}}_{i}\) are defined in Eq. (1). \({\overrightarrow{w}}_{ij}\) represents the sum of the vector rejection \({{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ij}}({\overrightarrow{u}}_{im})\) and \({\overrightarrow{w}}_{ji}\) represents the sum of the vector rejection \({{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ji}}({\overrightarrow{u}}_{jn})\). The inner product between \({\overrightarrow{w}}_{ij}\) and \({\overrightarrow{w}}_{ji}\) is then calculated to conduct dihedral torsion angle information of the intersecting edge eij as follows:

$$\left\langle {\overrightarrow{w}}_{ij},{\overrightarrow{w}}_{ji}\right\rangle =\mathop{\sum }\limits_{m=1}^{{N}_{i}}\mathop{\sum }\limits_{n=1}^{{N}_{j}}\left\langle {{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ij}}\left({\overrightarrow{u}}_{im}\right),{{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ji}}\left({\overrightarrow{u}}_{jn}\right)\right\rangle \\ =\mathop{\sum }\limits_{m=1}^{{N}_{i}}\mathop{\sum }\limits_{n=1}^{{N}_{j}}\cos {\varphi }_{mijn}$$
(4)

The improper angle is derived from a pyramid structure forming by 4 nodes. As the last toy molecule shown in Fig. 2, node i is the vertex of the pyramid, and the improper torsion angle is formed by two adjacent planes with an intersecting edge eij. We can also calculate the improper angle by vector rejection:

$$\begin{array}{rlr}{\overrightarrow{t}}_{ij}={{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ij}}\left({\overrightarrow{v}}_{i}\right)&=\mathop{\sum }\limits_{m=1}^{{N}_{i}}{{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ij}}\left({\overrightarrow{u}}_{im}\right)&\\ {\overrightarrow{t}}_{ji}={{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ji}}\left({\overrightarrow{v}}_{i}\right)&=\mathop{\sum }\limits_{n=1}^{{N}_{i}}{{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ji}}\left({\overrightarrow{u}}_{in}\right)\end{array}$$
(5)

In the same way, the inner product between \({\overrightarrow{t}}_{ij}\) and \({\overrightarrow{t}}_{ji}\) indicates the summation of improper angle information formed by eij:

$$\left\langle {\overrightarrow{t}}_{ij},{\overrightarrow{t}}_{ji}\right\rangle =\mathop{\sum }\limits_{m=1}^{{N}_{i}}\mathop{\sum }\limits_{n=1}^{{N}_{i}}\left\langle {{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ij}}\left({\overrightarrow{u}}_{im}\right),{{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{u}}_{ji}}\left({\overrightarrow{u}}_{in}\right)\right\rangle \\ =\mathop{\sum }\limits_{m=1}^{{N}_{i}}\mathop{\sum }\limits_{n=1}^{{N}_{i}}\cos {\psi }_{mijn}$$
(6)

Multiple works have shown the effectiveness of high-order geometric tensors for molecular modeling12,22,26,27. However, the computational overheads of these approaches are generally expansive due to the CG-product, impeding their further application for large systems. In this work, we convert the vectors to high-order representation with spherical harmonics but discard CG-product with the inner product following the idea of RGC. We find that the extended high-order geometric tensors can still represent the above angular information in the form of Legendre polynomials according to the addition theorem:

$${P}_{l}(\cos {\theta }_{jik})= {P}_{l}\left({\overrightarrow{u}}_{ij}\cdot {\overrightarrow{u}}_{ik}\right) \\ \propto \mathop{\sum }\limits_{m=-l}^{l}{Y}_{l,m}\left({\overrightarrow{u}}_{ij}\right){Y}_{l,m}^{*}\left({\overrightarrow{u}}_{ik}\right)$$
(7)

where the Pl is the Legendre polynomial of degree l, Yl,m denotes the spherical harmonics function and \({Y}_{l,m}^{*}\) denotes its complex conjugation. We sum the product of different order l to obtain the scalar angular representation, which is the same operation as the inner product. It is worth noting that such an extension does not increase the model size and keeps the model architecture unchanged. We also provide proof about the rotational invariance of the RGC strategy in the section “Proofs of the rotational invariance of RGC ”.

In order to make full use of geometric information and enhance the interaction between scalars and vectors, we designed an effective vector–scalar interactive message-passing mechanism with respect to the intersecting nodes and edges for angles and dihedrals, respectively. It is important to note that previous studies18,19 primarily focused on updating node features, whereas our approach updates both node and edge features during message passing, leading to a more comprehensive geometric representation. The key operations in ViS-MP are given as follows:

$${m}_{i}^{l}=\mathop{\sum}\limits_{j\in {{{{{{{\mathcal{N}}}}}}}}(i)}{\phi }_{m}^{s}\left({h}_{i}^{l},\, {h}_{j}^{l},\, {f}_{ij}^{l}\right)$$
(8)
$${\overrightarrow{m}}_{i}^{l}=\mathop{\sum}\limits_{j\in {{{{{{{\mathcal{N}}}}}}}}(i)}{\phi }_{m}^{v}\left({m}_{ij}^{l},{\overrightarrow{r}}_{ij},{\overrightarrow{v}}_{j}^{l}\right)$$
(9)
$${h}_{i}^{l+1}={\phi }_{un}^{s}\left({h}_{i}^{l},\, {m}_{i}^{l},\, \left\langle {\overrightarrow{v}}_{i}^{l},\, {\overrightarrow{v}}_{i}^{l}\right\rangle \right)$$
(10)
$${f}_{ij}^{l+1}={\phi }_{ue}^{s}\left({f}_{ij}^{l},\left\langle {{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{r}}_{ij}}\left({\overrightarrow{v}}_{i}^{l}\right),\,{{{{{{{{\rm{Rej}}}}}}}}}_{{\overrightarrow{r}}_{ji}}\left({\overrightarrow{v}}_{j}^{l}\right)\right\rangle \right)$$
(11)
$${\overrightarrow{v}}_{i}^{l+1}={\phi }_{un}^{v}\left({\overrightarrow{v}}_{i}^{l},\, {m}_{i}^{l},\, {\overrightarrow{m}}_{i}^{l}\right)$$
(12)

where hi denotes the scalar embedding of node i, fij stands for the edge feature between node i and node j. \({\overrightarrow{v}}_{i}\) represents the embedding of the direction unit mentioned in RGC. The superscript of variables indicates the index of the block that the variables belong to. We omit the improper angle here for brevity. A comprehensive version is depicted in Supplementary. ViS-MP extends the conventional message passing, aggregation, and update processes with vector–scalar interactions. Eqs. (8) and (9) depict our message-passing and aggregation processes. To be concrete, scalar messages mij incorporating scalar embedding hj, hi, and fij are passed and then aggregated to node i through a message function \({\phi }_{m}^{s}\) (Eq. (8)). Similar operations are applied for vector messages \({\overrightarrow{m}}_{i}^{l}\) of node i that incorporates scalar message mij, vector \({\overrightarrow{r}}_{ij}\) and vector embedding \({\overrightarrow{v}}_{j}\) (Eq. (9)). Equations (10) and (11) demonstrate the update processes. hi is updated by the aggregated scalar message output mi while the inner product of \({\overrightarrow{v}}_{i}\) is updated through an update function \({\phi }_{un}^{s}\). Then \({\overrightarrow{f}}_{ij}\) is updated by the inner product of the rejection of the vector embedding \({\overrightarrow{v}}_{i}\) and \({\overrightarrow{v}}_{j}\) through an update function \({\phi }_{ue}^{s}\). Finally, the vector embedding \({\overrightarrow{v}}_{i}\) is updated by both scalar and vector messages through an update function \({\phi }_{un}^{v}\). Notably, the vectors update function, i.e., ϕv require to be equivariant. The detailed message and update functions can be found in the Methods section. A proof about the equivariance of ViS-MP can be found in Supplementary Methods.

In summary, the geometric features are extracted by inner products in the RGC strategy and the scalar and vector embeddings are cyclically updating each other in ViS-MP so as to learn a comprehensive geometric representation from molecular structures.

Accurate quantum chemical property predictions

We evaluated ViSNet on several prevailing benchmark datasets including MD179,10,28, revised MD1729, MD2230, QM931, Molecule3D32, and OGB-LSC PCQM4Mv233 for energy, force, and other molecular property prediction. MD17 consists of the MD trajectories of seven small organic molecules; the number of conformations in each molecule dataset ranges from 133,700 to 993,237. The dataset rMD17 is a reproduced version of MD17 with higher accuracy. MD22 is a recently proposed MD trajectories dataset that presents challenges with respect to larger system sizes (42–370 atoms). Large molecules such as proteins, lipids, carbohydrates, nucleic acids, and supramolecules are included in MD22. QM9 consists of 12 kinds of quantum chemical properties of 133,385 small organic molecules with up to 9 heavy atoms. Molecule3D is a recently proposed dataset including 3,899,647 molecules collected from PubChemQC with their ground-state structures and corresponding properties calculated by DFT. We focus on the prediction of the HOMO–LUMO gap following ComENet34. OGB-LSC PCQM4Mv2 is a quantum chemistry dataset originally curated under the PubChemQC including a DFT-calculated HOMO–LUMO gap of 3,746,619 molecules. The 3D conformations are provided for 3,378,606 training molecules but not for the validation and test sets. The training details of ViSNet on each benchmark are described in the “Methods” section.

We compared ViSNet with the state-of-the-art algorithms, including DimeNet16, PaiNN18, SpookyNet21, ET19, GemNet20, UNiTE35, NequIP12, SO3KRATES36, Allegro22, MACE23 and so on. As shown in Table 1 (MD17), Table 2 (rMD17), and Table 3 (MD22), it is remarkable that ViSNet outperformed the compared algorithms for both small (MD17 and rMD17) and large molecules (MD22) with the lowest mean absolute errors (MAE) of predicted energy and forces. On the one hand, compared with PaiNN, ET, and GemNet, ViSNet incorporated more geometric information and made full use of geometric information in ViS-MP, which contributes to the performance gains. On the other hand, compared with NequIP, Allegro, SO3KRATES, MACE, etc., ViSNet testified the effect of introducing spherical harmonics in the RGC module.

Table 1 Mean absolute errors (MAE) of energy (kcal/mol) and force (kcal/mol/Å) for 7 small organic molecules on MD17 compared with state-of-the-art algorithms
Table 2 Mean absolute errors (MAE) of energy (kcal/mol) and force (kcal/mol/Å) for 10 small organic molecules on rMD17 compared with state-of-the-art algorithms
Table 3 Mean absolute errors (MAE) of energy (kcal/mol) and force (kcal/mol/Å) for 7 large-scale molecules on MD22

As shown in Table 4, ViSNet also achieved superior performance for chemical property predictions on QM9. It outperformed the compared algorithms for 9 of 12 chemical properties and achieved comparable results on the remaining properties. Elaborated evaluations on Molecule3D confirmed the high prediction accuracy of ViSNet as shown in Table 5. ViSNet achieved 33.6% and 6.51% improvements than the second-best for random split and scaffold split, respectively. Furthermore, ViSNet exhibited good portability to other multimodality methods, e.g., Transformer-M37 and outperformed other approaches on OGB-LSC PCQM4Mv2 (see Supplementary Fig. S1). ViSNet also achieved the winners of PCQM4Mv2 track in the OGB-LCS@NeurIPS2022 competition when testing on unseen molecules38 (https://ogb.stanford.edu/neurips2022/results/).

Table 4 Mean absolute errors (MAE) of 12 kinds of molecular properties on QM9 compared with state-of-the-art algorithms
Table 5 Mean absolute errors (MAE) of HOMO–LUMO gap (eV) on Molecule3D test set for both random and scaffold splits compared with state-of-the-art algorithms

To evaluate the computational efficiency of our ViSNet, following23, we compare the time latency of ViSNet with prevailing models in Supplementary Fig. S2. The latency is defined as the time it takes to compute forces on a structure (i.e., the gradient calculation for a set of input coordinates through the whole deep neural network). As shown in Supplementary Fig. S2, ViSNet (L = 2) saved 42.8% time latency compared with MACE (L = 2). Notably, despite the use of CG-product, Allegro had a significant speed improvement compared to NequIP and BOTNet. However, ViSNet still saved 6.1%, 4.1%, and 61% time latency compared to Allegro with L = 1, 2, and 3, respectively.

Efficient molecular dynamics simulations

To evaluate ViSNet as the potential for MD simulations, we incorporated ViSNet that trained only with 950 samples on MD17 into the ASE simulation framework39 to perform MD simulations for all seven kinds of organic molecules. All simulations are run with a time step τ = 0.5 fs under the Berendsen thermostat with the other settings the same as those of the MD17 dataset. As shown in Fig. 3, we analyzed the interatomic distance distributions derived from both AIMD simulations with ViSNet as the potential and ab initio molecular dynamics simulations at the DFT level for all seven molecules, respectively. As shown in Fig. 3a, the interatomic distance distribution h(r) is defined as the ensemble average of atomic density at a radius r9. Figure 3b–h illustrates the distributions derived from ViSNet are very close to those generated by DFT. We also compared the potential energy surfaces sampled by ViSNet and DFT for these molecules, respectively (Supplementary Fig. S3). The consistent potential energy surfaces suggest that ViSNet can recover the conformational space from the simulation trajectories. Moreover, compared to DFT, numerous groundbreaking machine learning force fields (MLFFs), including sGDML10, ANI40, DPMD41, and PhysNet42 have proven their exceptional speeds in MD simulations. Similar to such algorithms, ViSNet also exhibited significant computational cost reduction compared to DFT as shown in Supplementary Fig. S4 and Table S2.

Fig. 3: The interatomic distance distributions of MD simulations driven by ViSNet and DFT.
figure 3

a An illustration about the atomic density at a radius r with the arbitrary atom as the center. The interatomic distance distribution is defined as the ensemble average of atomic density. bh The interatomic distance distributions comparison between simulations by ViSNet and DFT for all seven organic molecules in MD17. The curve of ViSNet is shown using a solid blue line, while the dashed orange line is used for the DFT curve. The structures of the corresponding molecules are shown in the upper right corner. Source data are provided as a Source Data file.

To further examine the molecular properties derived from simulations driven by ViSNet, we performed 500 ps MD simulations at a constant energy ensemble (NVE) for ethanol in the MD17 dataset with a time step of τ = 0.5 fs and 200 ps Ac-Ala3-NHMe in the MD22 dataset with a time step of τ = 1 fs. The simulations were driven by ViSNet, sGDML, and DFT, respectively. For ethanol, we analyzed its vibrational spectra and the probability distribution of dihedral angles. For Ac-Ala3-NHMe, we investigated its vibrational spectra and potential energy surface (PES) via the Ramachandran plot. To analyze the Ramachandran plot of different simulations, the free energy value was estimated using the potential of mean force (PMF). ϕ and ψ were set as two reaction coordinates (x, y). All three ϕ and ψ dihedrals in Ac-Ala3-NHMe were calculated and plotted. The relative free energy value was calculated and referred to with the minimum value. To generate the landscape, 40 bins were used in both the x and y directions. Supplementary Fig. S5a and b demonstrate that both ViSNet and sGDML generate similar vibrational spectra, with slight differences in peak intensities compared to DFT. The probability distribution of hydroxyl angles in ethanol (Supplementary Fig. S5c) reveals three minima: gauche ± (Mg±) and trans (Mt). Furthermore, even though ViSNet showed better performance than sGDML for various conformations in the MD22 dataset, starting from the same structure of the alanine tetrapeptide, the performance difference may not have a notable impact on the sampling efficiency for such small molecules, and thus may also lead to similar dynamics on the Ramachandran plots as shown in the Supplementary Fig. S5d–f. These results demonstrate that with only a few training samples, ViSNet can act with the potential to perform high-fidelity molecular dynamics simulations with much less computational cost and higher accuracy.

Applications for real-world full-atom proteins

To examine the usefulness of ViSNet in real-world applications, we made evaluations on the 166-atom mini-protein Chignolin (Fig. 4a). Based on a Chignolin dataset consisting of about 10,000 conformations that sampled by replica exchange MD43 and calculated at DFT level by Gaussian 1644 in our another study45,46, we split it as training, validation, and test sets by the ratio of 8:1:1. We trained ViSNet as well as other prevailing MLFFs including ET19, PaiNN18, GemNet-OC47, MACE23, NequIP12 and Allegro22 and compared them with molecular mechanics (MM)48. The DFT results were used as the ground truth. Figure 4b shows the free energy landscape of Chignolin and is depicted by dD3−G7 (the distance between carbonyl oxygen on the D3 backbone and nitrogen on the G7 backbone) and dE5−T8 (the distance between carbonyl oxygen on the E5 backbone and nitrogen on T8 backbone). The concentrated energy basin on the left shows the folded state and the scattered energy basin on the right shows the unfolded state. We randomly selected six structures from different regions of the potential energy surface for visualization. Among them, four structures were predicted by the model with smaller errors than the MAE while the other two with larger errors. Interestingly, all models consistently performed poorly on the structures with high potential energies (low probability of sampling) and performed well on the other structures. This implies that the sampling of conformations with high potential energies could be enhanced to ensure the generalization ability of the models.

Fig. 4: Applications of ViSNet for Chignolin conformational space evaluation and MD simulations.
figure 4

a The visualization of Chignolin structure. The backbone is colored grey while the side chains of each residue in Chignolin are highlighted with a ball and stick. b The energy landscape of Chignolin sampled by REMD. The x-axis of the landscape is the distance between carbonyl oxygen on the D3 backbone and nitrogen on the G7 backbone, while the y-axis is the distance between carbonyl oxygen on the E5 backbone and nitrogen on the T8 backbone. Six structures were then selected for visualization. Each structure is shown as a cartoon and residues are depicted in sticks. The histograms show the absolute error between the energy difference predicted by MLFFs including ViSNet, ET, PaiNN, GemNet-OC, NequIP, Allegro, and MACE or calculated by MM, and the ground truth calculated by DFT on the corresponding structure. c The average root mean square deviation (RMSD) of the Chignolin trajectories simulated by ViSNet was calculated from 10 different trajectories. The shaded areas indicate the standard deviation range. d The MAE of each component of atomic forces during the simulations driven by ViSNet. The ground truth energies and forces were calculated using Gaussian 16. The shaded areas indicate the standard deviation range. Source data are provided as a Source Data file.

Supplementary Fig. S6 shows the correlations between the energies predicted by MLFFs or MM and the ground truth values calculated by DFT for all conformations in the test set. ViSNet achieved a lower MAE and a higher R2 score. From the violin plot of the absolute errors shown in Supplementary Fig. S7, ViSNet, PaiNN and ET exhibited smaller errors than other MLFFs while MM got a much wider range of prediction errors. Similar results can be seen in the force correlations in each component shown in Supplementary Fig. S8. Detailed settings about DFT and MM calculations are shown in Supplementary Materials. Furthermore, we also made a comprehensive comparison by taking model performance, training time consumption, and model size into consideration. ViSNet and other state-of-the-art algorithms such as PaiNN, ET, GemNet-OC, MACE, NequIP, and Allegro were analyzed on the Chignolin dataset and shown in Fig. 5. Although ViSNet is marginally slower than ET and PaiNN, it introduces more geometric information, significantly enhancing its performance. When compared to GemNet, which also incorporates dihedral angles, ViSNet’s computational cost is significantly more affordable. Similarly, ViSNet proves to be computationally efficient when compared to models employing the CG-product method, such as MACE, Allegro, and NequIP.

Fig. 5: The comparison of model performance (y-axis), training time consumption (x-axis), and training memory consumption (volume) among ViSNet (red) and other algorithms (grey) including PaiNN, ET, MACE, GemNet-OC, Allegro, and NequIP on Chignolin.
figure 5

PaiNN and ET are faster and smaller as ViSNet further incorporates dihedral calculation. ViSNet outperforms GemNet-OC due to its Runtime Geometry Calculation, reducing the explicit extraction of dihedral complexity from \({{{{{{{\mathcal{O}}}}}}}}({{{{{{{{\mathcal{N}}}}}}}}}^{3})\) to \({{{{{{{\mathcal{O}}}}}}}}({{{{{{{\mathcal{N}}}}}}}})\). Additionally, ViSNet is also faster and smaller than MACE, Allegro, and NequIP for streamlining the CG-product. ViSNet achieves the best performance for its elaborate design, i.e., runtime geometric calculation and vector–scalar interactive message passing. Source data are provided as a Source Data file.

In addition, we performed MD simulations for Chignolin driven by ViSNet. 10 conformations were randomly selected as initial structures, and 100 ps simulations were run for each. As shown in Fig. 4c, the RMSD for 10 simulation trajectories is shown against the simulation time. In Fig. 4d, we displayed the MAE values of each component of the atomic forces between ViSNet and those calculated by Gaussian 1644 at the DFT level. The simulation trajectory driven by ViSNet exhibited a small force difference for each component to quantum mechanics, which implies that ViSNet has no bias towards any force component, and thus consolidates the accuracy and potential usefulness for real-world applications.

Interpretability of ViSNet on molecular structures

Prior works have shown the effectiveness of incorporating geometric features, such as angles16,20. The primary method of geometry extraction utilized by ViSNet is the distinct inner product in its runtime geometry calculation. To this end, we illustrate a reasonable model interpretability of ViSNet by mapping the angle representations derived from the inner product of direction units in the model to the atoms in the molecular structure. We aim to bridge the gap between geometric representation in ViSNet and molecular structures. We visualized the embeddings after the inner product of direction units \(\langle {\overrightarrow{v}}_{i},{\overrightarrow{v}}_{i}\rangle\) extracted from 50 aspirin samples on the validation set. The high-dimensional embeddings were reduced to 2-dimensional space using T-SNE49 and then clustered using DBSCAN50 without the prior of number of clusters.

Supplementary Fig. S9 exhibits the clustering results of nodes’ embeddings after the inner product of their corresponding direction units. We further map the clustered nodes to the atoms of aspirin chemical structure. Interestingly, the embeddings for these nodes could be distinctly gathered into several clusters shown in different colors. For example, although carbon atom C11 and carbon atom C12 possess different positions and connect with different atoms, their inner product \(\langle {\overrightarrow{v}}_{i},{\overrightarrow{v}}_{i}\rangle\) are clustered into the same class for holding similar substructures ({C11O2O3C6} and {C12O1O4C13}). To summarize, ViSNet can discriminate different molecular substructures in the embedding space.

Ablation study

To further explore where the performance gains of ViSNet come from, we conducted a comprehensive ablation study. Specifically, we excluded the runtime angle calculation (w/o A), runtime dihedral calculation (w/o D), and both of them (w/o A&D) in ViSNet, in order to evaluate the usefulness of each part. ViSNet-improper denotes the additional improper angles and ViSNetl=1 uses the first-order spherical harmonics.

We designed some model variants with different message-passing mechanisms based on ViS-MP for scalar and vector interaction. ViSNet-N directly aggregates the dihedral information to intersecting nodes, and ViSNet-T leverages another form of dihedral calculation. The details of these model variants are elaborated in Supplementary. The results of the ablation study are shown in Supplementary Table S3 and Supplementary Fig. S10. Based on the results, we can see that both kinds of directional geometric information are useful and the dihedral information contributes a little bit more to the final performance. The significant performance drop from ViSNet-N and ViSNet-T further validates the effectiveness of the ViS-MP mechanism. ViSNet-improper achieves similar performance to ViSNet for small molecules, but the contribution of improper angles is more obvious for large molecules (see Table 3). Furthermore, ViSNet using higher-order spherical harmonics achieves better performance.

Discussion

We propose ViSNet, a geometric deep learning potential for molecular dynamics simulation. The group representation theory-based methods and the directional information-based methods are two mainstream classes of geometric deep learning potentials to enforce SE(3) equivariance20. ViSNet takes advantage of both sides in designing the RGC strategy and ViS-MP mechanism. On the one hand, the RGC strategy explicitly extracts and exploits the directional geometric information with computationally lightweight operations, making the model training and inference fast. On the other hand, ViS-MP employs a series of effective and efficient vector-scalar interactive operations, leading to the full use of geometric information. Furthermore, according to the many-body expansion theory51,52,53, the potential energy of the whole system equals the potential of each single atom plus the energy corrections from two-bodies to many-bodies. Most of the previous studies model the truncated energy correction terms hierarchically with k-hop information via stacking k message passing blocks. Different from these approaches, ViSNet encodes the angle, dihedral torsion, and improper information in a single block, which empowers the model to have a much more powerful representation ability. In addition, ViSNet’s universality or completeness is not validated by the geometric Weisfeiler–Leman (GWL) test54 due to the inner product operation, which is computationally efficient but fails to distinguish certain atom reflection structures with the same angular information. To pass counterexamples or the GWL test, incorporating the CG-product with higher-order spherical harmonics is necessary in future studies.

Besides predicting energy, force, and chemical properties with high accuracy, performing molecular dynamics simulations with ab initio accuracy at the cost of the empirical force field is a grand challenge. ViSNet proves its usefulness in real-world ab initio molecular dynamics simulations with less computational costs and the ability of scaling to large molecules such as proteins. Extending ViSNet to support larger and more complex molecular systems will be our future research direction.

Methods

Equivariance

In the context of machine learning for atomic systems, equivariance is a pervasive concept. Specifically, the atomic vectors such as dipoles or forces must rotate in a manner consistent with the conformation coordinates. In molecular dynamics, such equivariance can be ensured by computing gradients based on a predicted conservative scalar energy. Formally, a function \({{{{{{{\mathcal{F}}}}}}}}:{{{{{{{\mathcal{X}}}}}}}}\to {{{{{{{\mathcal{Y}}}}}}}}\) is equivariant should guarantee:

$${{{{{{{\mathcal{F}}}}}}}}({\rho }_{{{{{{{{\mathcal{X}}}}}}}}}(g)\circ x)={\rho }_{{{{{{{{\mathcal{Y}}}}}}}}}(g)\circ {{{{{{{\mathcal{F}}}}}}}}(x),$$
(13)

where \({\rho }_{{{{{{{{\mathcal{X}}}}}}}}}(g)\) and \({\rho }_{{{{{{{{\mathcal{X}}}}}}}}}(g)\) are group representations in input and output spaces. The integration of equivariance into model parameterization has been shown to be effective, as seen in the implementation of shift-equivariance in CNNs, which is critical for enhancing the generalization capacity.

Proofs of the rotational invariance of RGC

Assume that the molecule rotates in 3D space, i.e.,

$${\overrightarrow{{r}^{{\prime} }}}_{ij}=R{\overrightarrow{r}}_{ij}$$
(14)

where, RSO(3) is an arbitrary rotation matrix that satisfies:

$$\det | R|=1,\,{R}^{{\rm {T}}}R=I$$
(15)

The angular information after rotation is calculated as follows:

$${\overrightarrow{{u}^{{\prime} }}}_{ij}=\frac{{\overrightarrow{{r}^{{\prime} }}}_{ij}}{\left|{\overrightarrow{{r}^{{\prime} }}}_{ij}\right|}=\frac{R{\overrightarrow{r}}_{ij}}{\det | R| \cdot \left|{\overrightarrow{r}}_{ij}\right|}=R{\overrightarrow{u}}_{ij}$$
(16)
$${\overrightarrow{{v}^{{\prime} }}}_{i}=\mathop{\sum }\limits_{j=1}^{{N}_{i}}{\overrightarrow{{u}^{{\prime} }}}_{ij}=R\mathop{\sum }\limits_{j=1}^{{N}_{i}}{\overrightarrow{u}}_{ij}=R{\overrightarrow{v}}_{i}$$
(17)
$${\left|{\overrightarrow{{v}^{{\prime} }}}_{i}\right|}^{2}= \left\langle {\overrightarrow{{v}^{{\prime} }}}_{i},{\overrightarrow{{v}^{{\prime} }}}_{i}\right\rangle={\left({\overrightarrow{{v}^{{\prime} }}}_{i}\right)}^{{\rm {T}}}{\overrightarrow{{v}^{{\prime} }}}_{i}\\= {{\overrightarrow{v}}_{i}}^{{\rm {T}}}{R}^{{\rm {T}}}R{\overrightarrow{v}}_{i}=\left\langle {\overrightarrow{v}}_{i},{\overrightarrow{v}}_{i}\right\rangle={\left|{\overrightarrow{v}}_{i}\right|}^{2}$$
(18)

As shown in Eq. (18), the angle information does not change after rotation. The dihedral angular and improper information is also rotationally invariant since:

$${\overrightarrow{{w}^{{\prime} }}}_{ij}={\overrightarrow{{v}^{{\prime} }}}_{i}-\left\langle {\overrightarrow{{v}^{{\prime} }}}_{i},{\overrightarrow{{u}^{{\prime} }}}_{ij}\right\rangle {\overrightarrow{{u}^{{\prime} }}}_{ij}=R{\overrightarrow{v}}_{i}-\left\langle R{\overrightarrow{v}}_{i} \,,R{\overrightarrow{u}}_{ij}\right\rangle R{\overrightarrow{u}}_{ij}$$
(19)

As Eq. (18) proved, the inner product has rotational invariance. Then, Eq. (19) can be further simplified as

$${\overrightarrow{{w}^{{\prime} }}}_{ij}=R\left({\overrightarrow{v}}_{i}-\left\langle {\overrightarrow{v}}_{i},{\overrightarrow{u}}_{ij}\right\rangle {\overrightarrow{u}}_{ij}\right)=R{\overrightarrow{w}}_{ij}$$
(20)

The dihedral or improper angular information after rotation is calculated as:

$$\left\langle {\overrightarrow{{w}^{{\prime} }}}_{ij},\, {\overrightarrow{{w}^{{\prime} }}}_{ji}\right\rangle=\left\langle R{\overrightarrow{w}}_{ij} \,,R{\overrightarrow{w}}_{ji}\right\rangle=\left\langle {\overrightarrow{w}}_{ij},\, {\overrightarrow{w}}_{ji}\right\rangle$$
(21)

As a result, Eqs. (18) and (21) have proved the rotational invariance of our proposed runtime geometry calculation (RGC).

We also provide proof of the equivariance of our ViS-MP in Supplementary Methods.

Detailed operations and modules in ViSNet

ViSNet predicts the molecular properties (e.g., energy \(\hat{E}\), forces \(\overrightarrow{F}\in {{\mathbb{R}}}^{N\times 3}\), dipole moment μ) from the current states of atoms, including the atomic positions \(X\in {{\mathbb{R}}}^{N\times 3}\) and atomic numbers \(Z\in {{\mathbb{N}}}^{N}\). The architecture of the proposed ViSNet is shown in Fig. 1. The overall design of ViSNet follows the vector–scalar interactive message passing as illustrated from Eqs. (8)–(11). First, an embedding block encodes the atom numbers and edge distances into the embedding space. Then, a series of ViSNet blocks update the node-wise scalar and vector representations based on their interactions. A residual connection is placed between two ViSNet blocks. Finally, stacked corresponding gated equivariant blocks proposed by18 are attached to the output block for specific molecular property prediction.

The embedding block

ViSNet expands the direct node and edge embedding with their neighbors. It first embeds atomic chemical symbol zi, and calculates the edge representation whose distances within the cutoff through radial basis functions (RBF). Then the initial embedding of the atom i, its 1-hop neighbors j and the directly connected edge eij within cutoff are fused together as the initial node embedding \({h}_{i}^{0}\) and edge embedding \({f}_{ij}^{0}\). In summary, the embedding block is given by:

$${h}_{i}^{0},{f}_{ij}^{0}={{{{{{{\rm{Embedding}}}}}}}}\,{{{{{{{\rm{Block}}}}}}}}\left({z}_{i},\, {z}_{j},\, {e}_{ij}\right),\quad j\in {{{{{{{\mathcal{N}}}}}}}}(i)$$
(22)

\({{{{{{{\mathcal{N}}}}}}}}(i)\) denotes the set of 1-hop neighboring nodes of node i, and j is one of its neighbors. The embedding process is elaborated in Supplementary. The initial vector embedding \({\overrightarrow{v}}_{i}\) is set to \(\overrightarrow{0}\). The vector embeddings \(\overrightarrow{v}\) are projected into the embedding space by following18; \(\overrightarrow{v}\in {{\mathbb{R}}}^{N\times 3\times F}\) and F is the size of hidden dimension. The advantage of such projection is to assign a unique high-dimensional representation for each embedding to discriminate from each other. Further discussions on its effectiveness and interpretability are given in the Results section.

The Scalar2Vec module

In the Scalar2Vec module, the vector embedding \(\overrightarrow{v}\) is updated by both the scalar messages derived from node and edge scalar embeddings (Eq. (8)) and the vector messages with inherent geometric information (Eq. (9)). The message of each atom is calculated through an Edge-Fusion Graph Attention module, which fuses the node and edge embeddings and computes the attention scores. The fusion of the node and edge embeddings could be the concatenation operation, Hadamard product, or adding a learnable bias55. We leverage the Hadamard product and the vanilla multi-head attention mechanism borrowed from Transformer56 for edge-node fusion.

Following19, we pass the fused representations through a nonlinear activation function as shown in Eq. (23). The value (V) in the attention mechanism is also fused by edge features before being multiplied by attention scores weighted by a cosine cutoff as shown in Eq. (24),

$${\alpha }_{ij}^{l}=\sigma \left(\left({W}_{Q}^{l}{h}_{i}^{l}\right){\left({W}_{K}^{l}{h}_{j}^{l}\odot {{{{{{{\mathrm{Dense}}}}}}}\,}_{K}^{l}\left({f}_{ij}^{l}\right)\right)}^{\rm {{T}}}\right)$$
(23)
$$\begin{array}{rl}{m}_{ij}^{l}&={\alpha }_{ij}^{l}\cdot \phi \left(\left|{\overrightarrow{r}}_{ij}\right|\right)\cdot \left({W}_{V}^{l}{h}_{j}^{l}\odot {{{{{{{\mathrm{Dense}}}}}}}\,}_{V}^{l}\left({f}_{ij}^{l}\right)\right)\end{array}$$
(24)

where l {0, 1, 2,   , L} is the index of the block, σ denotes the activation function (SiLU in this paper), W is the learnable weight matrix,  represents the Hadamard product, ϕ(  ) denotes the cosine cutoff and Dense(  ) refers to one learnable weight matrix with an activation function. For brevity, we omit the learnable bias for linear transformation on scalar embedding in equations, and there is no bias for vector embedding to ensure the equivariance.

Then, the computed \({m}_{ij}^{l}\) is used to produce the geometric messages \({\overrightarrow{m}}_{ij}^{l}\) for vectors:

$${\overrightarrow{m}}_{ij}^{l}=\left({{{{{{{\mathrm{Dense}}}}}}}\,}_{u}^{l}\left({m}_{ij}^{l}\right)\odot {\overrightarrow{u}}_{ij}\right)+\left({{{{{{{\mathrm{Dense}}}}}}}\,}_{v}^{l}\left({m}_{ij}^{l}\right)\odot {\overrightarrow{v}}_{j}^{l}\right)$$
(25)

And the vector embedding \({\overrightarrow{v}}^{l}\) is updated by:

$${m}_{i}^{l}=\mathop{\sum}\limits_{j\in {{{{{{{\mathcal{N}}}}}}}}(i)}{m}_{ij}^{l},\quad {\overrightarrow{m}}_{i}^{l}=\mathop{\sum}\limits_{j\in {{{{{{{\mathcal{N}}}}}}}}(i)}{\overrightarrow{m}}_{ij}^{l}$$
(26)
$${{\Delta }}{\overrightarrow{v}}_{i}^{l+1}={\overrightarrow{m}}_{i}^{l}+{W}_{{{{{{{\mathrm{vm}}}}}}}\,}^{l}{m}_{i}^{l}\odot {W}_{v}^{l}{\overrightarrow{v}}_{i}^{l}$$
(27)

The Vec2Scalar module

In the Vec2Scalar module, the node embedding \({h}_{i}^{l}\) and edge embedding \({f}_{ij}^{l}\) are updated by the geometric information extracted by the RGC strategy, i.e., angles (Eq. (10)) and dihedrals (Eq. (11)), respectively. The residual node embedding \({{\Delta }}{h}_{i}^{l+1}\), is calculated by a Hadamard product between the runtime angle information and the aggregated scalar messages with a gated residual connection:

$${{\Delta }}{h}_{i}^{l+1}=\left\langle {W}_{t}^{l}{\overrightarrow{v}}_{i}^{l},{W}_{s}^{l}{\overrightarrow{v}}_{i}^{l}\right\rangle \odot {W}_{{{{{{{\mathrm{Angle}}}}}}}\,}^{l}{m}_{i}^{l}+{W}_{{{{{{{\mathrm{res}}}}}}}\,}^{l}{m}_{i}^{l}$$
(28)

To compute the residual edge embedding \({{\Delta }}{f}_{ij}^{l+1}\), we perform the Hadamard product of the runtime dihedral information with the transformed edge embedding:

$${{\Delta }}{f}_{ij}^{l+1}=\left\langle {{{{{{{\mathrm{Rej}}}}}}}\,}_{{\overrightarrow{r}}_{ij}}\left({W}_{Rt}^{l}{\overrightarrow{v}}_{i}^{l}\right),{{{{{{{\mathrm{Rej}}}}}}}\,}_{{\overrightarrow{r}}_{ji}}\left({W}_{Rs}^{l}{\overrightarrow{v}}_{j}^{l}\right)\right\rangle \odot {{{{{{{\mathrm{Dense}}}}}}}\,}_{{{{{{{\mathrm{Dihedral}}}}}}}\,}^{l}\left({f}_{ij}^{l}\right)$$
(29)

After the residual hidden representations are calculated, we add them to the original input of block l and feed them to the next block.

A comprehensive version that includes improper angles is depicted in Supplementary Methods.

The output block

Following PaiNN18, we update the scalar embedding and vector embedding of nodes with multiple gated equivariant blocks:

$${t}_{i}^{l}={{{{{{{\mathrm{Dense}}}}}}}\,}_{{o}_{2}}^{l}\left(\left[\left|\left| {W}_{{o}_{1}}^{l}{\overrightarrow{v}}_{i}^{l}\right|\right| \,,{h}_{i}^{l}\right]\right)$$
(30)
$${h}_{i}^{l+1}={W}_{{o}_{3}}^{l}{t}_{i}^{l}$$
(31)
$${\overrightarrow{v}}_{i}^{l+1}={W}_{{o}_{4}}^{l}{\overrightarrow{v}}_{i}^{l}\odot {W}_{{o}_{5}}^{l}{t}_{i}^{l}$$
(32)

where [  ,  ] is the tensor concatenation operation. The final scalar embedding \({h}_{i}^{L}\in {{\mathbb{R}}}^{N\times 1}\) and vector embedding \({\overrightarrow{v}}_{i}^{L}\in {{\mathbb{R}}}^{N\times 3\times 1}\) are used to predict various molecular properties.

On QM9, the molecular dipole is calculated as follows:

$$\mu=\left|\mathop{\sum }\limits_{i=1}^{N}{\overrightarrow{v}}_{i}^{L}+{h}_{i}^{L}\left({\overrightarrow{r}}_{i}-{\overrightarrow{r}}_{{\rm {c}}}\right)\right|$$
(33)

where \({\overrightarrow{r}}_{c}\) denotes the center of mass. Similarly, for the prediction of electronic spatial extent 〈R2〉, we use the following equation:

$$\left\langle {R}^{2}\right\rangle=\mathop{\sum }\limits_{i=1}^{N}{h}_{i}^{L}\left|{\overrightarrow{r}}_{i}-{\overrightarrow{r}}_{c} \right|^{2}$$
(34)

For the remaining 10 properties y, we simply aggregate the final scalar embedding of nodes as follows:

$$y=\mathop{\sum }\limits_{i=1}^{N}{h}_{i}^{L}$$
(35)

For models trained on the molecular dynamics datasets including MD17, revised MD17, and Chignolin, the total potential energy is obtained as the sum of the final scalar embedding of the nodes. As an energy-conserving potential, the forces are then calculated using the negative gradients of the predicted total potential energy with respect to the atomic coordinates:

$$E=\mathop{\sum }\limits_{i=1}^{N}{h}_{i}^{L}$$
(36)
$${\overrightarrow{F}}_{i}=-{\nabla }_{i}E$$
(37)

Statistics and reproducibility

For the QM9 dataset, we randomly split it into 110,000 samples as the train set, 10,000 samples as the validation set, and the rest as the test set by following the previous studies18,19. For the Molecule3D and OGB-LSC PCQM4Mv2 datasets, the splitting has been provided in their paper32,33.

To evaluate the effectiveness of ViSNet in simulation data, ViSNet was trained on MD17 and rMD17 with a limited data setting, which consists of only 950 uniformly sampled conformations for model training and 50 conformations for validation for each molecule. For the MD22 dataset, we use the same number of molecules as in ref. 30 for training and validation, and the rest as the test set.

Furthermore, the whole Chignolin dataset was randomly split into 80%, 10%, and 10% as the training, validation, and test datasets. Six representative conformations were picked from the test set for illustration.

Experimental settings

For the QM9 dataset, we adopted a batch size of 32 and a learning rate of 1e−4 for all the properties. For the Molecule3D dataset, we adopted a larger batch size of 512 and a learning rate of 2e−4. For the OGB-LSC PCQM4Mv2 dataset, we trained our model in a mixed 2D/3D mode with a batch size of 256 and a learning rate of 2e−4. The mean squared error (MSE) loss was used for model training. For the molecular dynamic dataset including MD17, rMD17, MD22, and Chignolin, we leveraged a combined MSE loss for energy and force prediction. The weight of energy loss was set to 0.05. The weight of force loss was set to 0.95. The batch size was chosen from 2, 4, 8 due to the GPU memory and the learning rate was chosen from 1e−4 to 4e−4 for different molecules. The cutoff was set to 5 for small molecules in QM9, MD17, rMD17, and Molecule3D, and changed to 4 for Chignolin in order to reduce the number of edges in the molecular graphs. For the MD22 dataset, the cutoff of relatively small molecules was set to 5, that of bigger molecules was set to 4. Cutoff was not used in the OGB-LSC PCQM4Mv2 dataset. We used the learning rate decay if the validation loss stopped decreasing. The patience was set to 5 epochs for Molecule3D, 15 epochs for QM9, and 30 epochs for MD17, rMD17, MD22, and Chignolin. The learning rate decay factor was set to 0.8 for these models. Training is stopped if a maximum number of epochs is reached, or the validation loss does not improve for a maximum number of early stopping patience. The ViSNet model trained on the molecular dynamic datasets and Molecule3D had 9 hidden layers and the embedding dimension was set to 256. We used a larger model for the QM9 dataset, i.e., the embedding dimension changed to 512. For the OGB-LSC PCQM4Mv2 dataset, we use the 12-layer and 768-dimension Transformer-M37 as the backbone. More details about the hyperparameters of ViSNet can be found in Supplementary Table S4. Experiments were conducted on NVIDIA 32G-V100 GPUs.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.