Abstract
Effective molecular representation learning is of great importance to facilitate molecular property prediction. Recent advances for molecular representation learning have shown great promise in applying graph neural networks to model molecules. Moreover, a few recent studies design selfsupervised learning methods for molecular representation to address insufficient labelled molecules; however, these selfsupervised frameworks treat the molecules as topological graphs without fully utilizing the molecular geometry information. The molecular geometry, also known as the threedimensional spatial structure of a molecule, is critical for determining molecular properties. To this end, we propose a novel geometryenhanced molecular representation learning method (GEM). The proposed GEM has a specially designed geometrybased graph neural network architecture as well as several dedicated geometrylevel selfsupervised learning strategies to learn the molecular geometry knowledge. We compare GEM with various stateoftheart baselines on different benchmarks and show that it can considerably outperform them all, demonstrating the superiority of the proposed method.
Main
Molecular property prediction has been widely considered as one of the most critical tasks in computational drug and materials discovery, as many methods rely on predicted molecular properties to evaluate, select and generate molecules^{1,2}. With the development of deep neural networks (DNNs), molecular representation learning exhibits a great advantage over feature engineeringbased methods, which has attracted increasing research attention to tackle the molecular property prediction problem.
Graph neural networks (GNNs) for molecular representation learning have recently become an emerging research area, which regard the topology of atoms and bonds as a graph, and propagate messages of each element to its neighbours^{3,4,5,6}. However, one major obstacle to hinder the successful application of GNNs (and DNNs) in molecule property prediction is the scarity of labelled data, which is also a common research challenge in natural language processing^{7,8} and computer vision^{9,10} communities. Inspired by the success of selfsupervised learning, recent studies^{4,11} start to use largescale unlabelled molecules in a selfsupervised methodology to pretrain the molecular representation and then use a small number of labelled molecules to fine tune the models, achieving substantial improvements.
Existing selfsupervised learning techniques for GNNs^{4,11} only consider the topology information of the molecules, neglecting the molecular geometry, that is, the threedimensional spatial structure of a molecule. These works conduct selfsupervised learning by masking and predicting in nodes, edges or contexts in the topology^{4,11}. Yet these tasks only enable the model to learn the laws of molecular graph such as which atom/group could be connected to a double bond, and lack the ability to learn the molecular geometry knowledge, which plays an important role in determining molecules’ physical, chemical and biological activities. For example, the water solubility (a critical metric of druglikeness) of the two molecules illustrated in Fig. 1 is different due to their differing geometries, even though they have the same topology. Cisplatin and transplatin are another example of molecules with the same topology but different geometries: cisplatin is a popular chemotherapy drug used to treat a number of cancers, whereas transplatin has no cytotoxic activity^{12}.
Although incorporating geometric information into graph architectures to benefit some molecular property estimation tasks has attracted research attention in recent years^{13,14,15,16,17}, there is still a demand to utilize the molecular geometry information to develop a selfsupervised learning paradigm for property prediction. We argue that adopting the selfsupervised learning to estimate the geometry can contribute to facilitating the model’s capacity in predicting various properties. Selfsupervised learning can take advantage of the largescale unlabelled molecules with coarse threedimensional spatial structures to better learn the molecular representation, where the coarse threedimensional spatial structures can be efficiently calculated by cheminformatics tools such as RDKit (https://www.rdkit.org/). By geometrylevel selfsupervised learning, the pretrained model is capable of inferring the molecular geometry by itself.
To this end, we propose a novel geometryenhanced molecular representation learning method (GEM). First, to make the message passing sensitive to geometries, we model the effects of atoms, bonds and bond angles simultaneously by designing a geometrybased GNN architecture (GeoGNN). The architecture consists of two graphs: the first graph regards the atoms as nodes and the bonds as edges, whereas the second graph regards the bonds as nodes and the bond angles as edges. Second, we pretrain the GeoGNN to learn the chemical laws and the geometries from largescale molecules with coarse threedimensional spatial structures, designing various geometrylevel selfsupervised learning tasks. To verify the effectiveness of the proposed GEM, we compared it with several stateoftheart (SOTA) baselines on 15 molecular property prediction benchmarks, among which GEM achieves 14 SOTA results.
Our contributions can be summarized as follows:

We propose a novel geometrybased GNN to encode both the topology and geometry information of molecules.

We design multiple geometrylevel selfsupervised learning tasks to learn the molecular spatial knowledge from largescale molecules with coarse spatial structures.

We evaluated GEM thoroughly on various molecular property prediction datasets. Experimental results demonstrate that GEM considerably outperforms competitive baselines on multiple benchmarks.
Preliminaries
Graphbased molecular representation
A molecule consists of atoms and the neighbouring atoms are connected by chemical bonds, which can be represented by a graph \(G=({{{\mathcal{V}}}},{{{\mathcal{E}}}})\), where \({{{\mathcal{V}}}}\) is a node set and \({{{\mathcal{E}}}}\) is an edge set. An atom in the molecule is regarded as a node \(v\in {{{\mathcal{V}}}}\) and a chemical bond in the molecule is regarded as an edge \((u,v)\in {{{\mathcal{E}}}}\) connecting atoms u and v.
Graph neural networks are messagepassing neural networks^{18}, making them useful for predicting molecular properties. Following the definitions of the previous GNNs^{19}, the features of a node v are represented by x_{v} and the features of an edge (u, v) are represented by x_{uv}. Taking node features, edge features and the graph structure as inputs, a GNN learns the representation vectors of the nodes, where the representation vector of a node v is denoted by h_{v}. A GNN iteratively updates a node’s representation vector by aggregating the messages from the node’s neighbours. Finally, the representation vector h_{G} of the entire graph can be obtained by pooling over the representation vectors {h_{v}} of all the nodes at the last iteration. The representation vector of the graph h_{G} is utilized to estimate the molecular properties.
Pretraining methods for GNNs
In the molecular representation learning community, recently several works^{4,11,20} have explored the power of selfsupervised learning to improve the generalization ability of GNN models on downstream tasks. They mainly focus on two kinds of selfsupervised learning tasks: the nodelevel (edgelevel) tasks and the graphlevel tasks.
The nodelevel selfsupervised learning tasks are devised to capture the local domain knowledge. For example, some studies randomly mask a portion of nodes or subgraphs and then predict their properties by the node/edge representation. The graphlevel selfsupervised learning tasks are used to capture the global information, like predicting the graph properties by the graph representation. Usually, the graph properties are domainspecific knowledge, such as experimental results from biochemical assays or the existence of molecular functional groups.
The GEM framework
This section introduces the details of our proposed geometryenhanced molecular representation learning method (GEM), which includes two parts: a novel geometrybased GNN and various geometrylevel selfsupervised learning tasks.
GeoGNN
We propose a GeoGNN that encodes molecular geometries by modelling the atom–bond–angle relations, distinguishing them from traditional GNNs, which only consider the relationship between atoms and bonds.
For a molecule, we denote the atom set as \({{{\mathcal{V}}}}\), the bond set as \({{{\mathcal{E}}}}\), and the bond angle set as \({{{\mathcal{A}}}}\). We introduce atom–bond graph G and bond–angle graph H for each molecule, as illustrated in Fig. 2a. The atom–bond graph is defined as \(G=({{{\mathcal{V}}}},{{{\mathcal{E}}}})\), where atom \(u\in {{{\mathcal{V}}}}\) is regarded as the node of G and bond \((u,v)\in {{{\mathcal{E}}}}\) as the edge of G, connecting atoms u and v. Similarly, the bond–angle graph is defined as \(H=({{{\mathcal{E}}}},{{{\mathcal{A}}}})\), where bond \((u,v)\in {{{\mathcal{E}}}}\) is regarded as the node of H and bond angle \((u,v,w)\in {{{\mathcal{A}}}}\) as the edge of H, connecting bonds (u, v) and (v, w). We use x_{u} as the initial features of atom u, x_{uv} as the initial features of bond (u, v), and x_{uvw} as the initial features of bond angle (u, v, w). The atom–bond graph G and the bond–angle graph H—as well as atom features, bond features and bond angle features—are taken as the inputs of GeoGNN.
GeoGNN learns the representation vectors of atoms and bonds iteratively. For the kth iteration, the representation vectors of atom u and bond (u, v) are denoted by h_{u} and h_{uv}, respectively. To connect the atom–bond graph G and bond–angle graph H, the representation vectors of the bonds are taken as the communication links between G and H. In the first step, the bonds’ representation vectors are learned by aggregating messages from the neighbouring bonds and corresponding bond angles in the bond–angle graph H. In the second step, the atoms’ representation vectors are learned by aggregating messages from the neighbouring atoms and the corresponding bonds in the atom–bond graph G. Finally, the molecular representation h_{G} is obtained by pooling over the atoms’ representations. See the Methods for details on the GeoGNN architecture.
Geometrylevel selfsupervised learning tasks
To further boost the generalization ability of GeoGNN, we propose three geometrylevel selfsupervised learning tasks to pretrain GeoGNN: (1) the bond lengths prediction; (2) the bond angles prediction; (3) the atomic distance matrices prediction. The bond lengths and bond angles describe the local spatial structures, whereas the atomic distance matrices describe the global spatial structures.
Local spatial structures
Bond lengths and angles are the most important molecular geometrical parameters: the former is the distance between two joint atoms in a molecule, reflecting the bond strength between the atoms, whereas the latter is the angle connecting two consecutive bonds, including three atoms, describing the local spatial structure of a molecule.
To learn the local spatial structures, we construct selfsupervised learning tasks that predict bond lengths and angles. First, for a molecule, we randomly select 15% of atoms. For each selected atom, we extract the onehop neighbourhood of this atom, including the adjacent atoms and bonds, as well as the bond angles formed by that selected atom. Second, we mask the features of these atoms, bonds and bond angles in the onehop neighbourhood. The representation vectors of the extracted atoms and bonds at the final iteration of GeoGNN are used to predict the extracted bond lengths and bond angles. Selfsupervised learning tasks based on bond lengths and bond angles are shown on left and middle of Fig. 2b. We design a regression loss function that penalizes the error between the predicted bond lengths/angles and the labels, whose details can be referred to in the Methods. The task of predicting the local spatial structures can be seen as a nodelevel selfsupervised learning task.
Global spatial structures
Except for the tasks for learning local spatial structures, we also design the atomic distance matrices prediction task for learning the global molecular geometry. We construct the atomic distance matrix for each molecule based on the threedimensional coordinates of the atoms. We then predict the elements in the distance matrix, shown on the right of Fig. 2b.
Note that for two molecules with the same topological structures, the spatial distances between the corresponding atoms could vary greatly; thus, for a molecule, rather than take predicting atomic distance matrix as a regression problem, we take it as a multiclass classification problem by projecting the atomic distances into 30 bins with equal stride. Details on the designed loss function can be found in the Methods. The task predicting the bond lengths can be seen as a special case of the task predicting the atomic distances. The former focuses more on the local spatial structures, whereas the latter focuses more on the distribution of the global spatial structures.
To pretrain GeoGNN, we consider both the local spatial structures and global spatial structures for each molecule by summing up the corresponding loss functions.
Experiments
To thoroughly evaluate the performance of GEM, we compare it with multiple SOTA methods on multiple benchmark datasets from MoleculeNet^{21} with various molecular property prediction tasks, such as physical, chemical and biophysics.
Pretraining settings
Datasets
We use 20 million unlabelled molecules sampled from Zinc15^{22}, a public access database that contains purchasable druglike compounds, to pretrain GeoGNN. We randomly sample 90% of the molecules for training and the remaining for evaluation.
Selfsupervised learning task settings
We utilize geometry and graphlevel tasks to pretrain GeoGNN. For the former, we utilize the Merck molecular force field (MMFF94)^{23} function in RDKit to obtain the simulated threedimensional coordinates of the atoms in the molecules. The geometric features of the molecule—including bond lengths, bond angles and atomic distance matrices—are calculated by the simulated threedimensional coordinates. We predict the molecular fingerprints for the graphlevel tasks. The graphlevel tasks can be formulated as a set of binary classification problems, where each bit of the fingerprints corresponds to one binary classification problem. Two kinds of fingerprints are used: (1) the molecular access system (MACCS) key^{24} and (2) the extendedconnectivity fingerprint (ECFP)^{25}.
Molecular property prediction settings
Datasets and splitting method
We conduct experiments on multiple molecular benchmarks from the MoleculeNet^{21}, including both classification and regression tasks^{26,27,28,29,30,31}. Following the previous work^{11}, we split all the datasets with scaffold split^{32}, which splits molecules according to the their scaffold (molecular substructure). Scaffold split is a more challenging splitting method and can better evaluate the generalization ability of the models on outofdistribution data samples.
GNN architecture
We use the AGGREGATE and COMBINE functions defined in the graph isomorphism network (GIN)^{19}. Residual connections^{33}, layer normalization^{34} and graph normalization^{35} are incorporated into GIN to further improve the performance. We also use the average pooling as the READOUT function to obtain the graph representation.
Evaluation metrics
As suggested by the MoleculeNet^{21}, we use the average ROCAUC^{36} as the evaluation metric for the classification datasets. ROCAUC (area under the receiver operating characteristic curve) is used to evaluate the performance of binary classification tasks, for which higher is better. With respect to the regression datasets, we use root mean square error (RMSE) for FreeSolv^{37}, ESOL^{38} and Lipo^{39}, whereas we use mean average error (MAE) for QM7^{40}, QM8^{41} and QM9^{42}. We execute four independent runs for each method and report the mean and the standard deviation of the metrics.
Baselines
We compare the proposed method with various competitive baselines. DMPNN^{43}, AttentiveFP^{44}, SGCN^{16}, DimeNet^{17} and HMGNN^{6} are the GNNs without pretraining, among which, SGCN, DimeNet and HMGNN incorporate threedimensional geometry information; NGram^{45}, PretrainGNN^{11} and GROVER^{4} are the methods with pretraining. NGram assembles the node embeddings in short walks in the graph and then leverages Random Forest or XGBoost to predict the molecular properties. PretrainGNN implements several types of selfsupervised learning tasks, among which we report the best result. GROVER integrates GNN into Transformer with two selfsupervised tasks, and we report the results of GROVER_{base} and GROVER_{large} with different network capacity.
Experimental results
Overall performance
The overall performance of GEM along with other methods is summarized in Table 1. We have the following observations: (1) GEM achieves SOTA results on 14 out of 15 datasets. On the regression tasks, GEM achieves an overall relative improvement of 8.8% on average compared with the previous SOTA results in each dataset. On the classification tasks, GEM achieves an overall relative improvement of 4.7% on the average ROCAUC compared with the previous SOTA result from DMPNN. (2) GEM achieves more substantial improvements on the regression datasets than the classification datasets. We guess that the regression datasets focus on predicting the quantum chemical properties, which are highly correlated to molecular geometries.
Contribution of GeoGNN
We investigate the effect of GeoGNN without pretraining on the regression datasets, including the properties of quantum mechanics and physical chemistry, which are highly correlated to molecular geometries. GeoGNN is compared with multiple GNN architectures, including: (1) the commonly used GNN architectures, GIN^{19}, GAT^{46} and GCN^{47}; (2) recent works incorporating threedimensional molecular geometry, SGCN^{16}, DimeNet^{17} and HMGNN^{6}; (3) the architectures specially designed for molecular representation, DMPNN^{43}, AttentiveFP^{44} and GTransformer^{4}. From Table 2, we can conclude that GeoGNN considerably outperforms other GNN architectures on all the regression datasets since GeoGNN incorporates geometrical parameters even though the threedimensional coordinates of the atoms are simulated. The overall relative improvement is 7.9% compared with the best results of previous methods.
Contribution of geometrylevel tasks
To study the effect of the proposed geometrylevel selfsupervised learning tasks, we apply different types of selfsupervised learning tasks to pretrain GeoGNN on the regression datasets. In Table 3, ‘Without pretrain’ denotes the GeoGNN network without pretraining, ‘Geometry’ denotes our proposed geometrylevel tasks, ‘Graph’ denotes the graphlevel task that predicts the molecular fingerprints and ‘Context’^{4} denotes a nodelevel task that predicts the atomic context. In general, the methods with geometrylevel tasks are better than that without it. Furthermore, ‘Geometry’ performs better than ‘Geometry + Graph’ in the regression tasks, which may due to the weak connection between molecular fingerprints and the regression tasks.
Pretrained representations visualization
To intuitively observe the representations that the selfsupervised tasks (without downstream finetuning) have learned, we visualize the representations by mapping them to the twodimensional space by tSNE algorithm^{48}, whose details can be found in the Supplementary Information. The Davies Bouldin index^{49} is calculated to measure the separation of clusters. The lower the Davies Bouldin index, the better the separation of the clusters. Here we test whether the pretraining methods are able to distinguish molecules with valid geometry (generated from RDKit) from molecules with invalid geometry (random generated). We randomly select 1,000 molecules from ZINC. For each molecule, we generate the valid and invalid geometry. As shown in Fig. 3a, both the graphlevel and geometrylevel pretraining methods can better distinguish the valid geometry from invalid geometry compared to not pretrained. Besides, the geometrylevel pretraining can further decrease the Davies Bouldin Index to 2.63, compared with 7.88 of the graphlevel.
Impact of the quality of geometry
To investigate the impact of the quality of geometry, we first compare GeoGNN, which adopts the default force field MMFF, with GeoGNN (UFF), which adopts the universal force field (UFF)^{50}, on dataset QM9. GeoGNN and GeoGNN (UFF) achieve similar performance, as shown in Fig. 3c. The impact of more precise threedimensional coordinates provided by dataset QM9 (calculated by DFT^{51}) is also investigated. GeoGNN (precise 3D) achieves a great improvement of about 12% compared with the baseline GeoGNN.
Furthermore, Fig. 3b shows the representation visuals for different qualities of molecular geometry. GeoGNN (without 3D) is a variant of GeoGNN that masks all the geometry features with zeros, GeoGNN is the baseline that utilizes coarse threedimensional coordinates, and GeoGNN (precise 3D) utilizes precise 3D coordinates generated by DFT. We equally divide 2,000 molecules from QM9 into two clusters, one with high HOMO–LUMO gaps and the other with low HOMO–LUMO gaps. We test the ability of different models to distinguish these two group of molecules. Visually, we observe that GeoGNN can better separate the clusters than GeoGNN (without 3D), whereas GeoGNN (precise 3D) works better than GeoGNN. The differences in Davies Bouldin index support the observations.
Contributions of atom–bond and bond–angle graphs
We evaluate the contributions of the atom–bond and bond–angle graphs in GeoGNN on dataset QM9, as shown in Fig. 3c. Atom–bond graph utilizes the atom–bond graph only and pool over the representations of the atoms to estimate the properties, whereas bond–angle graph utilizes the bond–angle graph only and pools over the representations of bonds. GeoGNN, which consists of both the atom–bond and bond–angle graphs, performs better than the above two variants, indicating that both the atom–bond and bond–angle graphs contribute to the performance.
Related work
Molecular representation
Current molecular representations can be categorized into three types: molecular fingerprints, sequencebased representations and graphbased representations.
Molecular fingerprints
Molecular fingerprints such as ECFP^{25} and MACCS^{24} are molecular descriptors. Fingerprints are handcrafted representations—widely used by traditional machine learning methods^{3,52,53,54}—that encode a molecule into a sequence of bits according to the molecules’ topological substructures. Although fingerprints can represent the presence of the substructures in the molecules, they suffer from bit collisions and vector sparsity, limiting their representation power.
Sequencebased representations
Some studies^{3,55} take SMILES strings^{56} that describe the molecules by strings as inputs, and leverage sequencebased models such as Recurrent Neural Networks and Transformer^{57,58} to learn the molecular representations; however, it is difficult for sequencebased methods to comprehend the syntax of SMILES. For example, two adjacent atoms may be far apart in the text sequence. Besides, a small change in a SMILES string can lead to a large change in the molecular structure.
Graphbased representations
Many works^{3,4,5,6,18} have showcased the great potential of graph neural networks on modelling molecules by taking each atom as a node and each chemical bond as an edge. For example, AttentiveFP^{44} proposes to extend graph attention mechanism to learn aggregation weights. Meanwhile, a group of studies have tried to incorporate threedimensional geometry information: (1)^{13,14,15} take partial geometry information as features, such as atomic distances; (2)^{16} proposed a spatial graph convolution that uses relative position vectors between atoms as input features; (3)^{17} proposed a message passing scheme based on bonds and transform messages from angles.
Pretraining for GNNs
Selfsupervised learning^{7,8,9,10,59} has achieved great success in natural language processing, computer vision and other domains; it trains unlabelled samples in a supervised manner to alleviate the overfitting issue and improve data utilization efficiency. Some studies^{4,11} recently applied selfsupervised learning methods to GNNs for molecular property prediction to overcome the insufficiency of the labelled samples. These works learn the molecular representation vectors by exploiting the node and graphlevel tasks, where the nodelevel tasks learn the local domain knowledge by predicting the node properties and the graphlevel tasks learn the global domain knowledge by predicting biological activities. Although existing selfsupervised learning methods can boost the generalization ability, they neglect the spatial knowledge that is strongly related to the molecular properties.
Conclusion
Efficient molecular representation learning is crucial for molecular property prediction. Existing works that apply pretraining methods for molecular property prediction fail to utilize the molecular geometries described by bonds, bond angles and other geometrical parameters. To this end, we design a geometrybased GNN and multiple geometrylevel selfsupervised learning methods capture the molecular spatial knowledge. Extensive experiments were conducted to verify the effectiveness of GEM, comparing it with multiple competitive baselines. GEM considerably outperforms other methods on multiple benchmarks. In the future we will try to adopt the proposed framework to more molecular tasks, especially the protein–ligand affinity prediction task that requires lots of threedimensional samplings.
Methods
Preliminary for GNNs
Graph neural networks is a message passing neural networks. More concretely, given a node v, its representation vector \({\mathbf{h}}_{v}^{(k)}\) at the kth iteration is formalized by
where \({{{\mathcal{N}}}}(v)\) is the set of neighbours of node v, AGGREGATE^{(k)} is the aggregation function for aggregating messages from a node’s neighbourhood, and COMBINE^{(k)} is the update function for updating the node representation. We initialize \({\mathbf{h}}_{v}^{(0)}\) by the feature vector of node v, that is, \({\mathbf{h}}_{v}^{(0)}={\mathbf{x}}_{v}\).
READOUT function is introduced to integrate the nodes’ representation vectors at the final iteration so as to gain the graph’s representation vector h_{G}, which is formalized as
where K is the number of iterations. In most cases, READOUT is a permutation invariant pooling function, such as summation and maximization. The graph’s representation vector h_{G} can then be used for downstream task predictions.
GeoGNN
The GeoGNN architecture encodes the molecular geometries by modelling two graphs: the atom–bond and bond–angle graphs, under which the representation vectors of atoms and bonds are learned iteratively. More concretely, the representation vectors of atom u and bond (u, v) for the kth iteration are denoted by h_{u} and h_{uv}, respectively. We initialize \({\mathbf{h}}_{u}^{(0)}={\mathbf{x}}_{u}\) and \({\mathbf{h}}_{uv}^{(0)}={\mathbf{x}}_{uv}\).
Given bond (u, v), its representation vector \({\mathbf{h}}_{uv}^{(k)}\) at the kth iteration is formalized by
Here, \({{{\mathcal{N}}}}(u)\) and \({{{\mathcal{N}}}}(v)\) denote the neighbouring atoms of u and v, respectively; \(\{(u,w):w\in {{{\mathcal{N}}}}(u)\}\cup \{(v,w):w\in {{{\mathcal{N}}}}(v)\}\) are the neighbouring bonds of (u, v). AGGREGATE_{bond−angle} is the message aggregation function and COMBINE_{bond−angle} is the update function for bond–angle graph H. In this way, the information from the neighbouring bonds and the corresponding bond angles is aggregated into \({\mathbf{a}}_{uv}^{(k)}\). The representation vector of bond (u, v) is then updated according to the aggregated information. With the learned representation vectors of the bonds from bond–angle graph \({{{\mathcal{H}}}}\), given an atom u, its representation vector \({\mathbf{h}}_{u}^{(k)}\) at the kth iteration can be formalized as
Similarly, \({{{\mathcal{N}}}}(u)\) denotes the neighbouring atoms of atom u, AGGREGATE_{atom−bond} is the message aggregation function for atom–bond graph G, and COMBINE_{atom−bond} is the update function. For atom u, messages are aggregated from the neighbouring atoms and the corresponding bonds. Note that, the messages of the bonds are learned from the bond–angle graph H. The aggregated messages then update the representation vector of atom u.
The representation vectors of the atoms at the final iteration are integrated to gain the molecular representation vector h_{G} by the READOUT function, which is formalized as
where K is the number of iterations. The molecule’s representation vector h_{G} is used to predict the molecular properties.
Geometrylevel selfsupervised learning tasks
Local spatial structures
The selfsupervised tasks for local spatial information are designed to learn two important molecular geometrical parameters, the bond lengths and the bond angles. The loss functions of the selfsupervised tasks are defined as follows:
Here, \({L}_{\mathrm{length}}({{{\mathcal{E}}}})\) is the loss function for bond lengths, with \({{{\mathcal{E}}}}\) as the set of bonds; \({L}_{\mathrm{angle}}({{{\mathcal{A}}}})\) is the loss function of bond angles, with \({{{\mathcal{A}}}}\) as set of angles; K is the number of iterations for GeoGNN; f_{length}(⋅) is the network predicting the bond lengths; and f_{angle}(⋅) is the network predicting the bond angles; l_{uv} denotes the length of the bond connecting atoms u and v; ϕ_{uvw} denotes the degree of the bond angle connecting bonds (u, v) and (v, w).
Global spatial structures
The selfsupervised tasks for global spatial information are designed to learn the atomic distance matrices between all atom pairs. Each element of the distance matrices is the threedimensional distance between two atoms. We use d_{uv} to denote the distance between two atoms u and v in the molecule. For the atomic distance prediction task, we clip the distance with the range from 0 Å to 20 Å and project it into 30 bins with equal stride. The loss function of the selfsupervised tasks is defined as follows:
where \({{{\mathcal{V}}}}\) is the set of atoms, f_{distance}(⋅) is the network predicting the distribution of atomic distances, the bin(⋅) function is used to discretize the atomic distance d_{uv} into a onehot vector and \({\mathrm{log}}(\cdot )\) is the logarithmic function.
Data availability
The selfsupervised data used in our study are publicly available in ZINC (https://zinc.docking.org/tranches/home/), whereas the downstream benchmarks can be downloaded from MoleculeNet (https://moleculenet.org/datasets1).
Code availability
The source code of this study providing the geometrybased GNN and several geometrylevel selfsupervised learning methods is freely available at GitHub (https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/pretrained_compound/ChemRL/GEM) to allow replication of the results. The version used for this publication is available at https://doi.org/10.5281/zenodo.5781821.
References
Shen, J. & Nicolaou, C. A. Molecular property prediction: recent trends in the era of artificial intelligence. Drug Discov. Today Technol. 32–33, 29–36 (2020).
Wieder, O. et al. A compact review of molecular property prediction with graph neural networks. Drug Discov. Today Technol. 37, 1–12 (2020).
Huang, K. et al. DeepPurpose: a deep learning library for drugtarget interaction prediction. Bioinformatics 36, 5545–5547 (2020).
Rong, Y. et al. Selfsupervised graph transformer on largescale molecular data. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (eds Larochelle, H. et al.) 12559–12571 (NeurIPS 2020).
Shindo, H. & Matsumoto, Y. Gated graph recursive neural networks for molecular property prediction. Preprint at https://arxiv.org/abs/1909.00259 (2019).
Shui, Z. & Karypis, G. Heterogeneous molecular graph neural networks for predicting molecule properties. In 20th IEEE International Conference on Data Mining (eds Plant, C. et al.) 492–500 (IEEE, 2020).
Devlin, J., Chang, M.W., Lee, K. & Toutanova, K. BERT: pretraining of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
He, P., Liu, X., Gao, J. & Chen, W. DeBERTa: decodingenhanced BERT with disentangled attention. In 9th International Conference on Learning Representations (ICLR, 2021).
Doersch, C., Gupta, A. & Efros, A. A. Unsupervised visual representation learning by context prediction. In International Conference on Computer Vision (IEEE Computer Society, 2015).
Gidaris, S., Singh, P. & Komodakis, N. Unsupervised representation learning by predicting image rotations. In 6th International Conference on Learning Representations (ICLR, 2018).
Hu, W. et al. Strategies for pretraining graph neural networks. In 8th International Conference on Learning Representations (ICLR, 2020).
PelegShulman, T., Najajreh, Y. & Gibson, D. Interactions of cisplatin and transplatin with proteins: comparison of binding kinetics, binding sites and reactivity of the ptprotein adducts of cisplatin and transplatin towards biological nucleophiles. J. Inorg. Biochem. 91, 306–311 (2002).
Schütt, K. et al. Schnet: A continuousfilter convolutional neural network for modeling quantum interactions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 (eds Guyon, I. et al.) 991–1001 (NeurIPS, 2017).
Li, J., Xu, K., Chen, L., Zheng, Z. & Liu, X. GraphGallery: a platform for fast benchmarking and easy development of graph neural networks based intelligent software. In 43rd IEEE/ACM International Conference on Software Engineering: Companion Proceedings 13–16 (IEEE, 2021).
Maziarka, L. et al. Molecule attention transformer. Preprint at https://arxiv.org/abs/2002.08264 (2020).
Danel, Tomasz et al. Spatial graph convolutional networks. In Neural Information Processing—27th International Conference, ICONIP 2020 Vol. 1333 (eds Yang, H. et al.) 668–675 (Springer, 2020).
Klicpera, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. In 8th International Conference on Learning Representations (ICLR, 2020).
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In Proc. 34th International Conference on Machine Learning Vol. 70 (eds Precup, D. & Teh, Y. W.) 1263–1272 (PMLR, 2017).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In 7th International Conference on Learning Representations (ICLR, 2019).
Sun, F.Y., Hoffmann, J., Verma, V. & Tang, J. Infograph: unsupervised and semisupervised graphlevel representation learning via mutual information maximization. In 8th International Conference on Learning Representations (ICLR, 2020).
Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Sterling, T. & Irwin, J. J. ZINC 15—ligand discovery for everyone. J. Chem. Inf. Model. 55, 2324–2337 (2015).
Halgren, T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem. 17, 490–519 (1996).
Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280 (2002).
Rogers, D. & Hahn, M. Extendedconnectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Subramanian, G., Ramsundar, B., Pande, V. & Denny, R. A. Computational modeling of βsecretase 1 (bace1) inhibitors using ligand based approaches. J. Chem. Inf. Model. 56, 1936–1949 (2016).
Martins, I. F., Teixeira, A. L., Pinheiro, L. & Falcão, A. O. A Bayesian approach to in silico blood–brain barrier penetration modeling. J. Chem. Inf. Model. 52, 1686–1697 (2012).
Richard, A. M. et al. Toxcast chemical landscape: paving the road to 21st century toxicology. Chem. Res. Toxicol. 29, 1225–1251 (2016).
Gayvert, K. M., Madhukar, N. S. & Elemento, O. A datadriven approach to predicting successes and failures of clinical trials. Cell Chem. Biol. 23, 1294–1301 (2016).
Huang, R. et al. Editorial: Tox21 challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental toxicants and drugs. Front. Environ. Sci. 3, 85 (2017).
Kuhn, M., Letunic, I., Jensen, L. J. & Bork, P. The SIDER database of drugs and side effects. Nucl. Acids Res. 44, 1075–1079 (2016).
Ramsundar, B., Eastman, P., Walters, P. & Pande, V. Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More (O’Reilly Media, 2019).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Ba, L. J., Kiros, J. R. & Hinton, G. E. Layer normalization. In NIPS 2016 Deep Learning Symposium recommendation (NIPS, 2016).
Chen, Y., Tang, X., Qi, X., Li, C.G. & Xiao, R. Learning graph normalization for graph neural networks. Preprint at https://arxiv.org/abs/2009.11746 (2020).
Bradley, A. P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30, 1145–1159 (1997).
Mobley, D. L. & Guthrie, J. P. FreeSolv: a database of experimental and calculated hydration free energies, with input files. J. Comput. Aided Mol. Des. 28, 711–720 (2014).
Delaney, J. S. ESOL: estimating aqueous solubility directly from molecular structure. J. Chem. Inf. Model. 44, 1000–1005 (2004).
Gaulton, A. et al. ChEMBL: a largescale bioactivity database for drug discovery. Nucl. Acids Res. 40, 1100–1107 (2012).
Blum, L. C. & Reymond, J.L. 970 Million druglike small molecules for virtual screening in the chemical universe database GDB13. J. Am. Chem. Soc. 131, 8732–8733 (2009).
Ramakrishnan, R., Hartmann, M., Tapavicza, E. & AnatoleVonLilienfeld, O. Electronic spectra from TDDFT and machine learning in chemical space. J. Chem. Phys. 143, 084111 (2015).
Ruddigkeit, L., van Deursen, R., Blum, L. C. & Reymond, J.L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB17. J. Chem. Inf. Model. 52, 2864–2875 (2012).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).
Xiong, Z. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 63, 8749–8760 (2020).
Liu, S., Demirel, M. F. & Liang, Y. Ngram graph: simple unsupervised representation for graphs, with applications to molecules. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (eds Wallach, H. M. et al.) 8464–8476 (NeurIPS, 2019).
Velickovic, P. et al. Graph attention networks. In 5th International Conference on Learning Representations (ICLR, 2017).
Kipf, T. N. & Welling, M. Semisupervised classification with graph convolutional networks. In 5th International Conference on Learning Representations (ICLR, 2017).
van der Maaten, L. Accelerating tSNE using treebased algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2014).
Davies, D. L. & Bouldin, D. W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell 1, 224–227 (1979).
Rappé, A. K., Casewit, C. J., Colwell, K. S., Goddard, W. A. III & Skiff, W. M. UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J. Am. Chem. Soc. 114, 10024–10035 (1992).
Gross, E.K.U. & Dreizler, R. M. Density Functional Theory Vol. 337 (Springer, 2013).
CeretoMassagué, A. et al. Molecular fingerprint similarity search in virtual screening. Methods 71, 58–63 (2015).
Coley, C. W., Barzilay, R., Green, W. H., Jaakkola, T. S. & Jensen, K. F. Convolutional embedding of attributed molecular graphs for physical property prediction. J. Chem. Inf. Model. 57, 1757–1772 (2017).
Duvenaud, D. et al. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (eds Cortes, C. et al.) 2224–2232 (NeurIPS, 2015).
Goh, G. B., Hodas, N. O., Siegel, C. & Vishnu, A. SMILES2Vec: an interpretable generalpurpose deep neural network for predicting chemical properties. Preprint at https://arxiv.org/abs/1712.02034 (2018).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Zaremba, W., Sutskever, I. & Vinyals, O. Recurrent neural network regularization. Preprint at https://arxiv.org/abs/1409.2329 (2014).
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conferenceon Neural Information Processing Systems 2017 5998–6008 (NeurIPS, 2017).
Li, P. et al. Learn molecular representations from largescale unlabeled molecules for drug discovery. Preprint at https://arxiv.org/abs/2012.11175 (2020).
Acknowledgements
This work is supported by National Engineering Research Center of Deep Learning Technology and Applications.
Author information
Authors and Affiliations
Contributions
X.F., F.W., H. Wu and H. Wang led the research. L.L., X.F. and F.W. contributed technical ideas. L.L., J.L., D.H., S.Z. and X.F. developed the proposed method. X.F., L.L., S.Z. and J.Z. developed analytics. X.F., L.L, F.W., J.L., D.H., S.Z. and J.Z. wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Details of the proposed networks and experiments.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fang, X., Liu, L., Lei, J. et al. Geometryenhanced molecular representation learning for property prediction. Nat Mach Intell 4, 127–134 (2022). https://doi.org/10.1038/s42256021004384
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256021004384
This article is cited by

ABTMPNN: an atombond transformerbased messagepassing neural network for molecular property prediction
Journal of Cheminformatics (2023)

Force fieldinspired molecular representation learning for property prediction
Journal of Cheminformatics (2023)

Enhancing drug property prediction with dualchannel transfer learning based on molecular fragment
BMC Bioinformatics (2023)

A knowledgeguided pretraining framework for improving molecular representation learning
Nature Communications (2023)

First fullyautomated AI/ML virtual screening cascade implemented at a drug discovery centre in Africa
Nature Communications (2023)