ChemRL-GEM: Geometry Enhanced Molecular Representation Learning for Property Prediction

Effective molecular representation learning is of great importance to facilitate molecular property prediction, which is a fundamental task for the drug and material industry. Recent advances in graph neural networks (GNNs) have shown great promise in applying GNNs for molecular representation learning. Moreover, a few recent studies have also demonstrated successful applications of self-supervised learning methods to pre-train the GNNs to overcome the problem of insufficient labeled molecules. However, existing GNNs and pre-training strategies usually treat molecules as topological graph data without fully utilizing the molecular geometry information. Whereas, the three-dimensional (3D) spatial structure of a molecule, a.k.a molecular geometry, is one of the most critical factors for determining molecular physical, chemical, and biological properties. To this end, we propose a novel Geometry Enhanced Molecular representation learning method (GEM) for Chemical Representation Learning (ChemRL). At first, we design a geometry-based GNN architecture that simultaneously models atoms, bonds, and bond angles in a molecule. To be specific, we devised double graphs for a molecule: The first one encodes the atom-bond relations; The second one encodes bond-angle relations. Moreover, on top of the devised GNN architecture, we propose several novel geometry-level self-supervised learning strategies to learn spatial knowledge by utilizing the local and global molecular 3D structures. We compare ChemRL-GEM with various state-of-the-art (SOTA) baselines on different molecular benchmarks and exhibit that ChemRL-GEM can significantly outperform all baselines in both regression and classification tasks. For example, the experimental results show an overall improvement of 8.8% on average compared to SOTA baselines on the regression tasks, demonstrating the superiority of the proposed method.


Introduction
Molecular property prediction has been widely considered as one of the most critical tasks in computational drug and materials discovery, since many methods rely on predicted molecular properties to evaluate, select and generate molecules [43,52]. With the development of deep neural networks (DNNs), molecular representation learning exhibits a great advantage over feature C H Figure 1: Comparison between two molecules (cis-1,2-DCE and trans-1,2-DCE) with the same topology but different geometries. The two chlorine atoms are on the different sides in the left molecule, while the same sides in the right molecule.
To solve these problems, we propose a novel Geometry Enhanced Molecular representation learning method (GEM) 3 for Chemical Representation Learning (ChemRL). Firstly, to make the message passing sensitive to geometries, we model the atoms, bonds, and bond angles simultaneously by designing a geometrybased GNN architecture (GeoGNN), which contains two graphs: the first encodes the atom-bond relations; the second encodes the bond-angle relations. Secondly, we pre-train the GeoGNN to learn the knowledge of both the chemical laws and the geometries by designing various geometry-level self-supervised learning tasks. To verify the effectiveness of the proposed ChemRL-GEM, we compared it with several SOTA baselines on a dozen of molecular property prediction benchmarks. Our exhaustive experimental study demonstrates the superiority of ChemRL-GEM.
Our contributions can be summarized as follows: • We propose a geometry-based graph neural network, GeoGNN, to encode both the topology and the geometry information of molecules. • We introduce multiple geometry-level self-supervised learning tasks to learn the molecular 3D spatial knowledge in addition to other self-supervised learning tasks.

Graph-based Molecular Representation
A molecule consists of atoms, and the neighboring atoms are connected by the chemical bonds, which can be naturally represented by a graph G = (V, E), where V is a node set and E is an edge set. An atom in the molecule is regarded as a node v ∈ V and a chemical bond in the molecule is regarded as an edge (u, v) ∈ E connecting atoms u and v.
Graph neural networks (GNNs) can be seen as message passing neural networks [18], which is useful for predicting molecular properties. Following the definitions of the previous GNNs [55], the features of a node v are represented by x v and the features of an edge (u, v) are represented by x uv . Taking node features, edge features and the graph structure as inputs, a GNN learns the representation vectors of the nodes and the entire graph, where the representation vector of a node v is denoted by h v and the representation vector of the entire graph is denoted by h G . A GNN iteratively updates a node's representation vector by aggregating the messages from the node's neighbors. Given a node v, its representation vector h (k) v at the k-th iteration is formalized by where N (v) is the set of neighbors of node v, AGGREGATE (k) is the aggregation function for aggregating messages from a node's neighborhood, and COMBINE (k) is the update function for updating the node representation. We initialize h v = x v . READOUT function is introduced to integrate the nodes' representation vectors at the final iteration so as to gain the graph's representation vector h G , which is formalized as where K is the number of iterations. In most cases, READOUT is a permutation invariant pooling function, such as summation and maximization. The graph's representation vector h G can then be used for downstream task predictions.

Pre-training Methods for GNNs
In the molecular representation learning community, recently several works [48,23,40] have explored the power of self-supervised learning to improve the generalization ability of GNN models on downstream tasks. They mainly focus on two kinds of self-supervised learning tasks: the node-level (edge-level) tasks and the graph-level tasks.
The node-level self-supervised learning tasks are devised to capture the local domain knowledge. For example, some studies randomly mask a portion of nodes or sub-graphs and then predict their properties by the node/edge representation. The graph-level self-supervised learning tasks are used to capture the global information, like predicting the graph properties by the graph representation. Usually, the graph properties are domain-specific knowledge, such as experimental results from biochemical assays or the existence of molecular functional groups.

The ChemRL-GEM Framework
This section introduces the details of our proposed Geometry Enhanced Molecular representation learning method (GEM) for Chemical Representation Learning, which includes two parts: a novel geometry-based GNN and various geometry-level self-supervised learning tasks.

Geometry-based Graph Neural Network
We propose a Geometry-based Graph Neural Network (GeoGNN) that encodes the molecular geometries by modeling the atom-bond-angle relations, while traditional GNNs only consider the relations between atoms and bonds. Angle: Figure 2: Illustration of the atom-bond graph and the bond-angle graph. The left figure shows the structure of Methanamine in the 3D space. We can easily encode its geometry information with the help of the atom-bond graph that describes the relations between atoms and bonds and the bond-angle graph that describes the relations between bonds and bond angles.
For a molecule, we denote the atom set as V, the bond set as E, and the bond angle set as A.
We introduce atom-bond graph G, and bond-angle graph H for each molecule, as illustrated in Figure 2. The atom-bond graph is defined as G = (V, E), where atom u ∈ V is regarded as the node of G and bond (u, v) ∈ E as the edge of G, connecting atom u and atom v. Similarly, the bond-angle graph is defined as H = (E, A), where bond (u, v) ∈ E is regarded as the node of H and bond angle (u, v, w) ∈ A as the edge of H, connecting bond (u, v) and bond (v, w). We use x u as the initial features of atom u, x uv as the initial features of bond (u, v) and x uvw as the initial features of bond angle (u, v, w). The atom-bond graph G and the bond-angle graph H, as well as atom features, bond features and bond angle features are taken as the inputs of GeoGNN. uv = x uv . In order to connect the atom-bond graph G and bond-angle graph H, the representation vectors of the bonds as the communication links between G and H, as shown in Figure 3. More concretely, the bonds' representation vectors are learned by aggregating the messages from the neighboring bonds and corresponding bond angles in the bond-angle graph H. Then, the learned bond representation vectors are taken as edge features of the atom-bond graph G and help to learn the atoms' representation vectors.
Given bond (u, v), its representation vector h (k) uv at the k-th iteration is formalized by Here, N (u) and N (v) denote the neighboring atoms of atom u and atom v, respectively. {(u, w) : w ∈ N (u)}∪{(v, w) : w ∈ N (v)} are the neighboring bonds of bond (u, v). AGGREGATE bond-angle is the message aggregation function, and COMBINE bond-angle is the update function for bond-angle graph H. In this way, the information from the neighboring bonds and the corresponding bond angles is aggregated into a (k) uv . Then, the representation vector of bond (u, v) is updated according to the aggregated information. With the learned representation vectors of the bonds from bond-angle graph H, given an atom u, its representation vector h (k) u at the k-th iteration can be formalized as Similarly, N (u) denotes the neighboring atoms of atom u, AGGREGATE atom-bond is the message aggregation function for atom-bond graph G, and COMBINE atom-bond is the update function. For atom The representation vectors of the atoms at the final iteration are integrated to gain the molecular representation vector h G by the READOUT function, which is formalized as where K is the number of iterations. The molecule's representation vector h G is used to predict the molecular properties.

Geometry-level Self-supervised Learning Tasks
To further boost the generalization ability of GeoGNN, we propose three geometry-level selfsupervised learning tasks to pre-train GeoGNN: 1) the bond lengths prediction; 2) the bond angles prediction; 3) the atomic distance matrices prediction. The bond lengths and bond angles describe the local spatial structures, while the atomic distance matrices describe the global spatial structures.

Local Spatial Structures
The bond lengths and the bond angles are the most important molecular geometrical parameters. The bond length is the average distance between two joint atoms in a molecule, reflecting the bond strength between the atoms, while the bond angle is the angle connecting two consecutive bonds, including three atoms, describing the local spatial structure of a molecule.
In order to learn the local spatial structures, we construct self-supervised learning tasks that predict the bond lengths and bond angles. Firstly, for a molecule, we randomly select 15% of atoms. For each selected atom, we extract 1-hop neighboring atoms and bonds, as well as the bond angles formed by that selected atom. Secondly, we mask the features of these atoms, bonds, and bond angles in the 1-hop context. The representation vectors of the extracted atoms and bonds at the final iteration of GeoGNN are used to predict the extracted bond lengths and bond angles. Figure 4(a) and Figure 4 show the self-supervised learning tasks based on bond lengths and bond angles. More concretely, for a selected atom v, the loss functions of the self-supervised tasks of local geometry information are defined as follow where f length (·) is the network predicting the bond lengths, and f angle (·) is the network predicting the bond angles. l uv denotes the length of the bond connecting atom u and atom v and φ uvw denotes the degree of the bond angle connecting bonds (u, v) and (v, w). The task of predicting the local spatial structures can be seen as a node-level self-supervised learning task.

Global Spatial Structures
Except for the tasks for learning local spatial structures, we also design the atomic distance matrices prediction task for learning the global molecular geometry. We construct the atomic distance matrix for each molecule based on the 3D coordinates of the atoms. Then, we predict the elements in the distance matrix, shown in Figure 4(c). We use d uv to denote the distance between two atoms u and v in the molecule. Note that, for two molecules with the same topological structures, the spatial distances between the corresponding atoms could vary greatly. Thus, for a molecule, rather than take predicting atomic distance matrix as a regression problem, we take it as a multi-class classification problem by discretizing the atomic distances. The loss function is defined as where f distance (·) is the network predicting the distribution of atomic distances, the bin(·) function is used to discretize the atomic distance d uv into a one-hot vector, and log(·) is the logarithmic function. The task predicting the bond lengths can be seen as a special case of the task predicting the atomic distances. The former focuses more on the accurate local spatial structures, while the latter focuses more on the distribution of the global spatial structures. To pre-train GeoGNN, we consider both the local spatial structures and global spatial structures for each molecule by summing up the loss functions defined in Equation 6 and Equation 7.

Experiments
To thoroughly evaluate the performance of ChemRL-GEM, we compare it with multiple state-ofthe-art (SOTA) methods on 12 benchmark datasets from MoleculeNet [53] with various molecular property prediction tasks, such as physical, chemical, and biophysics mechanics. The source codes for the experiments will be released to ensure reproducibility when the paper is published.

Pre-training Settings
Datasets. We use 20 million unlabelled molecules sampled from Zinc15 [46], a public access database that contains purchasable "drug-like" compounds, to pre-train GeoGNN. We randomly sample 90% of the molecules for training and the remaining for evaluation.
Self-supervised Learning Task Settings. We utilize the geometry-level and graph-level tasks to pretrain GeoGNN. For the geometry-level tasks, we utilize the Merck molecular force field [20] function from the RDKit 4 package, an open-source cheminformatics toolkit [29], to obtain the simulated 3D coordinates of the atoms in the molecules. The geometric features of the molecule, including bond lengths, bond angles and atomic distance matrices, are calculated by the simulated 3D coordinates. For the graph-level tasks, we predict two kinds of molecular fingerprints: 1) The Molecular ACCess System (MACCS) key [12]; 2) The extended-connectivity fingerprint (ECFP) [39].
Following the previous work [23], we split all the 12 datasets with scaffold split [37], which splits molecules according to the their scaffold (molecular substructure). Rather than splitting the dataset randomly, scaffold split is a more challenging splitting method, which can better evaluate the generalization ability of the models on out-of-distribution data samples. Based on scaffold split, we split the molecules in each dataset into training set, validation set, and test set by the ratio of 8:1:1. We run each method for 100 epochs on the training set in the training process and then select the epoch according to the validation set. The selected epoch is evaluated on the test set.
GNN Architecture. We use the AGGREGATE function and COMBINE function defined in Graph Isomorphism Network (GIN) [55]. To further improve the performance, residual connections [21], layer normalization [1], and graph normalization [7] are incorporated into GIN. Also, we use the average pooling as the READOUT function to obtain the graph representation.  Hyper-parameters and Evaluation Metrics. We use Adam Optimizer [26] with learning rate of 0.001 for all our models. For each dataset, we train the model with batch size of 32. As suggested by the MoleculeNet [53], we use the average ROC-AUC [4] as the evaluation metric for the 6 binary classification datasets. With respect to the regression datasets, for FreeSolv [34], ESOL [9], and Lipo [15], we use Root Mean Square Error (RMSE), and for QM7 [3], QM8 [36], and QM9 [41], we use Mean Average Error (MAE). We execute 4 independent runs for each method and report the mean and the standard deviation of the metrics.
Baselines. We compare the proposed method with various competitive baselines. D-MPNN [56] and AttentiveFP [54] are the GNNs without pre-training, while N-Gram [32], PretrainGNN [23], and GROVER [40] are the methods with pre-training. N-Gram assembles the node embeddings in short walks in the graph to obtain the graph representation and then leverages Random Forest or XGBoost to predict the molecular properties. PretrainGNN implements several types of self-supervised learning tasks, among which we report the best result. GROVER integrates GNN into Transformer with the context prediction task and the functional motif prediction task, and we report the results of GROVER base and GROVER large with different network capacity.
More network and experimental details can be found in the Appendices.

Overall Performance
The overall performance of ChemRL-GEM along with other methods on the molecular property prediction benchmarks is summarized in Table 1 5 , where the SOTA results are shown in bold and the cells in gray indicate the previous SOTA results. The numbers in brackets are the standard deviation. From Table 1, we have the following observations: 1) ChemRL-GEM achieves SOTA results on 11/12 datasets. On the regression tasks, ChemRL-GEM achieves an overall relative improvement of 8.8% on average compared to the previous SOTA results in each dataset. While on the classification tasks, it achieves an overall relative improvement of 3.7% on the average ROC-AUC compared to the   previous SOTA result from D-MPNN. 2) Since the tasks of the regression datasets, such as the water solubility prediction in the ESOL dataset and the electronic properties prediction in the QM7 dataset, are much more correlated to the molecular geometries than the classification tasks, ChemRL-GEM achieves more considerable improvement on the regression datasets compared to the classification datasets. 3) ChemRL-GEM does not achieve SOTA results in the Clintox. It is possible that the Clintox dataset is highly unbalanced, with only 9 positive samples in the test set, which may cause unstable results. We also conduct experiments on other splitting methods, and ChemRL-GEM still achieves SOTA results. Please refer to the Appendix C for more details.

Ablation Studies of ChemRL-GEM
Contribution of GeoGNN. We investigate the effect of GeoGNN on regression datasets which are more related to the molecular geometries. GeoGNN is compared with multiple GNN architectures, including the commonly used GNN architectures, GIN [55], GAT [50], and GCN [27], as well as the architectures specially designed for molecular representation, D-MPNN [56], AttentiveFP [54], and GTransformer [40]. From Table 2, we can conclude that GeoGNN significantly outperforms other GNN architectures on all the regression datasets since GeoGNN utilizes a clever architecture that incorporates geometrical parameters even though the 3D coordinates of the atoms are simulated. The overall relative improvement is 7.9% compared to the best results of previous methods.
Although the simulated 3D coordinates by RDKit are not accurate, they still provide the coarse information about the 3D conformation. Furthermore, we instead utilize the 3D coordinates of the molecules provided by QM9, which are much more accurate. Surprisingly GeoGNN with the accurate 3D coordinates achieves an average MAE of 0.00652, while GeoGNN with the coarse 3D coordinates achieves an average MAE of 0.00746. Such improvement further demonstrates the increasing power of GeoGNN on learning molecular representations when providing more accurate 3D coordinates.

Contribution of Geometry-level Tasks.
To study the effect of the proposed geometry-level selfsupervised learning tasks, we apply different types of self-supervised learning tasks to pre-train GeoGNN on the regression datasets. In Table 3, w/o pre-training denotes the GeoGNN network without pre-training, Geometry denotes our proposed geometry-level tasks, Graph denotes the graphlevel task that predicts the molecular fingerprints, and Context [40] denotes a node-level task that predicts the atomic context. In general, the methods with geometry-level tasks are better than that without it. Furthermore, Geometry performs better than Geometry+Graph in the regression tasks, which may due to the weak connection between molecular fingerprints and the regression tasks.

Molecular Representation
Current molecular representations can be categorized into three types: molecular fingerprints, sequence-based representations and graph-based representations.
Molecular Fingerprints. Molecular fingerprints, such as ECFP [39] and MACCS [12], are commonly used for molecular representations by traditional machine learning methods [6,8,13,24], which encode a molecule into a sequence of bits according to the molecules' topological substructures. However, molecular fingerprints lack the ability to represent complex global structures, since they only focus on the local substructures.
Sequence-based Representations. Some studies [19,24] take SMILES strings [51] that describe the molecules by strings as inputs, and leverage sequence-based models, such as Recurrent Neural Networks and Transformer [57,49], to learn the molecular representations. However, the same molecule could be represented by more than two SMILES strings, resulting in ambiguity of the representation. Besides, it is laborious for sequence-based representation to model some molecular topological structures, such as rings.
Graph-based Representations. Many works [24,40,44,45,18] have showcased the great potential of graph neural networks on modeling molecules by taking each atom as a node and each chemical bond as an edge. For example, Attentive FP [54] proposes to extend graph attention mechanism in order to learn aggregation weights. Furthermore, several works [42,30] start to take the atomic distance into edge features to consider partial geometry information. However, they still lack the ability to model the full geometry information due to the shortage of traditional GNN architecture.

Pre-training for Graph Neural Networks
Self-supervised learning [10,11,17,22,31] has achieved great success in natural language processing (NLP), computer vision (CV), and other domains, which trains unlabeled samples in a supervised manner to alleviate the over-fitting issue and improve data utilization efficiency. Recently, some studies [23,40] apply self-supervised learning methods to GNNs for molecular property prediction to overcome the insufficiency of the labeled samples. These works learn the molecular representation vectors by exploiting the node-level and graph-level tasks, where the node-level tasks learn the local domain knowledge by predicting the node properties and the graph-level tasks learn the global domain knowledge by predicting biological activities. Although existing self-supervised learning methods can boost the generalization ability, they neglect the spatial knowledge that is strongly related to the molecular properties.

Conclusions and Future Work
Efficient molecular representation learning is crucial for molecular property prediction. Existing works that apply GNNs and pre-training methods for molecular property prediction fail to fully utilize the molecular geometries described by bonds, bond angles, and other geometrical parameters. To this end, we design a geometry-based GNN for learning the atom-bond-angle relations that utilize the information of the bond angles by introducing a bond-angle graph on top of the atom-bond graph. Moreover, multiple geometry-level self-supervised learning methods are constructed to predict the molecular geometries to capture spatial knowledge. In order to verify the effectiveness of ChemRL-GEM, extensive experiments were conducted, comparing it with multiple competitive baselines. ChemRL-GEM significantly outperforms other methods on 12 benchmarks. In the future, other geometric parameters, such as torsional angles, will be considered to further boost the molecular representation capacity. We will also study the performance of applying molecular representation to other molecule-related problems, including predicting drug-target interaction and the interactions between molecules.

A.1 GNN Architecture
As introduced in the main body, GeoGNN consists of two stacks of GeoGNN blocks, one for the atom-bond graph and the other for the bond-angle graph. The architecture of GeoGNN are shown in the bottom of Figure 5, where Layer Norm [1], Graph Size Norm [7] and Residual Connection [21] are popular tricks commonly used in GNNs. GIN [55], a convolutional block for message passing, is utilized as the backbone of GeoGNN. In our experiments, the AGGREGATE function and COMBINE function in GIN are defined as where AGGREGATE function summarizes the node features and the edge features, while COMBINE function is a 2-layer Multi Layer Perceptron (MLP) with hidden size of 32. We use 8 GeoGNN blocks for atom-bond graph and bond-angle graph and the hidden size is set to be 32.   Figure 5. More specifically, for the geometry-level tasks, the headers are applied upon the node representations h where MLP is a 2-layer MLP network with hidde size of 256 and Concat is the concatenation operation. While for the graph-level tasks and downstream tasks, the headers are applied upon the graph representations h G : where the MLP is a 3-layer MLP network with hidden size of 128, and f down1 to f down N in Figure 5 represent N different downstream tasks.

A.2 Input Features
As stated in Section 3.1, the input features of GeoGNN can be categorized into three parts: the atom features, bond features, and bond angle features, shown in Table 4. All these features are extracted by RDKit [29] with the help of the Merck molecular force field function. Among them, bond lengths and bond angles are the continuous features, while the others are the discrete features. For the continuous features, we use the Radial Basis Functions [5] to expand each continuous value x into a feature vector e of dimension M : where γ controls the shape of the radial kernel, and we set γ = 10. {µ m } is a list of centers ranging from the minimum value to the maximum value of corresponding features with stride of 0.1. Besides, for the discrete features, they are converted into one-hot vectors according to their vocabulary size.

B Training and Test Processes for ChemRL-GEM
The training and test processes of ChemRL-GEM is shown in Algorithm 1.  We select 6 molecular regression datasets and 6 molecular classification datasets from MoleculeNet [53] as the benchmarks. These benchmarks can be categorized into the physical chemistry, quantum mechanics, biophysics, and physiology, as shown in Table 5. The benchmarks of physical chemistry and quantum mechanics are more related to molecular geometries compared to those of biophysics and physiology. The descriptions of various benchmarks are listed as follows: • ESOL [9] contains water solubility data (log solubility in mols per liter) for common organic small molecules. It is a standard dataset that is widely used to estimate solubility directly. • FreeSolv [34] contains the experimental values of free energy of hydration of small molecules in water. The values are all obtained through molecular dynamics simulations. • Lipophilicity [15] is collected from the ChEMBL database [15], containing the experimental results of the octanol or water partition coefficient, which reflects the solubility of the molecules. • QM7 [3] is a subset of GDB-13, which provides information about the spatial structures of the molecules. It records various electronic properties that are stable and synthetically obtainable, such as HOMO and LUMO determined by ab-initio density function theory (DFT), and atomization energy.
• QM8 [36] uses a variety of quantum mechanics methods to calculate the electronic spectrum and excited state energy of small molecules.
• QM9 6 [41] provides multiple data information on geometry, energy, electronic and thermodynamic properties of small molecules calculated by DFT.
• BACE [47] records molecules with 2D structures and properties, which provides the qualitative (binary label) binding results on inhibitors of human β-secretase 1.
• BBBP [33] is a dataset that contains molecules with measured permeaility property of penetrating the blood-brain barrier.
• ClinTox [16] includes drugs approved by the FDA and those that have failed clinical trials for toxicity reasons.
• SIDER [28] is a database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.
• Tox21 [25] measures the toxicity of compounds on 12 different targets, including nuclear receptors and stress response pathways. It has been used in the 2014 Tox21 Data Challenge as a public database.
• ToxCast [38] includes toxicology results of thousands of molecules. It provides multiple toxicity labels by running high-throughput screening experiments on a large library of chemicals.

C.2 Dataset Splitting Methods
There are several popular splitting methods used in molecular property prediction datasets, including random splitting, scaffold splitting [2] and random scaffold splitting. Scaffold splitting and random scaffold splitting offer more challenging yet realistic ways of splitting by keeping the molecules with the same scaffold in either the train, validation, or the test set. Since the random scaffold splitting introduces more randomness into the evaluation of different methods, we adopt the scaffold splitting in the experiments of our main body. At the same time, we additionally conduct experiments using the random scaffold splitting on the classification datasets following the same experimental settings used in GROVER [40], and again ChemRL-GEM achieves the SOTA results with an overall relative improvement of 2.0% compared to the previous SOTA results on all the datasets, as shown in Table 6. Note that, the results of the baselines are directly copied from [40].  One more thing to mention is that we inspect several public implementations of baselines on scaffold calculation and find that some of them consider the chirality of molecules and some do not. Such inconsistence results in a big difference in the splitting train, validation, and test sets. To make the comparison fair, we use the scaffold calculation that considers molecular chirality and apply it to all the experiments. Furthermore, to ensure the data alignment, we summarize the statistics of the labels on each benchmarks. For the regression tasks, we summarize the minimum, maximum, and mean values of the labels, as shown in Figure 7, while for the classification tasks, we summarize the ratios of positive and negative samples as shown in Figure 8. Since there are Nan values in Tox21 and ToxCast, the positive ratio and the negative ratio can not be summed up to be 1 in these two datasets. -3.00e-3 7.11e-1 1.31e-1 -4.50e-7 5.86e-1 1.28e-1 -2.43e-5 5.41e-1 1.32e-1 QM9 -4.29e-1 6.22e-1 7.00e-3 -3.15e-1 3.75e-1 9.00e-3 -3.67e-1 3.88e-1 6.00e-3

C.3 Pre-training and Downstream Fine-tuning
On the pre-training stage, we utilize the distributed training with 8 GPUs and batch = 512 for each GPU. We set the dropout rate as 0.2, the vocabulary size for discretizing atomic distance as 30, and the mask ratio as 0.15. Adam optimizer [26] with learning rate of 0.005 is utilized and we train 20 epochs for each pre-training method.
On the downstream fine-tuning stage, we utilize a single card to train each dataset. Different batch sizes are selected for different datasets: batch = 256 for the large-size datasets QM8 and QM9; batch = 128 for the medium-size datasets Tox21 and ToxCast; batch = 32 for all the other datasets. We use Adam optimizer and train 100 epochs for each model. As the downstream tasks are sensitive to hyper-parameters, we apply a grid search on the dropout rate and the learning rate. For the dropout rate, we search {0.1, 0.2, 0.5}. For the learning rate, we consider GeoGNN body and the downstream headers separately, where we search body-header learning rate pairs: {(0.001, 0.001), (0.004, 0.004), (0.0001, 0.001)}.