Introduction

The goal of drug discovery is to find novel molecules with desired properties, and predicting the properties of molecules accurately has been one of the critical issues. The key step of molecule properties prediction is how to represent the molecules that map the molecular information to a feature vector. Conventional methods learn the representations depending on the expert-crafted physic-chemical descriptors1, molecular fingerprints2, or the quantitative structure-activity relationship (QSAR) method3,4.

In recent decades, deep learning methods have shown strong potential to compete with or even outperform conventional approaches. Graph neural networks (GNNs) have gained increasing more popular due to their capability of modeling graph-structured data. For the association prediction task of biological network data, the heterogeneous graph neural network algorithms5,6,7 have achieved remarkable results. Molecules can be naturally expressed as a graph structure, so the GNNs method can effectively capture molecular structure information, including nodes (atoms) and edges (bonds)8. Compared with the conventional methods, deep learning methods can use SMILES or molecular graph as input which is more informative and lead to significant improvement in downstream tasks such as molecules properties prediction. The graph-based molecular property prediction models view the molecules as graphs and use graph neural networks (GNN) to learn the representations and try to capture the topological structure information from atoms and bonds. Due to their ability to represent molecules as graphs, they are an important research area for molecular property prediction tasks. The most representative GNNs, including GCN9,10,11,12,13,14, GAT15,16,17, and MPNN18,19,20 etc., have been actively used in the field of molecular graphs-based for molecular properties prediction. However, these models ignore the information of fragments that contain functional groups. Recently, Zhang et al.17,21 works have begun to focus on molecular fragment information to predict the properties of molecules.

Although incorporating fragment information into graph architectures to benefit some molecular property estimation tasks has attracted research attention in recent years, there still are two issues that impede the usage of GNNs in this field: (1) those models have not provided a global chemical perspective method to better integrate atom and fragment information and both ignore the reaction information between fragments; (2) lacking the generalization ability of the different types and feature dimensions of atoms, fragments, and bonds. To address those two issues, more comprehensive information from different levels needs to be embedded and there is still a demand to develop a heterogeneous GNNs model for molecular property prediction.

In the study, we propose a Pharmacophoric-constrained Heterogeneous Graph Transformer model (PharmHGT) to comprehensively learn different views of heterogeneous molecular graph features and boost the performance of molecule property prediction (the code is available on GitHub: https://github.com/mindrank-ai/PharmHGT). Firstly, we use the reaction information of BRICS to divide the molecule into fragments that contain functional groups and retain the reaction information between these fragments to construct a heterogeneous molecular graph containing two types of nodes and three types of edges (Fig. 1). Then, to comprehensively consider the multi-view and multi-scale graph representations of molecules and the reaction information connecting the fragments, we propose a novel heterogeneous graph transformer model based on message passing. Specifically, we use two variants of transformers to learn the features of edges and nodes in heterogeneous graphs respectively, and aggregate and update these features of edges and nodes through message passing to obtain the representation of heterogeneous molecular graphs. Extensive experiments show that the model has outperformed the advanced baselines on multiple benchmark datasets. Further ablation experiments also showed the effectiveness of learning molecules representation from different perspectives. Our contributions can be summarized as follows:

  • We obtain the pharmacophore information from the compound reaction and retain the reaction information between the fragments. On this basis, a heterogeneous molecular graph representation method is constructed.

  • We develop a heterogeneous graph transformer framework, which is able to efficiently capture the information of different node types and edge types, including the reaction information between fragments through the fusion of multi-views information of heterogeneous molecular graphs.

  • We evaluate PharmHGT on nine public datasets and demonstrate its superiority over state-of-the-art methods.

Fig. 1: An example of the overview of the molecular segmentation process and the construction of the heterogeneous molecular graph.
figure 1

In the heterogeneous molecular graph at the bottom, the green nodes represent fragments with pharmacophore information, and the blue nodes represent the atoms of the molecule. The green edges are the reaction information between fragments, the red dotted line edges are the related information of the atoms that connect the fragments, and the edges between atoms are bonds.

Results and discussion

In this section, we present the related work of this field and the proposed PharmHGT model. We also presented the results of PharmHGT for molecular property prediction on ten datasets, these experiments datasets are from Wu et al.22, including four classification and three regression tasks. More descriptions of the data process can be found in Supplementary Note.

Related work

For graph data, if there is only one kind of node and one kind of connection relationship from one node to another, it is called a homogeneous graph, otherwise, it is a heterogeneous graph. Currently, most of molecular graph is based on homogeneous graph and using the heterogeneous graph to learn the representation is still blank. In this section, we review related prior homogeneous graph-based molecular representation methods and heterogeneous graph embedding. We focused on the homogeneous graphs that have some relevance to our model and those models were used for baseline comparison with PharmHGT.

Fragment-based homogeneous graph-based molecular representation

It has been demonstrated that many characteristics of molecules are related to molecular substructures which contain functional group information. Zhang et al.17 obtained two fragments by breaking the acyclic single bonds in a molecule and exploited a fragment-oriented multi-scale graph attention network for molecular property prediction (FraGAT), which first proposed the definition of molecule graph fragments that may contain functional groups. However, FraGAT directly adopted the Attentive FP15 to get molecular graph embeddings, and the obtained fragments by this method also is rough because there are multiple substructures in one molecular. Zhang et al.21 proposed the Motif-based Graph Self-supervised Learning (MGSSL) model, which designed a molecule fragmentation method that leverages a retrosynthesis-based algorithm BRICS and additional rules for controlling the size of motif vocabulary and used GNNs to capture the rich structural and semantic information from graph motifs. However, this work still did not consider the reaction information between substructures obtained through BRICS and effectively combine the information of the atom and the substructures. Nevertheless, these works still prove that considering more information from molecular substructures with functional groups can provide a more informative representation that can significantly improve the performance of downstream tasks, but how and which kinds of information to be embedded needs more exploratory work.

Message passing neural networks

Gilmer et al.18 proposed Message Passing Neural Networks (MPNNs), which is the first general framework for supervised learning on graphs, and can effectively predict the quantum mechanical properties of small organic molecules. The MPNNs framework is capable of learning the representations from molecular graphs directly. Many researchers made improvements on this basis and proposed many models based on MPNN. Yang et al.23 introduced a graph convolutional model, called Directed MPNN (D-MPNN), which used messages associated with directed bonds to learn molecular representations. Song et al.19 proposed a directed graph-based Communicative Message Passing Neural Network (CMPNN) that comprehensively considered the information of atoms and bonds to improve the performance of molecular properties prediction. However, those MPNNs have ignored the chemical reactions information, which is vital for molecular properties from the knowledge of chemistry and pharmacy.

Transformer architecture

Researchers proposed the Transformer architecture eschewing recurrence and convolutions entirely and instead based solely on the attention mechanism24. Ying et al.25 have explored several simple coding methods of the graph, mathematically showing that many popular GNN variants are actually just special cases of Graph transformers. In the field of representation learning of molecules, Rong et al.8 proposed Graph Representation frOm self-superVised mEssage passing tRansformer (GROVER), which can learn the rich structure and semantic information of molecular from a large amount of unlabeled molecular data. Chen et al.26 proposed the Communicative Message Passing Transformer (CoMPT), which reinforces message interactions between nodes and edges based on the Transformer architecture.

Heterogeneous graph-based molecular representation

In the field of recommendation systems, heterogeneous graph models are popular for mining scenarios with nodes and/or edges of various types27,28,29. Heterogeneous graphs are notoriously difficult to mine because of the bewildering combination of heterogeneous contents and structures. The current representation learning for molecules is still at the level of homogeneous graphs, but in addition to the basic atom-based molecular graph representation, some fragment-based and motif-based representation schemes have been proposed to represent a molecule. Obviously, if these representation schemes can be constructed for a comprehensive heterogeneous molecular graph representation, it will be more conducive to capturing the characterization of molecules and potentially improve the performance of downstream tasks. In this paper, we propose a new molecular heterogeneous graph construction method and propose a heterogeneous graph transformers model that can efficiently learn molecular representations.

Overview of PharmHGT architecture

The key idea of PharmHGT is additionally capturing the pharmacophoric structure and chemical information feature from the heterogeneous molecular graph. Generally, the heterogeneous graph is associated with the node and edge attributes, while different node and edge types have unequal dimensions of features. As shown in Fig. 2, the proposed PharmHGT consists of three major modules: multi-view molecular graph construction, aggregation of nodes and edges information by heterogeneous graph transformer, and the attention mechanism to integrate multi-view molecular graph features for molecular property prediction.

Fig. 2: Illustration of Pharmacophoric-Constrained Heterogeneous Graph Transformer for molecular property prediction (PharmHGT).
figure 2

Firstly, the heterogeneous molecular graph is formalized as the feature matrix. Then, the feature matrix of each view will first do message passing independently to obtain the graph feature matrix of each view. Finally, the junction_view feature matrix will first do attention aggregation with the pharm_view feature matrix to obtain the aggregation feature matrix, then that matrix will do attention aggregation with the atom_view feature matrix, and finally, obtain the features of each node, and input those nodes as a sequence into the GRU to get the representation vector of the entire small molecule.

Experiments

Datasets

In order to better compare and prove the effectiveness of PharmHGT, we select nine benchmark molecular datasets for experiments including Blood-brain barrier permeability (BBBP), BACE, ClinTox, Tox21, SIDER, and HIV for classification tasks, and ESOL, Freesolv and Lipophilicity for regression tasks. Below, we include a brief introduction of these datasets.

  • Classification tasks. The BBBP dataset has 2035 molecules with binary labels of permeability properties, which are often used to predict the ability of molecules to penetrate the blood-brain barrier. The BACE dataset has 1513 molecules, which provides quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors of human β-secretase 1 (BACE-1). The ClinTox dataset has 1468 approved drug molecules and a list of molecules that failed due to toxicity during clinical trials. The Tox21 dataset has 7821 molecules for 12 different targets relevant to drug toxicity and was originally used in the Tox21 data challenge. The SIDER dataset has 1379 approved drug molecules and their side-effect, which are divided into 27 system organ classes. The HIV dataset has 41127 molecules and these molecules are tested for their ability to inhibit HIV replication.

  • Regression tasks. The ESOL dataset records the solubility of 1128 compounds. The FreeSolv includes a total of 642 molecules are selected from the Free Solvation Database. The Lipophilicity dataset provides the experimental result of octanol/water distribution coefficient (logD at pH 7.4) of 4198 compounds.

Implementation details

Following the previous works, we illustrate the results of each experiment with 5-fold cross-validation and replicate training five times to increase the credibility of our model. All benchmark datasets have been split as training, validation, and test sets with a ratio of 0.8/0.1/0.1, while all models were evaluated on random or scaffold-based splits as recommended by23. The node and edge features are processed by the open-source package RDKit, and the detail is demonstrated in Supplemental Experimental Procedures (Tables S1S5).

Baselines

In the study, we compare our model with eight baseline methods including 3 types.

  • Fragment-based method: The AttentiveFP15 is a graph neural network architecture, which uses a graph attention mechanism to learn from relevant drug discovery datasets. The FraGAT17 exploited a fragment-oriented multi-scale graph attention network for molecular property prediction; The MGSSL21 designed Motif-based Graph Self-supervised Learning (MGSSL) by introducing a novel self-supervised motif generation framework for GNNs.

  • MPNN baselines: The MPNN18 abstracts the commonalities between several of the most promising existing neural models into a single common framework, and focused on obtaining effective vertices (atoms) embedding by message passing module and message updating module; The DMPNN23 used messages associated with directed bonds rather than those with vertices; The CMPNN19 introduced a new message booster module to rich the message generation process.

  • Graph transformer baseline: The CoMPT26, with a Transformer architecture, has learned a more attentive molecular representation by reinforcing the message interactions between nodes and edges; The GROVER model8 standard for Graph Representation frOm self-superVised mEssage passing tRansformer, which can learn rich structural and semantic information of molecules from enormous unlabeled molecular data by carefully designed self-supervised tasks in node-level, edge-level, and graph-level. In addition, the Graphormer model is also based on graph transformer, but Graphormer is a 3D model, which requires the 3D conformation of each small molecules. There are some inconsistencies between our model and Graphormer in terms of target tasks and inputs. For better comparability, we have not added Graphormer into the benchmark, but we give the computational results in the Supplemental Experimental Procedures (Table S7 and Table S8).

Performance comparison

Performance in classification tasks

Table 1 presents the area under the receiver operating characteristic curve (ROC-AUC) results of eight baseline models on six classification datasets. The Clintox, Tox21, ToxCast, and SIDER are all multi-task learning tasks, including total of 658 classification tasks. Compared with traditional baselines and several GNN-based models, PharmHGT achieved large increases of ROC-AUC in all datasets (we give the prediction ROC curved plots in Fig. S1 and Fig. S2). PharmHGT is designed to be more attentive to pharmacophores, which makes this model more explainable. To note, the PharmHGT outperformed the pre-train methods with less computational cost. We also give computing resources performance comparison to the state-of-the-art methods base on ESOL datasets, see the Table S6.

Table 1 Overall performance comparison to the state-of-the-art methods on molecular property prediction classification tasks.

Performance in regressions tasks

Solubility and lipophilicity are basic physical chemistry property, which is vital for explaining how molecules interact with solvents and cell membrane. Table 2 compares PharmHGT results to other state-of-the-art model results. The best-case RMSE of the PharmHGT model on ESOL, FreeSolv and Lipophilicity are 0.680 ± 0.137, 1.266 ± 0.239, and 0.583 ± 0.063 in random split, and 0.839 ± 0.049, 1.689 ± 0.516 and 0.638 ± 0.040 in scaffold split. These results indicate that better representations of molecular graphs containing more information could significantly increase the model performance on downstream tasks.

Table 2 Overall performance comparison to the state-of-the-art methods on molecular property prediction regression tasks.

Ablation study

We conducted ablation studies on PharmHGT to explore the effect of atom-level view, pharm-level view, and junction-level view. Under the same experimental setup, we implement seven simplified variants of PharmHGT on the two benchmarks:

  • (1) PharmHGT_α: by only retaining the atom-level graph.

  • (2) PharmHGT_β: by only retaining the pharm-level graph with reaction information.

  • (3) PharmHGT_γ: by only retaining the junction-level graph.

  • (4) PharmHGT_βα: by aggregating features of the pharm-level graph with reaction information to the atom-level graph.

  • (5) PharmHGT_γα: by aggregating features of the junction-level graph to the atom-level graph.

  • (6) PharmHGT_βγ: by aggregating features of the pharm-level with reaction information to the junction-level graph.

  • (7) PharmHGT_γαβ: by aggregating features of the junction-level graph with the atom-level graph, then to the pharm-level graph.

As shown in Fig. 3, the PharmHGT considering the heterogeneous feature information from all views shows the best performance among all architectures. The exclusions of the atom-level, pharm-level, or junction-level view both caused decreases in performances and the PharmHGT_β performs the worst when only retaining the pharm-level graph with reaction information. It indicates that lacking information from the atoms can not effectively represent the characteristics of the molecule. When combining two kinds of feature information, PharmHGT_γα aggregates the junction-level graph into an atom-level graph and it has the best performance among the models with one or two views. It proves that integrating the feature information from molecular fragments can improve the prediction performance. The results of PharmHGT demonstrate that further integrating the information from the reaction can obtain the most effective molecular characterization.

Fig. 3: Ablation results on BBBP and ESOL datasets.
figure 3

The “X” represent the PharmHGT, the “X_” represents different PharmHGT variants of aggregating atom-level, junction-level, and pharm-level features.

Representation visualization

To investigate the molecular representations learning ability of PharmHGT, we used t-distributed Stochastic Neighbor Embedding (t-SNE) with default hyper-parameters to visualize molecular representations of the Tox21 dataset in Fig. 4. For this result, we define all molecules with a label of 0 as non-toxic compounds, and any molecule with a label of 1 as a toxic compound, and molecules with similar toxicity tend to have more similar feature spaces. Therefore, we visualize their embeddings by t-SNE and evaluate whether the model can learn effective molecular representations by whether the toxic and non-toxic molecules have a clear boundary. The DMPNN has second performance in Tox21 task and achieves reasonable distinction between toxic and non-toxic molecules (Fig. 4a), however, PharmHGT shows a more visible boundary to classify toxic and non-toxic compounds (Fig. 4c). In addition, the single-view (Fig. 4b) performance is far inferior to the multi-view PharmHGT (Fig. 4c), which also proves the necessity of considering the molecular multi-view information.

Fig. 4: Visualization of molecular features.
figure 4

Visualization of molecular features for Tox21 from a DMPNN, b PharmHGT_α, and c PharmHGT with t-SNE. All molecules with a label of 0 as non-toxic compounds, and any molecule with a label of 1 as a toxic compound, where toxicity compounds are colored in red and the non-toxic ones are in blue.

Case study

Pharmacophore is a molecular framework that defines the necessary components that are responsible for specific properties. Accordingly, identifying and adding the pharmacophore structure information associated with the target property into the model is vital for molecular representation. To illustrate the pharmacophore structure learning ability of PharmHGT, we visualize molecular features on the ClinTox dataset and select six molecules that are toxic in clinical trials and several of them have been applied in the clinical setting as chemotherapeutic drugs. The toxicities of these six molecules are highly correlated with the contained pharmacophore (i.e., some specific sub-structure). Figure 5c shows that our PharmHGT can aggregate molecules with similar toxic pharmacophores together and distinguish them from non-toxic samples; PharmHGT_α cannot well aggregate molecules with similar toxic pharmacophores, and have limited discrimination from negative samples without the pharm-level view (Fig. 5b); The pretraining model Grover, which achieves second performance in ClinTox subtask, can only aggregate only a few molecules with similar toxic pharmacophores (Fig. 5a) and the discrimination for non-toxic samples is far less than PharmHGT. This indicates that the embedded representations learned by PharmHGT can capture functional group structural information more effectively.

Fig. 5: Case study.
figure 5

Case study by t-SNE visualization of molecular features on ClinTox dataset from a GROVER, b PharmHGT_α, and c PharmHGT. Where molecular with toxicity are colored in gray, non-toxic molecules are in red and blue indicating six molecules are selected for the case study that showed toxicity in clinical trials and is still toxic after marketing.

Conclusions

In this paper, we propose PharmHGT, a pharmacophoric-constrained heterogeneous graph transformer model for molecular property prediction. We use the reaction information of BRICS to decompose molecules into several fragments and construct a heterogeneous molecular graph. Furthermore, we develop a heterogeneous graph transformation model to capture global information from multi-views of heterogeneous molecules. Extensive experiments demonstrate that our PharmHGT model achieves state-of-the-art performance on molecular properties prediction. The ablation study and case study also demonstrate the effectiveness of using pharmacophore group information and heterogeneous molecules information of molecules.

Methods

Notation and problem definition

We use the BRICS30 to decompose molecules into several fragments with pharmacophore, and retain the reaction information between fragments to construct a heterogeneous molecular graph. The heterogeneous molecular graph is denoted G = {V, E}, the G associated with a node type mapping function \(\varphi :V\to {{{{{{{\mathcal{O}}}}}}}}\) and an edge type mapping function \(\psi :E\to {{{{{{{\mathcal{P}}}}}}}}\), where \({{{{{{{\mathcal{O}}}}}}}}\) and \({{{{{{{\mathcal{P}}}}}}}}\) represent the set of all node types and the set of all edge types, respectively. We treat molecular structure as heterogeneous graphs to capture the chemical information from functional substructures and chemical reactions. We propose three views of molecular graph representation schemes, which are the atom-level view, pharm-level view containing pharmacophore information as well as reaction information, and junction-level view to comprehensively represent a molecule (Fig. 1). The specific definition is as follows:

Definition 1

(Atom-level view.) An atom-level view can be denoted as graph Gα = (Vα, Eα), for each atom \({v}_{i}^{\alpha }\) we have \({v}_{i}^{\alpha }\in {V}^{\alpha }\) where 1 ≤ i ≤ Nα and Nα is the total number of atoms, while for each bond \({e}_{ij}^{\alpha }\) we have \({e}_{ij}^{\alpha }\in {E}^{\alpha }\) where 1 ≤ i, j ≤ Nα. For featurization, the Vα is represented as \({X}_{v}^{\alpha }\in {{\mathbb{R}}}^{{N}^{\alpha }\times {D}_{v}^{\alpha }}\) where \({D}_{v}^{\alpha }\) is the dimensions of atom features, the Eα is represented as \({X}_{e}^{\alpha }\in {{\mathbb{R}}}^{{M}^{\alpha }\times {D}_{e}^{\alpha }}\) where Mα is the total number of directed bonds, \({D}_{e}^{\alpha }\) is the dimensions of bond features.

Definition 2

(Pharm-level view.) A pharm-level view can be denoted as graph Gβ = (Vβ, Eβ), for each pharmacophore \({v}_{i}^{\beta }\) we have \({v}_{i}^{\beta }\in {V}^{\beta }\) where 1 ≤ i ≤ Nβ and Nβ is the total number of pharmacophores, while for each BRICS reaction type \({e}_{ij}^{\beta }\) we have \({e}_{ij}^{\beta }\in {E}^{\beta }\) where 1 ≤ i, j ≤ Nβ. For featurization, the Vβ is represented as \({X}_{v}^{\beta }\in {{\mathbb{R}}}^{{N}^{\beta }\times {D}_{v}^{\beta }}\) where \({D}_{v}^{\beta }\) is the dimensions of pharmacophore features, the Eβ is represented as \({X}_{e}^{\beta }\in {{\mathbb{R}}}^{{M}^{\beta }\times {D}_{e}^{\beta }}\) where Mβ is the total number of BRICS reaction types, \({D}_{e}^{\beta }\) is the dimensions of BRICS reaction type features.

Definition 3

(Junction-level view.) A junction-level view can be denoted as graph Gγ = (Vγ, Eγ), for each node \({v}_{i}^{\gamma }\) we have \({v}_{i}^{\gamma }\in {V}^{\gamma }\) where 1 ≤ i ≤ Nγ and Nγ is the total number of atoms and pharmacophores, while for each edge \({e}_{ij}^{\gamma }\) we have \({e}_{ij}^{\gamma }\in {E}^{\gamma }\) where 1 ≤ i, j ≤ Nγ. For featurization, the Vγ is represented as \({X}_{v}^{\gamma }\in {{\mathbb{R}}}^{{N}^{\gamma }\times {D}_{v}^{\gamma }}\) where \({D}_{v}^{\gamma }\) is the dimensions of pharmacophore features, the Eγ is represented as \({X}_{e}^{\gamma }\in {{\mathbb{R}}}^{{M}^{\gamma }\times {D}_{e}^{\gamma }}\) where Mγ is the total number of atoms and pharmacophores junction relationships, \({D}_{e}^{\gamma }\) is the dimensions of junction relationship information.

An example of the heterogeneous molecular graph and its multi-view is illustrated in Fig. 1, which contains 2 node types and 3 edge types. Given the above definitions, our main task is to learn representations of heterogeneous molecular graphs.

Overview of PharmHGT

The key idea of PharmHGT is additionally capturing the pharmacophoric structure and chemical information feature from heterogeneous molecular graphs. Generally, the heterogeneous graph is associated with node and edge attributes, while different node and edge types have unequal dimensions of features. The framework consists of three parts: multi-view molecular graph construction (Fig. 1), aggregation of nodes and edges information by heterogeneous graph transformer, and the attention mechanism to integrate multi-view molecular graph features for molecular property prediction (Fig. 2).

Obtaining the embedding of nodes and edges

The inputs of PharmHGT are the feature matrix of node XV and the feature matrix of edge XE, the features of all nodes can be obtained according to the intensity of the attention between the node and the related edge. The multi-head self-attention mechanism enhances the signal of the node in each view. Specifically, the basic block of PharmHGT is the usual attention module:

$$[Q,K,V]=h(X)[{W}^{Q},{W}^{K},{W}^{V}]$$
(1)

where h(X) is the hidden features, WQ, WK, WV are the projection matrices. The normal attention module is the dot product self-attention, the Q, K, V is considered in the same semantic vector space, which is not adapted in heterogeneous graph. Therefore, we build a multi-view attention function to get more information from different views, and the function can be formulated as:

$${{{{{\mathrm{Attention}}}}}}(Q,K,V)=\mathop{\sum}\limits_{p\in P}{\Omega }^{p}\sigma \left(\frac{{Q}^{p}{K}^{pT}}{\sqrt{{d}_{k}}}\right){V}^{p}$$
(2)

where σ is active function, P is the view type set, pP is a view and Ωp is a learnable view type weight matrix. KpT is the transpose matrix of view p key matrix, and dk is the variance of Q and K. In addition, our model assumes that a single qi and ki satisfy the mean of 0 and the variance of 1. Considering the more general case, qi and ki satisfy the mean value of 0 and the variance is σ, then \(D({q}_{i}{k}_{i}^{T})={\sigma }^{4}\). And D(QKT) = dkσ4. In any case, divide by \(\sqrt{{d}_{k}}\) to ensure that \(D(Q{K}^{T})=D({q}_{i}{k}_{i}^{T})\). The reason to guarantee this is to make softmax not affected by the dimension of the vector. Furthermore, after adding multi-head attention structures, the embedding matrix can be formulated as:

$$\left\{\begin{array}{l}{{{{{\mathrm{hea}}}}}}{{{{{{\mathrm{d}}}}}}}_{i}={{{{{\mathrm{Attention}}}}}}({Q}_{i},{K}_{i},{V}_{i}) \hfill\\ {{{{{\mathrm{Hea}}}}}}{{{{{{\mathrm{d}}}}}}}_{i}={{{{{\mathrm{Concat}}}}}}({{{{{\mathrm{hea}}}}}}{{{{{{\mathrm{d}}}}}}}_{1},{{{{{\mathrm{hea}}}}}}{{{{{{\mathrm{d}}}}}}}_{2},\ldots ,{{{{{\mathrm{hea}}}}}}{{{{{{\mathrm{d}}}}}}}_{n}){W}^{o}\end{array}\right.$$
(3)

where Wo is the weight matrix of each head. Therefore, we can get the hidden nodes and edges features embedding matrix:

$$\left\{\begin{array}{l}H({X}_{V})={{{{{\mathrm{Concat}}}}}}({h}_{1}({X}_{V}),{h}_{2}({X}_{V}),\ldots ,{h}_{n}({X}_{V})){W}_{V}^{o}\\ H({X}_{E})={{{{{\mathrm{Concat}}}}}}({h}_{1}({X}_{E}),{h}_{2}({X}_{E}),\ldots ,{h}_{n}({X}_{E})){W}_{E}^{o}\end{array}\right.$$
(4)

Aggregation nodes and edges information

For each molecular graph view, we use graph transformer to obtain all nodes and edges features. All nodes’ features is \({X}_{{v}_{i}}\), viV, and the all edge nodes’ features are \({X}_{{e}_{ij}}\), eijE. PharmHGT interactively operates on edge hidden states \(H({X}_{{e}_{ij}})\), node hidden state \(H({X}_{{v}_{i}})\), message \({M}_{V}({X}_{{v}_{i}})\) and \({M}_{E}({X}_{{e}_{ij}})\). To learn different knowledge from multi-view snapshots, we build a view attention message passing strategy that is based on multi-head attention structures, the node and edge feature are propagated at each iteration, t denotes the current depth of the message passing, each step proceeds as follows:

$${M}_{V}^{1}({X}_{{v}_{i}})=\mathop{\sum}\limits_{{\Theta }_{{{{{{{{\mathcal{N}}}}}}}}}({v}_{i})}H({X}_{{\Theta }_{{{{{{{{\mathcal{N}}}}}}}}}({v}_{i})}),t=1$$
(5)
$${M}_{E}^{1}({X}_{{e}_{ij}})=H({X}_{{v}_{i}}),t=1$$
(6)
$$\begin{array}{l}{M}_{V}^{t}({X}_{{v}_{i}})=\mathop{\sum}\limits_{{\Theta }_{{{{{{{{\mathcal{N}}}}}}}}}({v}_{i})}{{{{{\mathrm{Attention}}}}}}\left({H}^{t-1}({X}_{{v}_{i}}){W}_{{v}_{i}}^{Q}\right.,\\ \left.{M}^{t-1}({X}_{{\Theta }_{{{{{{{{\mathcal{N}}}}}}}}}({v}_{i})}){W}_{{m}_{i}}^{K},{H}^{t-1}({X}_{{v}_{i}}){W}_{{v}_{i}}^{V}\right),t \, > \, 1\end{array}$$
(7)
$${M}_{E}^{t}({X}_{{e}_{ij}})= \,{{{{{\mathrm{Linear}}}}}}\left({M}_{E}^{1}({X}_{{e}_{ij}})+{H}^{t}({X}_{{v}_{i}})\right.\\ -\left.{H}^{t-1}({X}_{{e}_{ji}})\right),t \, > \,1$$
(8)

where the \({\Theta }_{{{{{{{{\mathcal{N}}}}}}}}}({v}_{i})\) is the function to find edges directed to node vi. Considering of the vanishing gradient issue, we set a simple residual block to make module training more stable during multi-views message passing:

$$\left\{\begin{array}{l}{H}^{t}({X}_{{v}_{i}})={H}^{t-1}({X}_{{v}_{i}})+{M}_{V}^{t}({X}_{{v}_{i}})\\ {H}^{t}({X}_{{e}_{ij}})={H}^{t-1}({X}_{{e}_{ij}})+{M}_{E}^{t}({X}_{{e}_{ij}})\end{array}\right.$$
(9)

Fusion multi-views information

For a given molecule, we obtain all types of representations of the three views of molecule atom-level, pharm-level, and junction-level by the above steps. Besides, the Gated Recurrent Unit is applied as a vision readout operator to get all three views feature vector {Zα, Zβ, Zγ} of the molecule, where Zα is the vector of atom-level view, Zβ is the vector of pharm-level view and Zγ is the vector of junction-level view.

Then, the acquired three views features are aggregated to the final features through the attention layer again, and the final representation vector of a molecule is obtained. The readout attention function is:

$${{{{{\mathrm{ReadOutAttention}}}}}}(X,Y)=\sigma \left(\frac{X\cdot {Y}^{T}}{\sqrt{{d}_{k}}}\right)X$$
(10)

Specifically, the pharm-level-based contains the features of the reaction information, and we first aggregate it with the junction-level-based features to capture the associated information of pharmacophores and atoms and the reaction information between pharmacophores. The formula is as follows:

$${Z}_{\gamma \beta }={{{{{\mathrm{ReadOutAttention}}}}}}({Z}_{\gamma },{Z}_{\beta })$$
(11)

Then we are aggregating this information with atom-level-based feature information to obtain the final molecular global feature representation (Fig. 2). The attention layer can distinguish the importance of features and adaptively assign more weight to more important features.

$$Z={{{{{\mathrm{ReadOutAttention}}}}}}({Z}_{\alpha },{Z}_{\gamma \beta })$$
(12)

Finally, we perform downstream property predictions \(\hat{y}=f(Z)\) where f() is a fully connected layer for classification or regression.