Decoding the Protein-ligand Interactions Using Parallel Graph Neural Networks

Protein-ligand interactions (PLIs) are fundamental to biochemical research and their identification is crucial for estimating biophysical and biochemical properties for rational therapeutic design. Currently, experimental characterization of these properties is the most accurate method, however, this is very time-consuming and labor-intensive. A number of computational methods have been developed in this context but most of the existing PLI prediction heavily depends on 2D protein sequence data. Here, we present a novel parallel graph neural network (GNN) to integrate knowledge representation and reasoning for PLI prediction to perform deep learning guided by expert knowledge and informed by 3D structural data. We develop two distinct GNN architectures, GNNF is the base implementation that employs distinct featurization to enhance domain-awareness, while GNNP is a novel implementation that can predict with no prior knowledge of the intermolecular interactions. The comprehensive evaluation demonstrated that GNN can successfully capture the binary interactions between ligand and proteins 3D structure with 0.979 test accuracy for GNNF and 0.958 for GNNP for predicting activity of a protein-ligand complex. These models are further adapted for regression tasks to predict experimental binding affinities and pIC50 is crucial for drugs potency and efficacy. We achieve a Pearson correlation coefficient of 0.66 and 0.65 on experimental affinity and 0.50 and 0.51 on pIC50 with GNNF and GNNP, respectively, outperforming similar 2D sequence-based models. Our method can serve as an interpretable and explainable artificial intelligence (AI) tool for predicted activity, potency, and biophysical properties of lead candidates. To this end, we show the utility of GNNP on SARS-Cov-2 protein targets by screening a large compound library and comparing our prediction with the experimentally measured data.


Introduction
Accurate prediction of protein-ligand interactions (PLI) is a critical step in therapeutic design and discovery. These interactions influence various molecular-level properties, such as substrate binding, product release, regio-selectivity, target protein function, and ability to facilitate potential hit identification, which is the first step in finding novel candidates for drug discovery. 1 With increases in computing power, code scalability, and advancement of theoretical methods, physics-based computational tools such as molecular dynamics and molecular/quantum mechanics can be used for the reliable representation of PLI and prediction of accurate binding free energies. [2][3][4] However, these methods are computationally expensive and are limited to a number of protein-ligand complexes. 5 This limits their routine use in high-throughput virtual screening 3,4 for the discovery of novel hit candidates and lead optimization 6 for a given protein target. Molecular docking has been used to predict binding affinity and estimate interactions with reasonable computational cost; 2,7-14 however, its accuracy is relatively low as it uses heuristic rules to evaluate the scoring function.
There has been significant effort to develop deep learning models that predict PLI, 15,16 and other biophysical properties that are critical for therapeutic design but cannot be predicted through physics based modeling. 17 The greater understanding of PLI enabled by deep learning can help in the estimation of properties such as activity, potency and binding affinity. 18 However, several technical challenges limit the use of deep learning for modeling protein-ligand complexes and accurate prediction of properties. The first challenge relates to the limited availability of protein-ligand 3D data and the second challenge focuses on the appropriate representation of the data (domain knowledge), specifically in terms of the comprehensive 3D geometry representation. Structure-based methods have the advantage of producing results that can be interpreted but are limited by the number of available samples. 17 Circular fingerprints, generated by encoding localized structural and geometric information, have long been a cornerstone of cheminformatics. 19 The flexibility of fingerprints has created new avenues for molecular computational research, including the increased im- Graph-based representations extend the learning of chemical data to graph neural networks (GNNs). Monti et al. designed a mixture model network (MoNet) that enables non-Euclidean data, such as graphs, to be learned by CNNs. 25 That approach has been generalized and improved through the formulation of graph attention (GAT) networks. 26 GAT architectures operate on the importance of a given node, leading to improved computational efficiency and accuracy. 27 Lim et al. showed the implementation of such an architecture to capture PLI, which provides a baseline for the development of more robust models. 28 Chen et al. proposed a bidirectional attention-driven, end-to-end GNN to predict PLI and enable biochemical insights through attention weight visualization. 1 Predicting the activity of a protein-ligand complex is a binary classification problem. Reshaping the problem to focus on affinity creates a regression problem of heightened complexity. The existing deep learning models that predict the binding affinity or other biophysical and biochemical properties such as IC 50 , K i , K d , and EC 50 18 17 . 29 However, most of the methods use sequence-based data for proteins and SMILES representations for interacting ligands. For example, DeepDTA 18 and DeepAffinity 17 use SMILES strings of the ligands and amino-acid sequences of the target proteins to predict the affinity. MONN 29 is a multi-objective sequence-based neural network model that first predicts the non-covalent interaction between the ligand and the residues of the interacting target and then the binding affinities in terms of IC 50 , K i , and K d . Such methods are accessible due to the abundant availability of sequence-based data, but do not capture 3D structural information in the PLI and predicting regression properties. Binding is best understood when the 3D pocket of the target is known, and in situ, the protein-ligand complex is formed due to changes in the conformation of the 3D structure of the protein and ligand post-translation.
In this contribution, we formulated two GNNs based on the GAT architecture by incorporating domain-specific featurization of the protein and ligand atoms (GNN F ) and by implementing parallel GAT layers such that GNN P uniquely learns the interaction with limited prior knowledge. The inclusion of different features on the protein and ligand atoms enables our models to be more physics informed. The implementation of GAT layers combined with our featurization enables the model to learn the representation and the chemical space of the training data. We further use these models to predict experimental binding affinity and pIC 50 of the protein-ligand complex. This allows us to leverage the 3D structures of the target protein, ligands, and the interaction between them which is crucial both for the activity and affinity prediction.

Network Architecture
The goal in this work is to define a GNN architecture that predicts characteristics of a protein-ligand pair by learning features of the protein and ligand that may not be obvious to the human observer. Our molecular graph structure is defined as G{V,E,A}, where V is the atomic node set, E is the corresponding edge set, and A is the adjacency matrix.
Given the diverse structural properties of protein-ligand complexes, we include additional biomolecular domain-aware features to previous GAT architectures 26,28 by defining distinct featurizations for the protein and ligand components, as shown in Table 1, denoted GNN F , and by removing the dependency on prior knowledge of the protein-ligand interaction through the implementation of parallel GAT layers, denoted GNN P .
The GNN F and GNN P models differ in the architecture of the attention head as seen in Figure 1, a schematic of the prediction logic implemented in our models. In GNN F , the protein and ligand adjacency matrices are combined into a single matrix, and edges are added between protein and ligand nodes based on the distance matrix obtained from docking simulations. The GNN F attention head uses a joined feature matrix for the ligand and target protein, which is passed into one GAT layer that learns attention based on the PLI adjacency matrix and a second GAT layer that learns attention based on the ligand adjacency matrix. The output of these two GAT layers are subtracted in the final step of each attention head.
In the absence of a co-crystal structure of a protein-ligand complex, docking simulations are typically performed to model the PLI. In GNN P , the 3D structures of the protein and ligand are initially embedded separately based on their adjacency matrices, which represent internal bonding interactions. The GNN P attention head passes separate features for the protein and ligand to individual GAT layers that learn attention based on the respective adjacency matrix. The outputs of the GAT layers are concatenated in the final step of the attention head. Separation of the ligand and protein in parallel GAT layers preformed by GNN P removes prior information about the interactions, providing a foundation to remove the need for docked structural information. This discrete representation enables us to enter the protein and ligand directly into the GNN without knowing the prior protein-ligand interaction, which otherwise needed to be computed using physics-based simulations.
In both models, each node is given a set of features F, described in Table 1, which are engineered with an emphasis toward biochemical information, using the molecular Python package RDKit. When input into a GAT layer along with the corresponding adjacency matrix, each node feature is transformed by a learned weight matrix W ∈ R F ×F , where F is the dimensions of the node features attributed to input represented by h = {ĥ 1 ,ĥ 2 ,...,ĥ N } whereĥ i ∈ R F and N is the number of atoms. The attention coefficient e ij for interacting atoms i and j is calculated as a summation of the importance of the i-th node interaction with the j-th node and vice versa. For node i given an input feature matrixĥ i at convolution layer l, the attention coefficient e ij is given as: Using the softmax activation function, the attention coefficients are normalized across neighbors and multiplied by the adjacency matrix A ij , which gives higher importance to node pairs closer in distance, reflecting the physical principle that the strength of an intermolecular bond decreases as the bond distance increases. The normalized attention coefficient a ij is given by: Each feature is then updated as a linear combination of neighboring node featuresĥ i . Finally, the gated attention mechanism is employed to give the transformed set of node features: where U ∈ R 2F ×1 is a learned vector, b is a learned scalar, and φ is the activation function.
These features are passed through a multilayer perceptron (MLP). For binary classification of the activity of the protein-ligand complex, the sigmoid activation function is applied, and the binary cross-entropy loss function is used. For regression, the ReLU activation function is applied, and the mean squared error loss function is used.  Figure 1: Schematic showing the prediction logic implemented in our GNN models. The two models differ based on the applied attention head. GNN F uses the PLI obtained from docking simulations to create a combined feature and adjacency matrix. In GNN P , the features for the ligand and target protein are coded separately alongside their corresponding adjacency matrices. The output from the attention head is passed through a series of MLP that can be tuned for activity classification through application of the sigmoid activation function and binary cross-entropy loss function or property regression through application of the linear activation function and mean squared error loss function.
x b a Bond feature; b bond feature is indirectly considered by a corresponding atom-level feature that captures the same physical property.

Classification Model Datasets
Machine learning models for training PLI prediction require data on target proteins, ligands/compounds, and interactions between them. In this work, our goal is to improve the degree to which the graph-based model can be generalized while also maintaining accuracy.
We accomplish this by enlarging the dataset used to train our model and including a variety of targets. We collected and curated protein-ligand complexes from two public datasets, DUD-E 30 and PDBbind, 31,32 which are described in further detail below. The datasets consist of protein-ligand complexes and their docking affinities, while some samples include experimentally derived binding affinities. Prediction is based on accessing ligands as active (active-interact molecules or positive) or inactive (set of decoys or negative) depending on whether the ligand is able to bind with the protein, as described in more detail below. Table   2 shows counts of the targets, ligands, and their various complexes for both datasets. To confirm that these two datasets offer diversity in terms of both target protein and different functionality of the ligand, we computed pair similarities between the targets and ligands, which can be found in the supporting information Figure S??. Target similarities were determined by computing the homology between each pair, while ligand similarities were taken as the Dice similarity coefficient of the Morgan fingerprints (diameter = 6) of each pair. In both datasets, the target similarity is centered around 40%, with DUD-E having a more diverse target set than PDBbind. The opposite is true for ligand similarity. While both datasets have mainly dissimilar ligands, the PDBbind dataset offers more diversity in ligand structure. Combining these datasets for training leads to a highly diverse training set in terms of both target proteins and ligand molecules.
DUD-E: The DUD-E dataset consists of pairs of experimentally verified active complexes and property-matched inactive pairs, called decoys. 30 The dataset was originally designed to test benchmark molecular docking programs by providing challenging decoys, but some have noted that the dataset suffers from limited chemical space and biases. 33,34 In their Ligands with a molecular weight above 500 Da were removed, and the docked structures were converted to a machine-readable format using RDKit. 35 The target proteins were pulled from the original DUD-E set, cleaned of water molecules, and matched with their designated ligands. Targets were screened with a homology test to understand the diversity of targets present in the dataset, and none had a similarity greater than 90%. Each complex was processed to crop the target protein and retain atoms within a distance of 8 Å from the ligand. Because the DUD-E dataset is biased towards inactive complexes due to the large number of decoys, equal counts of active and inactive samples were collected in regards of a target with no consideration of a specific complex. A complete 1:1 active-to-inactive ratio was not always achieved; however, the imbalance was found to be minimal and to have no affect on performance. Samples from the experimentally verified ChEMBL dataset were labeled as active, while decoy samples were denoted with the key word ZINC.
PDBbind: The PDBbind dataset contains experimentally verified protein-ligand complexes from the Protein Data Bank. 31,32 Binding poses for a refined set of the protein-ligand com-plexes were generated by docking calculations. 36 A 90% homology test was run on this set, which resulted in a small number of proteins being removed because of a high level of similarity. As with the DUD-E dataset, the docked structures were converted to a machine-readable format using RDKit. The root-mean-square distance (RMSD) was used to label ligands as positive if they maintained an RMSD less than or equal to 2Åcompared to the original crystal structure, and negative if the RMSD was greater than or equal to 4Å. Molecules with RMSDs between these thresholds were removed. The viable molecules and their target proteins were processed into a dataset of protein atoms cropped within a distance of 8Å. As PDBbind data were significantly limited, all available samples were used. This resulted in a greater inclusion of negative samples than positive samples. IBS dataset: The IBS dataset is created using 486,232 synthetic compounds from the IB screening database (www.ibscreen.com). All ligands were fed as an input to the our GNN models paired against M Pro and NSP15 target. Because the IBS dataset only contains ligands and we did not dock the IBS molecules with their corresponding receptors, we used our GNN P model to evaluate these complexes. We chose SARS-CoV-2 (M Pro ) and SARS-CoV-2 non-structural protein endoribonuclease (NSP15) as target proteins to study the performance of our model. These are exactly the same targets we have used for our creating our SARS-CoV-2 dataset. Our team has been actively workin1g on the development of covalent electrophiles and non-covalent inhibitor candidates against the viral proteases, this gave us ready access to the protein pockets and active compounds that were binding with these targets. In addition, we performed homology tests on M Pro and NSP15 against the targets in the DUD-E and

SARS-CoV
PDBbind training sets to quantify the similarity of these new targets to our larger dataset.
M Pro showed an average similarity of 40% with the DUD-E training targets and 48% with the PDBbind training targets. NSP15 showed an average similarity of 55.72% with DUD-E targets and 52.26% with PDBbind targets.

Regression Model Datasets
In this section, we discuss the datasets used for the regression models noting that part of these datasets are created from the same sources used for the classification models. Our regression models are composed of the same attention heads as in the classification models and, therefore, include the same domain-level information from the protein and ligand atoms.
The main difference in our classification and regression models is the final activation layer, as shown in Figure 1. We perform two regression experiments referred as Experimental Binding Affinity (EBA)and pIC 50 prediction. It is important to highlight that some of these properties such as pIC 50 cannot be accurately modeled through physics based modeling methods. Table 3 and 4 gives a summary of the target and ligands distribution in various regression datasets.
For Experimental Affinity experiments, we consider three data sources: 1) PDBbind2016, 2) PDBbind2018, and 3) PDBbind2019. 32 We consider just crystal poses from the PDB-bind2016 general, refined, and core datasets. For the PDBbind2018 dataset, we used both the docked and crystal structures as two independent datasets for two independent experiments. For the docked dataset, we used the same targets and split as used for classification experiments. For the crystal-only dataset, we used all the available targets without any homology-similarity screening. For the PDBbind2019 dataset, we considered just a handful of crystal structures as a part of independent test set. The PDBbind2019 dataset is a structure-based evaluation set comprises targets that have been added to PDBbind2019 after the year 2016 and thus are not a part of PDBbind2016 and have a GDC similarity less than 65% to the training targets. For the PDBbind2016 dataset, we prepared various train-test splits, which are summarized in Table S??. The experimental binding affinity is estimated in terms of K i and K d , which refer to the inhibition and dissociation constants, respectively. These properties collectively determine the binding affinity of a molecule towards a receptor. The PDBbind repository provides a database of protein-ligand complexes along with their experimentally measured data. Here, we define the experimental binding affinity as −log( K i K d ), which is used as the label in the regression model. All decoys from the PDBbind2018 dataset were labeled with the experimental affinity of the corresponding crystal structure.
We focus on the pIC 50 (the inverse log of the half maxi-mal inhibitory concentration, IC 50 ), which is an experimentally measured property that captures the potency of a therapeutic candidate towards a protein target where higher values indicate exponentially more potent inhibitors. For the pIC 50 data, we used a combination of the DUD-E and PDBbind2016 datasets. From DUD-E, we included active ligands for 65 of the DUD-E targets, as only active protein-ligand pairs have an associated experimental affinity in the ChEMBL repository. We considered the top-scoring docked pose for each protein-ligand complex in the DUD-E dataset because there were no crystal structures available. For the PDBbind2016 data, all crystal poses with an experimentally measured IC 50 were used. We retained the same 80:20 split for the PDBbind2016 dataset as used for the experimental affinity dataset. Table 4 gives details of the dataset used for pIC 50 regression.
We also considered an independent dataset associated with SARS-CoV-2 targets. To  Table 5.

Hyperparameter Optimization
Hyperparameters such as network depth, layer dimension, and learning rate can have a large effect on model training and the weights in the final realized model. Therefore, we performed a number of trainings to examine combinations of learning rate, number of attention heads, and layer dimension. The hyperparameters that performed best were a learning rate of 0.0001, two attention heads, and a dimension of 70. These parameters resulted in an average test AUROC of 0.864. The combinations of these parameters are summarized in Table S??, and all successfully completed hyperparameter experiments can be found in Table S?? along with further explanation of the trials.

GNN Classification Models
Our primary goal in this work is to develop a model that has high accuracy and that can be generally applied for predicting PLI and activity using distinct atom-and bond-level features (domain awareness) for the protein and ligand. To this end, we would like to understand the effects of the number of protein targets and the number of protein-ligand complexes per target; therefore, we train the GNN F and GNN P models on a variety of datasets. The datasets consisted of either 17, 79, or 96 targets from the DUD-E dataset and all available data from the PDBbind dataset, which are summarized in Figure S?? in the SI. In all cases, the accuracy greatly increases from 0.723 to 0.879 when the number of targets is increased from 17 to 79, and then decreases to 0.842 when the number of targets is further increased to 96. The accuracy of each experiment is shown in Table S??. In each training set, we examined the effects of having 1,000 complexes per target or 2,000 complexes per target.
When using 17 DUD-E targets, including more complexes per target did not improve the accuracy, while in both the 79 and 96 DUD-E target sets, the accuracy was improved to over 90% when the dataset consisted of 2,000 complexes per target. Notably, the improvement was greater for the 96 DUD-E target set. It is important to note that all training sets had an equal distribution of positive and negative samples and complexes were randomly divided into training and test sets with an 80:20 split.
In addition to making sure the model can be generally applied, we are interested in developing a model that does not require the docked structure to be known before inferences can be made. We performed the same experiment using varying numbers of DUD-E targets and complexes on our GNN P model, which does not require advance knowledge of the proteinligand interaction to predict activity. The same overall trends were observed for GNN P as for GNN F but with reduced accuracy, as shown in Table S??. The highest scoring GNN P model was that trained on 79 DUD-E targets with 2,000 complexes per target with an accuracy of 0.880%, which is only 3.2% lower than the GNN F model trained on the same dataset.
Though decreased accuracy was observed with GNN P , its advantage relies on knowledge of only the separated protein and ligand structures, greatly reducing required preprocessing steps and increasing the throughput of the trained model. To assess the ability of our GNN models to be generally applied, we produced three test sets of varying similarity to the training set (  GNN P , however, showed reduced accuracy for the base and distinct test sets as compared with the overlap test set, indicating decreased capability to be generally applied. Overall, each implementation shows significantly improved performance in terms of prediction compared to docking. We can see that the overlap set produces a slight increase in AUROC, decrease in specificity, and significant increase in sensitivity.

Top-N ranks
We then assessed the model's ability to identify top-scoring 3D poses in each protein-ligand complex from our PDBbind repository described in the dataset preparation section. The 'best' docked pose is quantified as having an RMSD of less than 2 Å with respect to the crystal structure. In this analysis, we measure not only the ability of the model to identify an active protein-ligand pose but also its ability to identify the best pose among multiple docked poses. Figure 2 shows the percent of complexes found in the top-N ranks. Figure 2: Comparison of the GNN P and GNN F models with docking. Each bar corresponds to the percentage of protein-ligand complexes identified in top-N ranks which have an RMSD less than 2 Å from the crystal structure.
In each rank, both the GNN F and GNN P models outperformed or matched the performance of docking when using complexes from the training data. However, on the test data, docking showed better performance for certain ranks, while both docking and GNN F identified 100% of the protein-ligand complexes. For the top rank, all three methods showed 100% identification. If we consider the percent targets with at least one pose in the top-N ranks, all three models show equivalent performance (see Fig. S??).
In addition, we tested our models on datasets that include molecules from ChEMBL for DUD-E targets and SARS-CoV-2-target-specific data. Additional ChEMBL data were collected for a small subset of DUD-E targets extracted from the initial test set and imple- Dalton (see Fig. S??). For the M Pro target, our results indicate that the binding probability is centered near zero and the majority binding probability is under 0.5. For the NSP15 target, the binding probability is centered closer to 0.1 with a small number of compounds having probabilities greater than 0.5. We observe that GNN P consistently performed better in relation to the NSP15 target. Contributing factors to this increased performance include a larger number of available active compound samples for the NSP15 targets; that is, 2,157 samples as opposed to 307 samples available for the M Pro target. The NSP15 target has a disproportionately larger binding pocket than that of M Pro , which also attributes to the improved performance.

GNN as a Regression Model
Predicting the binding affinity of a protein-ligand complex plays a critical role in identifying a lead molecule that binds with the protein target; however, the experimental measurement of protein-ligand binding affinity is laborious and time-consuming, which is one of the greatest bottlenecks in drug discovery. On the other hand, half maximal inhibitory concentration (IC 50 ) provides a quantitative measure of the potency of a candidate to inhibit a protein target and is typically estimated from experiments. If we can predict affinity and potency of a specific ligand to a target protein quickly and predict their interactions accurately, the efficiency of in silico drug discovery would be significantly improved. To this end, we modified both GNN models to perform regression in order to predict the biophysical and biochemical properties (affinity and potency) of protein-ligand complexes. Our regression models are composed of the same attention heads as in the classification models and, therefore, include the same domain-level information for the protein and ligand atoms. We evaluated these models on each of the datasets discussed in the Methods section.

Experimental Binding Affinity (EBA) regression experiments:
To predict binding affinity and assess the performance of our model, we performed three experiments using three different datasets: 1) PDBbind2018-docked, 2) PDBbind2018 crys-tal, and 3) PDBbind2016 crystal structure dataset. Since most of the deep learning models that we compare in this section are trained on complexes from the PDBbind2016 database, training our models on the same dataset helps to obtain a comparison with the previous models. [45][46][47] We also tested our models on the PDBbind2016 core set, which is a refined subset filtered on the basis of protein-sequence similarity.
First, we assessed the performance of the GNN models trained using PDBbind2018 data docked and crystal-only data as shown in Table 7. Notably, the addition of more unique protein-ligand complexes improves the performance, while the addition of multiple docking poses for fewer complexes decreases performance, as the Pearson and Spearman correlations are low on the docked dataset as compared to the crystal-only dataset. We compared our methods with Pafnucy, 46 which is a 3D CNN for protein-ligand affinity prediction that combines the 3D voxelization with atom-level features. Among all the results, the best performance is achieved by the GNN P -EBA model, and overall, both GNN P and GNN F outperform Pafnucy on both the docking and crystal-only datasets. models. Our analysis on the PDBbind2016 core set shows that the GNN P -EBA model performs similarly to the FAST model, 45 while K DEEP shows the highest performance (see Table S?? for detailed comparison). We also report the performance of our model on the dataset used for training and evaluating the FAST model (see Table S?? for details) using the same train-validation-test splits as provided by the authors over the PDBbind2016 general (G) and refined (R) sets. 45 Out of all the combinations of the general and refined sets, we achieved the best performance with the GNN P model trained on the general set.
Our results on the PDBbind2018 and PDBbind2016 datasets suggests that with our proposed GNN frameworks, we achieve top performance compared with prior deep learning methods for binding affinity prediction while preserving the spatial orientation of the protein and ligand. While K DEEP is a purely 3D CNN-based network for protein-ligand affinity predictions, the FAST model uses a graph-based network, but utilizes the 2D graph representation which does not include the non-covalent interactions. This structural information is critical for understanding PLI and its impact on estimating affinity. To tackle these issues, we devised our graph-based models to specifically include all the atom-and bond-level information while using the 3D structures of the protein and ligand. The graph representation not only encloses atom-and bond-level information, but also retains the spatial information associated with the protein-ligand complexes, thus enabling us to include all necessary information associated with the natural binding state of the protein and ligand. Our GNN F model accounts for intermolecular interactions between the protein and ligand, which are not captured in the FAST model.
To investigate the generalizability of our GNN models in predicting the binding affinity of unseen and novel targets, we compare the performance of our GNN and various models on the PDBbind2019 structure-based evaluation dataset ( Table 8). The PDBbind2019 structurebased evaluation dataset is composed of targets that are novel from the PDBbind2016 in terms of their addition to the database as well as sequence similarity. The prediction results of the GNN models are better than those of previous models, as shown in Table 8, and most importantly, our GNN F -EBA model performs as accurate as Pafnucy. Our results demonstrate that even with limited 3D structural data in terms of the size typically needed to train deep learning models, we achieved relatively more accurate generalization with the GNN model. In addition, our GNN outperforms FAST and K DEEP in terms of generalizability given the scarcity of available structural data. Finally, to expand the scope of our experiment, we also trained our model on a physicsbased docking affinity score. We refer to these models as Docking Binding Affinity (DBA) models. We compare our model's performance against the physics-based docking on the two DBA datasets. Our results suggest that, with a correlation score of 0.79, the GNN F model was able to reproduce some correlation between the actual and predicted affinity to an extent (see Tables S?? and S?? for detailed dataset description and results). This indicates that, with our GNN framework, we are not only capturing the details needed for predicting the binding affinity but also achieving the capability to differentiate between distinct docked poses of a protein-ligand complex and associate it with its docking score.
pIC 50 regression experiments: As a next step, we tailored our GNN models to predicting pIC 50 . pIC 50 provides a quantitative measure of the potency of a candidate to inhibit a protein target, which is typically estimated from experiments. A number of methods have been developed to approximate pIC 50 , so we compared our pIC 50 prediction with existing deep-learning methods, such as DeepAffinity, 17 DeepDTA 18 and MONN. 29 While these methods were trained purely on the protein sequence and 2D SMILES representation of the ligands, our model is novel in that it considers the 3D structures of the protein and ligand to predict pIC 50 , which is key to defin-ing the inhibition rate while identifying and optimizing hits in early-stage drug discovery.
To the best of our knowledge, 3D protein-ligand complexes have not been used to predict pIC 50 before.
The overall performance of the GNN model is relatively low compared to the existing methods listed in the Table 9. This could be attributed to the smaller size of the dataset containing both pIC 50 and the corresponding crystal structures. DeepAffinity, 17 DeepDTA, 18 and MONN 29 were trained on BindingDB data, which has nearly 10 times the amount of data than that available for our dataset. In addition, we observed improvement in the Pearson correlation coefficients in predicting pIC 50 from 0.45 to 0.51 when using weights from the GNN-EBA model. The improvement from the baseline to the transfer learning model suggests that our model can achieve better performance if a larger dataset is used, as the GNN-EBA models are trained on comparatively larger datasets. To assess how critical the learned protein and ligand representations are for predicting pIC 50 , we predicted pIC 50 of a few inhibitors that were recently designed for SARS-CoV-2 Mpro, where the co-crystal structures have been solved and their IC 50 have been experimentally measured. From the perspective of deep learning and 3D protein-ligand complex representation, this is a much more difficult regression problem compared to the classification problem above. Our ultimate goal was to quantify the error in IC 50 prediction relative to the experimental measurement that could used for iterative design of potential inhibitors or lead optimization. Our extensive analysis on the SARS-CoV-2 M Pro data demonstrates that both GNN models overestimate IC 50 by 0.92% as compared to experimental values, as shown in Table 10.
The pIC 50 of our recently designed MCULE-5948770040 compound with M Pro 7TLJ complex with GNN P model predicted to be 6.20, which is comparable to the measured experimental value of 5.37. 2 Interestingly, GNN P proved to be the best model with an average error of 0.42 Molar. It is important to highlight that GNN P gives an advantage in predicting pIC 50 even when the experimentally bound structure is not known. This can help rank potent candidates while screening potential libraries against a protein target or possible protein targets of a given disease, which can then be utilized for experimental testing.

Conclusion
In this work, we devised graph-based deep learning models, GNN P and GNN F , by integrat- dataset, we achieved comparable performance to existing methods that were trained on relatively larger 2D sequence datasets. With the availability of pIC 50 data and corresponding protein structure predicted from Alphafold, 48 the GNN model performance can be further improved.
Our model is unique and novel in that it considers the 3D structures of the protein and ligand to predict affinity and pIC 50 , which is key to provide a quantitative measure of the potency and selectivity of a candidate to inhibit a specific protein target. To accelerate in silico hit identification and lead optimization in the early stage of drug design, our GNN P model can be used to screen a large ligand library to predict either biophysical properties or activities against a given protein target or set of targets for specific disease.

Supporting Information
The Supporting information is available with

Data and Code availability
We collected and curated protein-ligand complexes from two public datasets, DUD-E 30 and PDBbind 31,32 which are described in further detail in Dataset Preparation subsection. SARS-CoV-2 and other dataset is generated in using high throughput docking simulations and the IBS dataset is created using known synthetic compounds from the IB screening database (www.ibscreen.com). The trained GNN P and GNN f models and the code to reproduce this study is available at https://github.com/nkkchem/pf-gnn_pli and https://github.com/PNNL-CompBio/pf-gnn_pli.     The PDBbind2016-EBA core dataset consists of the entire general and refined as the train set and just the core set as the test set. The rest of the model's test set includes a fraction of the general and refined dataset along with the core set.

Author Contributions
GNN F and GNN P 1k and 2k Comparison

Top-N Ranks Analysis of Classification Models
We compare the performance of GNN with docking in identifying protein-ligand complexes in top-n ranks. We choose the best GNN model (2k_79) to get predictions on the train and the test dataset. For the GNN, we rank the models in ascending order of predicted binding probabilities. For docking we rank the models based on the descending order of binding affinity scores obtained as a part of docking pipeline. We assess the performance of each of the scoring method (GNN and docking) on their ability to identify protein-ligand models with RMSD less than 2Å from the experimental crystal structure.  To measure the potency of GNN models. We assess the model's ability to identify docked poses per protein-ligand complex. With this, we not only measure the ability of the model to identify an active protein-ligand pose but also its ability to identify the best pose amongst multiple docked poses for a protein-ligand complex (see Fig. SS3). Figure S4: Distribution of properties of IBS molecules to measure their drug-likeliness. The properties include the synthetic accessibility (SA) score, quantitative estimation of druglikeliness (QED), and the partition coefficient (logP).

IBS molecule properties
The SA score is the synthesizability of generated molecules and ranges between 0-10, where the lower end suggests increased accessibility. QED is a measures the quantifying and ranking the drug-likeness of a compound. The values range from 0 for unfavorable to 1 for favorable. The partition coefficient, logP is a measure that determines the physical nature of a compound and its ability to reach the target in the body. A positive logP value indicates the compound is lipophilic and a negative logP value indicates a hydrophilic compound. We can see that more than 60% of compounds have a low SA score ( indicates easily synthesizable) and a high QED ( indicates high drug-likeliness) and a lopP between -0.4 and 5.6.
These properties suggest that considerably a greater number of IBS molecules could have a high potency to bind to a known receptor. The GNN F model displays increased performance to the GNN P as the density of higher pIC 50 valued molecules is concentrated at high binding probability values, as shown in Figure   S6. We can see a similar trend for GNN P model as well, except that the model identifies quite a majority of high pIC 50 valued compounds as poorly binding complexes. While looking at the performance on the training dataset, both GNN F and GNN P showed a good learning by attributing high pIC 50 compounds with a high binding probability. Overall we can say that GNN F has shown a good performance on both training and test set while learning the corellation between the experimental binding affinity and probability of binding.

IBS molecules -Binding probability vs molecular weight
GNN p test set has a bimodal distribution for probability 0, so it gave errors on a number of low IC50 samples compared to GNN f where the errors were mainly for high IC50 samples

Docking Binding affinity Data and Results
For Docking Binding Affinity (DBA) regression models, we used 1000 positive and 1000 negative poses per target in DUD-E dataset as we did for the classification dataset. In addition, for the DBA scores, we used the docking score as obtained from the docking pipeline for all the PDBbind and DUD-E decoys.