Improving the generalizability of protein-ligand binding predictions with AI-Bind

Chatterjee, Ayan; Walters, Robin; Shafi, Zohair; Ahmed, Omair Shafi; Sebek, Michael; Gysi, Deisy; Yu, Rose; Eliassi-Rad, Tina; Barabási, Albert-László; Menichetti, Giulia

doi:10.1038/s41467-023-37572-z

Download PDF

Article
Open access
Published: 08 April 2023

Improving the generalizability of protein-ligand binding predictions with AI-Bind

Ayan Chatterjee¹,
Robin Walters²,
Zohair Shafi ORCID: orcid.org/0000-0001-6154-1466²,
Omair Shafi Ahmed²,
Michael Sebek ORCID: orcid.org/0000-0003-3347-7131^1,3,
Deisy Gysi^1,3,4,
Rose Yu⁵,
Tina Eliassi-Rad^1,2,6,7,
Albert-László Barabási^1,3,8 &
…
Giulia Menichetti ORCID: orcid.org/0000-0001-5201-6774^1,3,9

Nature Communications volume 14, Article number: 1989 (2023) Cite this article

15k Accesses
16 Citations
44 Altmetric
Metrics details

Subjects

Abstract

Identifying novel drug-target interactions is a critical and rate-limiting step in drug discovery. While deep learning models have been proposed to accelerate the identification process, here we show that state-of-the-art models fail to generalize to novel (i.e., never-before-seen) structures. We unveil the mechanisms responsible for this shortcoming, demonstrating how models rely on shortcuts that leverage the topology of the protein-ligand bipartite network, rather than learning the node features. Here we introduce AI-Bind, a pipeline that combines network-based sampling strategies with unsupervised pre-training to improve binding predictions for novel proteins and ligands. We validate AI-Bind predictions via docking simulations and comparison with recent experimental evidence, and step up the process of interpreting machine learning prediction of protein-ligand binding by identifying potential active binding sites on the amino acid sequence. AI-Bind is a high-throughput approach to identify drug-target combinations with the potential of becoming a powerful tool in drug discovery.

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

De novo design of protein structure and function with RFdiffusion

Article Open access 11 July 2023

Introduction

The accurate prediction of binding interactions between chemicals and proteins is a critical step in drug discovery, necessary to identify new drugs and novel therapeutic targets, to reduce the failure rate in clinical trials, and to predict the safety of drugs¹. While molecular dynamics and docking simulations^2,3 are frequently employed to identify potential protein-ligand binding, the computational complexity (namely, run-times) of the simulations and the lack of 3D protein structures significantly limit the coverage and the feasibility of large-scale testing. Therefore, machine learning (ML) and artificial intelligence (AI) based models have been proposed to circumvent the computational limitations of the existing approaches⁴, leading to the development of models that rely either on deep learning architectures or chemical feature representations^5,6,7.

Deep learning frameworks formulate the binding prediction problem as either a binary classification task or a regression task. The successful training of a binary classifier requires positive samples, pairs of proteins and ligands that are known to bind to each other, typically extracted from protein-ligand binding databases like DrugBank⁸, BindingDB⁹, Tox21¹⁰, ChEMBL¹¹, or Drug Target Commons (DTC)¹². Training also requires negative samples, i.e., pairs that do not interact or only weakly interact. However, the positive and the negative annotations associated with different proteins and ligands are not evenly distributed, but some proteins and ligands have disproportionately more positive annotations than negative ones, and vice-versa, an annotation imbalance learned by the ML models, which then predict that some proteins and ligands bind disproportionately more often than others. In other words, the ML models learn the binding patterns from the degree of the nodes in the protein-ligand interaction network, neglecting relevant node metadata, like the chemical structures of the ligands or the amino acid sequences of the proteins^5,13. This annotation imbalance leads to good performance as quantified by the Area Under the Receiver Operating Characteristics (AUROC) and the Area Under the Precision Recall Curve (AUPRC) for the unknown annotations associated with missing links in the protein-ligand interaction network used for training. A key signal of such shortcut learning is the degradation of the performance of an ML model when asked to predict binding between novel (i.e., never-before-seen) protein targets and ligands. This modeling limitation is in-line with the findings of Geirhos et al.¹⁴, who showed that deep learning methods tend to exploit shortcuts in training data to achieve good performance. Laarhoven et al. discuss similar bias in drug-target interaction data and its effect on cross-validation performance¹⁵. Lee et al.¹⁶ and Wang et al.¹⁷ proposed approaches that partly address shortcut learning, but fail to generalize to unexplored proteins, i.e., proteins that lack sufficient binding annotations, or originate from organisms with no close relatives in current protein databases. More recently, models such as MolTrans¹⁸, MONN¹⁹, and TransDTI²⁰, explore innovative structural representations of protein and ligand molecules. Though these models better leverage the molecular structures to predict binding, end-to-end training limits their ability to generalize beyond the molecular scaffolds present in the training data.

Here, we introduce AI-Bind, a pipeline for predicting protein-ligand binding which can successfully generalize to unseen proteins and ligands. AI-Bind combines network science methods with unsupervised pre-training to control for the over-fitting and the annotation imbalance of existing libraries. We leverage the notion of shortest path distance on a network to identify distant protein-ligand pairs as negative samples. Combining these network-derived negatives with experimentally validated non-binding protein-ligand pairs, we ensure sufficient positive and negative samples for each node in the training data. Additionally, AI-Bind learns, in an unsupervised fashion, the representation of the node features, i.e., the chemical structures of ligand molecules or the amino acid sequences of protein targets, helping circumvent the model’s dependency on limited binding data. Instead of training the deep neural networks in an end-to-end fashion using binding data, we pre-train the embeddings for proteins and ligands using larger chemical libraries, allowing us to generalize the prediction task to chemical structures, beyond those present in the training data.

Results

Limitations of existing ML models

ML models characterize the likelihood of each node (proteins and ligands) to bind to other nodes according to the features and the annotations in the training data. While annotations capture known protein-ligand interactions, features refer to the chemical structures of proteins and ligands, which determine their physical and chemical properties, and are expressed as amino acid sequences or 3D structures for proteins, and chemical SMILES²¹ for ligands. In an ideal scenario, the ML model learns the patterns characterizing the features which drive the protein-ligand interactions, capturing the physical and chemical properties of a protein and of a ligand that determine the mutual binding affinity. Yet, as we show next, multiple state-of-the-art deep learning models, such as DeepPurpose⁵, ignore the features and rely largely on annotations, i.e., the degree information for each protein and ligand in the drug-target interaction (DTI) network, as a shortcut to make new binding predictions. A bipartite network represents the binding information as a graph with two different types of nodes: one corresponding to proteins (also called targets, representing for example, human or viral proteins) and the other corresponding to ligands (representing potential drugs or natural compounds), respectively. A protein-ligand annotation, i.e., evidence that a ligand binds to a protein, is represented as a link between the protein and the ligand in the bipartite network²². Experimentally validated annotations define the known DTI network. While binding depends only on the detailed chemical characteristics of the nodes (proteins and ligands), as we show here, many ML models predictions are primarily driven by the topology of the DTI network. We begin by noticing that the number of annotations linked to a protein or a ligand follows a fat-tailed distribution²², indicating that the vast majority of proteins and ligands have only a small number of annotations, which then coexist with a few hubs, nodes with an exceptionally large number of binding records²². For example, the number of annotations for proteins follows a power law distribution with degree exponent γ_p = 2.84 in the BindingDB data used for training and testing DeepPurpose, while the ligands have a degree exponent γ_l = 2.94 (Fig. 1a). For these degree exponents, the second moment of the distribution diverges for large sample sizes, implying that the expected uncertainty in the binding information is highly significant, limiting our ability to predict the binding between a single protein and a ligand^22,23. Furthermore, positive and negative annotations are determined by applying a threshold on kinetic constants like the constant of disassociation K_d. If the kinetic constant associated with a protein-ligand pair is less than a set threshold, we consider that pair as a positive or binding sample; otherwise, the pair is tagged as negative or non-binding. However, K_d is not randomly distributed across the records, but the number of annotations k and the average K_d per k (i.e., 〈K_d〉), calculated as the average across all links stemming from nodes of degree k, are anti-correlated (Fig. 1b), indicating stronger binding propensity for proteins and ligands with more annotations (r_Spearman(k_p, 〈K_d〉) = −0.47 for proteins, r_Spearman(k_l, 〈K_d〉) = −0.29 for ligands in the BindingDB data used by DeepPurpose). Furthermore, we observe lower variability in K_d values across links originating from high-degree nodes, compared to lower-degree nodes (see Supplementary Note 1). As the annotations follow fat-tailed distributions, the observed anti-correlation drives the hub proteins and ligands to have disproportionately more binding records on average, whereas proteins and ligands with fewer annotations have both binding and non-binding examples. This annotation imbalance prompts the ML models to leverage degree information (positive and negative annotations) in making binding prediction instead of learning binding patterns from the molecular structures. We term this phenomenon as the emergence of topological shortcuts (see Supplementary Note 1).

**Fig. 1: Annotation bias in BindingDB training data and DeepPurpose predictions.**

To investigate the emergence of topological shortcuts, for each node i with number of annotations k_i, we quantify the balance of the available training information via the degree ratio,

$${\rho }_{i}=\frac{{k}_{i}^{+}}{{k}_{i}^{+}+{k}_{i}^{-}}=\frac{{k}_{i}^{+}}{{k}_{i}},$$

(1)

where, ${k}_{i}^{+}$ is the positive degree, corresponding to the number of known binding annotations in the training data, and ${k}_{i}^{-}$ is the negative degree, or the number of known non-binding annotations in the training data (Fig. 2a, b). As most proteins and ligands lack either binding or non-binding annotations (Table 1), the resulting {ρ_i} are close to 1 or 0 (See Fig. 1c above), these ρ values represent the annotation imbalance in the prediction problem. As many state-of-the-art deep learning models, such as DeepPurpose⁵, uniformly sample the available positive and negative annotations, they assign higher binding probability to proteins and ligands with higher ρ (Fig. 2c, d). Consequently, their binding predictions are driven by topological shortcuts in the protein-ligand network, which are associated with the positive and negative annotations present in the training data rather than the structural features characterizing proteins and ligands.

**Fig. 2: Drug-Target Interaction Network.**

Table 1 BindingDB training data for DeepPurpose

Full size table

The higher binding predictions in DeepPurpose for proteins with large degree ratios (Fig. 2c) prompted us to compare the performance of DeepPurpose with network configuration models, algorithms that ignore the features of proteins and ligands and instead predict the likelihood of binding by leveraging only topological constraints derived from the network degree sequence^22,24,25. In the configuration model (Fig. 3a, Methods), the probability of observing a link is determined only by the degrees of its end nodes. In a 5-fold cross-validation on the benchmark BindingDB dataset (Table 1), we find that the top-performing DeepPurpose architecture, Transformer-CNN⁵, achieves AUROC of 0.86 (±0.005) and AUPRC of 0.64 (±0.009). At the same time, the network configuration model on the same data achieves an AUROC of 0.86 (±0.005) and AUPRC of 0.61 (±0.009) (Fig. 3b).

**Fig. 3: Comparing DeepPurpose and the duplex configuration model.**

In other words, the network configuration model, relying only on annotations, performs just as well as the deep learning model, confirming that the topology of the protein-ligand interaction network drives the prediction task. The major driving factor of the topological shortcuts is the monotone relation between k and 〈K_d〉, which associates a link type with the degree of its end nodes. Moreover, in BindingDB we observe that hubs encounter less variance for 〈K_d〉 compared to the low degree nodes, making the degree of the hubs a stronger predictor of the link types. Thus, the configuration model is able to achieve good test performance in predicting the link types associated with the hubs. Since hub nodes contribute to the majority of the links in the protein-ligand bipartite network, the configuration model achieves excellent test performance by making correct predictions that mainly leverage the degree information of the hubs. To further investigate this hypothesis, we tested three distinct scenarios: (i) unseen edges (Transductive test), when both proteins and ligands from the test dataset are present in the training data; (ii) unseen targets (Semi-inductive test), when only the ligands from the test dataset are present in the training data; (iii) unseen nodes (Inductive test), when both proteins and ligands from the test dataset are absent in the training data.

We find that both DeepPurpose and the configuration model perform well in scenarios (i) and (ii) (Fig. 3c, d). However, for the inductive test scenario (iii), when confronted with new proteins and ligands, both performances drop significantly (Table 2). DeepPurpose has an AUROC of 0.61 (±0.074) and AUPRC of 0.43 (±0.071), comparable to the configuration model, for which we have AUROC of 0.50 and AUPRC of 0.30 (±0.038). To offer a final piece of evidence that DeepPurpose disregards node features, we randomly shuffled the chemical SMILES²¹ and amino acid sequences in the training set, while keeping the same positive and negative annotations per node, an operation that did not change the test performance (Table 3). These tests confirm that DeepPurpose leverages network topology as a learning shortcut and fails to generalize predictions to proteins and ligands beyond the training data, indicating that we must use inductive testing to evaluate the true performance of ML models.

Table 2 DeepPurpose and duplex configuration model performances on BindingDB dataset

Full size table

Table 3 Assigning SMILES and amino acid sequences randomly

Full size table

Beyond DeepPurpose, models such as MolTrans¹⁸ explore different structural representations of protein and ligand molecules. We investigated transductive, semi-inductive, and inductive performances for MolTrans, a state-of-the-art protein-ligand binding prediction model which uses a combination of sub-structural pattern mining algorithm, interaction modeling module, and an augmented transformer encoder to better learn the molecular structures (see Supplementary Note 8). While the innovative representation of the molecules improves upon DeepPurpose in transductive tests (AUROC of 0.952 (±0.041), AUPRC of 0.887 (±0.087)), the same representation still relies only on the training DTI and fails to generalize to novel molecular structures, as captured by the poor performance in inductive tests (AUROC of 0.572 (±0.104), AUPRC of 0.432 (±0.105)).

AI-Bind and statistics across models

AI-Bind is a deep learning pipeline that combines network-derived learning strategies with unsupervised pre-trained node features to optimize the exploration of the binding properties of novel proteins and ligands. Our pipeline is compatible with various neural architectures, three of which we propose here: VecNet, Siamese model, and VAENet. AI-Bind uses two inputs (Fig. 4a): For ligands, it takes as input isomeric SMILES, which capture the structures of ligand molecules. AI-Bind considers a search-space consisting of all the drug molecules available in DrugBank and the naturally occurring compounds in the Natural Compounds in Food Database (NCFD) (see Supplementary Note 4), and can be extended by leveraging larger chemical libraries like PubChem²⁶. For proteins, AI-Bind uses as input the amino acid sequences retrieved from the protein databases Protein Data Bank (PDB)²⁷, the Universal Protein knowledgebase (UniProt)²⁸, and GeneCards²⁹.

**Fig. 4: AI-Bind pipeline: VecNet Performance and Validation.**

AI-Bind benefits from several novel features compared to the state-of-the-art: (a) It relies on network-derived negatives to balance the number of positive and negative samples for each protein and ligand. To be specific, it uses protein-ligand pairs with shortest path distance ≥7 as negative samples, ensuring that the neural networks observe both binding and non-binding examples for each protein and ligand (see Fig. 5, Methods, Supplementary Note 5). (b) During unsupervised pre-training, AI-Bind uses the node embeddings trained on larger collections of chemical and protein structures, compared to the set with known binding annotations, allowing AI-Bind to learn a wider variety of structural patterns. Indeed, while models like DeepPurpose were trained on 862,337 ligands and 7504 proteins provided in BindingDB, or 7307 ligands and 4762 proteins provided in DrugBank, the unsupervised representation in AI-Bind’s VecNet is trained on 19.9 million compounds from ZINC³⁰ and ChEMBL¹¹ databases, and on 546,790 proteins from Swiss-Prot³¹.

We begin the model’s validation by systematically comparing the performance of AI-Bind to DeepPurpose and the configuration model on a 5-fold cross-validation using the network-derived dataset for transductive, semi-inductive, and inductive tests. AI-Bind’s VecNet model uses pre-trained mol2vec³² and protvec³³ embeddings combined with a simple multi-layer perceptron to learn protein-ligand binding (Fig. 4b, see Methods). We observe that the configuration model performs poorly in inductive testing (AUROC 0.5, AUPRC 0.464 ± 0.017). Due to the network-derived negatives that remove the annotation imbalance, DeepPurpose shows improved performance for novel proteins and ligands (AUROC 0.646 ± 0.023, AUPRC 0.576 ± 0.009). The best performance on unseen nodes is observed for AI-Bind’s VecNet, with AUROC of 0.75 ± 0.032 and AUPRC of 0.718 ± 0.029 (see Fig. 4c and see Supplementary Table 3 for a summary of the performances). The unsupervised pre-training for ligand embeddings allows us to generalize AI-Bind to naturally occurring compounds, characterized by complex chemical structures and fewer training annotations compared to drugs (see Supplementary Note 2), obtaining performances comparable to those obtained for drugs (Fig. 4d).

Beyond DeepPurpose, AI-Bind’s VecNet consistently achieves better inductive performance (AUROC 0.75 ± 0.032, and AUPRC 0.718 ± 0.029) compared to MolTrans (AUROC 0.612 ± 0.028, and AUPRC 0.478 ± 0.034). The comparison between AI-Bind and state-of-the-art models like DeepPrupose and MolTrans validates how unsupervised pre-training of the molecular embeddings improves the generalizability of binding prediction models (see Supplementary Note 8).

Validation of AI-Bind predictions on COVID-19 proteins

For a better understanding of the reliability of the AI-Bind predictions, we move beyond standard ML cross-validation and compare our predictions with molecular docking simulations, and in vitro and clinical results on protein-ligand binding. Docking simulations offer a reliable but computationally intensive method to predict (or validate) binding between proteins and ligands³⁴. Motivated by the need to model rapid response to sudden health crises, we chose as our validation set the 26 SARS-CoV-2 viral proteins and the 332 human proteins targeted by the SARS-CoV-2 viral proteins^35,36,37. These proteins are missing from the training data of AI-Bind, hence represent novel targets and allow us to rely on recent efforts to understand the biology of COVID-19 to validate the AI-Bind predictions. We retrieved the amino acid sequences in FASTA format for 16 SARS-CoV-2 viral proteins and 330 human proteins from UniProt²⁸, and use them as input to AI-Bind’s VecNet. Binding between viral and human proteins is necessary for the virus to synthesize its own viral proteins and to facilitate its replication. Our goal is to predict drugs in DrugBank or naturally occurring compounds that can bind to any of the 16 SARS-CoV-2 or 330 human proteins associated with COVID-19, potentially disrupting the viral infection. After sorting all protein-ligand pairs based on their binding probability predicted by AI-Bind’s VecNet (${p}_{ij}^{VecNet}$), we tested the predicted top 100 and bottom 100 binding interactions with blind docking simulations using AutoDock Vina³⁴, which estimates binding affinity by considering all possible binding locations on the 3D protein structures (see Methods). Of the 54 proteins present in the top 100 and bottom 100 predicted pairs, 23 had 3D structures available in PDB²⁷ and UniProt²⁸, and 51 of the 59 involved ligand structures were available on PubChem²⁶, allowing us to perform 128 docking simulations (84 involving the top and 44 involving the bottom predictions). We find that 74 out of 84 top predictions from AI-Bind are indeed validated binding pairs. Furthermore, we find that the median binding affinity for the top VecNet predictions is −7.65 kcal mol⁻¹, while for the bottom ones is −3.0 kcal mol⁻¹ (Fig. 6a), confirming that for AI-Bind, the top predictions show significantly higher binding propensity than the bottom ones (Kruskal–Wallis H-test p-value of 2.5*10⁻⁵). As a second test, we obtained the binary labels (binding or non-binding) from docking and AI-Bind predictions using the threshold of −1.75 kcal mol⁻¹ for binding affinities³⁸ and the optimal threshold on ${p}_{ij}^{VecNet}$ corresponding to the highest F1-Score on the inductive test set (see Supplementary Note 7, Supplementary Fig. 11). In the derived confusion matrix we observe sensitivity = 0.76, representing the fraction of binding predictions made by AI-Bind that are true binders, i.e., the ratio True Positives/(True Positives + False Negatives), and F1-Score = 0.82. These two numbers confirm that the rank list provided by AI-Bind predictions shows a significant similarity to the rank list obtained by binding affinities compared to a random selection (Fig. 6b).

**Fig. 6: Validating and interpreting AI-Bind predictions.**

We further check the stability of these performance metrics by randomly choosing 20 protein-ligand pairs in a 5-fold bootstrapping set-up and observe F1-Score = 0.90 ± 0.02. Additionally, we find that the predictions made by AI-Bind’s VecNet (${p}_{ij}^{VecNet}$) and the free energy of protein-ligand binding obtained from docking (ΔG) are anti-correlated with ${r}_{Spearman}({p}_{ij}^{VecNet},\, {{\Delta }}G)=-0.51$. As lower binding affinity values correspond to stronger binding, these results document the agreement between AI-Bind predictions and docking simulations.

Among the 50 ligands with the highest average binding probability we find two FDA-approved drugs Anidulafungin (NDA#021948) and Cyclosporine (ANDA#065017). Experimental evidence³⁹ shows that these drugs have anti-viral activity at very low concentrations in the dose-response curves, and have IC₅₀ values of 4.64 μM and 5.82 μM, respectively, measured by immunofluorescence analysis with an antibody specific for the viral N protein of SARS-CoV-2. These low IC₅₀ values support anti-viral activity, confirming that Anidulafungin and Cyclosporine bind to COVID-19 related proteins⁴⁰, and the activity at low concentrations indicate that they are safe to use for treating COVID-19 patients¹. Anidulafungin binds to the SARS-CoV-2 viral Non-structural protein 12 (Nsp12), a key therapeutic target for coronaviruses⁴¹.

AI-Bind also offers several novel predictions with potential therapeutic relevance. For example, it predicts that the naturally occurring compounds Spironolactone, Oleanolic acid, and Echinocystic acid are potential ligands for COVID-19 proteins, all three ligands binding to Tripartite motif-containing protein 59 (Trim59), a human protein to which the SARS-CoV-2 viral proteins Open reading frames 3a (Orf3a) and Non-structural protein 9 (Nsp9) bind⁴². AutoDock Vina supports these predictions, offering binding affinities −7.1 kcal mol⁻¹, −8.0 kcal mol⁻¹, and −7.6 kcal mol⁻¹, respectively.

Spironolactone, found in rainbow trout⁴³, has been suggested to reduce COVID susceptibility^44,45. Oleanolic acid is present in apple, tomato, strawberry, and peach, and has been proposed as a potential anti-viral agent for COVID-19⁴⁶. Oleanolic acid, which passed the drug efficacy benchmark ADME (Absorption, Distribution, Metabolism, and Excretion), plays an important role in controlling viral replication of SARS-CoV-2⁴⁷ and is effective in preventing virus entry at low viral loads⁴⁶. Finally, Echinocystic acid, found in sunflower, basil, and gala apples, is known for its anti-inflammatory⁴⁸ and anti-viral activity⁴⁹, but its potential anti-viral role in COVID-19 is yet to be validated.

Identifying active binding sites

Beyond predicting binding probability, AI-Bind can also be used to identify the probable active binding sites on the amino acid sequence, even in absence of a 3D protein structure. Specifically, we can use AI-Bind to identify which amino acid trigrams in the amino acid sequence play the most significant role in binding predictions, indicative of potential protein-ligand binding locations. We perturb each amino acid trigram in the sequence and observe the changes in AI-Bind prediction (see Supplementary Note 9). Valleys in the obtained binding probability profile represent the trigrams most predictive of binding locations on the amino acid sequence. To validate the AI-Bind predicted binding sites, we focus on the human protein Trim59, a protein for which we have results from multiple docking simulations. We visualized the binding pockets on Trim59 using PyMOL⁵⁰ and identified the amino acid residues binding to the ligand molecules (Fig. 6c). We find that the amino acid residues responsible for binding directly map to the valleys in the binding probability profile identified by AI-Bind. By viewing the docking results for Pipecuronium, Buprenorphine and Voclosporin, ligands that bind to three different pockets on Trim59, we mark the valleys corresponding to the respective binding sites on the binding probability profiles (Fig. 6c). For example, pocket 1, where Pipecuronium binds, corresponds to five AI-Bind predicted valleys marked by 1A, 1B, 1C, 1D and 1E.

Since not all the valleys in the binding probability profile map to binding sites, we use the protein secondary structure to prioritize the valleys. We predict the secondary structure from the amino acid sequence using S4PRED⁵¹ and identify the regions with α-helix, β-sheet and coil. In particular, α-helices prefer non-solvent accessible environments⁵², contain non-polar amino acid residues⁵³, and consist of weaker inter-molecular interactions⁵⁴. Thus, the presence of α-helices reduce the chances of binding between a ligand and a protein. In contrast, β-sheets and non-regular coil regions (unstructured regions) are preferred by ligands as active binding sites since they provide more binding opportunity to other molecules⁵⁵. Indeed, most of the ligand-binding valleys in Fig. 6c map to β-sheets and coils on Trim59, associated with pockets 1 and 2 (27 out of 34 ligands validated by docking). By combining the binding probability profile predicted by AI-Bind and the secondary structure predicted by S4PRED, we can create an optimal search grid for the subsequent docking simulations, drastically reducing runtime.

We pursued further validation of AI-Bind predicted binding sites with a gold standard protein binding dataset⁵⁶ and with P2Rank, another state-of-the-art binding site prediction model⁵⁷, to extensively assess the reliability of the AI-Bind pipeline (see Supplementary Note 13).

In summary, ML models often fail in real world settings when making predictions on data that they were not explicitly trained upon, despite achieving good test performance based on traditional ML metrics. It is, therefore, necessary to validate the applicability of these models before deploying them. The documented validation of the AI-Bind predictions with molecular docking simulations and in vitro experiments offers us confidence that AI-Bind is an effective prioritization tool in diverse settings.

Discussion

The accurate prediction of drug-target interactions is an essential precondition of drug discovery. Here we showed that by taking topological shortcuts, existing deep learning models significantly limit their predictive power. Indeed, a mechanistic and quantitative understanding of the origins of these shortcuts indicates that uniform sampling in the presence of annotation imbalance drives ML models to disregard the features of proteins and ligands, limiting their ability to generalize to novel protein targets and ligand structures. To address these shortcomings, we introduced a pipeline, AI-Bind, which mitigates the annotation imbalance of the training data by introducing network-derived negative annotations inferred via shortest path distance, and improves the transferability of the ML models to novel protein and ligand structures by unsupervised pre-training. The proposed unsupervised pre-training of node features also influences the quality of false predictions, removing potential structural biases towards specific protein families (see Supplementary Note 10). Once we improved the statistical sampling of the training data and generated the node embeddings in an unsupervised fashion, we observed an increase in performance compared to DeepPurpose, resulting in commendable AUROC (24% improvement) and AUPRC (74% improvement) and, most importantly, an ability to predict beyond proteins and ligands present in the training dataset.

A major limitation of using binding predictions in drug discovery is that binding to disease-related protein targets does not always imply a therapeutic treatment. As a future work, we plan to extend our implementation by introducing an ML-based classifier to sort the list of potential ligands according to their pharmaceutical (therapeutic) effects, combining the current node features with additional metrics derived from traditional network medicine approaches⁵⁸.

AI-Bind leverages ligands’ Morgan fingerprints and proteins’ amino acid sequences, which encode relevant properties of the molecules: from the presence of hydrogen donors, hydrogen acceptors, count of different atoms, chirality, and solubility for ligands, to the existence of R groups, N or C terminus in proteins. All these properties influence the mechanisms driving protein-ligand binding (see Supplementary Note 11)⁵⁹. Yet, the binding phenomenon is largely dependent on the 3D structures of the molecules, which determines the binding pocket structures and the rotation of the bonds. We plan to embed the 3D structures of protein and ligand molecules, which will take into account higher order molecular properties driving protein-ligand binding and refine the predictive power of AI-Bind. To maximize generalization across 3D structure, we will use SE(3) equivariant networks to learn embeddings. Equivariance has proven to be a powerful tool for improving generalization over molecular structures^60,61. We also plan to explore the performance of AI-Bind over the entire druggable genome⁶², allowing us to predict for each protein, which domains are responsible for the binding predictions. Finally, we envision enabling AI-Bind to predict the kinetic constants K_d, K_i, IC₅₀, and EC₅₀ by formulating a regression task over these variables.

The existing docking infrastructures allow screening for a specific protein structure against wide chemical libraries. Indeed, VirtualFlow⁶³, an open-source drug discovery platform offers virtual screening over more than 1.4 billion commercially available ligands. However, running docking simulations over these vast libraries incurs high costs for data preparation and computation time and are often limited to only proteins with 3D structures²⁷. For example, in our validation step, only half (23 out of 54) of the 3D structures of the proteins associated with COVID-19 were available. Since AI-Bind only requires the chemical SMILES for ligands²¹ and amino acid sequences for proteins, it can offer fast screening for large libraries of targets and molecules without requiring 3D structures, guiding the computationally expensive docking simulations on selected protein-ligand pairs.

Methods

Data preparation

We use InChIKeys and amino acid sequences as the unique identifiers for ligands and targets, respectively. Positive and negative samples are selected from DrugBank, BindingDB and DTC (see Supplementary Note 4). We consider samples from BindingDB and DTC to be binding or non-binding based on the kinetic constants K_i, K_d, IC₅₀, and EC₅₀. We use thresholds of ≤10³ nM and ≥10⁶ nM to obtain positive and (absolute) negative annotations, respectively³⁸. We then filter out all samples outside the temperature range 20–45 °C to remove ambiguous pairs. All amino acid sequences were obtained from UniProt²⁸.

Positive samples

We consider the binding information from DrugBank as positive samples. From these annotations, we removed 53 pairs that are available in BindingDB and have kinetic constants ≥10⁶ nM. To obtain additional positive samples for drugs, we searched in BindingDB using their InChIKeys. We obtained 4330 binding annotations from BindingDB related to the drugs in DrugBank. Overall, we gathered a total of 28,188 positive samples for drugs. We identified also naturally occurring/food-borne compounds, small molecules generally lacking target annotations, by leveraging the Natural Compounds in Food Database (NCFD) (see Supplementary Note 4)^64,65,66. We queried BindingDB and DTC with the associated InChIKeys, obtaining a total of 1555 positive samples.

Network-derived negative samples

To generate annotation-balanced training data for AI-Bind, we merged the positive annotations derived from DrugBank, BindingDB, and DTC, for a total of 5104 targets and 8111 ligands, of which 485 are naturally occurring, and calculated the shortest path distribution. All odd-path lengths in the bipartite network correspond to protein-ligand pairs (Fig. 5c). Overall, the longer the shortest path distance separating a protein and a ligand, the higher the kinetic constant observed in BindingDB (Fig. 5d). In particular, pairs more than 7 hops apart have, on average, kinetic constants K_i ≥ 10⁶ nM, which is generally considered above the protein-ligand binding threshold³⁸ (see Supplementary Note 5). We randomly selected a subset of protein-ligand pairs which are 7 hops apart as negative samples, to create an overall class balance between positive and negative samples in the training data. Finally, we removed all nodes with only positive or only negative samples and obtained the network-derived negative instances.

We performed testing and validation on ≥11-hop distant pairs. Additionally, we included in testing and validation the absolute non-binding pairs derived from BindingDB by thresholding the kinetic constants (K_i, K_d, IC₅₀, and EC₅₀).

Network configuration model

Overview

Protein-ligand annotations are naturally embedded in a bipartite duplex network, consisting of a set of nodes, comprising all proteins and ligands, interacting in two layers, each reflecting a distinct type of interaction linking the same pair of nodes²⁴. More specifically, one layer (Layer 1) captures the positive or binding annotations, while the second layer (Layer 2) collects the negative or non-binding annotations (Fig. 3a). A multilink m between two nodes encodes the pattern of links connecting these nodes in different layers. In particular, m = (1, 0) indicates positive interactions, m = (0, 1) refers to negative interactions, m = (0, 0) represents the absence of any type of annotations, and m = (1, 1) is mathematically forbidden, as binding and non-binding cannot coexist for the same pair of protein and ligand.

We developed a canonical bipartite duplex null model that conserves on average the number of positive and negative annotations of each node, while correctly rewiring positive and negative links and avoiding forbidden configurations. By means of entropy maximization with constraints, we derive the analytical formulation of each multilink probability and the conditional probability of observing positive binding once an annotation is reported.

Mathematical formulation

Let ${A}_{ij}^{{{{{{{{\bf{m}}}}}}}}}$ be the multi-adjacency matrix representing the bipartite duplex of ligands ({i}) and proteins ({j}), with elements equal to 1 if there is a multilink m between i and j and zero otherwise. We define the multidegree of ligand i and target j as

$${k}_{i}^{{{{{{{{\bf{m}}}}}}}}}=\mathop{\sum }\limits_{j=1}^{{N}_{T}}{A}_{ij}^{{{{{{{{\bf{m}}}}}}}}},\,\,\,\,\,\,{t}_{j}^{{{{{{{{\bf{m}}}}}}}}}=\mathop{\sum }\limits_{i=1}^{{N}_{L}}{A}_{ij}^{{{{{{{{\bf{m}}}}}}}}},$$

(2)

where N_T is the number of targets and N_L is the number of ligands.

A bipartite duplex network ensemble can be defined as the set of all duplexes satisfying a given set of constraints, such as the expected multidegree sequences defined in Equation (2). We determine the probability of observing a bipartite duplex network $P(\overrightarrow{G})$ by entropy maximization with multidegree constraints $\{{k}_{i}^{(1,0)}\}$, $\{{k}_{i}^{(0,1)}\}$, $\{{t}_{j}^{(1,0)}\}$, and $\{{t}_{j}^{(0,1)}\}$, and corresponding Lagrangian multipliers $\{{\lambda }_{i}^{(1,0)}\}$, $\{{\lambda }_{i}^{(0,1)}\}$, $\{{\mu }_{j}^{(1,0)}\}$, and $\{{\mu }_{j}^{(0,1)}\}$^24,25. The probability $P(\overrightarrow{G})$ factorizes as

$$P(\overrightarrow{G})=\frac{1}{Z}\mathop{\prod}\limits_{ij}\exp \left[-\mathop{\sum}\limits_{{{{{{{{\bf{m}}}}}}}}\ne (0,0),(1,1)}({\lambda }_{i}^{{{{{{{{\bf{m}}}}}}}}}+{\mu }_{j}^{{{{{{{{\bf{m}}}}}}}}}){A}_{ij}^{{{{{{{{\bf{m}}}}}}}}}\right],$$

(3)

with

$$Z=\mathop{\prod}\limits_{ij}\left[1+\mathop{\sum}\limits_{{{{{{{{\bf{m}}}}}}}}\ne (0,0),(1,1)}{e}^{-({\lambda }_{i}^{{{{{{{{\bf{m}}}}}}}}}+{\mu }_{j}^{{{{{{{{\bf{m}}}}}}}}})}\right].$$

(4)

Multilink probabilities ${p}_{ij}^{{{{{{{{\bf{m}}}}}}}}}$ are determined by the derivatives of log (Z) according to $({\lambda }_{i}^{{{{{{{{\bf{m}}}}}}}}}+{\mu }_{j}^{{{{{{{{\bf{m}}}}}}}}})$. For instance, the probability of observing a positive annotation is

$${p}_{ij}^{(1,0)}=\frac{{e}^{-({\lambda }_{i}^{(1,0)}+{\mu }_{j}^{(1,0)})}}{1+{e}^{-({\lambda }_{i}^{(1,0)}+{\mu }_{j}^{(1,0)})}+{e}^{-({\lambda }_{i}^{(0,1)}+{\mu }_{j}^{(0,1)})}},$$

(5)

while the probability of observing a negative annotation follows

$${p}_{ij}^{(0,1)}=\frac{{e}^{-({\lambda }_{i}^{(0,1)}+{\mu }_{j}^{(0,1)})}}{1+{e}^{-({\lambda }_{i}^{(1,0)}+{\mu }_{j}^{(1,0)})}+{e}^{-({\lambda }_{i}^{(0,1)}+{\mu }_{j}^{(0,1)})}},$$

(6)

with ${p}_{ij}^{(1,0)}+{p}_{ij}^{(0,1)}+{p}_{ij}^{(0,0)}=1$.

In this theoretical framework, binding prediction is inherently conditional, as for each ligand i and protein j, we test only the presence of positive and negative annotations. Consequently, ${p}_{ij}^{(1,0)}$ and ${p}_{ij}^{(0,1)}$ are normalized by the probability of observing a generic annotation ${p}_{ij}^{(1,0)}+{p}_{ij}^{(0,1)}$. In case of unseen edges, binding prediction is determined by

$${p}_{ij}^{{{{{{{{\rm{conditional}}}}}}}}}=\frac{{p}_{ij}^{(1,0)}}{{p}_{ij}^{(1,0)}+{p}_{ij}^{(0,1)}},$$

(7)

while in case of unseen target j^*, the binding probability towards a known compound i follows

$${p}_{i{j}^{*}}^{{{{{{{{\rm{conditional}}}}}}}}}=\frac{{\left\langle {p}_{ij}^{(1,0)}\right\rangle }_{j}}{{\left\langle {p}_{ij}^{(1,0)}\right\rangle }_{j}+{\left\langle {p}_{ij}^{(0,1)}\right\rangle }_{j}}={\rho }_{i},$$

(8)

where 〈⋅〉_j denotes the average over all known targets, and ρ_i follows from Equation (1). In case of unseen ligand i^* and target j^*, the binding probability is determined by the overall number of positive (L^(1, 0)) and negative (L^(0, 1)) annotations, i.e.,

$${p}_{{i}^{*}{j}^{*}}^{{{{{{{{\rm{conditional}}}}}}}}}=\frac{{\left\langle {p}_{ij}^{(1,0)}\right\rangle }_{ij}}{{\left\langle {p}_{ij}^{(1,0)}\right\rangle }_{ij}+{\left\langle {p}_{ij}^{(0,1)}\right\rangle }_{ij}}=\frac{{L}^{(1,0)}}{{L}^{(1,0)}+{L}^{(0,1)}},$$

(9)

where 〈⋅〉_ij indicates the average over all known pairs of ligands and targets.

Novel deep learning architectures

VecNet

VecNet uses the pre-trained mol2vec³² and protvec³³ models (Fig. 4b). These models create 300- and 100-dimensional embeddings for ligands and proteins, respectively. Based on word2vec⁶⁷, these methods treat the Morgan fingerprint⁶⁸ and the amino acid sequences as sentences, where words are fingerprint fragments or amino acid trigrams. The training is unsupervised and independent from the following binding prediction task.

VAENet

VAENet uses a Variational Auto-Encoder⁶⁹, an unsupervised learning technique, to embed ligands onto a latent space. The Morgan fingerprint is directly fed to convolutional layers. The auto-encoder creates latent space embeddings by minimizing the loss of information while reconstructing the molecule from the latent representation. We train the Variational Auto-Encoder on 9.5 million chemicals from ZINC database³⁰, and all drugs and natural compounds in our binding dataset. Similar to VecNet, we use ProtVec for target embeddings.

Siamese model

The Siamese model embeds ligands and proteins into the same space using a one-shot learning approach⁷⁰. We construct triplets of the form 〈protein target, non−binding ligand, binding ligand〉 and train the model to find an embedding space that maximizes the Euclidean distances between non-binding pairs, while minimizing it for the binding ones.

File preparation for docking simulations

We performed docking simulations for 128 protein-ligand interactions found within the top 100 and bottom 100 predictions of AI-Bind. The PDB accession codes for the 3D structures of the proteins are listed in Supplementary Table 8. The steps to implement docking simulations in AutoDock Vina³⁴ include:

1.
Obtain the 3D ligand structures in SDF format from PubChem and save it in .pdb format with PyMOL for use in AutoDockTools.
2.
Download the 3D protein structures in .pdb format and load them into AutoDockTools to remove water molecules from the protein structure, add all hydrogen atoms, and the Kollman charge to the protein.
3.
Save both the protein and the ligand structures in .pdbqt format using AutoDockTools.
4.
Create the grid for docking that encompasses the whole protein structure. This grid selection ensures a blind docking set-up, so that all locations on the protein are considered for determining the binding affinities. The selected grid sizes are available in gridsizes.txt (see Data availability).
5.
Create the configuration files with the grid details for each protein and launch the docking simulation. We consider the protein molecules to be rigid, whereas the ligand molecules are flexible, i.e., we allow rotatable bonds for the ligands.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The data generated and analyzed in the study have been deposited on Zenodo at https://zenodo.org/record/7226641. The top 100 and bottom 100 binding predictions from AI-Bind on the COVID-19 related proteins are available within the Supplementary Files. A Source Data File is provided with this manuscript. The publicly available datasets used in this study can be found on their associated websites: DrugBank (https://www.drugbank.com/), BindingDB (https://www.bindingdb.org), Drug Target Commons (http://drugtargetcommons.fimm.fi/), Uniprot (https://www.uniprot.org/), Protein Data Bank (https://www.rcsb.org/), and PubChem (https://pubchem.ncbi.nlm.nih.gov/).

Code availability

The codes that support the findings of this study are openly available on our GitHub at https://doi.org/10.5281/zenodo.7730755.

References

Hughes, J., Rees, S., Kalindjian, S. & Philpott, K. Principles of early drug discovery. Br. J. Pharmacol. 162, 1239–1249 (2011).
Article CAS PubMed PubMed Central Google Scholar
Vivo, M. D., Masetti, M., Bottegoni, G. & Cavalli, A. Role of molecular dynamics and related methods in drug discovery. J. Med. Chem. 59, 4035–4061 (2016).
Article PubMed Google Scholar
Meng, X.-Y., Zhang, H.-X., Mezei, M. & Cui, M. Molecular docking: a powerful approach for structure-based drug discovery. Curr. Comput. Aided Drug Des. 7, 146–157 (2011).
Article CAS PubMed PubMed Central Google Scholar
Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 23, 1241–1250 (2018).
Article PubMed Google Scholar
Huang, K. et al. DeepPurpose: a deep learning library for drug–target interaction prediction. Bioinformatics 36, 5545–5547 (2020).
Article CAS PubMed Central Google Scholar
Zhang, H. et al. DeepBindPoc: a deep learning method to rank ligand binding pockets using molecular vector representation. PeerJ 8, e8864 (2020).
Article PubMed PubMed Central Google Scholar
Xia, C.-Q., Pan, X. & Shen, H.-B. Protein–ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics 36, 3018–3027 (2020).
Article CAS PubMed Google Scholar
Wishart, D. S. et al. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 36, D901–D906 (2007).
Article PubMed PubMed Central Google Scholar
Gilson, M. K. et al. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 44, D1045–D1053 (2015).
Article PubMed PubMed Central Google Scholar
Richard, A. M. et al. The Tox21 10k compound library: Collaborative chemistry advancing toxicology. Chem. Res. Toxicol. 34, 189–216 (2021).
Article CAS PubMed Google Scholar
Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954 (2016).
Article PubMed PubMed Central Google Scholar
Tang, J. et al. Drug target commons: a community effort to build a consensus knowledge base for drug-target interactions. Cell Chem. Biol. 25, 224–229.e2 (2018).
Article PubMed PubMed Central Google Scholar
Öztürk, H., Ozkirimli, E. & Özgür, A. A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction. BMC Bioinforma. 17, 128 (2016).
Article Google Scholar
Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).
Article Google Scholar
van Laarhoven, T. & Marchiori, E. Pattern recognition in bioinformatics. In Biases of Drug–Target Interaction Network Data (eds. Comin, M., Käll, L., Marchiori, E., Ngom, A. & Rajapakse, J.) 23–33 (Springer International Publishing, 2014).
Lee, A. A., Brenner, M. P. & Colwell, L. J. Predicting protein–ligand affinity with a random matrix framework. Proc. Natl Acad. Sci. USA 113, 13564–13569 (2016).
Article CAS PubMed PubMed Central ADS Google Scholar
Wang, Z., Liang, L., Yin, Z. & Lin, J. Improving chemical similarity ensemble approach in target prediction. J. Cheminform. 8, 1–10 (2016).
Huang, K., Xiao, C., Glass, L. M. & Sun, J. MolTrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics 37, 830–836 (2020).
Article PubMed Central Google Scholar
Li, S. et al. MONN: a multi-objective neural network for predicting compound-protein interactions and affinities. Cell Syst. 10, 308–322.e11 (2020).
Google Scholar
Kalakoti, Y., Yadav, S. & Sundar, D. TransDTI: transformer-based language models for estimating DTIs and building a drug recommendation workflow. ACS Omega 7, 2706–2717 (2022).
Article CAS PubMed PubMed Central Google Scholar
Weininger, D. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Article CAS Google Scholar
Barabási, A.-L. Network Science (Cambridge University Press, 2016).
Yang, J., Shen, C. & Huang, N. Predicting or pretending: artificial intelligence for protein-ligand interactions lack of sufficiently large and unbiased datasets. Front. Pharmacol. 11, 69 (2020).
Article CAS PubMed PubMed Central Google Scholar
Menichetti, G., Remondini, D., Panzarasa, P., Mondragón, R. J. & Bianconi, G. Weighted multiplex networks. PLoS One 9, e97857 (2014).
Article PubMed PubMed Central ADS Google Scholar
Menichetti, G. & Remondini, D. Entropy of a network ensemble: definitions and applications to genomic data. Theor. Biol. Forum 107, 77–87 (2014).
PubMed Google Scholar
Kim, S. et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 49, D1388–D1395 (2020).
Article PubMed Central Google Scholar
Berman, H. M. et al. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).
Article CAS PubMed PubMed Central ADS Google Scholar
The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2020).
Article Google Scholar
Stelzer, G. et al. The GeneCards suite: from gene data mining to disease genome sequence analyses. Curr. Protoc. Bioinforma. 54, 1.30.1-1.30.33 (2016).
Article Google Scholar
Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S. & Coleman, R. G. ZINC: a free tool to discover chemistry for biology. J. Chem. Inf. Model 52, 1757–1768 (2012).
Article CAS PubMed PubMed Central Google Scholar
Bairoch, A. The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic Acids Res. 24, 21–25 (1996).
Article CAS PubMed PubMed Central Google Scholar
Jaeger, S., Fulle, S. & Turk, S. Mol2vec: unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model 58, 27–35 (2018).
Article CAS PubMed Google Scholar
Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One 10, e0141287 (2015).
Article PubMed PubMed Central Google Scholar
Trott, O. & Olson, A. J. AutoDock vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31, 455–461 (2010).
CAS PubMed PubMed Central Google Scholar
Patten, J. J. et al. Identification of potent inhibitors of SARS-CoV-2 infection by combined pharmacological evaluation and cellular network prioritization. iScience 25, 104925 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
Gysi, D. M. et al. Network medicine framework for identifying drug-repurposing opportunities for COVID-19. Proc. Natl Acad. Sci. USA 118, e2025581118 (2021).
Article CAS Google Scholar
Gordon, D. E. et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature 583, 459–468 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Smith, R. D., Engdahl, A. L., Dunbar, J. B. & Carlson, H. A. Biophysical limits of protein–ligand binding. J. Chem. Inf. Model 52, 2098–2106 (2012).
Article CAS PubMed PubMed Central Google Scholar
Jeon, S. et al. Identification of antiviral drug candidates against SARS-CoV-2 from FDA-approved drugs. Antimicrob. Agents Chemother. 64, e00819–20 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cour, M., Ovize, M. & Argaud, L. Cyclosporine a: a valid candidate to treat COVID-19 patients with acute respiratory failure? Crit. Care 24, 276 (2020).
Article PubMed PubMed Central Google Scholar
Dey, S. K. et al. Suramin, penciclovir, and anidulafungin exhibit potential in the treatment of COVID-19 via binding to nsp12 of SARS-CoV-2. J. Biomol. Struct. Dyn. 40, 1–17 (2021).
Google Scholar
Kondo, T., Watanabe, M. & Hatakeyama, S. TRIM59 interacts with ECSIT and negatively regulates NF-κb and IRF-3/7-mediated signal pathways. Biochem. Biophys. Res. Commun. 422, 501–507 (2012).
Article CAS PubMed Google Scholar
Duke, J. A. Handbook of Phytochemical Constituent Grass, Herbs and Other Economic Plants: Herbal Reference Library (CRC Press, 1992).
Jeon, D., Son, M. & Choi, J. Effect of spironolactone on COVID-19 in patients with underlying liver cirrhosis: a nationwide case-control study in South Korea. Front. Med. 8, 629176 (2021).
Article Google Scholar
Cadegiani, F. A., Wambier, C. G. & Goren, A. Spironolactone: an anti-androgenic and anti-hypertensive drug that may provide protection against the novel coronavirus (SARS-CoV-2) induced acute respiratory distress syndrome (ARDS) in COVID-19. Front. Med. 7, 453 (2020).
Article Google Scholar
Carino, A. et al. Hijacking SARS-CoV-2/ACE2 receptor interaction by natural and semi-synthetic steroidal agents acting on functional pockets on the receptor binding domain. Front. Chem. 8, 572885 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kumar, A. et al. Identification of phytochemical inhibitors against main protease of COVID-19 using molecular modeling approaches. J. Biomol. Struct. Dyn. 39, 3760–3770 (2020).
Article PubMed Google Scholar
Joh, E.-H., Gu, W. & Kim, D.-H. Echinocystic acid ameliorates lung inflammation in mice and alveolar macrophages by inhibiting the binding of LPS to TLR4 in NF-κb and MAPK pathways. Biochem. Pharmacol. 84, 331–340 (2012).
Article CAS PubMed Google Scholar
ting Deng, Y., bo Kang, W., ning Zhao, J., Liu, G. & gao Zhao, M. Osteoprotective effect of echinocystic acid, a triterpone component from eclipta prostrata, in ovariectomy-induced osteoporotic rats. PLoS One 10, e0136572 (2015).
Article Google Scholar
Schrödinger, L. L. C. PyMOL molecular graphics system, Version1.8. (2015). http://www.pymol.org/pymol.
Moffat, L. & Jones, D. T. Increasing the accuracy of single sequence prediction methods using a deep semi-supervised learning framework. Bioinformatics 37, 3744–3751 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kutchukian, P. S., Yang, J. S., Verdine, G. L. & Shakhnovich, E. I. All-atom model for stabilization of α-helical structure in peptides by hydrocarbon staples. J. Am. Chem. Soc. 131, 4622–4627 (2009).
Article CAS PubMed PubMed Central Google Scholar
Fujiwara, K., Toda, H. & Ikeguchi, M. Dependence of alpha-helical and beta-sheet amino acid propensities on the overall protein fold type. BMC Struct. Biol. 12, 18 (2012).
Article CAS PubMed PubMed Central Google Scholar
Cheng, P.-N., Pham, J. D. & Nowick, J. S. The supramolecular chemistry of β-sheets. J. Am. Chem. Soc. 135, 5477–5492 (2013).
Article CAS PubMed PubMed Central Google Scholar
Remaut, H. & Waksman, G. Protein–protein interaction through β-strand addition. Trends Biochem. Sci. 31, 436–444 (2006).
Article CAS PubMed Google Scholar
Cheng, T., Li, X., Li, Y., Liu, Z. & Wang, R. Comparative assessment of scoring functions on a diverse test set. J. Chem. Inf. Model 49, 1079–1093 (2009).
Article CAS PubMed Google Scholar
Krivák, R. & Hoksza, D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J. Cheminform. 10, 39 (2018).
Article PubMed PubMed Central Google Scholar
Guney, E., Menche, J., Vidal, M. & Barábasi, A.-L. Network-based in silico drug efficacy screening. Nat. Commun. 7, 10331 (2016).
Article CAS PubMed PubMed Central ADS Google Scholar
Ferreira de Freitas, R. & Schapira, M. A systematic analysis of atomic protein-ligand interactions in the PDB. Medchemcomm 8, 1970–1981 (2017).
Article CAS PubMed PubMed Central Google Scholar
Fuchs, F. B., Worrall, D. E., Fischer, V. & Welling, M. SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Curran Associates Inc., 2020).
Stärk, H., Ganea, O.-E., Pattanaik, L., Barzilay, R. & Jaakkola, T. Equibind: Geometric deep learning for drug binding structure prediction. In Proceedings of the 39th International Conference on Machine Learning (PMLR, 2022).
Finan, C. et al. The druggable genome and support for target identification and validation in drug development. Sci. Transl. Med. 9, eaag1166 (2017).
Article PubMed PubMed Central Google Scholar
Gorgulla, C. et al. An open-source drug discovery platform enables ultra-large virtual screens. Nature 580, 663–668 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Barabási, A.-L., Menichetti, G. & Loscalzo, J. The unmapped chemical complexity of our diet. Nat. Food 1, 33–37 (2020).
Article Google Scholar
Menichetti, G. & Barabási, A.-L. Nutrient concentrations in food display universal behaviour. Nat. Food 3, 375–382 (2022).
Article Google Scholar
Menichetti, G. An AI pipeline to investigate the binding properties of poorly annotated molecules. Nat. Rev. Phys. 4, 359 (2022).
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. neural Inf. Process. Syst. 26, 3111–3119 (2013).
Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model 50, 742–754 (2010).
Article CAS PubMed Google Scholar
Doersch, C. Tutorial on variational autoencoders. Preprint at https://arxiv.org/abs/1606.05908 (2016).
Koch, G., Zemel, R. & Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proc. 32nd International Conference on Machine Learning 37 (Lille, 2015).

Download references

Acknowledgements

We thank Noah DeMoes from Lincoln Laboratory for helping out with setting up the DeepPurpose models. Christian De Frondeville from the Bioinformatics department at Northeastern University has helped with the gold standard validation of binding probability profile. This work was partially supported by NIH grant 1P01HL132825 (A.L.B.), American Heart Association grant 151708 (A.L.B.), ERC grant 810115-DYNASET (A.L.B.), and Rockefeller Foundation grant 2019 FOD 026 (A.L.B.).

Author information

Authors and Affiliations

Network Science Institute, Northeastern University, Boston, MA, USA
Ayan Chatterjee, Michael Sebek, Deisy Gysi, Tina Eliassi-Rad, Albert-László Barabási & Giulia Menichetti
Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
Robin Walters, Zohair Shafi, Omair Shafi Ahmed & Tina Eliassi-Rad
Department of Physics, Northeastern University, Boston, MA, USA
Michael Sebek, Deisy Gysi, Albert-László Barabási & Giulia Menichetti
Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
Deisy Gysi
Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Rose Yu
Santa Fe Institute, Santa Fe, NM, USA
Tina Eliassi-Rad
The Institute for Experiential AI, Northeastern University, Boston, MA, USA
Tina Eliassi-Rad
Department of Network and Data Science, Central European University, Budapest, Hungary
Albert-László Barabási
Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
Giulia Menichetti

Authors

Ayan Chatterjee
View author publications
You can also search for this author in PubMed Google Scholar
Robin Walters
View author publications
You can also search for this author in PubMed Google Scholar
Zohair Shafi
View author publications
You can also search for this author in PubMed Google Scholar
Omair Shafi Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Michael Sebek
View author publications
You can also search for this author in PubMed Google Scholar
Deisy Gysi
View author publications
You can also search for this author in PubMed Google Scholar
Rose Yu
View author publications
You can also search for this author in PubMed Google Scholar
Tina Eliassi-Rad
View author publications
You can also search for this author in PubMed Google Scholar
Albert-László Barabási
View author publications
You can also search for this author in PubMed Google Scholar
Giulia Menichetti
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.C. contributed to writing the manuscript, data curation and preparation, generating the predictions for the network configuration model, performing experiments to identify the emergence of topological shortcuts, implementing negative sample generation, developing and testing of VecNet and VAENet, running docking simulations and developing the method to predict the active binding sites. R.W. contributed to writing the manuscript, generating the predictions for the network configuration model, designing and training VecNet and VAENet. Z.S. contributed to training and testing of all the deep learning models, and designing the Siamese model. O.S.A. contributed to the deep learning literature review, running the DeepPurpose models, developing and implementing negative sample generation, training VAENet, setting up model training and software architecture as well as data pipelines for all models and experiments. M.S. contributed to exploring the optimal representation of molecules and developing the method to predict the active binding sites. D.G. contributed to the data curation and preparation, and performed the gene phylogeny study. R.Y., T.E.R., and A.L.B. have provided guidance on designing the experiments and writing the manuscript. G.M. conceived the project, developed the duplex configuration model, designed experiments to identify the emergence of topological shortcuts, contributed to data preparation, data analysis, and writing the manuscript.

Corresponding author

Correspondence to Giulia Menichetti.

Ethics declarations

Competing interests

A.L.B. is a scientific founder of Scipher Medicine, Inc., which applies network medicine strategies to personalized drug selection, and Naring, Inc., which applies data science to food and health. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Hojung Nam, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Dataset 1

Dataset 2

Dataset 3

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chatterjee, A., Walters, R., Shafi, Z. et al. Improving the generalizability of protein-ligand binding predictions with AI-Bind. Nat Commun 14, 1989 (2023). https://doi.org/10.1038/s41467-023-37572-z

Download citation

Received: 16 May 2022
Accepted: 23 March 2023
Published: 08 April 2023
DOI: https://doi.org/10.1038/s41467-023-37572-z

This article is cited by

PMF-CPI: assessing drug selectivity with a pretrained multi-functional model for compound–protein interactions
- Nan Song
- Ruihan Dong
- Fei Guo
Journal of Cheminformatics (2023)
ZeroBind: a protein-specific zero-shot predictor with subgraph matching for drug-target interactions
- Yuxuan Wang
- Ying Xia
- Xiaoyong Pan
Nature Communications (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Limitations of existing ML models

AI-Bind and statistics across models

Validation of AI-Bind predictions on COVID-19 proteins

Identifying active binding sites

Discussion

Methods

Data preparation

Positive samples

Network-derived negative samples

Network configuration model

Overview

Mathematical formulation

Novel deep learning architectures

VecNet

VAENet

Siamese model

File preparation for docking simulations

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links