Abstract
In drug discovery, determining the binding affinity and functional effects of small-molecule ligands on proteins is critical. Current computational methods can predict these protein–ligand interaction properties but often lose accuracy without high-resolution protein structures and falter in predicting functional effects. Here we introduce PSICHIC (PhySIcoCHemICal graph neural network), a framework incorporating physicochemical constraints to decode interaction fingerprints directly from sequence data alone. This enables PSICHIC to attain capabilities in decoding mechanisms underlying protein–ligand interactions, achieving state-of-the-art accuracy and interpretability. Trained on identical protein–ligand pairs without structural data, PSICHIC matched and even surpassed leading structure-based methods in binding-affinity prediction. In an experimental library screening for adenosine A1 receptor agonists, PSICHIC discerned functional effects effectively, ranking the sole novel agonist within the top three. PSICHIC’s interpretable fingerprints identified protein residues and ligand atoms involved in interactions, and helped in unveiling selectivity determinants of protein–ligand interaction. We foresee PSICHIC reshaping virtual screening and deepening our understanding of protein–ligand interactions.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All raw and benchmark data resources used are publicly available. The raw data were obtained from the following databases: Protein Data Bank (https://www.rcsb.org)62, UniProt (https://www.uniprot.org)63, PDBBind (http://www.pdbbind.org.cn/)30, ExCAPE-ML (https://solr.ideaconsult.net/search/excape/)36 and Papyrus37,64. All datasets used in this study for training and testing the models, including the manually curated protein–ligand functional effect dataset and the large-scale interaction dataset, are made publicly available65. Source data are provided with this paper.
Code availability
A GitHub repository containing the source code and data files for retraining and evaluating PSICHIC is available at https://github.com/huankoh/PSICHIC (ref. 66). The repository contains a user-friendly, open-source online platform for PSICHIC’s virtual screening application, integrated with Google Colaboratory for easy web-based interaction. The weights of the trained PSICHIC model are also available in the repository.
References
Kitchen, D. B., Decornez, H., Furr, J. R. & Bajorath, J. Docking and scoring in virtual screening for drug discovery: methods and applications. Nat. Rev. Drug Discov. 3, 935–949 (2004).
Hopkins, A. L. Predicting promiscuity. Nature 462, 167–168 (2009).
Chen, L. et al. TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 36, 4406–4414 (2020).
Jiang, M. et al. Drug–target affinity prediction using graph neural network and contact maps. RSC Adv. 10, 20701–20712 (2020).
Bagherian, M. et al. Machine learning approaches and databases for prediction of drug–target interaction: a survey paper. Brief. Bioinform. 22, 247–269 (2021).
Li, S. et al. Structure-aware interactive graph neural networks for the prediction of protein–ligand binding affinity. In Proc. 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 975–985 (Association for Computing Machinery, 2021).
Dhakal, A., McKay, C., Tanner, J. J. & Cheng, J. Artificial intelligence in the prediction of protein–ligand interactions: recent advances and future directions. Brief. Bioinform. 23, bbab476 (2022).
Lu, W. et al. TANKBind: trigonometry-aware neural networks for drug–protein binding structure prediction. Adv. Neural Inf. Process. Syst. 35, 7236–7249 (2022).
Bai, P., Miljković, F., John, B. & Lu, H. Interpretable bilinear attention network with domain adaptation improves drug–target prediction. Nat. Mach. Intell. 5, 126–136 (2023).
Ng, H. W. et al. Competitive molecular docking approach for predicting estrogen receptor subtype α agonists and antagonists. BMC Bioinf. 15, S4 (2014).
Rodríguez, D., Gao, Z.-G., Moss, S. M., Jacobson, K. A. & Carlsson, J. Molecular docking screening using agonist-bound GPCR structures: probing the A2A adenosine receptor. J. Chem. Inf. Model. 55, 550–563 (2015).
Kooistra, A. J., Leurs, R., de Esch, I. J. P. & de Graaf, C. Structure-based prediction of G-protein-coupled receptor ligand function: a β-adrenoceptor case study. J. Chem. Inf. Model. 55, 1045–1061 (2015).
Cai, T., Abbu, K. A., Liu, Y. & Xie, L. DeepREAL: a deep learning powered multi-scale modeling framework for predicting out-of-distribution ligand-induced GPCR activity. Bioinformatics 38, 2561–2570 (2022).
Michel, M., Menéndez Hurtado, D. & Elofsson, A. PconsC4: fast, accurate and hassle-free contact predictions. Bioinformatics 35, 2677–2679 (2018).
Rao, R., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A. Transformer protein language models are unsupervised structure learners. In Proc. 8th International Conference on Learning Representations (ICLR, 2020).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Jiang, M. et al. Sequence-based drug-target affinity prediction using weighted graph neural networks. BMC Genomics 23, 449 (2022).
Wang, P. et al. Structure-aware multimodal deep learning for drug–protein interaction prediction. J. Chem. Inf. Model. 62, 1308–1317 (2022).
Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17, 184–192 (2020).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Wong, F. et al. Benchmarking AlphaFold‐enabled molecular docking predictions for antibiotic discovery. Mol. Syst. Biol. 18, e11081 (2022).
He, X. et al. AlphaFold2 versus experimental structures: evaluation on G protein-coupled receptors. Acta Pharmacol. Sin. 44, 1–7 (2023).
Nguyen, T. et al. GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics 37, 1140–1147 (2021).
Corso, G., Stärk, H., Jing, B., Barzilay, R. & Jaakkola, T. S. DiffDock: diffusion steps, twists, and turns for molecular docking. In Proc. 10th International Conference on Learning Representations (ICLR, 2020).
Somnath, V. R., Bunne, C. & Krause, A. Multi-scale representation learning on proteins. Adv. Neural Inf. Process. Syst. 34, 25244–25255 (2021).
Corso, G., Cavalleri, L., Beaini, D., Liò, P. & Veličković, P. Principal neighbourhood aggregation for graph nets. Adv. Neural Inf. Process. Syst. 33, 13260–13271 (2020).
Rarey, M. & Dixon, J. S. Feature trees: a new molecular similarity measure based on tree matching. J. Comput. Aided Mol. Des. 12, 471–490 (1998).
Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In Proc. 35th International Conference on Machine Learning 2323–2332 (PMLR, 2018).
Bianchi, F. M., Grattarola, D. & Alippi, C. Spectral clustering with graph neural networks for graph pooling. In Proc. 37th International Conference on Machine Learning 874–883 (PMLR, 2020).
Su, M. et al. Comparative assessment of scoring functions: the CASF-2016 update. J. Chem. Inf. Model. 59, 895–913 (2019).
Stärk, H., Ganea, O., Pattanaik, L., Barzilay, D. R. & Jaakkola, T. EquiBind: geometric deep learning for drug binding structure prediction. In Proc. 39th International Conference on Machine Learning 20503–20521 (PMLR, 2022).
Huang, K., Xiao, C., Glass, L. M. & Sun, J. MolTrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics 37, 830–836 (2021).
Zitnik, M., Sosič, R., Maheshwari, S. & Leskovec, J. BioSNAP Datasets: Stanford Biomedical Network Dataset Collection (Stanford Univ., 2018); https://snap.stanford.edu/biodata
Liu, T., Lin, Y., Wen, X., Jorissen, R. N. & Gilson, M. K. BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res. 35, D198–D201 (2007).
Liu, Z. et al. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics 31, 405–412 (2015).
Sun, J. et al. ExCAPE-DB: an integrated large scale dataset facilitating big data analysis in chemogenomics. J. Cheminform. 9, 17 (2017).
Béquignon, O. J. M. et al. Papyrus: a large-scale curated dataset aimed at bioactivity predictions. J. Cheminform. 15, 3 (2023).
Cortellis Drug Discovery Intelligence (Clarivate, 2023); https://www.cortellis.com/drugdiscovery/
Lin, H. et al. Discovery of potent and selective covalent protein arginine methyltransferase 5 (PRMT5) inhibitors. ACS Med. Chem. Lett. 10, 1033–1038 (2019).
Rusere, L. N. et al. HIV-1 protease inhibitors incorporating stereochemically defined P2′ ligands to optimize hydrogen bonding in the substrate envelope. J. Med. Chem. 62, 8062–8079 (2019).
Yilmaz, N. K., Swanstrom, R. & Schiffer, C. A. Improving viral protease inhibitors to counter drug resistance. Trends Microbiol. 24, 547–557 (2016).
Draper-Joyce, C. J. et al. Structure of the adenosine-bound human adenosine A1 receptor–Gi complex. Nature 558, 559–563 (2018).
Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).
Bento, A. P. et al. An open source chemical structure curation pipeline using RDKit. J. Cheminform. 12, 51 (2020).
Nguyen, A. T. N. et al. Extracellular loop 2 of the adenosine A1 receptor has a key role in orthosteric ligand affinity and agonist efficacy. Mol. Pharmacol. 90, 703–714 (2016).
Roth, B. L., Sheffler, D. J. & Kroeze, W. K. Magic shotguns versus magic bullets: selectively non-selective drugs for mood disorders and schizophrenia. Nat. Rev. Drug Discov. 3, 353–359 (2004).
Harding, S. D. et al. The IUPHAR/BPS Guide to PHARMACOLOGY in 2024. Nucleic Acids Res. 52, D1438–D1449 (2024).
Jacobson, K. A. & Gao, Z.-G. Adenosine receptors as therapeutic targets. Nat. Rev. Drug Discov. 5, 247–264 (2006).
Perreira, M. et al. “Reversine” and its 2-substituted adenine derivatives as potent and selective A3 adenosine receptor antagonists. J. Med. Chem. 48, 4910–4918 (2005).
Glukhova, A. et al. Structure of the adenosine A1 receptor reveals the basis for subtype selectivity. Cell 168, 867–877.e13 (2017).
Deng, Z., Chuaqui, C. & Singh, J. Structural Interaction Fingerprint (SIFt): a novel method for analyzing three-dimensional protein−ligand binding interactions. J. Med. Chem. 47, 337–344 (2004).
Thal, D. M. et al. Recent advances in the determination of G protein-coupled receptor structures. Curr. Opin. Struct. Biol. 51, 28–34 (2018).
Draper-Joyce, C. J. et al. Positive allosteric mechanisms of adenosine A1 receptor-mediated analgesia. Nature 597, 571–576 (2021).
Jeffrey Conn, P., Christopoulos, A. & Lindsley, C. W. Allosteric modulators of GPCRs: a novel approach for the treatment of CNS disorders. Nat. Rev. Drug Discov. 8, 41–54 (2009).
Freitas, R. Fde & Schapira, M. A systematic analysis of atomic protein–ligand interactions in the PDB. MedChemComm 8, 1970–1981 (2017).
Krivák, R. & Hoksza, D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J. Cheminform. 10, 39 (2018).
Cai, T. et al. GraphNorm: a principled approach to accelerating graph neural network training. In Proc. 38th International Conference on Machine Learning 1204–1215 (PMLR, 2021).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proc. 3rd International Conference on Learning Representations (ICLR, 2015).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In Proc. 6th International Conference on Learning Representations (ICLR, 2018).
Khazanov, N. A. & Carlson, H. A. Exploring the composition of protein–ligand binding sites on a large scale. PLoS Comput. Biol. 9, e1003321 (2013).
Baltos, J.-A. et al. Quantification of adenosine A1 receptor biased agonism: implications for drug discovery. Biochem. Pharmacol. 99, 101–112 (2016).
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
The UniProt Consortium. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Béquignon, O. J. M. et al. Accompanying data - Papyrus - a large scale curated dataset aimed at bioactivity predictions. Zenodo https://doi.org/10.5281/zenodo.10943207 (2024).
Koh, H. Y., Nguyen, A. T. N., Pan, S., May, L. T. & Webb, G. I. Datasets for “Physicochemical graph neural network for learning protein–ligand interaction fingerprints from sequence data”. Zenodo https://doi.org/10.5281/zenodo.10901712 (2024).
Koh, H. Y. huankoh/PSICHIC: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.10901685 (2024).
Stepniewska-Dziubinska, M. M., Zielenkiewicz, P. & Siedlecki, P. Development and evaluation of a deep learning model for protein–ligand binding affinity prediction. Bioinformatics 34, 3666–3674 (2018).
Zheng, L., Fan, J. & Mu, Y. OnionNet: a multiple-layer intermolecular-contact-based convolutional neural network for protein–ligand binding affinity prediction. ACS Omega 4, 15956–15965 (2019).
Jiang, D. et al. InteractionGraphNet: a novel and efficient deep graph representation learning framework for accurate protein–ligand interaction predictions. J. Med. Chem. 64, 18209–18232 (2021).
Koes, D. R., Baumgartner, M. P. & Camacho, C. J. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J. Chem. Inf. Model. 53, 1893–1904 (2013).
McNutt, A. T. et al. GNINA 1.0: molecular docking with deep learning. J. Cheminform. 13, 43 (2021).
Sverrisson, F., Feydy, J., Correia, B. E. & Bronstein, M. M. Fast end-to-end learning on protein surfaces. In Proc. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 15272–15281 (IEEE, 2021).
Roy, K. et al. Some case studies on application of “rm2” metrics for judging quality of quantitative structure–activity relationship predictions: emphasis on scaling of response data. J. Comput. Chem. 34, 1071–1082 (2013).
McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).
Adasme, M. F. et al. PLIP 2021: expanding the scope of the protein–ligand interaction profiler to DNA and RNA. Nucleic Acids Res. 49, W530–W534 (2021).
Acknowledgements
Research on adenosine receptor signalling was supported by a National Heart Foundation Future Leader Fellowship (101857 to L.T.M.), National Health and Medical Research Council (NHMRC) of Australia Ideas grant (APP2013629 to L.T.M., G.I.W. and A.T.N.N.) and a Department of Health and Aged Care (MRFF) Stem Cell Therapies Mission grant (MRF2015957 to L.T.M. and A.T.N.N.). H.Y.K.’s scholarship is supported by the Australian Government Research Training Program (RTP) Scholarship and the Australian Research Council under grant ARC DP210100072. High-throughput screening was performed at the National Drug Discovery Centre, WEHI, Parkville, Australia, with support from the Australian Government Medical Research Future Fund (MRFF). Our acknowledgement extends to Cortellis Drug Discovery Intelligence for granting public access to the curated functional effect dataset, and to BioRender for the display elements used in our figures, which were created using BioRender.com. Special thanks to Monash Institute of Pharmaceutical Sciences (MIPS) Monash University for access to the MIPS library, in particular P. Sexton and A. Christopoulos for purchase of the MIPS library and to J. Baell for the design of the library. We thank C. S. Lu for assistance with pharmacological evaluation. Computational resources were generously provided by the Nectar Research Cloud, a collaborative Australian research platform supported by the NCRIS-funded Australian Research Data Commons (ARDC) and the MASSIVE HPC facility. We extend our sincere gratitude to B. K. Koh, H. J. W. Koh and Y. Li for their invaluable feedback on paper writing and figures.
Author information
Authors and Affiliations
Contributions
H.Y.K. designed and developed the PSICHIC method, evaluated PSICHIC against leading methods, applied it to virtually screen the MIPS library for novel A1R agonists and applied it for selectivity profiling of AR subtypes. A.T.N.N. performed the pharmacological validation of novel A1R agonists and conducted data analysis. H.Y.K. and A.T.N.N. prepared figures and wrote the paper. A.T.N.N., S.P., L.T.M. and G.I.W. supervised the project. Project design, data interpretation and paper preparation were performed by all authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Haiping Lu, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Schematic Diagram of PSICHIC Architecture.
The figure should be read from top to bottom. Initially, ligand atom and protein residue graphs pass through a single hidden layer network before entering three physicochemical graph convolutional layers. Within each layer, intramolecular interactions (represented by yellow blocks) are modeled using two independent PNA-GNNs26. PSICHIC then imposes physicochemical constraints (light red blocks) by pooling ligand atoms and protein residues using junction tree decomposition28 and minCUT Clustering29, respectively. PSICHIC models intermolecular interactions (white blocks) in three steps: first, it aggregates ligand functional groups into a “ligand ball” using attentional aggregation; second, it models the interaction strengths between this ligand ball and protein regions through cross-attention message passing, where features are weighted and transferred between the ligand and protein; third, PSICHIC unpools the functional groups and clustered regions back into updated ligand atoms and protein residues. Finally, PSICHIC generates an interaction fingerprint that weighs atoms and residues based on importance scores from intermolecular interactions. This fingerprint feeds into a single-hidden-layer network for predicting interaction properties. The figure can be read in conjunction with the Method section.
Extended Data Fig. 2 Robustness and Ablation Study of PSICHIC.
a, b, Line plots with best-fit linear regression lines and 95% confidence error bands from bootstrapping (points represent data points). a, PSICHIC Absolute Error against ESM2 contact map prediction performance (ROC-AUC) lacks significant correlation relative to the 0.05 p-value threshold (\(P=0.452\), two-sided Pearson’s correlation, \(n=653\)). b, PSICHIC Absolute Error vs. Lipophilic Ligand Efficiency (LLE) also lacks significance relative to the 0.05 p-value threshold (\(P=0.275\), two-sided Pearson’s correlation, \(n=653\)). c, Bar graph comparing average PSICHIC Absolute Error for ligands that violate versus obey Lipinski’s rule of five (RO5). No significant difference between the two ligand types relative to 0.05 p-value threshold (\(P=0.871\) two-sided independent t-test; \(n=653\); bar, mean; error bars, one SEM; points, data points). d, Ablation study of PSICHIC Architecture (\(n=5\); bar, mean; error bars, one SEM; points, data points). Discarding PSICHIC’s physicochemical constraints (w/o Constraints) reduced performance, mirroring the effects seen without interaction modeling (w/o Interaction), highlighting the importance of injecting physicochemical constraints. Omitting the Importance Score mechanism (w/o Importance Score) reduced performance. Without ESM2 embeddings (w/o ESM2 embedding), performance slightly lags behind TankBind8 which utilizes its own protein model embeddings. This comparison is not directly comparable and favors TankBind. e, Line plots for the relationship between PSICHIC’s predicted affinity and confidence in PSICHIC residue importance scores (left), and the correlation of PSICHIC residue importance scores with residues’ binding site proximity (right), with points indicating mean and shaded error band indicating one SEM. PSICHIC with physicochemical constraints (red) and without (grey) showed constraints were central to learning interaction patterns that adhere to physicochemical principles. Without constraints, PSICHIC exhibits stronger confidence in residue importance scores (left). This confidence is misplaced as the scores did not correlate with residues’ proximity to the binding site (right), suggesting that sequence-based methods could overfit the data if constraints are not incorporated.
Extended Data Fig. 3 Multi-Comparison Matrix for Benchmarking Sequence-based Methods.
The matrix displays pairwise comparisons among PSICHIC, DrugBAN, STAMP-DPI, TransCPI, GraphDTA, WGNN-DTA, MolTrans, and DGraphDTA across 9 drug discovery settings from Human, BioSNAP, and BindingDB benchmark datasets. The evaluations are made using random split, unseen ligand scaffold split, and unseen protein target split on three sequence-only benchmark datasets from Human32, BioSNAP33, and BindingDB34, thereby creating 9 drug discovery settings. Performance is assessed using the Area Under the Receiver Operating Characteristics Curve (ROC-AUC). The colors on the Heat Map signify the mean differences in ROC-AUC performance. A positive difference, indicated in red, signifies that the method in the row outperforms the method in the column on average. Each cell contains three lines detailing performance metrics: the first line shows the average difference in ROC-AUC between the method in the row and the method in the column; the second line shows the number of wins/ties/losses across various drug discovery settings; and the third line indicates the exact p-value, determined through the two-sided Wilcoxon Signed Rank Test (\(n=9\) for all models). Text within each cell is presented in bold if the p-value is lower than 0.05.
Extended Data Fig. 4 High-Quality Protein-Ligand Functional Effect Dataset Curation.
PSICHIC, a data-driven framework, can predict various interaction properties after training on labeled sequence datasets. To enable PSICHIC to predict protein-ligand functional effects, we curated data from reputable databases, specifically Cortellis Drug Discovery38, ExCAPE-ML36, and Papyrus37. We gathered samples from Cortellis on 02/02/2023, focusing on the proteins categorized as “Receptor” in the database with over 20 samples (Source: Cortellis Drug Discovery Intelligence, 02 02, 2023 https://www.cortellis.com/drugdiscovery/, ® 2023 Clarivate. All rights reserved.). This resulted in 22,085 agonists and 17,211 antagonists. Due to the absence of negative data, we adopted a rigorous approach to include decoys. From ExCAPE-ML36 and Papyrus37, we selected protein-ligand pairs with pXC50 or pKD/pKI values below 5, focusing on high-quality data in the database. After standardizing the molecular data using the ChEMBL pipeline44, we generated a dataset comprising 160,910 unique protein-ligand pairs: 22,085 agonists, 17,211 antagonists, and 121,614 non-binders, which includes 131 unique protein receptors and 128,122 unique ligands. The dataset also contains assay-dependent potency values (EC50 for agonists and IC50 for antagonists), which were omitted for model training but used for the UMAP plot in Fig. 2. The curated dataset is publicly available, with detailed methodology in Supplementary Material 14.
Extended Data Fig. 5 PSICHIC Pharmacophore: A Case Study with Ligands of the Galectin-3 Protein Target.
a, The line plot illustrates relationships based on the predicted binding affinity by PSICHIC: (1) an orange line shows the relationship between affinity (x-axis) and confidence in residue importance scores (y-axis), the latter being measured by the Gini coefficient, (2) a red line depicts the relationship between affinity (x-axis) and the correlation of PSICHIC residue importance scores with residues’ proximity to the binding site (y-axis). The error band (shaded region) of line plots indicates one standard error of the mean (SEM). b, Complex structures of Galectin-3 with low (PDB ID: 6QLS) and high (PDB ID: 6I76) predicted binding affinities are displayed. A darker red hue on the structures signifies higher PSICHIC importance scores. For low-affinity 6QLS, scores are dispersed—sometimes extending 180 degrees away from the binding site—yet key binding residues are highlighted. Conversely, scores for high-affinity 6I76 are primarily focused on the binding sites. c, PSICHIC’s ligand atom importance scores spotlight key functional groups, especially fluorine atoms, in their interaction with Galectin-3. The 14 binding ligands, 11 of which share the same scaffold (*), are ordered based on PSICHIC-predicted affinities. The emphasis of PSICHIC on the fluorine functional groups aligns with the original studies, where fluorine plays a critical role in forming fluorine–amide interactions with Galectin-3’s binding site. d, As a control, PSICHIC did not universally prioritize fluorine atoms when the fluorine functional groups do not form important interactions with the target protein. Further details are provided in Supplementary Methods 17.
Extended Data Fig. 6 PSICHICXL Development using Large-scale (XL) Interaction Dataset.
a, Pipeline for constructing Large-scale (XL) Interaction Dataset. a-I., From the Protein-Ligand Functional Effect dataset we curated, samples comprising a protein, ligand, and functional label (agonist, antagonist, or non-binder) were extracted. a-II., From ExCAPE-ML36, protein-ligand pairs with pXC50 values below 5 were labeled as non-binders and above 7 as binders. a-III., From Papyrus37, we selected pairs with high-quality binding affinity, labeling those with pKD/KI below 5 as non-binders and above 7 as binders. a-IV., These databases were standardized and normalized using the ChEMBL pipeline44 and combined to form a large-scale dataset of approximately 3 million unique protein-ligand pairs, labeled with either binding affinity or functional effect. The final dataset contains 618,247 fully labeled and 2,341,057 partially labeled pairs, encompassing 5,107 unique proteins and 1,084,834 unique ligands. b, Multi-task optimization of PSICHICXL on Large-scale (XL) Interaction Dataset. Schematic illustrating the multi-objective loss function for handling partially labeled protein-ligand pairs in the Large-scale (XL) Interaction Dataset. Two scenarios of partial annotation were addressed: (i) when only the functional effect class (agonist, antagonist, or non-binder) is known, only the cross-entropy loss is calculated; (ii) when only binding affinity is known and ligand is a binder (that is, the functional effect of the ligand on its protein target is not known), the cross-entropy loss function is minimized as \((-\log \left(p\ne \text{non-binder}\right))\) or \((-\left(\log \left({p}_{\text{agonist}}\right)+\log \left({p}_{\text{antagonist}}\right)\right))\), together with the mean-squared-error loss applied to the binding affinity predictions. Refer to Methods and Supplementary Methods 19 for full details on training PSICHICXL on the Large-scale (XL) Interaction Dataset.
Extended Data Fig. 7 PSICHICXL: Extensive Exposure to Broad Spectrum of Protein Families and Types.
a, Challenges in Training PSICHICXL on the Large-scale Interaction Dataset: Training with random sampling presented an overfitting risk due to uneven data distribution. 50 out of 5,107 proteins (about 1%) accounted for over half of the dataset (a-I.), with many being non-binders (a-II.). A few proteins comprised a large portion of the data, increasing the likelihood of being sampled (a-III.; red line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers; \(n=\mathrm{5,107}\)). b, Protein Representation (pie chart - outer: family; inner: proteins within families): The data show diversity across protein families (ChEMBL level 2). However, some proteins are disproportionately represented within the families; for example, 2 proteins (UniProt IDs: Q03431 and P43220) comprise 86% of ‘GPCR: Others’, causing an imbalance. c, Optimizing Sampling for PSICHICXL: A good training data landscape for PSICHICXL should balance distribution both across and within protein families, as depicted in the pie chart. d, This was established through a three-step process. First, we categorized each protein-ligand interaction as either a binder or non-binder (for example, ProteinA_Binder to ProteinZ_Non-Binder). Second, we weighted groups by the square root of the unique scaffold number, with a 90th percentile cap to prevent overrepresentation. Third, we probabilistically select a group, then a unique scaffold within it, followed by a ligand with that scaffold, ensuring diversely represented protein-ligand interactions. The training distribution is depicted in pie charts (c, d-I). Agonists and antagonists were slightly underrepresented (d-II). Hence, PSICHICXL can be further fine-tuned using a subset of the large-scale interaction dataset. Nonetheless, given PSICHICXL’s training on a wide range of proteins and ligands (c, d), PSICHICXL should be highly effective for most proteins without additional fine-tuning, and is recommended for general use (d-III; red line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers; \(n=\mathrm{5,107}\)).
Supplementary information
Supplementary Information
Supplementary Methods 1–24, Figs. 1–12 and Tables 1–12.
Source data
Source Data Fig. 2
Source data for plotting the figure.
Source Data Fig. 3
Source data for plotting the figure.
Source Data Fig. 4
Source data for plotting the figure.
Source Data Fig. 5
Source data for plotting the figure.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Koh, H.Y., Nguyen, A.T.N., Pan, S. et al. Physicochemical graph neural network for learning protein–ligand interaction fingerprints from sequence data. Nat Mach Intell 6, 673–687 (2024). https://doi.org/10.1038/s42256-024-00847-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-024-00847-1