Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Guiding questions to avoid data leakage in biological machine learning applications

Abstract

Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Visualization of the lifecycle of an ML model f.
Fig. 2: Schematic overview of the seven questions designed to reveal data leakage.

Similar content being viewed by others

Data availability

No data were generated or analyzed for this work, and no source code was developed.

References

  1. Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Gundersen, O. E. & Kjensmo, S. State of the art: reproducibility in artificial intelligence. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, No. 1 https://ojs.aaai.org/index.php/AAAI/article/view/11503 (2018).

  5. Matschinske, J. et al. The AIMe registry for artificial intelligence in biomedical research. Nat. Methods 18, 1128–1131 (2021).

    Article  CAS  PubMed  Google Scholar 

  6. Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4, 100804 (2023). This article presents a taxonomy of common pitfalls that introduce data leakage and lead to overoptimistic results in many scientific fields. The authors also suggest model info sheets to identify and prevent those pitfalls and, ultimately, counteract the reproducibility crisis.

  7. Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6, 1–21 (2012). This article provides a formal definition of data leakage and suggests ways to detect and avoid it.

    Article  Google Scholar 

  8. Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. 23, 169–181 (2022).

    Article  CAS  PubMed  Google Scholar 

  9. Chiavegatto Filho, A., Batista, A. F. D. M. & Dos Santos, H. G. Data leakage in health outcomes prediction with machine learning. Comment on ‘prediction of incident hypertension within the next year: prospective study using statewide electronic health records and machine learning’. J. Med. Internet Res. 23, e10969 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with alphamissense. Science 381, eadg7492 (2023).

    Article  CAS  PubMed  Google Scholar 

  11. Grimm, D. G. et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum. Mutat. 36, 513–523 (2015). This article demonstrates two types of circularity that lead to overly optimistic results for deleteriousness prediction tools.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Schaefer, M. H., Serrano, L. & Andrade-Navarro, M. A. Correcting for the study bias associated with protein–protein interaction measurements reveals differences between protein degree distributions from different cancer types. Front. Genet. 6, 137790 (2015).

    Article  Google Scholar 

  13. Lucchetta, M., List, M., Blumenthal, D. B. & Schaefer, M. H. Emergence of power-law distributions in protein–protein interaction networks through study bias. Preprint at bioRxiv https://doi.org/10.1101/2023.03.17.533165 (2023).

  14. Ofer, D., Brandes, N. & Linial, M. The language of proteins: Nlp, machine learning & protein sequences. Comput. Struct. Biotechnol. J. 19, 1750–1758 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Song, C. & Raghunathan, A. Information leakage in embedding models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 377–390 (2020).

  16. Zhang, G. et al. How does a deep learning model architecture impact its privacy? a comprehensive study of privacy attacks on CNNs and transformers. Preprint at https://arxiv.org/abs/2210.11049 (2022).

  17. Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019).

    Article  CAS  PubMed  Google Scholar 

  18. Notin, P. et al. ProteinGym: large-scale benchmarks for protein design and fitness prediction. In Advances in Neural Information Processing Systems 36 (NeurIPS, 2023).

  19. Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812–3814 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Joeres, R., Blumenthal, D. B. & Kalinina, O. V. Datasail: data splitting against information leakage. Preprint at bioRxiv https://doi.org/10.1101/2023.11.15.566305 (2023).

  23. Teufel, F. et al. GraphPart: homology partitioning for biological sequence analysis. NAR Genom. Bioinform. 5, lqad088 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Weissenow, K., Heinzinger, M., Steinegger, M. & Rost, B. Ultra-fast protein structure prediction to capture effects of sequence variation in mutation movies. Preprint at bioRxiv https://doi.org/10.1101/2022.11.14.516473 (2022).

  25. Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).

    Article  Google Scholar 

  26. Haselbeck, F. et al. Superior protein thermophilicity prediction with protein language model embeddings. NAR Genom. Bioinform. 5, lqad087 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Teufel, F. et al. Signalp 6.0 predicts all five types of signal peptides using protein language models. Nat. Biotechnol. 40, 1023–1025 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Wu, R. et al. High-resolution de novo structure prediction from primary sequence. Preprint at bioRxiv https://doi.org/10.1101/2022.07.21.500999 (2022).

  29. Charoenkwan, P. et al. SAPPHIRE: a stacking-based ensemble learning framework for accurate prediction of thermophilic proteins. Comput. Biol. Med. 146, 105704 (2022).

    Article  CAS  PubMed  Google Scholar 

  30. Lin, H. & Chen, W. Prediction of thermophilic proteins using feature selection technique. J. Microbiol. Methods 84, 67–70 (2011).

    Article  CAS  PubMed  Google Scholar 

  31. Ahmed, Z. et al. iThermo: a sequence-based model for identifying thermophilic proteins using a multi-feature fusion strategy. Front. Microbiol. 13, 790063 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Pei, H. et al. Identification of thermophilic proteins based on sequence-based bidirectional representations from transformer-embedding features. Appl. Sci. 13, 2858 (2023).

    Article  CAS  Google Scholar 

  33. Pudžiuvelytė, I. et al. TemStaPro: protein thermostability prediction using sequence representations from protein language models. Bioinformatics 40, btae157 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Pucci, F., Bernaerts, K. V., Kwasigroch, J. M. & Rooman, M. Quantification of biases in predictions of protein stability changes upon mutations. Bioinformatics 34, 3659–3665 (2018). This article analyzes biases in protein stability prediction tools and shows that most predictors favor destabilizing mutations. The authors also propose a new method addressing this issue by imposing physical symmetries under inverse mutations.

    Article  CAS  PubMed  Google Scholar 

  35. Usmanova, D. R. et al. Self-consistency test reveals systematic bias in programs for prediction change of stability upon mutation. Bioinformatics 34, 3653–3658 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Fang, J. The role of data imbalance bias in the prediction of protein stability change upon mutation. PLoS ONE 18, e0283727 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Sanavia, T. et al. Limitations and challenges in protein stability prediction upon genome variations: towards future applications in precision medicine. Comput. Struct. Biotechnol. J. 18, 1968–1979 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Stourac, J. et al. Fireprotdb: database of manually curated protein stability data. Nucleic Acids Res. 49, D319–D324 (2021).

    Article  CAS  PubMed  Google Scholar 

  39. Rodrigues, C. H., Pires, D. E. & Ascher, D. B. Dynamut2: Assessing changes in stability and flexibility upon single and multiple point missense mutations. Protein Sci. 30, 60–69 (2021).

    Article  CAS  PubMed  Google Scholar 

  40. Fang, J. A critical review of five machine learning-based algorithms for predicting protein stability changes upon mutation. Brief. Bioinform. 21, 1285–1292 (2019).

    Article  PubMed Central  Google Scholar 

  41. Menche, J. et al. Uncovering disease-disease relationships through the incomplete interactome. Science 347, 1257601 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Batra, R. et al. On the performance of de novo pathway enrichment. NPJ Syst. Biol. Appl. 3, 6 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Bernett, J., Blumenthal, D. B. & List, M. Cracking the black box of deep sequence-based protein–protein interaction prediction. Brief. Bioinform. 25, bbae076 (2024). This article shows that reported performances of numerous deep learning-based protein–protein interaction prediction models are massively inflated due to data leakage. The authors also provide a leakage-free gold-standard dataset to foster the development of better protein–protein interaction predictors in the future.

    Article  PubMed  PubMed Central  Google Scholar 

  44. Park, Y. & Marcotte, E. M. Flaws in evaluation schemes for pair-input computational predictions. Nat. Methods 9, 1134–1136 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Dunham, B. & Ganapathiraju, M. K. Benchmark evaluation of protein–protein interaction prediction algorithms. Molecules 27, 41 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  46. Hamp, T. & Rost, B. Evolutionary profiles improve protein–protein interaction prediction from sequence. Bioinformatics 31, 1945–1950 (2015).

    Article  CAS  PubMed  Google Scholar 

  47. Blohm, P. et al. Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis. Nucleic Acids Res. 42, D396–D400 (2014).

    Article  CAS  PubMed  Google Scholar 

  48. Ben-Hur, A. & Noble, W. S. Choosing negative examples for the prediction of protein–protein interactions. BMC Bioinformatics 7, S2 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  49. Tabar, M. S. et al. Illuminating the dark protein–protein interactome. Cell Rep. Methods 2, 100275 (2022).

  50. Aloy, P., Ceulemans, H., Stark, A. & Russell, R. B. The relationship between sequence and interaction divergence in proteins. J. Mol. Biol. 332, 989–998 (2003).

    Article  CAS  PubMed  Google Scholar 

  51. Marsh, J. A. & Teichmann, S. A. Structure, dynamics, assembly, and evolution of protein complexes. Annu. Rev. Biochem. 84, 551–575 (2015).

    Article  CAS  PubMed  Google Scholar 

  52. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    Article  CAS  PubMed  Google Scholar 

  53. Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Yao, Y., Du, X., Diao, Y. & Zhu, H. An integration of deep learning with feature embedding for protein–protein interaction prediction. PeerJ 7, e7126 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  55. Chen, M. et al. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 35, i305–i314 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Davis, M. I. et al. Comprehensive analysis of kinase inhibitor selectivity. Nat. Biotechnol. 29, 1046–1051 (2011).

    Article  CAS  PubMed  Google Scholar 

  57. Tang, J. et al. Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J. Chem. Inf. Model. 54, 735–743 (2014).

    Article  CAS  PubMed  Google Scholar 

  58. Liu, Z. et al. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics 31, 405–412 (2015).

    Article  CAS  PubMed  Google Scholar 

  59. Liu, T., Lin, Y., Wen, X., Jorissen, R. N. & Gilson, M. K. BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res. 35, D198–D201 (2007).

    Article  CAS  PubMed  Google Scholar 

  60. Chatterjee, A. et al. Improving the generalizability of protein–ligand binding predictions with AI-Bind. Nat. Commun. 14, 1989 (2023). This article shows how deep learning models for drug–target interaction prediction learn shortcuts from the topology of the training network instead of hidden mechanisms and, hence, generalize poorly. The authors further propose a new method designed to overcome these shortcomings.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Bai, P. et al. Hierarchical clustering split for low-bias evaluation of drug–target interaction prediction. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 641–644 (IEEE, 2021).

  62. Torrisi, M., de la Vega de León, A., Climent, G., Loos, R. & Panjkovich, A. Improving the assessment of deep learning models in the context of drug–target interaction prediction. Preprint at bioRxiv https://doi.org/10.1101/2022.04.20.488898 (2022).

  63. Chan, W. K. et al. GLASS: a comprehensive database for experimentally validated GPCR–ligand associations. Bioinformatics 31, 3035–3042 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Ramsundar, B. Molecular machine learning with DeepChem. Ph.D. thesis, Stanford University (2018).

  65. Huang, K. et al. Artificial intelligence foundation for therapeutic science. Nat. Chem. Biol. 18, 1033–1036 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Steshin, S. Lo-Hi: practical Ml drug discovery benchmark. In Advances in Neural Information Processing Systems 36 (NeurIPS, 2023).

  67. Elnaggar, A. et al. Ankh: optimized protein language model unlocks general-purpose modelling. Preprint at https://arxiv.org/abs/2301.06568 (2023).

  68. Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. Preprint at https://arxiv.org/abs/2010.09885 (2020).

  69. Kim, S. et al. Pubchem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109 (2019).

    Article  PubMed  Google Scholar 

  70. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS, 2019).

  71. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems. In 12th USENIX Symposium on Operating Systems Design and Implementation (USENIX, 2016).

  72. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  73. Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2 (Springer, 2009).

  74. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016); http://www.deeplearningbook.org/

  75. Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  76. Goodman, S. N., Fanelli, D. & Ioannidis, J. P. A. What does research reproducibility mean? Sci. Transl. Med. 8, 341ps12 (2016). This article provides a subdivision of the term ‘reproducibility’ into ‘methods reproducibility’, ‘results reproducibility’ and ‘inferential reproducibility’. Data leakage is one important source of lack of inferential reproducibility.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

J.B. and M.L. were supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the CompLS funding concept (031L0305A, DROP2AI). M.L. was additionally funded by the Deutsche Forschungsgemeinschaft (German Research Foundation; 422216132). D.B.B. was supported by the BMBF within the framework of the CompLS funding concept (031L0309A, NetMap). M.L. and D.B.B. were funded by the Deutsche Forschungsgemeinschaft (German Research Foundation; 516188180). R.J. was supported by the HelmholtzAI grant XAI-Graph, the Knut and Alice Wallenberg Foundation and the University of Gothenburg. O.V.K. acknowledges financial support from the Klaus Faber Foundation. R.J. and O.V.K. thank A. Gress and I. Senatorov for fruitful discussions.

Author information

Authors and Affiliations

Authors

Contributions

J.B., D.B.B., D.G.G., F.H., R.J., O.V.K. and M.L. jointly initiated this work, conceptualized the used understanding of data leakage, phrased the seven guiding questions, and wrote and revised the manuscript. D.G.G. and F.H. provided the details for the deleteriousness prediction and thermostability prediction use cases. D.B.B., J.B. and M.L. provided the details for the protein–protein interaction prediction use case. R.J. and O.V.K. provided the details for the drug–target interaction prediction use case. R.J. designed Fig. 1. J.B. designed Fig. 2. D.B.B. provided the details for the link between data leakage and lack of reproducibility.

Corresponding authors

Correspondence to David B. Blumenthal, Dominik G. Grimm, Olga V. Kalinina or Markus List.

Ethics declarations

Competing interests

M.L. consults for mbiomics. D.B.B. consults for BioVariance. All other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Mark Craven, Arunima Singh and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bernett, J., Blumenthal, D.B., Grimm, D.G. et al. Guiding questions to avoid data leakage in biological machine learning applications. Nat Methods 21, 1444–1453 (2024). https://doi.org/10.1038/s41592-024-02362-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-024-02362-y

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing