Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Simple nearest-neighbour analysis meets the accuracy of compound potency predictions using complex machine learning models

Abstract

Compound potency prediction is a popular application of machine learning in drug discovery, for which increasingly complex models are employed. The general aim is the identification of new chemical entities that are highly potent against a given target. The relative performance of potency prediction models and their accuracy limitations continue to be debated in the field, and it remains unclear whether deep learning can further advance potency prediction. We have analysed and compared approaches of varying computational complexity for potency prediction and shown that simple nearest-neighbour analysis consistently meets or exceeds the accuracy of machine learning methods regarded as the state of the art in the field. Moreover, completely random predictions using different models were shown to reproduce experimental values within an order of magnitude, resulting from the potency value distributions in commonly used compound data sets. Taken together, these findings have important implications for typical benchmark calculations to evaluate machine learning performance. Simple controls such as nearest-neighbour analysis should generally be included in model evaluation. Furthermore, the narrow margin separating the best and completely random potency predictions is unrealistic and requires the consideration of alternative benchmark criteria, as discussed herein.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Potency value distributions of activity classes.
Fig. 2: Prediction accuracy.
Fig. 3: Prediction accuracy for unique hold-out sets.
Fig. 4: Prediction accuracy for most potent compounds.
Fig. 5: Performance of random prediction models.

Similar content being viewed by others

Data availability

Publicly available compounds and activity data including compound activity classes and sets of analogue series extracted from these classes were obtained from ChEMBL using the data selection and calculation protocols provided in Compounds and activity data and Molecular representations, similarity calculations and analogue series. In addition, all data sets used for the calculations reported herein are freely via the following link: https://github.com/TiagoJanela/ML-for-compound-potency-prediction. Source data are provided with this paper.

Code availability

All calculations were carried out using public domain programs and computational tools.

Additional code used for our calculations is freely available via the following link: https://github.com/TiagoJanela/ML-for-compound-potency-prediction. The code is also available at https://doi.org/10.5281/zenodo.7238586 (ref. 39).

References

  1. Gleeson, M. P. & Gleeson, D. QM/MM calculations in drug discovery: a useful method for studying binding phenomena? J. Chem. Inf. Model. 49, 670–677 (2009).

    Article  Google Scholar 

  2. Mobley, D. L. & Gilson, M. K. Predicting binding free energies: frontiers and benchmarks. Annu. Rev. Biophys. 46, 531–558 (2017).

    Article  Google Scholar 

  3. Li, H., Sze, K. H., Lu, G. & Ballester, P. J. Machine‐learning scoring functions for structure‐based virtual screening. WIREs Comput. Mol. Sci. 11, e1478 (2021).

    Article  Google Scholar 

  4. Lewis, R. A. & Wood, D. Modern 2D QSAR for drug discovery. WIREs Comput. Mol. Sci. 4, 505–522 (2014).

    Article  Google Scholar 

  5. Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).

    Article  Google Scholar 

  6. Lavecchia, A. Deep learning in drug discovery: opportunities, challenges and future prospects. Drug Discov. Today 24, 2017–2032 (2019).

    Article  Google Scholar 

  7. Walters, W. P. & Barzilay, R. Applications of deep learning in molecule generation and molecular property prediction. Acc. Chem. Res. 54, 263–270 (2020).

    Article  Google Scholar 

  8. Torng, W. & Altman, R. B. Graph convolutional neural networks for predicting drug–target interactions. J. Chem. Inf. Model. 59, 4131–4149 (2019).

    Article  Google Scholar 

  9. Son, J. & Kim, D. Development of a graph convolutional neural network model for efficient prediction of protein–ligand binding affinities. PLoS ONE 16, e0249404 (2021).

    Article  Google Scholar 

  10. Li, Y. et al. An adaptive graph learning method for automated molecular interactions and properties predictions. Nat. Mach. Intell. 4, 645–651 (2022).

    Article  Google Scholar 

  11. Fang, X. et al. Geometry-enhanced molecular representation learning for property prediction. Nat. Mach. Intell. 4, 127–134 (2022).

    Article  Google Scholar 

  12. Sakai, M. et al. Prediction of pharmacological activities from chemical structures with graph convolutional neural networks. Sci. Rep. 11, 525 (2021).

    Article  Google Scholar 

  13. Chen, L. et al. Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening. PLoS ONE 14, e0220113 (2019).

    Article  Google Scholar 

  14. Yang, J., Shen, C. & Huang, N. Predicting or pretending: artificial intelligence for protein–ligand interactions lack of sufficiently large and unbiased datasets. Front. Pharmacol. 11, e69 (2020).

    Article  Google Scholar 

  15. Volkov, M. et al. On the frustration to predict binding affinities from protein–ligand structures with deep neural networks. J. Med. Chem. 65, 7946–7958 (2022).

    Article  Google Scholar 

  16. Bento, A. P. et al. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 42, D1083–D1090 (2002).

    Article  Google Scholar 

  17. Stumpfe, D., Hu, Y., Dimova, D. & Bajorath, J. Recent progress in understanding activity cliffs and their utility in medicinal chemistry. J. Med. Chem. 57, 18–28 (2014).

    Article  Google Scholar 

  18. Baell, J. B. & Holloway, G. A. New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J. Med. Chem. 53, 2719–2740 (2010).

    Article  Google Scholar 

  19. Bruns, R. F. & Watson, I. A. Rules for identifying potentially reactive or promiscuous compounds. J. Med. Chem. 55, 9763–9772 (2012).

    Article  Google Scholar 

  20. Irwin, J. J. et al. An aggregation advisor for ligand discovery. J. Med. Chem. 58, 7076–7087 (2015).

    Article  Google Scholar 

  21. Ashton, M. et al. Identification of diverse database subsets using property-based and fragment-based molecular descriptions. Quant. Struct. Relatsh. 21, 598–604 (2002).

    Article  Google Scholar 

  22. Willett, P., Barnard, J. M. & Downs, G. M. Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996 (1998).

    Article  Google Scholar 

  23. Drucker, H., Surges, C. J. C., Kaufman, L., Smola, A. & Vapnik, V. Support vector regression machines. In Proc. Ninth International Conference on Neural Information Processing Systems (eds Jordan, M. I. & Petsche, T.) 155–161 (MIT Press, 1997).

  24. Smola, A. J. & Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 14, 199–222 (2004).

    Article  MathSciNet  Google Scholar 

  25. Ralaivola, L., Swamidass, S. J., Saigo, H. & Baldi, P. Graph kernels for chemical informatics. Neural Netw. 18, 1093–1110 (2005).

    Article  Google Scholar 

  26. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  MATH  Google Scholar 

  27. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Article  MATH  Google Scholar 

  28. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

  29. Nielsen, M. A. Neural Networks and Deep Learning (Determination, 2015).

  30. Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. In Third International Conference on Learning Representations (ICLR) 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (eds Bengio, Y. & LeCun, Y.) (2015).

  31. Abadi, M. et al. TensorFlow: a system for large-scale machine learning. In OSDI’16: Proc. 12th USENIX Conf. Operating Systems Design and Implementation (chairs Keeton, K. & Roscoe, T.) 265–283 (USENIX Association, 2016).

  32. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M. & Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. Learn. Syst. 20, 61–80 (2009).

    Article  Google Scholar 

  33. Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints. Adv. Neural Inf. Process. Syst. 28, 2224–2232.

  34. Altman, N. S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46, 175–185 (1992).

    MathSciNet  Google Scholar 

  35. Rücker, C., Rücker, G. & Meringer, M. y-Randomization and its variants in QSPR/QSAR. J. Chem. Inf. Model. 47, 2345–2357 (2007).

    Article  Google Scholar 

  36. Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

    Article  Google Scholar 

  37. Naveja, J. J. et al. Systematic extraction of analogue series from large compound collections using a new computational compound–core relationship method. ACS Omega 4, 1027–1032 (2019).

    Article  Google Scholar 

  38. Conover, W. J. On methods of handling ties in the Wilcoxon signed-rank test. J. Am. Stat. Assoc. 68, 985–988 (1973).

    Article  MathSciNet  MATH  Google Scholar 

  39. Janela, T. ML-for-compound-potency-prediction. Zenodo https://doi.org/10.5281/zenodo.7238586 (2022).

Download references

Acknowledgements

We thank C. Feldmann, A. Lamens, F. Siemers and M. Vogt for helpful discussions.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, J.B.; methodology, T.J. and J.B.; data and code, T.J.; investigation, T.J.; analysis, T.J. and J.B.; writing—original draft, J.B.; writing—review and editing, T.J. and J.B.

Corresponding author

Correspondence to Jürgen Bajorath.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Alexander Tropsha and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Structural similarity versus potency differences.

Structural similarity versus potency differences. For all activity classes, structural similarity vs. (logarithmic) potency difference plots are shown. Each data point represents a pairwise compound comparison. Tanimoto similarity was calculated using ECFP4 (Methods). In addition, similarity and potency difference value distributions are displayed in each plot.

Extended Data Fig. 2 Prediction accuracy.

Prediction accuracy. Boxplots report the distribution of RMSE values for 10 independent potency prediction trials on different activity classes using different models (kNN, SVR, RFR, DNN, GCN, and MR). Results of predictions are reported for complete training sets (complete set) and size-reduced training sets (random and diverse sets, respectively). The boxplot elements are defined according to Fig. 2.

Source data

Extended Data Fig. 3 Prediction accuracy for unique hold-out sets.

Prediction accuracy for unique hold-out sets. RMSE values are reported for a prediction trial on a structurally unique hold-out set (cluster set) from each activity class using different models (kNN, SVR, RFR, DNN, GCN, and MR) derived from complete training sets.

Source data

Extended Data Fig. 4 Prediction accuracy for most potent compounds.

Prediction accuracy for most potent compounds. RMSE values are reported for a prediction trial on most potent compounds held-out from each activity class (potent set) using different models (kNN, SVR, RFR, DNN, GCN, and MR) derived from complete training sets.

Source data

Supplementary information

Supplementary Information

Supplementary Table 1.

Supplementary Data 1

Source data for Supplementary Table 1.

Source data

Source Data Fig. 1

Statistical source data.

Source Data Fig. 2

Statistical source data.

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 4

Statistical source data.

Source Data Table 1

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Janela, T., Bajorath, J. Simple nearest-neighbour analysis meets the accuracy of compound potency predictions using complex machine learning models. Nat Mach Intell 4, 1246–1255 (2022). https://doi.org/10.1038/s42256-022-00581-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-022-00581-6

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research