Simple nearest-neighbour analysis meets the accuracy of compound potency predictions using complex machine learning models

Janela, Tiago; Bajorath, Jürgen

doi:10.1038/s42256-022-00581-6

Article
Published: 14 December 2022

Simple nearest-neighbour analysis meets the accuracy of compound potency predictions using complex machine learning models

Nature Machine Intelligence volume 4, pages 1246–1255 (2022)Cite this article

2142 Accesses
17 Citations
17 Altmetric
Metrics details

Subjects

Abstract

Compound potency prediction is a popular application of machine learning in drug discovery, for which increasingly complex models are employed. The general aim is the identification of new chemical entities that are highly potent against a given target. The relative performance of potency prediction models and their accuracy limitations continue to be debated in the field, and it remains unclear whether deep learning can further advance potency prediction. We have analysed and compared approaches of varying computational complexity for potency prediction and shown that simple nearest-neighbour analysis consistently meets or exceeds the accuracy of machine learning methods regarded as the state of the art in the field. Moreover, completely random predictions using different models were shown to reproduce experimental values within an order of magnitude, resulting from the potency value distributions in commonly used compound data sets. Taken together, these findings have important implications for typical benchmark calculations to evaluate machine learning performance. Simple controls such as nearest-neighbour analysis should generally be included in model evaluation. Furthermore, the narrow margin separating the best and completely random potency predictions is unrealistic and requires the consideration of alternative benchmark criteria, as discussed herein.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Potency value distributions of activity classes.**

**Fig. 3: Prediction accuracy for unique hold-out sets.**

**Fig. 4: Prediction accuracy for most potent compounds.**

**Fig. 5: Performance of random prediction models.**

Accurate structure prediction of biomolecular interactions with AlphaFold 3

Article 08 May 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

De novo generation of multi-target compounds using deep generative chemistry

Article Open access 06 May 2024

Data availability

Publicly available compounds and activity data including compound activity classes and sets of analogue series extracted from these classes were obtained from ChEMBL using the data selection and calculation protocols provided in Compounds and activity data and Molecular representations, similarity calculations and analogue series. In addition, all data sets used for the calculations reported herein are freely via the following link: https://github.com/TiagoJanela/ML-for-compound-potency-prediction. Source data are provided with this paper.

Code availability

All calculations were carried out using public domain programs and computational tools.

Additional code used for our calculations is freely available via the following link: https://github.com/TiagoJanela/ML-for-compound-potency-prediction. The code is also available at https://doi.org/10.5281/zenodo.7238586 (ref. ³⁹).

References

Gleeson, M. P. & Gleeson, D. QM/MM calculations in drug discovery: a useful method for studying binding phenomena? J. Chem. Inf. Model. 49, 670–677 (2009).
Article Google Scholar
Mobley, D. L. & Gilson, M. K. Predicting binding free energies: frontiers and benchmarks. Annu. Rev. Biophys. 46, 531–558 (2017).
Article Google Scholar
Li, H., Sze, K. H., Lu, G. & Ballester, P. J. Machine‐learning scoring functions for structure‐based virtual screening. WIREs Comput. Mol. Sci. 11, e1478 (2021).
Article Google Scholar
Lewis, R. A. & Wood, D. Modern 2D QSAR for drug discovery. WIREs Comput. Mol. Sci. 4, 505–522 (2014).
Article Google Scholar
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).
Article Google Scholar
Lavecchia, A. Deep learning in drug discovery: opportunities, challenges and future prospects. Drug Discov. Today 24, 2017–2032 (2019).
Article Google Scholar
Walters, W. P. & Barzilay, R. Applications of deep learning in molecule generation and molecular property prediction. Acc. Chem. Res. 54, 263–270 (2020).
Article Google Scholar
Torng, W. & Altman, R. B. Graph convolutional neural networks for predicting drug–target interactions. J. Chem. Inf. Model. 59, 4131–4149 (2019).
Article Google Scholar
Son, J. & Kim, D. Development of a graph convolutional neural network model for efficient prediction of protein–ligand binding affinities. PLoS ONE 16, e0249404 (2021).
Article Google Scholar
Li, Y. et al. An adaptive graph learning method for automated molecular interactions and properties predictions. Nat. Mach. Intell. 4, 645–651 (2022).
Article Google Scholar
Fang, X. et al. Geometry-enhanced molecular representation learning for property prediction. Nat. Mach. Intell. 4, 127–134 (2022).
Article Google Scholar
Sakai, M. et al. Prediction of pharmacological activities from chemical structures with graph convolutional neural networks. Sci. Rep. 11, 525 (2021).
Article Google Scholar
Chen, L. et al. Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening. PLoS ONE 14, e0220113 (2019).
Article Google Scholar
Yang, J., Shen, C. & Huang, N. Predicting or pretending: artificial intelligence for protein–ligand interactions lack of sufficiently large and unbiased datasets. Front. Pharmacol. 11, e69 (2020).
Article Google Scholar
Volkov, M. et al. On the frustration to predict binding affinities from protein–ligand structures with deep neural networks. J. Med. Chem. 65, 7946–7958 (2022).
Article Google Scholar
Bento, A. P. et al. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 42, D1083–D1090 (2002).
Article Google Scholar
Stumpfe, D., Hu, Y., Dimova, D. & Bajorath, J. Recent progress in understanding activity cliffs and their utility in medicinal chemistry. J. Med. Chem. 57, 18–28 (2014).
Article Google Scholar
Baell, J. B. & Holloway, G. A. New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J. Med. Chem. 53, 2719–2740 (2010).
Article Google Scholar
Bruns, R. F. & Watson, I. A. Rules for identifying potentially reactive or promiscuous compounds. J. Med. Chem. 55, 9763–9772 (2012).
Article Google Scholar
Irwin, J. J. et al. An aggregation advisor for ligand discovery. J. Med. Chem. 58, 7076–7087 (2015).
Article Google Scholar
Ashton, M. et al. Identification of diverse database subsets using property-based and fragment-based molecular descriptions. Quant. Struct. Relatsh. 21, 598–604 (2002).
Article Google Scholar
Willett, P., Barnard, J. M. & Downs, G. M. Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996 (1998).
Article Google Scholar
Drucker, H., Surges, C. J. C., Kaufman, L., Smola, A. & Vapnik, V. Support vector regression machines. In Proc. Ninth International Conference on Neural Information Processing Systems (eds Jordan, M. I. & Petsche, T.) 155–161 (MIT Press, 1997).
Smola, A. J. & Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 14, 199–222 (2004).
Article MathSciNet Google Scholar
Ralaivola, L., Swamidass, S. J., Saigo, H. & Baldi, P. Graph kernels for chemical informatics. Neural Netw. 18, 1093–1110 (2005).
Article Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article MATH Google Scholar
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Nielsen, M. A. Neural Networks and Deep Learning (Determination, 2015).
Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. In Third International Conference on Learning Representations (ICLR) 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (eds Bengio, Y. & LeCun, Y.) (2015).
Abadi, M. et al. TensorFlow: a system for large-scale machine learning. In OSDI’16: Proc. 12th USENIX Conf. Operating Systems Design and Implementation (chairs Keeton, K. & Roscoe, T.) 265–283 (USENIX Association, 2016).
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M. & Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. Learn. Syst. 20, 61–80 (2009).
Article Google Scholar
Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints. Adv. Neural Inf. Process. Syst. 28, 2224–2232.
Altman, N. S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46, 175–185 (1992).
MathSciNet Google Scholar
Rücker, C., Rücker, G. & Meringer, M. y-Randomization and its variants in QSPR/QSAR. J. Chem. Inf. Model. 47, 2345–2357 (2007).
Article Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article Google Scholar
Naveja, J. J. et al. Systematic extraction of analogue series from large compound collections using a new computational compound–core relationship method. ACS Omega 4, 1027–1032 (2019).
Article Google Scholar
Conover, W. J. On methods of handling ties in the Wilcoxon signed-rank test. J. Am. Stat. Assoc. 68, 985–988 (1973).
Article MathSciNet MATH Google Scholar
Janela, T. ML-for-compound-potency-prediction. Zenodo https://doi.org/10.5281/zenodo.7238586 (2022).

Download references

Acknowledgements

We thank C. Feldmann, A. Lamens, F. Siemers and M. Vogt for helpful discussions.

Author information

Authors and Affiliations

Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany
Tiago Janela & Jürgen Bajorath

Authors

Tiago Janela
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Bajorath
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, J.B.; methodology, T.J. and J.B.; data and code, T.J.; investigation, T.J.; analysis, T.J. and J.B.; writing—original draft, J.B.; writing—review and editing, T.J. and J.B.

Corresponding author

Correspondence to Jürgen Bajorath.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Alexander Tropsha and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Structural similarity versus potency differences.

Structural similarity versus potency differences. For all activity classes, structural similarity vs. (logarithmic) potency difference plots are shown. Each data point represents a pairwise compound comparison. Tanimoto similarity was calculated using ECFP4 (Methods). In addition, similarity and potency difference value distributions are displayed in each plot.

Extended Data Fig. 2 Prediction accuracy.

Prediction accuracy. Boxplots report the distribution of RMSE values for 10 independent potency prediction trials on different activity classes using different models (kNN, SVR, RFR, DNN, GCN, and MR). Results of predictions are reported for complete training sets (complete set) and size-reduced training sets (random and diverse sets, respectively). The boxplot elements are defined according to Fig. 2.

Source data

Extended Data Fig. 3 Prediction accuracy for unique hold-out sets.

Prediction accuracy for unique hold-out sets. RMSE values are reported for a prediction trial on a structurally unique hold-out set (cluster set) from each activity class using different models (kNN, SVR, RFR, DNN, GCN, and MR) derived from complete training sets.

Source data

Extended Data Fig. 4 Prediction accuracy for most potent compounds.

Prediction accuracy for most potent compounds. RMSE values are reported for a prediction trial on most potent compounds held-out from each activity class (potent set) using different models (kNN, SVR, RFR, DNN, GCN, and MR) derived from complete training sets.

Source data

Supplementary information

Supplementary Information

Supplementary Table 1.

Supplementary Data 1

Source data for Supplementary Table 1.

Source data

Source Data Fig. 1

Statistical source data.

Source Data Fig. 2

Statistical source data.

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 4

Statistical source data.

Source Data Table 1

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Janela, T., Bajorath, J. Simple nearest-neighbour analysis meets the accuracy of compound potency predictions using complex machine learning models. Nat Mach Intell 4, 1246–1255 (2022). https://doi.org/10.1038/s42256-022-00581-6

Download citation

Received: 13 July 2022
Accepted: 03 November 2022
Published: 14 December 2022
Issue Date: December 2022
DOI: https://doi.org/10.1038/s42256-022-00581-6

This article is cited by

Relationship between prediction accuracy and uncertainty in compound potency prediction using deep neural networks and control models
- Jannik P. Roth
- Jürgen Bajorath
Scientific Reports (2024)
Harnessing Shannon entropy-based descriptors in machine learning models to enhance the prediction accuracy of molecular properties
- Rajarshi Guha
- Darrell Velegol
Journal of Cheminformatics (2023)
Limitations of representation learning in small molecule property prediction
- Ana Laura Dias
- Latimah Bustillo
- Tiago Rodrigues
Nature Communications (2023)
Rationalizing general limitations in assessing and comparing methods for compound potency prediction
- Tiago Janela
- Jürgen Bajorath
Scientific Reports (2023)
Designing highly potent compounds using a chemical language model
- Hengwei Chen
- Jürgen Bajorath
Scientific Reports (2023)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links