Limitations of representation learning in small molecule property prediction

Dias, Ana Laura; Bustillo, Latimah; Rodrigues, Tiago

doi:10.1038/s41467-023-41967-3

Download PDF

Comment
Open access
Published: 13 October 2023

Limitations of representation learning in small molecule property prediction

Nature Communications volume 14, Article number: 6394 (2023) Cite this article

4071 Accesses
5 Altmetric
Metrics details

Subjects

Representation learning is making inroads into drug discovery. A study in Nature Communications emphasizes multiple limitations in property prediction. The results suggest that continued research and improvements are required for this specific area that coalesces machine learning and molecular medicine.

Biological and medicinal chemistry are experiencing an unprecedented (r)evolution with the emergence of machine learning (ML) algorithms, improved hardware, and data storage capabilities. The prime goal of ML in drug discovery is accelerating research by prioritizing the most relevant experiments, mitigating attrition from an early stage, and thus expediting development pipelines^1,2. While of utmost relevance, the more involved cheminformatics community is now realizing that advanced deep learning algorithms rarely display desirable performance in multiple molecular design tasks involving the prediction of physicochemical and biological endpoints^3,4,5. In fact, traditional ML algorithms and molecular representations are still the state-of-the-art, performance-wise, and may remain so as long as training data is scarce⁴. This is the case because deep learning algorithms are typically data hungry, i.e., requiring large amounts of high-quality data to train thousands to millions of parameters / weights that lead to an optimal model fit.

Drug discovery is a peculiar and challenging use case for discriminative ML models for several reasons: (1) high-throughput experimentation is available⁶ yet data scarcity is still the norm for real-world problems^7,8; (2) sparse coverage of search spaces⁵, which impose data distribution shifts over the project timeline and concerns over the models’ domains of applicability; (3) experimental uncertainty is largely unaccounted for in ML models and a clear solution to this limitation is currently unavailable⁹. The latter is particularly concerning since it directly impacts the quality of the available training datasets, benchmarks, and the attainability of robust decision-making processes. Moreover, it is also apparent a persisting lack of standardized reporting practices in ML studies that make method comparison nontrivial and potentially misleading^9,10. While we⁹ and others^11,12 have suggested solutions and guidelines to overcome those issues, said guidelines are rooted on hands-on experience and are still not widely adopted. Building on that, Wang and co-workers¹³ go one step further and exemplify good ML practice with the widely used MoleculeNet data. MoleculeNet¹⁴ is not free of its own limitations as the dynamic range in some endpoints is irrelevant in a drug discovery setting. This suggests that better benchmarks are required. Still, the team exposed shortcomings of deep learning algorithms that should dampen unfounded hype around ML with molecular featurization based on graphs or natural language.

In a thorough methodology survey, the research team studied different factors that might bias method comparison and performance, such as input data, train/test splits, molecular representations, performance metrics and the random seed. More specifically, random forests (RF), extreme gradient boosting (XGBoost) and support vector machines (SVMs) were employed with circular fingerprints, to obtain relevant baseline models. Those models were pitted against a recurrent neural network, different flavors of transformers (e.g., MolBERT, GROVER), generative and graph-based methods that sieved directly through SMILES strings to learn a chemical language or graph descriptors. Despite pre-training routines, it became apparent that baseline models performed competitively or seemingly better in select bioactivity and physicochemical property datasets. In particular, RFs displayed the best performance on the BACE, BBBP, ESOL and Lipop use cases, which can be ascribed not only to the fingerprint descriptors, but also to the performance superiority of this type of algorithm in the low data regimes. Conversely, deep learning algorithms only became competitive in the HIV dataset, and in the prediction of molecular weight and number of atoms when datasets contained >1000 training examples. Albeit previously reported, the result further reinforces deficiencies in representation learning as a generally applicable solution to accelerate molecular medicine. An identical low performance pattern was observed when using scaffold splits to assess the model generalization on both unseen scaffolds and activity cliff molecules. In this case, the result was not entirely unexpected. One can speculate the reason lies not only on the customary low abundance of training data, but also on a data shift issue. In fact, the application of learning algorithms to previously unseen scaffolds likely imposes a distribution mismatch and a higher likelihood of mispredictions. This mismatch is often encountered in real-world drug discovery programs as molecular design can change dramatically over a project timeline. Experimentally, testing of chemical entities that significantly differ from prior knowledge can increase attrition, akin to using models outside their domain of applicability. It is thus understandable that learning algorithms underperform with scaffold splits, in comparison to random splits, where no development timelines are taken into account in the splitting routine.

When analyzing RFs, it was also found that no descriptor set works satisfactorily well on all predictive tasks, indicating that feature engineering and the development of molecular representation toolkits are and will continue being a current topic in computational medicinal chemistry. Another particularly interesting issue discussed by Wang and colleagues is the empirical binning of continuous bioactivity readouts – with enormous loss of information – to obtain classifiers rather than regressors. Arguably, the latter need more training data, which is sometimes incompatible with the experimentation costs. In the case of classifiers, it is also discussed the uneven (or so-called imbalanced) label distribution and the most appropriate metrics for model assessment to avoid erroneous or skewed comparisons¹⁵. As noted, the area under the receiver operating characteristic curve is commonly used to gauge performance in classifiers. However, it can be optimistic in imbalanced label distributions. In such scenarios, the precision–recall curve is advisable as it focuses on the minority class.

Overall, the team highlights numerous methodological shortcomings in ML toolkits and practices that the community as a whole must strive to change. Further, they speculate that self-supervised learning can bypass the need for human annotations and expensive experimentation, and hint that the contrastive type of self-supervised learning might be applicable to small datasets in drug discovery. Indeed, the presented data partly counter cycles the current enthusiasm in deep learning by showing that tree-based methods with fixed representations are likely still the best option for property prediction. Albeit surprising to some, the report by Wang and team should further spur investigations in a quest to make representation learning more competitive and suited to real-world molecular medicine.

References

Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).
Article CAS PubMed PubMed Central Google Scholar
de Almeida, A. F., Moreira, R. & Rodrigues, T. Synthetic organic chemistry driven by artificial intelligence. Nat. Rev. Chem. 3, 589–604 (2019).
Article Google Scholar
Van Tilborg, D., Alenicheva, A. & Grisoni, F. Exposing the limitations of molecular machine learning with activity cliffs. J. Chem. Inf. Model. 62, 5938–5951 (2022).
Article PubMed PubMed Central Google Scholar
Janela, T. & Bajorath, J. Simple nearest-neighbour analysis meets the accuracy of compound potency predictions using complex machine learning models. Nat. Mach. Intell. 4, 1246–1255 (2022).
Article Google Scholar
Saebi, M. et al. On the use of real-world datasets for reaction yield prediction. Chem. Sci. 14, 4997–5005 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 360, 186–190 (2018).
Article ADS CAS PubMed Google Scholar
Shields, B. J. et al. Bayesian reaction optimization as a tool for chemical synthesis. Nature 590, 89–96 (2021).
Article ADS CAS PubMed Google Scholar
Reker, D., Hoyt, E. A., Bernardes, G. J. L. & Rodrigues, T. Adaptive optimization of chemical reactions with minimal experimental information. Cell Rep. Phys. Sci. 1, 100247 (2020).
Article Google Scholar
Bender, A. et al. Evaluation guidelines for machine learning tools in the chemical sciences. Nat. Rev. Chem. 6, 428–442 (2022).
Article PubMed Google Scholar
Rodrigues, T. The good, the bad, and the ugly in chemical and biological data for machine learning. Drug Discov. Today Technol. 32–33, 3–8 (2019).
Article PubMed Google Scholar
Artrith, N. et al. Best practices in machine learning for chemistry. Nat. Chem. 13, 505–508 (2021).
Article CAS PubMed Google Scholar
Keeping checks on machine learning. Nat. Methods 18, 1119–1119 (2021).
Deng, J. et al. A systematic study of key elements underlying molecular property prediction. Nat. Commun. https://doi.org/10.1038/s41467-023-41948-6 (2023).
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Article CAS PubMed Google Scholar
Lee, K. et al. Combating small-molecule aggregation with machine learning. Cell Rep. Phys. Sci. 2, 100573 (2021).
Article Google Scholar

Download references

Author information

These authors contributed equally: Ana Laura Dias, Latimah Bustillo.

Authors and Affiliations

Research Institute for Medicines (iMed), Faculdade de Farmácia, Universidade de Lisboa, Lisbon, Portugal
Ana Laura Dias, Latimah Bustillo & Tiago Rodrigues

Authors

Ana Laura Dias
View author publications
You can also search for this author in PubMed Google Scholar
Latimah Bustillo
View author publications
You can also search for this author in PubMed Google Scholar
Tiago Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the writing of the manuscript.

Corresponding author

Correspondence to Tiago Rodrigues.

Ethics declarations

Competing interests

T. R. is a co-founder and shareholder of TargTex S.A. and a consultant to the pharmaceutical industry. The remaining authors declare no competing interest.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Dias, A.L., Bustillo, L. & Rodrigues, T. Limitations of representation learning in small molecule property prediction. Nat Commun 14, 6394 (2023). https://doi.org/10.1038/s41467-023-41967-3

Download citation

Received: 13 July 2023
Accepted: 18 September 2023
Published: 13 October 2023
DOI: https://doi.org/10.1038/s41467-023-41967-3

Limitations of representation learning in small molecule property prediction

Subjects

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

A systematic study of key elements underlying molecular property prediction

Search

Quick links

Subjects

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links