Biological and medicinal chemistry are experiencing an unprecedented (r)evolution with the emergence of machine learning (ML) algorithms, improved hardware, and data storage capabilities. The prime goal of ML in drug discovery is accelerating research by prioritizing the most relevant experiments, mitigating attrition from an early stage, and thus expediting development pipelines1,2. While of utmost relevance, the more involved cheminformatics community is now realizing that advanced deep learning algorithms rarely display desirable performance in multiple molecular design tasks involving the prediction of physicochemical and biological endpoints3,4,5. In fact, traditional ML algorithms and molecular representations are still the state-of-the-art, performance-wise, and may remain so as long as training data is scarce4. This is the case because deep learning algorithms are typically data hungry, i.e., requiring large amounts of high-quality data to train thousands to millions of parameters / weights that lead to an optimal model fit.

Drug discovery is a peculiar and challenging use case for discriminative ML models for several reasons: (1) high-throughput experimentation is available6 yet data scarcity is still the norm for real-world problems7,8; (2) sparse coverage of search spaces5, which impose data distribution shifts over the project timeline and concerns over the models’ domains of applicability; (3) experimental uncertainty is largely unaccounted for in ML models and a clear solution to this limitation is currently unavailable9. The latter is particularly concerning since it directly impacts the quality of the available training datasets, benchmarks, and the attainability of robust decision-making processes. Moreover, it is also apparent a persisting lack of standardized reporting practices in ML studies that make method comparison nontrivial and potentially misleading9,10. While we9 and others11,12 have suggested solutions and guidelines to overcome those issues, said guidelines are rooted on hands-on experience and are still not widely adopted. Building on that, Wang and co-workers13 go one step further and exemplify good ML practice with the widely used MoleculeNet data. MoleculeNet14 is not free of its own limitations as the dynamic range in some endpoints is irrelevant in a drug discovery setting. This suggests that better benchmarks are required. Still, the team exposed shortcomings of deep learning algorithms that should dampen unfounded hype around ML with molecular featurization based on graphs or natural language.

In a thorough methodology survey, the research team studied different factors that might bias method comparison and performance, such as input data, train/test splits, molecular representations, performance metrics and the random seed. More specifically, random forests (RF), extreme gradient boosting (XGBoost) and support vector machines (SVMs) were employed with circular fingerprints, to obtain relevant baseline models. Those models were pitted against a recurrent neural network, different flavors of transformers (e.g., MolBERT, GROVER), generative and graph-based methods that sieved directly through SMILES strings to learn a chemical language or graph descriptors. Despite pre-training routines, it became apparent that baseline models performed competitively or seemingly better in select bioactivity and physicochemical property datasets. In particular, RFs displayed the best performance on the BACE, BBBP, ESOL and Lipop use cases, which can be ascribed not only to the fingerprint descriptors, but also to the performance superiority of this type of algorithm in the low data regimes. Conversely, deep learning algorithms only became competitive in the HIV dataset, and in the prediction of molecular weight and number of atoms when datasets contained >1000 training examples. Albeit previously reported, the result further reinforces deficiencies in representation learning as a generally applicable solution to accelerate molecular medicine. An identical low performance pattern was observed when using scaffold splits to assess the model generalization on both unseen scaffolds and activity cliff molecules. In this case, the result was not entirely unexpected. One can speculate the reason lies not only on the customary low abundance of training data, but also on a data shift issue. In fact, the application of learning algorithms to previously unseen scaffolds likely imposes a distribution mismatch and a higher likelihood of mispredictions. This mismatch is often encountered in real-world drug discovery programs as molecular design can change dramatically over a project timeline. Experimentally, testing of chemical entities that significantly differ from prior knowledge can increase attrition, akin to using models outside their domain of applicability. It is thus understandable that learning algorithms underperform with scaffold splits, in comparison to random splits, where no development timelines are taken into account in the splitting routine.

When analyzing RFs, it was also found that no descriptor set works satisfactorily well on all predictive tasks, indicating that feature engineering and the development of molecular representation toolkits are and will continue being a current topic in computational medicinal chemistry. Another particularly interesting issue discussed by Wang and colleagues is the empirical binning of continuous bioactivity readouts – with enormous loss of information – to obtain classifiers rather than regressors. Arguably, the latter need more training data, which is sometimes incompatible with the experimentation costs. In the case of classifiers, it is also discussed the uneven (or so-called imbalanced) label distribution and the most appropriate metrics for model assessment to avoid erroneous or skewed comparisons15. As noted, the area under the receiver operating characteristic curve is commonly used to gauge performance in classifiers. However, it can be optimistic in imbalanced label distributions. In such scenarios, the precision–recall curve is advisable as it focuses on the minority class.

Overall, the team highlights numerous methodological shortcomings in ML toolkits and practices that the community as a whole must strive to change. Further, they speculate that self-supervised learning can bypass the need for human annotations and expensive experimentation, and hint that the contrastive type of self-supervised learning might be applicable to small datasets in drug discovery. Indeed, the presented data partly counter cycles the current enthusiasm in deep learning by showing that tree-based methods with fixed representations are likely still the best option for property prediction. Albeit surprising to some, the report by Wang and team should further spur investigations in a quest to make representation learning more competitive and suited to real-world molecular medicine.