Recent estimates have indicated that the median cost of researching and developing a drug is over US$1 billion per drug, with research and development taking an average of ten years. Thus, using machine learning (ML) to accelerate research timelines and reduce development costs is an appealing avenue, particularly for drug discovery centers in lower-to-middle-income countries (LMICs). To date, limited access to computational resources has slowed the implementation of ML-based approaches at research institutions in LMICs, and decreasing those barriers is essential for enabling important discoveries in countries that are disproportionately impacted by infectious diseases. Recent work by Gemma Turon and colleagues proposed ZairaChem, an automated ML pipeline that is designed to be easily implemented on conventional computers in low-resource settings and that uses a range of chemical descriptors along with an ensemble of ML algorithms to train quantitative structure-activity/property relationship (QSAR/QSPR) models.
In ZairaChem, molecules are represented numerically using a combination of physicochemical parameters, 2D structural fingerprints, bioactivity profiles, graph-based embeddings, and chemical language models. Using such a wide variety of descriptors helps to improve the applicability of the approach over a broad range of tasks. Subsequently, a host of ML algorithms are applied using automated ML techniques that aim to yield accurate QSAR/QSPR models without the need for human intervention, reducing the need for data science expertise. ZairaChem was applied to key assays in antimalarial and antitubercular drug discovery programs conducted at the Holistic Drug Discovery and Development (H3D) Center at the University of Cape Town in South Africa. In addition to selecting candidates with high activities against the intended targets, the high accuracy and hit enrichment of the proposed models reduced attrition rates of the existing experimental pipeline at H3D, suggesting that it could accelerate the bench-to-bedside turnaround time. The amount of required training data was also reduced, as using a small set of approximately 30 local molecules within a chemical space was shown to be sufficient for producing models that successfully prioritize compound designs for further synthesis. This work represents a step towards enabling more universal access to ML-based tools in LMICs, as well as low-resource drug discovery.
This is a preview of subscription content, access via your institution