Active machine learning helps drug hunters tackle biology

A growing cadre of startups is pursuing iterative cycles of machine learning, wet-lab experimentation and human feedback to accelerate target drug discovery.

Over the past year, a new brand of drug discovery venture has been gaining momentum. December saw the formal launch of Cellarity, a drug discovery startup that uses machine learning to analyze integrated single-cell sequencing data. But computational biology is just part of Cellarity’s strategy. The company, powered by $50 million in backing from Flagship Pioneering, is also building out extensive wet-lab facilities, which will churn out biological data for their algorithms to analyze and interpret — generating insights that can in turn lead to better models and smarter experiments. Cellarity is one of several new startups working to apply what some are calling ‘active learning’ to elucidate new drug-target biology.

Human–machine iterations are enabling a new way of doing biology. Credit: Sean Prior / Alamy Stock Photo

Active learning is a subfield of machine learning that focuses on algorithms that select the data they need to improve their own models. In contrast to conventional machine learning, which linearly creates models from a sample of data and then uses those models for purposes such as classification, clustering or prediction, active machine learning is iterative. This means that it incorporates a component that characterizes what sorts of experiments or data would best improve the model further. Using this approach, a drug discovery team can iteratively use hypotheses generated from the machine-learning models to plan and execute subsequent experiments, yielding new data that again are used iteratively to update the model.

Cellarity is the latest in a growing array of companies that use such an approach to accelerate drug development. These include Exscientia, LabGenius, Insitro and Recursion Pharmaceuticals. “There are certain areas of scientific discovery that are fundamentally too complex for the human brain to understand, or even for groups of humans to collectively understand,” says James Field, founder and CEO of LabGenius. Like Cellarity, his company is betting on the notion that having humans work together with algorithms to interpret data from well-designed experiments could break through these conceptual logjams to identify new drug targets, pharmaceuticals and disease-relevant biological pathways.

This strategy has already generated buy-in from pharma. For example, Exscientia — one of the first companies to enter this space, having spun off from the University of Dundee in 2012 — has active collaborations with Celgene, Sanofi and GlaxoSmithKline. Peter Crane, corporate strategy manager at lab automation software company Synthace, has noted the surge in interest and believes that technological capabilities have finally reached a point where such hybrid human–machine efforts are not only practical, but also sufficiently powerful to begin tackling tough biomedical problems. “A variety of methods has now emerged that, when connected together, can enable this new way of doing biology,” says Crane.

Gisbert Schneider, who studies computer-assisted drug design at ETH Zurich, notes that medicinal chemists have been employing the statistical analysis techniques underlying machine learning since the late 1970s. And today, such algorithms have matured as an effective tool for accelerating key steps in the process such as compound design or identification of promising lead candidates. “Quite frankly, I would trust a machine-learning tool here more than a group of medicinal chemists, even being a medicinal chemist myself,” says Schneider. However, it has proven far more challenging to leverage similar algorithmic approaches to characterize disease biology and to assess drug efficacy.

One of the critical factors determining an algorithm’s success or failure is the data that users feed into it. When Insitro founder Daphne Koller began working in computational biology some 20 years ago, this was a near-insurmountable problem. “The datasets were just not very good — they were small, ridden with artifacts, and barely fit for purpose for the problem that you wanted to solve,” she says. Even within a single drug company, high-quality data from different programs might be locked into separate silos, says Cellarity’s chief digital and data officer, Milind Kamkolkar. He adds that, in many cases, computational experts operate in relative isolation from the clinicians, chemists and biologists with the deepest grasp of the disease being studied. “The informatics team is often the last team to understand the scientific hypotheses or why they’re looking at this particular data,” says Kamkolkar.

This new generation of machine-learning-assisted drug discovery startup now aims to overcome these challenges through tight integration of computation and benchwork, with close collaboration between both sets of experts. On-site data generation is a critical component of this process. “We built our own biology lab, where we do protein production, structural genomics, biophysical screening and pharmacology,” says Andrew Hopkins, founder and CEO of Exscientia. “For many projects, we are the first people to screen against a given target.”

The data being generated vary from company to company. LabGenius is focused on protein engineering efforts, with experimental processes that can identify new protein therapeutics and optimize the biochemical and pharmacological properties of existing proteins. Recursion’s strategy emphasizes cell and tissue imaging, and the company claims to have assembled the world’s largest collection of such data. Other companies, like Insitro and Celsius Therapeutics, are relying on single-cell transcriptome analysis to understand how different cell types behave in disease and health, or in response to specific perturbations. “We have a high-throughput biology lab that is essentially a factory for data, using human-derived induced pluripotent stem cells that span a spectrum of human genetic diversity,” says Insitro’s Koller, noting that they make extensive use of CRISPR technology to test the effects of known disease-related mutations and also work with relevant donor-derived specimens.

These results are then fed into analytical pipelines. Some of the goals here are relatively conventional in the computational biology world — for example, classifying distinct cell types or states on the basis of gene expression profiles. But these algorithms can also be used to assemble more extensive descriptions of cellular states that incorporate multiple data modalities such as imaging, genomics, transcriptomics and proteomics into a single, detailed representation. Cellarity calls such representations ‘maps’, which can in turn be used to identify signposts of healthy or abnormal function. “We can’t possibly understand every little detail of what’s going on everywhere,” says chief technology officer Chad Nusbaum. “We’re looking for biomarkers of the behavior of that cell in a healthful and an unhealthful state, and to predict what’s going to encourage the cell to go to the healthful state.”

Although potentially informative, such models are just educated guesses until experimentally validated. But rather than testing their models with massive high-throughput screens that may ultimately prove fruitless, these startups again leverage machine learning to plan next steps.

“Active learning is about asking which compounds I should make and test that would be most informative to move my project forward,” says Hopkins. “It’s all about data acquisition.” Rather than thousands or more of compounds, these algorithms might propose a far smaller number of experiments, including drug candidates predicted to fail as well as those expected to succeed. This makes active learning a potent hypothesis-generation and testing engine that can help drug developers focus their efforts. “Eliminating the ‘bad apples’ early is where I see the main advantage of active learning,” says Schneider. The proposed experimental plan then goes back to the experimentalists at the bench, kicking off a new cycle of testing and analysis.

This iterative development is an essential component of the process. “AI-based drug design is not going to just give you a drug straight off — that’s still science fiction,” says Hopkins. At Celsius, for example, researchers carefully analyze the distinct cell types and gene expression profiles identified in initial single-cell transcriptomic experiments and then test those models by using tools like CRISPR to introduce targeted genomic manipulations that might confirm or invalidate potential drug targets in disease-affected cells. “It helps generate these hypotheses faster and then test them — it’s a lot faster to push electrons around than cells,” says chief data officer Gregory Ryslik. And as the algorithms and workflows improve, the process can achieve greater efficiency; Hopkins notes since Exscientia’s launch in 2012, the company has streamlined the number of drug leads typically needed to obtain a candidate suitable for clinical development from around 2,500 to 400 or fewer.

These gains can be notably accelerated if automation is brought into play. In this scenario, the actual design of new experiments still falls to human scientists, but once the workflow is formalized and validated, it can easily be automated. At Insitro, Koller notes that they can now perform experiments within several weeks that would take years for most academic labs, while also eliminating the inherent variability that arises even with skilled and experienced bench scientists. LabGenius’s Field says that “these platforms change what it means to be a scientist … with the object of maximizing time spent doing creative thinking and minimizing time spent executing in the lab.”

This process can help researchers optimize multiple aspects of the drug development process in parallel — for example, simultaneously selecting compounds with optimal pharmacokinetic and toxicological properties as well as selective and strong on-target therapeutic effects. Many companies in this space are initially going after difficult clinical targets, including complex multifactorial diseases like inflammatory bowel disease or cancer, where the root causes may be poorly understood, and may which require carefully calibrated treatment with multiple drugs. “I believe that with our approach, we can really tackle many different networks that affect the behavior of the cell,” says Cellarity president Cristina Rondinone. She also notes that this approach could put drug companies in a position to engage in discovery science in the realm of disease research — an area that is still chiefly the domain of academia.

It is neither cheap nor straightforward to build strong capabilities in both wet lab experimentation and computational biology in parallel. On the one hand, the core techniques for machine learning are well established, and the broad availability of cloud computing has largely eliminated the need for massive investments in computational infrastructure. But, on the other hand, Koller points out that the scale and complexity of these data can quickly overwhelm conventional analytical pipelines. “We had to rewrite the machine learning algorithms because they’re built for microscope images of 256 by 256 pixels, and ours are 20,000 by 80,000,” she says. “And when you think about the multimodal nature of the models that we have to build, very few machine-learning tasks involve that level of challenge.”

Team building can be an even tougher task. Some of the laboratory techniques being employed, such as single-cell RNA-seq, are sufficiently cutting-edge that it can be hard to find researchers with deep experience. And, on the computational side, companies are not just looking for data scientists, coders and engineers with the skills needed to push the limits of machine learning capabilities, but also competing with the deep pockets of tech titans like Google. “Are pharma companies really prepared to have a head of deep learning who is earning more than their head of drug discovery?” asks Crane. A culture of collaboration is also essential, lest these ambitious startups lapse into the siloed organizational structures that have limited the impact of machine learning in drug development in the past.

But this work and investment could ultimately pay off massively if it succeeds in improving the efficiency — and thus reducing the cost — of drug development. Koller thinks the benefit is clearly worth the risk. “The front end is the wrong place to save,” she says. “If a drug discovery program costs well over $1 billion, you don’t want to save $10–15 million on a front-end pipeline that would create the right dataset and make the right predictions.” The first products of this approach are now starting to trickle into the clinical pipeline, most notably a drug candidate for obsessive–compulsive disorder developed by Exscientia for Sumitomo Dainippon Pharma, which entered phase 1 testing earlier this year. However, it remains unclear what impact this development process will have on success rates in the clinic — particularly in the later phases of efficacy testing, where seemingly promising drug candidates routinely falter. “We are working at the edge of chaos, where we’re interfering with living, adaptive systems,” cautions Schneider. “It would be wrong to hope for perfect predictions.”

Nevertheless, a growing number of major industry players are proving eager to explore this approach. For example, Celsius is now working with Janssen on identifying biomarkers of response to a combination therapy for ulcerative colitis, and Insitro is gearing up to present preliminary data from a collaboration with Gilead to devise new drugs for the notoriously difficult-to-treat liver disorder nonalcoholic steatohepatitis. But these startups are not merely looking to act as collaborative service providers. “At our core, we are a precision medicine company, and our long-term goal is to develop and commercialize therapies independently,” says Celsius CEO Tariq Kassum.

Indeed, Hopkins sees this approach as the natural next evolutionary step for the industry. He cites the example of chess grandmaster Garry Kasparov, who, after being bested by IBM’s Deep Blue computer, teamed up with the AI to achieve a gameplay style more formidable than either man or machine alone. “This is not about replacing the drug designer,” he says. “This is about enhancing their abilities and increasing their probability of getting a drug into clinical trials.”

Author information



Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Eisenstein, M. Active machine learning helps drug hunters tackle biology. Nat Biotechnol 38, 512–514 (2020).

Download citation