Scientists are using ever more sophisticated AI algorithms trained on vast, unlabeled datasets to develop models that can ‘interpret’ biological data to help guide biomolecule design.
AI methods similar to those that produced GPT-4 are now yielding ‘foundation models’ that can make sophisticated predictions about the structure and function of DNA, RNA and protein, and enable the generation of novel biomolecules based on real-world biological and evolutionary principles.
This past July, for example, techbio startup EvolutionaryScale posted a preprint describing ESM3, an AI algorithm that leverages vast quantities of structural, sequence and functional data to guide protein design.
As a demonstration, the company produced an ESM3-generated green fluorescent protein (GFP) that shares just 58% of its amino acid sequence with any known GFP while still delivering equivalent brightness to naturally occurring proteins. “It can really solve these very complex combinations of prompts,” says EvolutionaryScale cofounder and chief scientist Alexander Rives. “You can give it these atomic-level instructions and you can also give it kind of high-level keywords, and the model figures out how to generate something coherent that respects those prompts.” This release coincided with the announcement that the company had raised $142 million in seed funding from backers including Nvidia and Amazon.
ESM3 is just one of a steadily growing number of industry and academic efforts to build AI models that can derive fundamental biological principles based on protein, RNA, DNA or even cell imaging data, and then use those principles to guide diverse analytical or design tasks. These include relatively young AI-focused drug discovery companies like Deep Genomics, Recursion and BioMap as well as established pharma giants like GlaxoSmithKline and tech titans like IBM, Microsoft and Nvidia.
Many of these models are described as foundation models. This term, coined in 2021 by researchers at the Stanford Institute for Human-Centered Artificial Intelligence, broadly describes AI models that acquire the capacity to perform well across a range of different problems through a process of ‘pretraining’ on enormous unannotated datasets. This is in contrast to most machine and deep learning algorithms in the biology world, which are typically focused on very specific tasks.
“A year and a half ago in our pitch, we would proudly talk about how we had 40 different machine learning models,” says Brendan Frey, founder and chief innovation officer of RNA drug discovery company Deep Genomics. Maintaining, scaling and integrating disparate models for predicting distinct yet intersecting biological processes such as RNA transcription, splicing or polyadenylation proved too challenging, however. As a solution, the company developed BigRNA, a foundation model that can make robust predictions about all these aspects of RNA biology — and more.
Many non-biologists already have firsthand experience with foundation models — GPT-4 is probably the best-known mainstream example. As a large language model (LLM), GPT-4 was pretrained on vast swathes of internet text, enabling it to identify complex patterns within the training data. The resulting model uses these patterns to forge statistical associations that make it possible to predict appropriate responses to user-generated prompts — essentially, a far more sophisticated version of the autocomplete function used by smartphones and search engines. This predictive capability enables GPT-4 to tackle diverse writing tasks for which it was not explicitly designed, ranging from writing poetry to generating computer code. Foundation models for biology follow the same premise, but instead of human language, the ‘text’ is based on protein, DNA or RNA. Most — but not all — current biological foundation models are built on an LLM framework, analogous to that of GPT-4.
The pretraining process associated with building these models is laborious and expensive. “Size absolutely does matter — the more data you can throw at these things, the more capable they become,” says Kimberly Powell, vice president of healthcare at Nvidia. She also highlights the importance of diversity — for example, using data from a broad range of species or cell types to minimize potential bias. In most cases, these training data are unstructured and lack manual annotation, simply because carefully curated data are generally not available at the required scale — encompassing many millions or billions of proteins, gene sequences or other biological data points. In many cases, these data are harvested from publicly accessible repositories like the Protein Data Bank, although some companies have opted to generate their own internal datasets for pretraining purposes.
But there are other important considerations. LLM-based AI models use certain variables, known as ‘parameters’, to classify and process data, and the number of parameters informs the quality of the resulting analyses. GPT-4, for example, uses upwards of 1 trillion parameters, and several biological foundation models are now approaching that number. Data processing at this scale requires substantial computing power. To develop ESM3, Rives and colleagues benefited from earlier models developed while working at Meta — a company with massive computing resources. Le Song, chief technology officer at biotech company BioMap, estimates that his company needed three to four months of continuous training with 800 graphical processing units (GPUs) to develop their 100-billion-parameter xTrimoPGLM protein foundation model. GPUs are powerful computing components, produced by companies including Nvidia, that have proven essential to meet the rigorous analytical demands associated with the development of AI models.
Such colossal training needs can pose a significant energetic barrier, making foundation model development a costly and technologically demanding process that may be out of reach for many research groups or organizations. But by operating at this scale, one can generate models that exhibit emergent properties that start to resemble scientific reasoning. “It gets better and better at making predictions for things it’s never seen before,” says Frey of his experiences with his company’s BigRNA foundation model. “It’s a weird new era of machine learning.”
In some cases, the resulting pretrained model can itself be used out of the box. But more often, the models are subjected to further, specialized training with much smaller, targeted datasets to hone specific ‘skills’. Song compares foundation model pretraining to the education received in primary school, which instills fundamental knowledge that will eventually be essential for college and postgraduate training. “Those foundations you have are accelerating you becoming an expert in particular areas,” says Song. For foundation models, these areas can include predictive tasks, in which the model can be used to extrapolate insights into biological properties like protein structure or gene expression levels, as well as generative tasks, like the design of new protein or oligonucleotide therapeutics.
Early work in biological foundation models has centered around generative design of proteins. For example, the BioMap team has shown that their protein language model xTrimoPGLM can predict a protein’s structure, stability and capacity to interact with other molecular targets. This in turn enables scientists to design customized molecules with specific functions and properties. BioMap has multiple external collaborations to exploit their model for AI-guided protein design. These include an antibody engineering partnership with Sanofi that kicked off in October 2023, which could ultimately prove worth up to $1 billion.
ESM3 is unusual in that it makes use of annotated rather than unlabeled data and threads together multiple types of data — structural, functional and sequence — from nearly 2.8 billion proteins to extrapolate and recapitulate real-world evolutionary processes. “You can imagine at that scale just this tremendous number of experiments being run in parallel by nature,” says Rives. “We observe the outcome of each of those experiments, and it’s reflected in the patterns in the sequences.” This informed the choice of GFP as a test case: although numerous GFP relatives exist in nature, efforts to develop new family members in the lab have mainly involved small-scale tweaks and refinements. In contrast, the esmGFP protein generated with ESM3 is sufficiently distinct from GFP that it would have required roughly 500 million years to evolve naturally.
Similar models can also be aimed at other classes of biomolecules. For example, some companies are making headway in using foundation models to predict properties of RNA, as well as factors that influence healthy versus pathological gene expression. The Deep Genomics team trained their BigRNA model on unannotated genome and transcriptome data from dozens of tissues from 70 human donors. In a September 2023 bioRxiv preprint, they demonstrate that BigRNA can accurately predict the intron–exon structure and expression levels for genes that it has never encountered before and is even able to identify oligonucleotide sequences that could potentially help normalize gene expression in various genetic disorders if delivered as antisense therapies. “That shocked me … because BigRNA was never trained on oligo data,” says Frey.
The design of RNA therapeutics is the central goal for Atomic AI’s ATOM-1 foundation model, which was described in a December 2023 bioRxivpreprint. Founding scientist and head of machine learning Stephan Eismann says that the research community has struggled to predict the structure and physicochemical properties of RNA-based drugs and their targets. “The amount of 3D structures deposited in the Protein Data Bank for RNA is orders of magnitude smaller than for proteins,” he says. To address this, Atomic AI scientists performed extensive chemical mapping experiments, using reagents that introduce structure-dependent patterns of RNA nucleotide modification that can subsequently be deciphered via sequencing. They were able to use the resulting model to accurately predict RNA folding, as well as other characteristics like stability in solution, and Eismann believes it should be reasonably straightforward to apply the same model for de novo design of RNAs based on detailed functional specifications.
Foundation models can also reveal broader elements of cellular-scale biology. For example, Arc Institute researchers Patrick Hsu and Brian Hie recently developed a powerful new model known as Evo, which was pretrained on unannotated genomic data from millions of different prokaryotic organisms and phage viruses. “DNA is a very fundamental layer in biology,” says Hie. “Evo can be useful for tasks in RNA or tasks in proteins … but also eventually higher-order systems or pathways or operons and even at the organismal scale.” A preprint from this past February demonstrated that Evo not only is suitable for predictive work, such as modeling gene expression on the basis of patterns of regulatory sequences, but also can perform ambitious generative AI tasks as well. In one experiment, Evo produced a novel ‘genome’ spanning 650,000 base pairs that contained many fundamental features of naturally occurring genomes. “The proteins embedded within these genome-scale generations match to fundamental biochemical or biological functions,” says Hsu. “We also find tRNAs embedded inside of the sequences, and other kind of key sequence motifs that would make it possible for this to potentially be ‘alive’, if you will.”
London based AI startup InstaDeep is also leveraging DNA’s power as a fundamental cellular ‘blueprint’ in their Nucleotide Transformer foundation model. In contrast to Evo, this model was trained on both prokaryotic and eukaryotic genomes. The genomes spanned 850 species, including 3,202 different human genomes. Thomas Pierrot, a research scientist at InstaDeep who has led the Nucleotide Transformer development effort, says that their algorithm has proven remarkably effective at detecting and predicting the activity of diverse genomic features, including splice sites, enhancers and other regulatory elements, across a range of different genomes with single-nucleotide resolution. In January 2023, the company was acquired by Mainz, Germany-based RNA vaccine pioneer BioNTech, and Pierrot says that Nucleotide Transformer is now being used for internal BioNTech projects as well as external collaborations in fields including agricultural science. “With the same model, you can screen mutations for cancer patients and also understand how to modify corn so that it’s not devastated by insects or things like that,” says Pierrot. “It’s very versatile.”
Even if DNA sequences contain the full recipe for life, there are limits to how much cellular- and tissue-scale biology can currently be extrapolated from genome-based models alone. Single-cell transcriptomic data are a powerful asset here, providing a real-time snapshot of many of the biological activities taking place in a cell at any given moment. A June publication from the BioMap team described scFoundation, a newly developed model based on transcriptomic data from 50 million human cells, which they used to tackle tasks ranging from cell type classification to predicting how different tissues respond to various drugs.
Recursion is pursuing a radically different approach with its Phenom family of foundation models, which are trained directly on image data rather than using the LLM framework favored by other groups. For their first-generation model, Phenom-beta, Recursion scientists used multichannel fluorescence imaging to measure how human vascular endothelial cells respond to CRISPR-based knockout of virtually every protein-coding gene, as well as treatment with varying doses of more than 1,600 different chemical agents. According to Imran Haque, senior vice president of AI and digital sciences, the resulting model can interpret how cells are responding to treatment with new drug candidates from both a mechanistic and a molecular perspective. “You can say that these two samples are very similar to each other, or that this particular compound has a similar effect on the cellular phenotype as this gene knockout,” says Haque. He notes that the model has proven robust in terms of accurately interpreting image formats that differ from those used in the training data — for example, brightfield microscopy — and his team is currently focused on extending Phenom to live-cell imaging data, as well as other cell types such as neurons.
The foundation model concept is very new, and the field is still struggling with challenges such as how to objectively benchmark their performance. “If people don’t publish benchmarking results, then I don’t believe it,” says Frey. In the case of BigRNA, his team compared how their foundation model performed relative to Deep Genomics’ internally developed single-purpose machine learning models, as well as external models, and many other foundation model preprints also include similar assessments. However, there are still no standardized tests for making apples-to-apples comparisons, and Pierrot says that a proper battery of well-designed challenges will most likely be necessary to cover the full breadth of tasks that foundation models can perform. “Our goal is really to have something that can handle variability, that’s very flexible,” he says.
Fortunately, many of the early pioneers in this space have shown a commitment to open-source sharing, allowing other researchers to experiment with different foundation models in their own labs. “We’ve released the ESM3 open model for non-commercial use,” says Rives. And Nvidia has invested considerable resources into broadening access to a curated and optimized collection of foundation models for biological applications through their BioNeMo platform, including Recursion’s Phenom-beta. “We optimize it so that the runtimes of these models can be as cheap as they can be,” says Powell. “You can deploy them anywhere there’s a GPU — on a desk, in a data center, on any cloud.”
As more researchers begin to tinker with the models, Frey envisions opportunities ranging from accelerated discovery of new drugs for ultra-rare diseases to ‘multimodal’ models that integrate data from DNA, RNA, and protein to produce a more holistic perspective on the cellular interior. “Eventually, I would hope we’ll start peeling back the cover on systems biology,” he says.
Change history
30 September 2024
The article has been updated to amend the sentence “…Stephan Eismann says the company was struggling to predict the structure and physicochemical properties of RNA-based drugs and their targets” to “…Stephan Eismann says that the research community has struggled to predict the structure…” in the HTML and PDF version of the article.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Eisenstein, M. Foundation models build on ChatGPT tech to learn the fundamental language of biology. Nat Biotechnol 42, 1323–1325 (2024). https://doi.org/10.1038/s41587-024-02400-2
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-024-02400-2