“I didn’t think we would get to this point in my lifetime.” That’s how one research leader in structural biology responded to last week’s publication of research in which artificial intelligence (AI) was used to predict the structure of more than 20,000 human proteins, as well as that of nearly all the known proteins produced by 20 model organisms such as Escherichia coli, fruit flies and yeast, but also soya bean and Asian rice. That is a combined total of around 365,000 predictions1.
The data, publicly accessible for the first time (see https://alphafold.ebi.ac.uk), were released online on 22 July by researchers at DeepMind, a London-based AI company owned by Google’s parent company, Alphabet, and the European Bioinformatics Institute, based at the European Molecular Biology Laboratory (EBI-EMBL) near Cambridge, UK.
The DeepMind team developed a machine-learning tool called AlphaFold. The team trained this program on DNA sequences, including their evolutionary history, and the already-known shapes of tens of the thousands of proteins contained in a public-access database of proteins hosted by the EBI-EMBL researchers. A week earlier, DeepMind also released the source code for AlphaFold and detailed how it was constructed2, at the same time that researchers from the University of Washington, Seattle, published details of another protein-structure prediction program — inspired by AlphaFold — called RoseTTAFold3.
The unveiling of this catalogue of predicted structures would not be nearly such good news were the data and the methodology not open and freely available. Structural biologists and other researchers are already starting to use AlphaFold to obtain more-accurate models for proteins that have been difficult or impossible to characterize by current experimental methods.
Speeding up structure prediction
Predicting the 3D shape that proteins fold into has been one of biology’s unsolved ‘grand challenges’ since the discovery in 1953 of the structure of DNA itself. Before AI, structure prediction from sequence was an intensely time-consuming, not to say labour-intensive, process with little guarantee of getting an accurate result. The new data will still need to be validated and experimentally verified. But the AI tools can accurately predict protein structures in minutes to hours — compared with the months, or years, that it used to take to determine the structure of just one or two proteins. And that opens up possibilities for applications, for example in the engineering of enzymes to break down environmental pollutants such as microplastics.
Last week’s breakthrough depended not just on the sharing of open data, but on advances in fundamental science and technology. Since the 1960s, structural biologists have worked on parallel approaches to understanding the science of protein folding. One involves piecing together the structures of proteins by understanding the underlying physical forces. Another attempts to predict the shapes by making comparisons with closely related proteins, using an organism’s evolutionary history. And then there’s been the all-important role of imaging technologies, starting with X-ray crystallography and now cryo-electron microscopy.
In the basic science of structural biology, key problems remain to be solved. Although AI in science and technology is good at producing accurate results, it doesn’t (at least for now) explain how, or why, those results happened. The teams at DeepMind, EBI-EMBL, the University of Washington and elsewhere should be congratulated for crucial breakthroughs. But there is still work to be done to unlock the science — the essential biology, chemistry and physics — of how and why proteins fold.
Public and private
In terms of significance, some are comparing the latest advances to the first draft human genome sequence 20 years ago. And it’s true that there are comparisons to be made. Both the Human Genome Project and DeepMind’s catalogue of human protein-structure predictions equip their fields with a tool that is set to markedly accelerate discovery.
The human genome’s first draft was the result of a race. Solving protein folding has also benefited from a kind of competition — an annual event called the Critical Assessment of Protein Structure Prediction (or CASP), which has been essential to getting a result.
Today’s research teams — just like those involved in early genome sequencing — needed open access to data. In making the data and the methodology openly available to all, DeepMind now sets a benchmark that will make it harder for other corporations in this space, such as Facebook and Microsoft, to continue arguing for proprietary data.
And so, what of the future? Over the past week, Nature interviewed nearly a dozen researchers in the field. The consensus is that it’s too early to predict exactly what impact the application of AI in the life sciences will have, except that any impact will be transformative.
Accurately predicting how AI will change biology needs good training data, which we don’t yet have. But in AI, the structural-biology research community — and its collaborators in other fields — have a vast trove of fresh data. In addition to its research and data, AI provides a window into models for research organization and management that universities should study. For today’s researchers, and those in future generations, there is much work to follow up on.