The race to crack one of biology’s grandest challenges — predicting the 3D structures of proteins from their amino-acid sequences — is intensifying, thanks to new artificial-intelligence (AI) approaches.
At the end of last year, Google’s AI firm DeepMind debuted an algorithm called AlphaFold, which combined two techniques that were emerging in the field and beat established contenders in a competition on protein-structure prediction by a surprising margin. And in April this year, a US researcher revealed an algorithm that uses a totally different approach. He claims his AI is up to one million times faster at predicting structures than DeepMind’s, although probably not as accurate in all situations.
More broadly, biologists are wondering how else deep learning — the AI technique used by both approaches — might be applied to the prediction of protein arrangements, which ultimately dictate a protein’s function. These approaches are cheaper and faster than existing lab techniques such as X-ray crystallography, and the knowledge could help researchers to better understand diseases and design drugs. “There’s a lot of excitement about where things might go now,” says John Moult, a biologist at the University of Maryland in College Park and the founder of the biennial competition, called Critical Assessment of protein Structure Prediction (CASP), where teams are challenged to design computer programs that predict protein structures from sequences.
The latest algorithm’s creator, Mohammed AlQuraishi, a biologist at Harvard Medical School in Boston, Massachusetts, hasn’t yet directly compared the accuracy of his method with that of AlphaFold — and he suspects that AlphaFold would beat his technique in accuracy when proteins with sequences similar to the one being analysed are available for reference. But he says that because his algorithm uses a mathematical function to calculate protein structures in a single step — rather than in two steps like AlphaFold, which uses the similar structures as groundwork in the first step — it can predict structures in milliseconds rather than hours or days.
“AlQuraishi’s approach is very promising. It builds on advances in deep learning as well as some new tricks AlQuraishi has invented,” says Ian Holmes, a computational biologist at the University of California, Berkeley. “It might be possible that, in the future, his idea can be combined with others to advance the field,” says Jinbo Xu, a computer scientist at the Toyota Technological Institute at Chicago, Illinois, who competed at CASP13.
At the core of AlQuraishi’s system is a neural network, a type of algorithm inspired by the brain’s wiring that learns from examples. It’s fed with known data on how amino-acid sequences map to protein structures and then learns to produce new structures from unfamiliar sequences. The novel part of his network lies in its ability to create such mappings end-to-end; other systems use a neural network to predict certain features of a structure, then another type of algorithm to laboriously search for a plausible structure that incorporates those features. AlQuraishi’s network takes months to train, but once trained, it can transform a sequence to a structure almost immediately.
His approach, which he dubs a recurrent geometric network, predicts the structure of one segment of a protein partly on the basis of what comes before and after it. This is similar to how people’s interpretation of a word in a sentence can be influenced by surrounding words; these interpretations are in turn influenced by the focal word.
Technical difficulties meant AlQuraishi’s algorithm did not perform well at CASP13. He published details of the AI in Cell Systems in April1 and made his code publicly available on GitHub, hoping others will build on the work. (The structures for most of the proteins tested in CASP13 have not been made public yet, so he still hasn’t been able to directly compare his method with AlphaFold.)
AlphaFold competed successfully at CASP13 and created a stir when it outperformed all other algorithms on hard targets by nearly 15%, according to one measure.
AlphaFold works in two steps. Like other approaches used in the competition, it starts with something called multiple sequence alignments. It compares a protein’s sequence with similar ones in a database to reveal pairs of amino acids that don’t lie next to each other in a chain, but that tend to appear in tandem. This suggests that these two amino acids are located near each other in the folded protein. DeepMind trained a neural network to take such pairings and predict the distance between two paired amino acids in the folded protein.
By comparing its predictions with precisely measured distances in proteins, it learnt to make better guesses about how proteins would fold up. A parallel neural network predicted the angles of the joints between consecutive amino acids in the folded protein chain.
But these steps can’t predict a structure by themselves, because the exact set of distances and angles predicted might not be physically possible. So in a second step, AlphaFold created a physically possible — but nearly random — folding arrangement for a sequence. Instead of another neural network, it used an optimization method called gradient descent to iteratively refine the structure so it came close to the (not-quite-possible) predictions from the first step.
A few other teams used one of the approaches, but none used both. In the first step, most teams merely predicted contact in pairs of amino acids, not distance. In the second step, most used complex optimization rules instead of gradient descent, which is almost automatic.
“They did a great job. They’re about one year ahead of the other groups,” says Xu.
DeepMind is yet to release all the details about AlphaFold — but other groups have since started adopting tacticsdemonstrated by DeepMind and other leading teams at CASP13. Jianlin Cheng, a computer scientist at the University of Missouri in Columbia, says he’ll modify his deep neural networks to have some features of AlphaFold’s, for instance by adding more layers to the neural network in distance-predicting stage. Having more layers — a deeper network — often allows networks to process information more deeply, hence the name deep learning.
“We look forward to seeing similar systems put to use,” says Andrew Senior, the computer scientist at DeepMind who led the AlphaFold team.
Moult said there was a lot of discussion at CASP13 about how else deep learning might be applied to protein folding. Maybe it could help to refine approximate structure predictions; report on how confident the algorithm is in a folding prediction; or model interactions between proteins.
And although computational predictions aren’t yet accurate enough to be widely used in drug design, the increasing accuracy allows for other applications, such as understanding how a mutated protein contributes to disease or knowing which part of a protein to turn into a vaccine for immunotherapy. “These models are starting to be useful,” Moult says.
AlQuraishi, M. Cell Syst. 8, 292–301, 2019.