Proteins perform or catalyse nearly all chemical and mechanical processes in cells. Synthesized as linear chains of amino-acid residues, most proteins spontaneously fold into one or a small number of favoured three-dimensional structures. The sequence of amino acids specifies a protein’s structure and range of motion, which in turn determine its function. Over decades, structural biologists have experimentally determined thousands of protein structures, but the difficulty of these studies has made the promise of a computational approach for predicting protein structure from sequence alluring. Writing in Nature, Senior et al.1 describe an algorithm, AlphaFold, that takes a leap forward in solving this classic problem by bringing to bear modern machine-learning techniques.
The diversity of protein structures precludes the possibility of obtaining simple folding rules, making structure prediction difficult. Protein folding is ultimately driven by quantum mechanics. Were it possible to compute the exact energy of protein molecules from quantum theory, and to do so for every possible conformation, then predicting a protein’s most energetically favoured structure would be easy. Unfortunately, a quantum treatment of proteins is computationally intractable (quantum computers might change this), and the total set of possible conformations that any protein can take is astronomical, prohibiting such a brute-force approach.
This has not stopped scientists from attempting a direct attack on the problem. Physical chemists have devised tractable, but approximate, energy models for proteins2, and computer scientists have developed ways to explore protein conformations3. Much progress has been made on the first problem but the second has proved more recalcitrant.
The set of shapes that a protein might take can be likened to a landscape: different locations in the landscape correspond to different shapes, with nearby locations having similar shapes. The height of a location corresponds to how energetically favourable the associated shape is, with the lowest point being the most favoured. Natural proteins evolved to have funnel-shaped landscapes that enable newly synthesized proteins, jostled by the thermal fluctuations of the cell, to cross the landscape and find their way to a favoured conformation in physiologically relevant timescales (milliseconds to minutes)4. Algorithms can search the landscape to find favoured conformations by following the landscape’s inclination, but the ruggedness of the terrain causes them to get stuck in troughs and valleys far from the lowest basin.
The course of the structure-prediction field changed nearly a decade ago with the publication of a series of seminal papers5–7 exploring the idea that the evolutionary record contains clues about how proteins fold. The idea is predicated on the following premise: if two amino-acid residues in a protein are close together in 3D space, then a mutation that replaces one of them with a different residue (for example, large for small) will probably induce, at a later time, a mutation that alters the other residue in a compensatory direction (in our example, swapping small for large). The set of co-evolving residues therefore encodes valuable spatial information, and can be found by analysing the sequences of evolutionarily related proteins.
By transforming this co-evolutionary information into a matrix known as a binary contact map, which encodes which residues are proximal, the set of conformations that merit consideration by algorithmic searches can be restricted. This in turn makes it possible to accurately predict the most favourable protein conformation, especially for proteins for which many evolutionarily related sequences are known. The idea was not new8, but the rapid growth in available sequence data in the early 2010s, coupled with crucial algorithmic breakthroughs, meant that its time had finally come.
Co-evolutionary analysis has been responsible for most progress in protein-structure prediction in the past few years, but it has not obviated the need for algorithms to search the energy landscapes of proteins: binary contact maps constrain the search space, but do not pin down a single 3D structure. Furthermore, the mathematics underpinning the conversion of co-evolutionary data into contact maps is restricted by the types of input used and the output generated. The initial injection of deep learning (a type of machine learning) into co-evolutionary analyses improved matters by incorporating richer inputs9. AlphaFold takes things a step further by changing the outputs.
In lieu of binary contact data, AlphaFold predicts the probabilities of residues being separated by different distances. Because probabilities and energies are interconvertible, AlphaFold predicts an energy landscape — one that overlaps in its lowest basin with the true landscape, but is much smoother. In fact, AlphaFold’s landscape is so smooth that it nearly eliminates the need for searching. This makes it possible to use a simple procedure to find the most favourable conformation, rather than the complex search algorithms employed by other methods.
The idea that a complex search could be unnecessary for structure prediction is, in hindsight, unsurprising. Mathematically, the distances between points determine their relative locations. Predictions of distances can therefore predict structure. Moreover, relatively simple models of protein energy landscapes known as Gō potentials, in which experimentally determined distances between residues are favoured, can lead to protein-folding pathways that resemble ones experienced by real proteins10. This suggests that proteins fold more like simple origami than like an intricate knot — all parts can come together at once. My own work has shown that folding can be predicted implicitly using a deep-learning model without searching11, and minimal search procedures have also been embedded within another deep-learning model to predict protein structures12.
What is notable about AlphaFold is that it predicts distances with sufficient accuracy to outperform state-of-the-art search methods (Fig. 1). Senior et al. used advances in deep learning to extract as much structural information as possible from protein sequences. The resulting algorithm outperformed all entrants at the most recent blind assessment of methods used to predict protein structures (the CASP13 event), generating the best structure for 25 out of 43 proteins, compared with 3 out of 43 for the next-best method. AlphaFold’s predictions had a median accuracy of 6.6 ångströms on this set of proteins — that is, for the middle-ranked protein in this set, the atoms in the proposed structures were on average 6.6 Å away from their actual positions.
Challenges remain. AlphaFold is not yet accurate enough for most applications, such as working out the catalytic mechanisms of enzymes or how drugs bind to proteins (which both typically require 2–3 Å resolution). And although AlphaFold’s search procedure is much simpler than most modern methods, it can still be slow, taking tens to hundreds of hours to make a single prediction. For applications such as protein design, which require the structures of many different protein sequences to be modelled, the lack of speed is an impediment.
Nevertheless, this is a watershed moment for the field. Given continued growth in the number of available protein sequences, it is possible that the coarse structures (about 4 Å resolution) of most proteins that consist of a single folded domain will become available in the next five years from structure predictions. Such broad availability of structural information might transform the life sciences, just as sequence information did in the preceding decades. This could mean that, combined with the rapid advances in protein–structure determination enabled by cryo-electron microscopy, we are entering a golden age of structural biology — one that makes possible a quantitative and mechanistic basis for the life sciences, broadly grounded in firm structural hypotheses.
Nature 577, 627-628 (2020)