A paradigm shift in structural biology

The release of protein structure predictions from AlphaFold will increase the number of protein structural models by almost three orders of magnitude. Structural biology and bioinformatics will never be the same, and the need for incisive experimental approaches will be greater than ever. Combining these advances in structure prediction with recent advances in cryo-electron microscopy suggests a new paradigm for structural biology.


A paradigm shift in structural biology
The release of protein structure predictions from AlphaFold will increase the number of protein structural models by almost three orders of magnitude. Structural biology and bioinformatics will never be the same, and the need for incisive experimental approaches will be greater than ever. Combining these advances in structure prediction with recent advances in cryo-electron microscopy suggests a new paradigm for structural biology.
Sriram Subramaniam and Gerard J. Kleywegt I t has long been a goal of computational biology to bypass the need to determine structures experimentally by predicting accurate 3D structures directly from amino acid sequences. These efforts have steadily gathered momentum over the years, fueled by the biannual CASP community challenge 1 to correctly predict protein folds and structures with accuracy measured against structures independently determined by X-ray crystallography, NMR spectroscopy or cryo-electron microscopy (cryo-EM) methods. The astonishing accuracy with which protein folds can now be predicted by the programs AlphaFold (developed at DeepMind 2 , a subsidiary of Alphabet Inc.) and RoseTTAFold (developed at the University of Washington in Seattle 3 ) represents a dramatic advance in structural biology. DeepMind's collaboration with the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) to make structure predictions available at proteome scale has resulted in the AlphaFold Protein Structure Database resource (AlphaFold DB; https:// alphafold.ebi.ac.uk), which in its initial release contained a set of 365,000 predicted structures for unique UniProt entries, covering most of the human proteome and those of 20 other model organisms, and which now contains 800,000 predicted structures, covering most entries in SwissProt 4,5 . This database is expected to grow to >100 million models in 2022.
What is the scale of this addition? As of 10 November 2021, there were 183,954 released entries for structures of biological macromolecules in the Protein Data Bank (PDB) 6 , collected over the past five decades. Of these, 161,086 were at least in part determined from X-ray crystallography, 13,448 from solution NMR and 8,971 from cryo-EM, with the remaining 837 coming from miscellaneous methods such as solid-state NMR and neutron and electron crystallography. The vast majority of these entries (~75%) report structures that contain a single kind of polypeptide, including multimers, with the rest corresponding to heteromeric complexes of varying kinds or to RNA or DNA structures. Collectively, these structures represent 56,194 unique proteins (UniProt entries). By sometime in 2022, the AlphaFold DB will thus feature a >2000-fold increase in structural coverage of the known protein sequences and a >700-fold increase in the number of structures.
The arrival of AlphaFold and RoseTTAFold heralds a paradigm shift 7 in structural biology. Structure determination itself is only the first step in the science of structural biology, which seeks to use structure as a basis to derive insights and hypotheses regarding biological functions and mechanisms; but the barriers to entry have now been lowered. An important point to appreciate is that AlphaFold and RoseTTAFold predict structures of only the polypeptide components. These predictions are thus not yet complete structural models in the classical sense of an atomic model obtained by X-ray crystallography: that is, they do not include description of atomic positions for cofactors, metal ions, water molecules and any other ordered, bound ligands. Nevertheless, the availability of the predictions means that across all biological disciplines, studies involving proteins can begin with a structural model, with the focus of experiments being the testing of a series of predictions that can validate, repudiate or refine the model and the structural hypothesis. Iteration of prediction and experimental validation will now become the process that defines the discipline of structural biology. AlphaFold predictions have already demonstrated sustained accuracy over a range of targets, with insights into where the predictions can be improved [2][3][4]8 .
AlphaFold's structure prediction method also independently generates a quantitative estimate of reliability for every residue of a predicted structure, as well as of the reliability of the relative position and orientation of different parts of it. Models can now be generated for proteins that have been intractable to current methods for experimental structure determination. Although there have been methods to predict structure for a long time, the jump in accuracy achieved by AlphaFold makes model generation from sequence a more credible proposition than before, opening the door to the use of protein structural models (instead of just sequences) by biologists, geneticists, medicinal chemists and physiologists and thus redefining boundaries between disciplines in biology. AlphaFold models can function as 'hypothesis generators' for all functional biologists, providing ideas for the design of experiments that can test the effects of mutations or changes in binding partners. Needless to say, a solid knowledge of structural biology and protein structure will be an essential prerequisite for users to correctly interpret the predicted models.
It is useful to explore where the current gaps lie in this new landscape. The developers of AlphaFold noted that their dataset of human proteome structures "covers 58% of residues with a confident prediction, of which a subset (36% of all residues) [has] very high confidence" 4 . The 42% that AlphaFold cannot predict with a high degree of accuracy is likely to include proteins and regions of proteins that are intrinsically disordered, including regions that may be engaged in interactions with other proteins, RNA, DNA or other molecular partners. The human bromodomain protein BRD4, which plays a critical role in transcription and in cell cycle regulation, is a good example of this category of proteins: three of the four regions of BRD4 that are predicted with high confidence by AlphaFold have been previously amenable to analysis by X-ray crystallography, but for a majority of the polypeptide, the predictions are at low confidence (Fig. 1). An unexpected feature of AlphaFold predictions appears to be that the regions where the structure predictions are of low confidence also have a high probability of being disordered in isolation 9,10 . For such regions, structural insights will need to be derived through extensive biochemical analysis and through structural exploration of complexes formed with other partner proteins or nucleic acids that stabilize the interfaces.
Another important limitation is that the predicted structural models do not provide insight into conformational dynamics. For membrane proteins such as transporters and ion channels, AlphaFold predictions typically generate one of the states of the protein, but understanding the mechanism of action will require knowledge of the broader conformational landscape. This is also true for multidomain and allosteric proteins, for which function is intimately connected to changes in tertiary and quaternary structure spanning multiple conformational states. The development of computational methods to address this gap can yield great dividends, but for the foreseeable future, we will need experimental approaches to discover and validate the trajectory of protein conformational changes.
Discovery of more effective drugs is an area where the use of machine-learning methods will undoubtedly have great impact, and accurate prediction of drug-binding modes can be critically useful as starting points for experiments 11 . Similarly, predictions that account for the effects on structure of post-translational protein modifications, incorporation of cofactors, metal ions and nonstandard amino acids, and ligand binding will also be extremely valuable. At present, AlphaFold predictions do not cover this level of detail, nor does the program provide clues to the locations of water molecules, which may have critical functional roles in reaction mechanisms. Nevertheless, the availability of the new AlphaFold models significantly increases the number of targets that can be explored for drug design and will catalyze many more efforts at structure-guided drug discovery.
Unraveling cell biological mechanisms requires an understanding of the diverse set of protein-protein interactions that are at the heart of cell function. One notable gap evident in the initial reports of AlphaFold and RoseTTAFold was that the predictions covered only single-chain proteins. Since the original publications in July 2021, there have been exciting advances reported on this front 12,13 , with spectacular success in predicting certain classes of protein-protein interfaces. Despite these developments, it is worth noting that the scale of the problem is immense because of the massive network of interactions that take place in and outside the milieu of the cell. Here again, BRD4, a key cellular protein whose function involves interactions with numerous other proteins and nucleic acids, and that is also a target for cancer drug discovery, serves as a good example. An interaction map of BRD4 from the String database 14 illustrates the enormous complexity of the protein's interactions in the cell (Fig. 2). Some of these interactions are mediated by the folded regions, but many more are likely to involve the unstructured regions indicated  in Fig. 1. The use of effective computational and experimental tools to derive and validate these interaction maps is certain to remain an active research area for a long time to come. Fortunately, several of the gaps in the predictions of these powerful new methods correspond almost exactly to the areas where cryo-EM can provide useful information. Cryo-EM methods based on 'single-particle' imaging enable the determination of 3D structures of macromolecular assemblies that are frozen by rapid vitrification in cryogens such as liquid ethane. The rapid freezing process allows the preservation of protein structures under near-native conditions. The rate of deposition of cryo-EM structures in the Electron Microscopy Data Bank (EMDB) has been increasing each year (Fig. 3), and remarkable insights into the structures of both small and large molecular assemblies are being generated. Regions located at protein surfaces that are only partially ordered or can be ordered with the binding of stabilizing antibodies have now been resolved in many instances. Visualizing multiple conformational and functional states that are populated in the samples by 3D classification is a unique strength of the cryo-EM toolbox.
Because cryo-EM structure determination does not require crystallization, heterogeneous post-translational modifications such as glycosylation (for example, in the SARS-CoV-2 spike glycoprotein 15,16 ) can be studied without needing to be engineered away by mutagenesis, enzymatic degradation or chemical modification. The success of cryo-EM methods in resolving bound ligands and delineating structured water molecules now makes it possible to obtain detailed information relevant to molecular mechanisms at a level that until recently was attainable only with X-ray crystallography. Nevertheless, hurdles are often encountered on the road to successful cryo-EM structural analysis, especially at the stage of specimen preparation. Further, because cryo-EM methods rely on averaging tens or hundreds of thousands of images containing objects with similar molecular composition and conformation, there are still major challenges in addressing structures of intrinsically disordered proteins and transient interactions. However, these challenges also represent opportunities to capitalize on conditions under which disordered regions become ordered by interaction with other proteins, nucleic acids or ligands.
Whereas single-particle cryo-EM is used to obtain 3D structures from proteins and related complexes after biochemical purification, cryo-electron tomography (cryo-ET) adds another powerful dimension to the use of cryo-EM in understanding how proteins and other macromolecules assemble and interact in the context of an intact cell 17,18 . Structural information in cryo-ET is obtained by collecting a 'tilt series' of images of the same region of a sample at different orientations relative to the electron beam. This set of images can be converted computationally into volumes (also referred to as 'tomograms') using principles similar to those used in computed axial tomography (CAT). When this approach is used to image whole cells, meaningful information is only recovered from the cell peripheries, where the sample is thin enough for imaging by transmission electron microscopy. To image the cell interior, thin sections can be prepared using focused ion beam (FIB) milling, followed by lift-out methods that allow the extraction of specific regions of cells by sectioning with a gallium ion beam 19 . Tomograms obtained from these thin sections are beginning to provide unprecedented insights into subcellular organization 20 . Yet another exciting area on the horizon is the use of focused ion beams combined with scanning electron microscopy (FIB-SEM) to rapidly image whole cells and tissue segments in 3D 21 . Although FIB-SEM imaging generates images at lower resolution than cryo-ET, application of artificial-intelligence-driven annotation of the resulting volumes can provide unique insights into subcellular organization and macromolecular localization 22 .
The intersection of protein-structure prediction and cryo-ET may well turn out to be one of the most fertile areas at the interface between structural and cell biology in the coming decade. Although membrane organization and cytoskeletal elements, as well as large complexes such as ribosomes, can be visualized directly in tomograms, the structural maps obtained by cryo-ET of complexes and subcellular structures are generally at much lower resolution than what is typically achieved with cryo-EM of the corresponding purified complexes. Averaging the superimposed subvolumes of thousands of copies of specific complexes can improve the resolution, but only rarely do such averages approach atomic resolution. However, by combining the information obtained by tomography with AlphaFold and RoseTTAFold predictions, it should be possible to obtain near-atomic-resolution models of progressively larger complexes in their physiological contexts inside the cell. Mass spectrometric and proteomics methods that help characterize the identity of interacting proteins also provide invaluable information to understand the organization of multiprotein subcellular assemblies. Hybrid approaches of this kind that integrate different structural and biochemical modalities are harbingers of a great future for structural cell biology 23 .
In the decades leading up to the current era of machine-learning approaches to structure prediction, NMR spectroscopy, X-ray crystallography and cryo-EM have each demonstrated their strengths and limitations in structural biology. Solution NMR studies have helped resolve structures of small proteins and provided insights into ligand binding and dynamics. X-ray crystallographic methods have helped resolve the structures of a wide range of biomolecules and their complexes ranging from peptides to viruses, provided that they could be coaxed into forming well-ordered 3D crystals. As noted above, cryo-EM methods are emerging as powerful tools to further expand this repertoire to large, dynamic and heterogeneous assemblies. With the arrival of AlphaFold, the toolbox of structural biology has been augmented with a new and formidable set of computational tools. There is little doubt that the growth of structural biology in the next decade will rely profoundly on the synergy between experiment and prediction, fueled by the remarkable complementarity between machine-learning-based predictions and especially cryo-EM technology, which arguably represent the yin and the yang of the new alphabet of structural biology.
Note added in proof: On 9 December 2021, version 2 of the AlphaFold DB was released which now contains 800,000 predicted structural models and covers most of the entries in SwissProt. ❐ Sriram Subramaniam 1,2 ✉ and Gerard J. Kleywegt 3 ✉