It has been one year since the release of the first SARS-CoV-2 genome1, which provided scientists with critical knowledge about its proteins. Thanks to the unprecedented experimental efforts by scientists worldwide, we have now obtained structural knowledge about most SARS-CoV-2 proteins, determining their three-dimensional (3D) shapes. Perhaps even more critical is the structural knowledge of the protein complexes that underlie the basics of viral functioning. Months before the experimental protein structures were solved, computational efforts by several groups provided researchers with accurate 3D models of the viral proteins and their physical interactions with each other and with host proteins. This 3D molecular information is instrumental in basic research, to understand mechanisms behind the viral entry and replication, as well as in structure-based drug design, to determine new antiviral targets, or in vaccine development, to study effects of novel mutations on antigen–antibody binding. Given that it is not ‘if’, but ‘when’ a new viral pandemic will emerge2, it is crucial to know whether computational modeling methods can facilitate structural characterization of viral proteins and their essential complexes. After one year of intensive research by the structural biology community, we have accumulated enough data to evaluate the impact of computational modeling efforts toward understanding the structural nature of the virus.

Structural genomics efforts to characterize the protein repertoire of a virus are usually carried out by comparative—or template-based—modeling3. A newer technique, de novo protein modeling4, does not require a template structure and may complement existing methods. Template-based models are often more accurate than de novo ones; however, the former technique is dependent on previously solved structures of homologous proteins or protein complexes while the latter can be applied to novel proteins. The latest success in protein modeling has been primarily due to recent technological innovations in the development of novel protein structure prediction algorithms, which use deep learning and are empowered by advances in graphical processing unit (GPU)-accelerated computing. We surveyed accurate template-based and de novo models of SARS-COV-2 proteins and protein complexes that were also experimentally solved to determine (i) model accuracy when compared with the experimental structure and (ii) how far ahead of the experimental structures they were obtained (Fig. 1). We considered comparative models generated by our group5 and de novo models reported by AlphaFold6 and C-I-TASSER7, which have also contributed to structural characterization of SARS-COV-2 proteins (Fig. 1a and Supplementary Table 1). Of the 29 putative proteins, 17 were at least partially experimentally and computationally resolved, while 5, including key structural protein M, were characterized only computationally. Six putative proteins have not been structurally characterized at all. The computational methods were fairly accurate, producing an average root mean squared deviation (r.m.s.d.) error of 4.1 Å for all 17 proteins (Supplementary Note). On average, computational models covered roughly 80% of the viral protein sequence, while experimental structures covered 82%. Most importantly, 3D models of viral proteins were released on average 86 days earlier than the corresponding experimental structures.

Fig. 1: Evaluation of computational approaches for modeling 3D structures of SARS-CoV-2 proteins and related protein complexes.
figure 1

a, Analysis of 17 individual proteins that were both experimentally characterized and computationally modeled, using comparative (circles) and de novo (squares) methods. b, Analysis of 8 protein complexes; each complex consists of two (circle), three (triangle) or four (square) protein subunits. For each modeled protein or protein complex, its r.m.s.d. error between the model and experimental structure, the number of days between the releases of experimental and computational structures, and the model’s coverage of the protein sequence (color) are calculated.

Even if we had structural knowledge of all SARS-COV-2 proteins, our understanding of the virus’s functional units would be far from complete: most, if not all, viral proteins carry out their functions by forming macromolecular complexes. Recent efforts to map all protein complexes formed by SARS-CoV-2 proteins have identified hundreds of putative interactions8. Unfortunately, only a small fraction of these complexes have been structurally characterized (Fig. 1b and Supplementary Table 2): 18 protein complexes have been characterized experimentally and 16 computationally. Overall, for 13 protein complexes, the structure was both modeled and resolved experimentally. For 5 of these, an incorrect oligomer conformation was derived from homologous complexes; for the remaining 8, the computational models yielded accurate protein complexes in correct conformations, with an average r.m.s.d. of 2.6 Å over the entire multimeric structure (Supplementary Information). The models were available on average 53 days earlier than experimental structures, covering on average 77% of all protein sequences involved in the complex. Lastly, for 4 modeled complexes, no experimental structures have yet been obtained.

In the 2011 science fiction movie Contagion, which went viral [sic] in 2020, scientists were shown looking at a structure of a viral surface protein bound to the host receptor just a couple of days after the viral genome was sequenced. That speed is not yet possible experimentally, but can already be achieved using computational modeling. Modeling 3D shapes of the viral proteins and their key complexes brings structural knowledge of the virus several critical months earlier than experiments can. We expect that computational models will be increasingly helpful in designing experiments to test neutralizing antibodies, studying the role of emerging mutations, and understanding the molecular mechanisms behind viral infections. Furthermore, we envision a new generation of artificial intelligence (AI)-driven protein modeling tools, such as AlphaFold 2 (ref. 9), providing even greater improvement in protein models for novel viruses. Still, de novo modeling should be used with caution and backed up by experiments when characterizing viral proteins because their remarkably diverse structural repertoire might not be captured during training of an AI method. Furthermore, structural characterization of the macromolecular complexes formed by viral proteins presents a major challenge. Thus, development of the new methods for accurate de novo characterization of protein complexes, akin to AI-driven protein structure prediction methods, is the next frontier.