Flexibility and mobility of SARS-CoV-2-related protein structures

The worldwide CoVid-19 pandemic has led to an unprecedented push across the whole of the scientific community to develop a potent antiviral drug and vaccine as soon as possible. Existing academic, governmental and industrial institutions and companies have engaged in large-scale screening of existing drugs, in vitro, in vivo and in silico. Here, we are using in silico modelling of possible SARS-CoV-2 drug targets, as deposited on the Protein Databank (PDB), and ascertain their dynamics, flexibility and rigidity. For example, for the SARS-CoV-2 spike protein—using its complete homo-trimer configuration with 2905 residues—our method identifies a large-scale opening and closing of the S1 subunit through movement of the S\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}^\text{B}$$\end{document}B domain. We compute the full structural information of this process, allowing for docking studies with possible drug structures. In a dedicated database, we present similarly detailed results for the further, nearly 300, thus far resolved SARS-CoV-2-related protein structures in the PDB.

At the end of 2019 a cluster of pneumonia cases was discovered in Wuhan city in China, which turned out to be caused by a novel coronavirus, SARS-CoV-2 1 . Since then the virus has spread around the world and currently has caused over 75 million infections with nearly 1.7 million deaths worldwide (Dec 21, 2020) 2 . SARS-CoV-2 is the seventh identified coronavirus that causes human disease. Four of these viruses cause infections similar to the common cold and three, SARS-CoV, MERS-CoV and SARS-CoV-2 cause infections with high mortality rates 3 . As well as affecting humans, coronaviruses also cause a number of infections in farm animals and pets 4 . The risk of future cross-species transmission, especially from bats, has the potential to lead to future pandemics 5 . Thus, there is an urgent need to develop drugs to treat infections and a vaccine to prevent this disease. Some success has already been achieved for SARS-CoV-2 with dexamethasone reducing mortality in hospitalised patients 6 and a number of vaccine trials are currently ongoing.
The viral spike protein is of particular interest from a drug-and vaccine-development perspective due to its involvement in recognition and fusion of the virus with its host cell. The spike protein is a heavily glycosylated homotrimer, anchored in the viral membrane. It projects from the membrane, giving the virus its characteristic crown-like shape 3 . The ectodomain of each monomer consists of an N-terminal subunit, S1, comprising two domains, S A and S B , followed by an S2 subunit forming a stalk-like structure. Each monomer has a single membrane-spanning segment and a short C-terminal cytoplasmic tail 7 . The S1 is involved in recognition of the human receptor, angiotensin-converting enzyme 2 (ACE2). This subunit has a closed or down configuration where all the domains pack together with their partners from adjacent polypeptides 8,9 . However, in order for recognition and binding to ACE2 to take place, one of the three S B domains dissociates from its partners and tilts upwards into the open or up configuration 8,9 . Binding of ACE2 to the open conformation leads to proteolytic cleavage of the spike polypeptide between S1 and S2 7 . S2 then promotes fusion with the host cell membrane leading to viral entry 10 . In the following we compute the full, structurally-resolved motion pathway from the open to the closed structure of the full homo-trimer. We will show that in the closed states, the vibrational motion of the trimer closely follows the behavior of an elastic cylinder. In contrast, the most important motion of open structure is overall much more flexible and a particularly high mobility in the S B domains can lead to a full closure of the trimer. The full structural details is this motion pathway are available via our approach. Drugs that target this motion of the spike protein therefore have the potential to prevent infection of host cells.
The main protease of the virus (M pro ) is another important drug target. M pro is responsible for much of the proteolytic cleavage required to obtain the functional proteins needed for viral replication and transcription.
These proteins are synthesised in the form of polyproteins which are cleaved to obtain the mature proteins 11 . M pro is active in its dimeric form but the SARS-CoV M pro is found as a mixture of monomer and dimers in solution 12 . SARS-CoV M pro has been crystallised in different, pH-dependent conformations suggesting flexibility, particularly around the active site. MD simulations support this flexibility 13 . The protease is highly conserved in all coronaviruses, including SARS-CoV-2, so targeting either dimerization or enzymatic activity may give rise to drugs that can target multiple coronaviruses, known and yet unknown 14,15 .
Since the discovery of SARS-CoV-2, a plethora of structures have been determined including M pro16 and the ectodomain of the spike protein 8,9 as well as other potential drug and vaccine targets. These structures provide the opportunity for rational drug design using computational biology to identify candidates and optimise lead compounds. However, crystal structures only provide a static picture of proteins, whereas proteins are dynamic. This property is often important in drug development. For example, agonists and antagonists often bind different conformations of G coupled-protein receptors 17 . Flexibility also affects the thermodynamic properties of drug binding 18 , yet the ability to assess flexibility is often hampered by the long computational times needed for MD simulations.
We use a recent protein flexibility modeling approach 19 , combining methods for deconstructing a protein structure into a network of rigid and flexible units (First) 20 with a method that explores the elastic modes of motion of this network [21][22][23][24][25] , and a geometric modeling of flexible motion (Froda) 26,27 . The usefulness of this approach has recently been shown in a variety of systems [28][29][30][31][32] . Methods similar in spirit, exploring flexible motion using geometric simulations biased along "easy"directions, have also been implemented using Frodan 33 and NMSim 34 . We have performed our analysis through multiple conformational steps starting from the crystal structures of SARS-CoV-2-related proteins as currently deposited in the PDB. This results in a comprehensive overview of rigidity and the possible flexible motion trajectories of each protein. We emphasize that these trajectories do not represent actual stochastic motion in a thermal bath as a function of time, but rather the possibility of motion along a set of most relevant low-frequency elastic modes. Each trajectory leads to a gradual shift of the protein from the starting structure and this shift may reach an asymptote, where no further motion is possible along the initial vector, as a result of steric constraints 31 . Energies associated with such a trajectory for bonds, angles, electrostatics, and so forth, have be estimated in previous studies of other proteins and shown to be consistent and physically plausible. Computing times for the method vary with the size of the proteins, but range from minutes to a few hours and are certainly much faster than thermodynamically equilibrated MD simulations. Hence, the approach offers the possibility of large-scale screening for protein mobilities.

Results
Protein selection. We have downloaded protein structure files as deposited on the Protein Data Bank 35 , including all PDB codes that came up when using "SARS-CoV-2" and "Covid-19" as search terms, as well as minor variations in spelling. In our first search of April 18th, 2020, this resulted in 133 protein structures. Further searches on May 20th and 29th, and August 25th, 2020 gave a total of 287 structures. Many of the structures found, as outlined above, have been deposited on the PDB in dimer or trimer forms. Hence one has the choice to study the rigidity and motion of the monomer or the dimer/trimer. Clearly, the computational effort for a dimer/ trimer structure is much larger than for a monomer since, in addition to the intra-monomer bond network, inter-monomer bonds also need to be taken into account. Furthermore, it is not necessarily clear whether the possible motion of a monomer or dimer should be computed to have the most biological relevance. We have calculated the motion of full dimer and trimer structures only for a few selected and biologically most relevant such structures while the default results concentrated on the monomers. Nevertheless, we wish to emphasize that when results for a certain monomer exist, it should nearly always be possible to also obtain their motion in dimer/trimer configuration. For some protein structures, we have found that steric clashes were present in the PDB structures that made a flexibility and sometimes even just the rigidity analysis impossible. Usually, this is due to a low crystal resolution. A list of all current protein PDB IDs included in this work is given in Table S1. Figs. 1 and 2 we show examples of different rigidity patterns that emerge from the First analysis. In line with previous studies comparing these rigidity patterns across various protein families 19 , we find that they can be classified into about four types. In Fig. 1a we see that for the crystal structure of SARS-CoV-2 nucleocapsid protein N-terminal RNA binding domain (PDB: 6m3m), the largest rigid cluster in the pristine structure, i.e. at E cut = 0 , largely remains rigid through the dilution process of consecutively lowering E cut values. When bonds are opened the rigid cluster shrinks but the newly independent parts are flexible and not part of any new large rigid cluster themselves. Only very small parts of the protein chain break to form their own independent rigid structures before at a certain E cut the whole protein is essentially flexible. We shall denote such a behaviour as brick-like. In contrast, in Fig. 1b we observe that for chain A of the co-factor complex of NSP7 and the C-terminal domain of NSP8, which forms part of the RNA synthesis machinery from SARS CoV-2 (PDB: 6wiq), already the crystal structure has fallen into three independent rigid structures. Opening bonds, we find that the largest rigid cluster breaks into 2 rigid structures. Then these now four domains retain their character upon opening more and more bonds until they simply dissolve into full flexibility (We use the term domain here to denote large regions of a protein connected in the same rigidity cluster. Usually, domains in a biological sense are also detected as domains in a rigidity sense as used here 32 . However, this relationship is not necessarily always the case.).

Rigidity and flexibility. In
While brick-and domain-like behaviour is only found occasionally, two further types of rigidity dissolution are more prevalent. In Fig. 2a we see that for, e.g. the monomer of the crystal structure of the COVID-19 main protease (M pro ) in apo form (PDB: 6m03), the rigid cluster dominating the crystal structure quickly falls apart upon change in E cut with five newly formed independent rigid clusters emerging towards the C-terminal of the www.nature.com/scientificreports/ gives chain A of the co-factor complex of NSP7 and the C-terminal domain of NSP8 from SARS CoV-2 (PDB: 6wiq). Different rigid clusters of the polypeptide chain appear as coloured blocks along the protein chain with each C α labelled from its N-terminal at 1 to its C-terminal. The three columns at the left of each cluster decomposition denote the index of the hydrogen bond to have been opened last (with 117 and 124 to total number of H bonds in (a) and (b), respectively), the energy cutoff E cut (E) and the mean number of bonded neighbours r (mean coordination) at that value of E cut . The label A indicates that protein chain A is being considered from the structure. We emphasize that the E cut scale is not linear since new rows are only added when the rigidity of a structure has changed. When the value of E cut decreases (the downward direction points towards larger, negative E cut magnitudes), rigid clusters break up and flexibility increases. The colour coding shows which atoms belong to which rigid cluster. Flexible regions appear as thin black lines. www.nature.com/scientificreports/ gives results for chain A of a complex resulting from the reaction between the SARS-CoV main protease and a carbamate (PDB: 6y7m). Distinct rigid clusters of the polypeptide chain appear as coloured blocks along the protein chain with each C α labelled from its N-terminal at 1 (left edge) to its C-terminal (right edge). The coloring and choice of blocks follows the same logic as detailed in Fig. 1  www.nature.com/scientificreports/ protein chain. These new clusters remain stable to the opening of further bonds, even when the remnants of the original rigid cluster has become fully flexible. Such a behaviour relies on a certain critical E cut value to dominate the rigidity dissolution and is reminiscent of so-called first-order phase transitions in statistical physics. Hence this rigidity-type is usually denoted as 1st order 19 . The behaviour seen in 2b for the crystal structure of the complex resulting from the reaction between the SARS-CoV main protease and tert-butyl Here, there are many values of E cut where large parts of the original rigid structure break off one after another so that towards the end of the bond-opening process, the original cluster is still present, but no longer dominates the rigidity pattern. This behaviour is called 2nd order 19 .
In Table S1, we have indicated the classification for each protein structure into the four classes. Obviously, this classification is not perfect and there are also sometimes intermediate rigidity patterns. Nevertheless, this rigidity classification already provides a first insight into the possible flexibility and range of motion for each structure. It should be clear, that a brick-like rigidity shows the least flexibility until it dissolves completely. On the other hand, for domain-like structures, one can expect possible intra-domain motion while inter-domain motion might be harder to spot. Similarly, for a 1st-order rigidity, one would expect little dynamic mobility until the "transition" value for E cut has been reached, although high levels of flexibility should be possible afterwards. Lastly, a protein with 2nd-order rigidity should have the most complex behaviour in terms of flexibility since new possible mobility can be expected throughout the range of E cut values.
Protein mobility. For each value of E cut , the First analysis has provided us with a map of rigid and flexible regions in a given crystal structure. We can now translate this into propensity of motion by allowing the flexible parts to move perturbatively, subject to full bonding and steric constraints (see "Methods" section). Moving along directions proposed by an elastic normal model analysis of the crystal structure, we can therefore construct possible motion trajectories that are fully consistent with the bond network and steric constraints. Each trajectory corresponds to one such normal mode, denoted m 7 up to m 12 for the first low-frequency non-trivial such modes, as well as the chosen E cut value. Generally, a larger value of |E cut | implies less rigidity and results in larger scale flexible motion. In Figs. 3 and 4 we give examples of such motion trajectories. Figure 3 shows a monomer of the SARS-CoV-2 spike glycoprotein (closed state) (PDB code: 6vxx 9 ). We can see that there is a good range of motion from the crystal structure when following the normal mode modes into either positive or negative changes along the mode vector. Figure 4 shows the motion for the dimer structure of the SARS-CoV-2 main protease (PDB code: 6lu7, in complex with inhibitor N3. We have removed the inhibitor in the motion analysis here) 16 . Again, one can see considerable motion, although due to the complexity of the structure, it is www.nature.com/scientificreports/ difficult to distinguish individual movement patterns from such a frozen image. Much better insight can be gained when watching for full range of motion as a movie. Direct access to the movie for the spike ecto domain monomer (6vxx) is provided via Ref. 36 , while the motion of the main protease dimer can be viewed at Ref. 37 . We have made a dedicated web download page 38 where movies for the complete set of PDB codes as given in Table S1 are publicly available. The movies are being offered for various modes m 7 to m 12 as well as E cut values, at least containing results for E cut = −1 kcal/mol , −2 kcal/mol and −3 kcal/mol , respectively. In addition, the site includes the rigidity resolutions discussed above and all intermediate structures needed to make the movies. This allows for the calculation of relative distances and other such structural measures along each motion trajectory as desired. For a detailed analysis of the SARS-CoV-2 main protease with similar methods, we refer the reader to Ref. 39 .

SARS-CoV-2 spike ecto domain structures.
The trans-membrane spike glycoprotein mediates entry into host cells 8,9 . As such, it is a "main target for neutralizing antibodies upon infection and the focus of therapeutic and vaccine design" 9 . It forms homotrimers protruding from the viral surface 3 . Up to the end of May 2020, 3 structures of the trans-membrane spike glycoprotein have been deposited in the PDB. These transmission electron cryomicroscopy (cryo-EM) studies have led to structures with PDB codes 6vsb 8 , 6vxx 9 and 6vyb 9 now being available. With RMS resolution of 3.5 Å, 6vsb has a slightly lower resolution than 6vxx at 2.8 Å and 6byv at 3.2 Å. In the following, we shall discuss the resulting rigidity and flexibility properties of these three structures in their full trimer form. Results for individual monomer rigidities, as given in Fig. 3, are also available at Ref. 38 .

Prefusion SARS-CoV-2 spike glycoprotein with a single receptor-binding domain open.
The rigidity pattern of the homotrimer for this structure (PDB: 6vsb, 440.69 kDa , 22854 atoms, 2905 residues) is dominated by a large rigid cluster encompassing most of the trimer structure except for a region from roughly residue 330 to residue 525 (With nearly 3000 residues in the structure of the trimer, is it impossible to capture the behaviour in a www.nature.com/scientificreports/ single figure that at the same time would allow the reader to see enough detail. The reader is referred to Ref. 38 for access to the electronic images.). This corresponds to the S B domain of S1 in each monomer. In the trimer configuration, it is known that one of the S B domains can change from the closed to an open configuration 8 . At |E cut | = 0.016 kcal/mol , the 6vsb structure of the trimer breaks into many different rigid parts, but the original large cluster remains a dominating feature across the rigidity dilution plot. A motion analysis does not compute but breaks with bad sterics, apparently due to the comparably low resolution of the crystal structure.

Mobility of the SARS-CoV-2 spike glycoprotein (closed state).
For the closed state structure (6vxx, 438.53 kDa , Atom Count: 23694 atoms, 2916 residues) of the SARS-CoV-2 spike glycoprotein, the rigidity pattern is again "brick"-like and the whole of the crystal structure is part of a single cluster. At |E cut | = 0.022 kcal/mol , there is a first order break of the large rigid cluster into dozens of smaller rigid units. Nevertheless, the original cluster retains a good presence in each chain of the trimer. We are now able to produce motion studies for the full set of modes m 7 to m 12 and various E cut 's. In Figs. 5 and 6 we show the results, again using the structures for the extreme ranges of the motion as in Fig. 3. Figure 6 shows a view from the top while Fig. 5 gives a view from the side of the spike protein. Both views are chosen similar to those given in Ref. 9 . The movies corresponding to Figs. 5 and 6 can be viewed at Refs. 40,41 , respectively. Looking at the whole range of modes and E cut 's computed, we find that the motion is very reminiscent of the vibrational excitations of a rigid cone or cylinder. There is twist motion around the central axis, bending of the trimer along the central axis, relative twist motion of 2 chains relative to the remaining chain (c.p. Fig. 5), etc. At m 12 (with |E cut | = 2 kcal/mol) , large scale motion has already stopped and one only observes smaller scale motion in flexible parts of the trimer chains. Overall, this behaviour is very consistent with the elastic behaviour of a "closed" structure 9 such as a cone or cylinder.

Mobility of the SARS-CoV-2 spike glycoprotein (open state).
We now turn to the last SARS-CoV-2 spike ecto domain structure (6vyb, 437.65 kDa , 22365 atoms, 2875 residues) 9 . This structure is in the open configura- www.nature.com/scientificreports/ tion similar to the conformation seen in 6vsb 8 . The structure is shown in top view in Fig. 7 and in side view in Fig. 8. From the rigidity plots we find, in addition to the largest rigid cluster, already for the crystal structure at |E cut | = 0 kcal/mol a second rigid cluster, also roughly spanning from residue 330 to 525. Again, this region identifies the S B domain as in the 6vsb structure. Compared to the closed state (6vxx), we see that this cluster has more internal structure, i.e. consists of more flexible parts in the S B region from 330 to 525 and also seems to fall apart upon further increasing the |E cut | value. This suggests that it is more flexible as a whole. Upon motion simulation, see 42,43 for side and top views, we observe a very high mobility in that prominent S B subdomain. Starting from the crystal structure, we find for |E cut | = 1 kcal/mol a clear further opening towards a negative Froda mode, m 7 , while in the other, positive direction of m 7 , the structure can again close the trimer. The distance range of the motion can be expressed as follows: the distance from residue 501 in the middle of the central β-sheet of the S B to the most opened conformation is 57 Å while distance to the same residue in the most closed conformation is 28 Å. The distance from open to closed is 69 Å. Hence the motion simulation adds additional insight into the distinction between the open (6vyb) and the closed (6vxx) structures while also showing that a transition from open to closed is indeed possible. As stated in Ref. 9 , this interplay of closing and opening is expected to be central to the viral entry into the human cell.
Other SARS-CoV-2 structures of interest. Similar results have been obtained 38 for the other about 300 Covid-19 structures thus far deposited in the PDB. For example, from those structures deposited from late May to August 2020, the nsp7-nsp8c complex (PDB code: 6m5i) 44 is potentially interesting as this complex is part of the RNA transcription and replication machinery. The motion of the hetero dimer shows an intriguing movement restricted to just one of the total seven α helices. The motion of ORF3a (PDB code: 6xdc) 45 , a putative ion channel, might be important in understanding the inflammatory response resulting from Covid-19 infection, i.e. as deletion has been shown to reduce viral titer. We find considerable flexibility in the homo dimer structure between the α-helix-rich and the β-sheet regions. On the other hand, SARS-CoV-2 Orf9b, which potentially suppresses the interferon response upon infection leading to a less effective immune response 46 , shows very little flexibility (PDB code: 6z4u) 47 . The range of movement from the most opened to the most closed conformation can be measured to be quite large. The flexibility within the S B domain itself during the hinge movement is also seen to be considerable. All these findings suggest that the S B domain of the spike protein has the necessary flexibility to attach itself readily to ACE2. However, when starting from the closed structure (6vxx), we do not see an opening. This suggests possibly stronger bonds and steric constraints which need to be overcome before the structure is able to open up. Nevertheless, to our knowledge this is the first time the hinge motion of the S B domain has been predicted solely based on the dynamics of a structure. In principle, the full structural information provided in our download site 38 can be used to perform possible docking studies with ligands of different sizes. These intermediate conformations can also be used as convenient starting points for further MD-based dynamics studies to evaluate thermodynamic properties of the structures. This shows the power of FIRST and FRODA to complement structure determinations and MD simulations to make valid predictions about the dynamics of proteins. Similarly detailed structural analyses and further MD work is also possible for the other structures included in this work. The classification of the protein structures into four types of dynamics provides information about the level of flexibility of a protein which is relevant to drug discovery. For more detailed information rigidity dilution plots and motion movies are available for download at Ref. 38 . In addition, the code underlying our study is accessible at Ref. 48 and implemented in a free web server at Ref. 49 .

Methods
Rigidity analysis. In each case we start the rigidity, flexibility and mobility modelling with a given protein crystal structure file in PDB .pdb format. Hydrogen atoms absent from the PDB crystal structures are added using the software Reduce 50 . Alternate conformations that might be present in the protein structure file are removed if needed and the hydrogen atoms are renumbered in PyMol 51 . We find that for some protein structures the addition of hydrogen atoms is not possible without steric clashes. Consequently, identification of a viable structure and its continued analysis is not possible. These proteins are labelled in Table S1. We observe five structures which exhibit such steric clashes. For the remaining proteins we produce the "rigidity dilution" or rigid cluster decomposition (RCD) plot using First 20 . The plots show the dependence of the protein rigidity on an energy cutoff parameter, E cut < 0 . It parametrizes a bonding range cutoff based on a Mayo potential 52,53 such that larger (negative) values of E cut correspond to more open bonds, i.e. a smaller set of hydrogen bonds to be included in the rigidity analysis.
Elastic network modes. We obtain the normal modes of motion using elastic network modelling (ENM) 54 implemented in the ElNemo software 23,24 . This generates a set of elastic eigenmodes and associated eigenfrequencies for each protein. The low-frequency modes are expected to have the largest motion amplitudes and thus be most significant for large conformational changes. The six lowest-frequency modes (modes 1-6) are just trivial combinations of rigid-body translations and rotations of the entire protein. Here we consider the six lowest-frequency non-trivial modes, that is, modes 7-12 for each protein. We denote these modes as m 7 , m 8 , . . . , m 12 .

Mobility simulations.
The modes are next used as starting directions for a geometric simulation, implemented in the Froda module 26 within First. This explores the flexible motion available to a protein within a given pattern of rigidity and flexibility. Froda then reapplies bonding and steric constraints to produce an acceptable new conformation. Since the displacement from one conformation to the next is typically small, we www.nature.com/scientificreports/ only record every 50th conformation. Typically the computation continues for several thousand conformations. A mode run is considered complete when no further projection along the mode eigenvector is possible (due to steric clashes or bonding constraints). Such a situation manifests itself in slow generation of new conformations. We have performed a Froda mobility simulation for each protein at several selected values of E cut . This allows us to study each protein at varying stages of its bond network, roughly corresponding to changing environmental conditions, such as different temperatures as well as changes in the solution environments. In a previous publication, we discussed the criteria for a robust selection of E cut 53 . Ideally, for each protein structure, a bespoke set of E cut values should be found, with the RCD plots providing good guidance on which E cut values to select. Clearly, for a large-scale study as presented here, this is not readily possible due to time constraints. Instead, we have chosen E cut = −1 kcal/mol, −2 kcal/mol and −3 kcal/mol for each protein. These values have been used before in a multi-domain protein with 22 kDa and shown to reproduce well the behaviour of (i) a mostly rigid protein at E cut = −1 kcal/mol , (ii) a protein with large flexible substructures/domains at E cut = −2 kcal/mol and (iii) a protein with mostly flexible parts connecting smaller sized rigid subunits at E cut = −3 kcal/mol 31,53 . In addition, we have also performed the analysis at other values of E cut when upon inspection of the RCD plots it was seen that the standard values (i)-(iii) would not be sufficient. The exact values used are given in Table S1. We emphasize that these trajectories do not represent actual stochastic motion in a thermal bath as a function of time, but rather the possibility of motion along the most relevant elastic modes. Each trajectory leads to a gradual shift of the protein from the starting structure. This shift eventually reaches an asymptote, where no further motion is possible along the initial vector, as a result of steric constraints. Energies associated with such a trajectory for bonds, angles, electrostatics, and so forth, can be estimated and shown to be consistent and physically plausible 31 .