Introduction

The outbreak of a respiratory illness in Wuhan, China on December 19, 2019 has created a global public emergency and spread a new coronavirus disease (COVID-19)1,2. The virus causing COVID-19 was termed SARS-CoV-23, and shares sequence and structural similarity with the Middle East respiratory syndrome coronavirus (MERS-CoV)4 and with the severe acute respiratory syndrome coronavirus (SARS-CoV)5,6,7. As of March 30, 2021, the COVID-19 pandemic has infected 128,006,406 people and sadly killed over 2,799,201 worldwide (https://www.worldometers.info/coronavirus/). The pandemic has caused economic shutdown in many countries, fears of a global recession, restricted travel, closure of educational institutions, and mental distress on a global level8,9,10. With the drastic widespread and impact of SARS-CoV-2, many countries are still fighting against the second and third waves of the virus and fear the devastation of future waves11.

SARS-CoV-2 contains a lipid envelope bilayer, with attached spike and membrane proteins, surrounding a stranded RNA genome of the virus12. Similar to MERS-CoV and SARS-CoV, the spike proteins in SARS-CoV-2 mitigate the attachment and binding of the virus to cell receptors and facilitate the release and entry of the viral genome into host cells13,14,15. The binding to host cells occurs at the receptor-binding domain (RBD)16 in the S1 subunit of the spike protein. The mechanical stability of the RBD in SARS-CoV-2 is 50pn greater than in SARS-CoV, which could explain the rapid spread of COVID-1917. The SARS-CoV-2 RBD recognizes the angiotensin-converting enzyme 2 (ACE2) located in the lungs, heart, kidney and intestines as its host receptor2,18. Impeding the function of spike proteins has been the target of several antibodies, vaccines, and inhibitors19,20,21,22.

Recent findings have reported the occurrence of sequence mutations in the SARS-CoV-2 spike protein, including some in the RBD23. The impact of the mutations on disease progression and biochemical phenotypes in COVID-19 still remains unknown. However, mutations in the spike protein and RBD of SARS-CoV and MERS-CoV are believed to have played a significant evolutionary factor in the transmission of the virus from bats to humans and influenced binding potentials to host receptors24,25,26. Mutations increasing affinity to human receptors are ubiquitous27,28, and exploring how mutations in the RBD region of SARS-CoV-2 impact the spike protein would aid in the design of better inhibitors and potential vaccines29.

Vaccines targeting the RBD region will have to account for possible natural mutations that could influence the spike protein’s stability and tweak its dynamics with the ACE2 receptor. Mutations altering the RBD conformation have been recently shown to allow SARS-CoV-2 to elude antibody treatments and resist therapy30. It is hence imperative to study the possible mutations that could occur in the RBD and the impact they might have on COVID-19 progression, and vaccine interactions. In this study, we computationally explore the mutation landscape of the RBD region and pinpoint mutations expressing strong binding potentials. We developed a tool, called SpikeMutator, to generate all possible single-point spike mutant trimer structures and map their free energies31,32 to assess the affect of mutations on structure stability. We analyzed the current isolated spike sequences in the GISAID database23 against the energy landscape and found evidence of accumulated mutations increasing the spike’s structural stability. Vaccine efforts targeting spikes will have to account for such mutations as more spike variants start to appear across the globe33. To the best of our knowledge, this is the first work that aims to study the mutation landscape of the SARS-CoV-2 RBD region.

Results and discussion

To analyze the stability of the spike protein against possible mutations that could occur in the RBD region, we developed a tool called SpikeMutator to generate a PDB structure of the spike with mutations applied to it. The spike structures we build are based on the cryo-EM SARS-CoV-2 spike glycoprotein reported by Walls et al.34. The amino acid sequence of this structure is presented in Table S1. In this structure, the RBD is located between amino acids 331 and 524, inclusive. The spike complex involved in COVID-19 is a trimer structure made up of three spike protomers. Figure 1 presents a schematic image of the spike protein as a single and trimer structure. Three single protomer structures aggregate to form a trimer conformation in Fig. 1a that binds to the ACE2 enzyme at the RBD interface. Each spike protomer contains two functional subunits, S1 and S2. The S1 subunit contains an N-terminal domain (NTD) and a receptor-binding domain (RBD), as highlighted in Fig. 1b. S1 binds to a host receptor and S2 contains the protein’s fusion machinery35,36.

Figure 1
figure 1

3D Structure of a Spike Protein. (a) Crystal structure of the spike protein trimer (PDB ID: 6VXX) composed of three protomers colored red, blue, and yellow, and are all in a closed conformation. (b) A single protomer spike in a closed conformation containing a receptor-binding domain highlighted in red, an N-terminal domain highlighted in blue, a connector domain highlighted in green, and position 614 highlighted in yellow.

To generate a spike protein with a set of desired n-point mutations, SpikeMutator applies each mutation to all three protomers of the spike complex and runs an all-atom molecular simulation to compute the free energy of a mutant complex using energy terms defined in Eq. (1). Figure 2 presents a flowchart outlining the steps of this tool. The output of the tool is a PDB structure with the desired n-point mutations applied to all three aggregate protomers in the spike complex. The tool supports the construction of the spike complex in both receptor-accessible (open) and receptor-inaccessible (closed) states36 and can be used to explore the energetics of the 1up2down and 2up1down spike conformations37.

Figure 2
figure 2

SpikeMutator pipeline. A flowchart describing the methods involved in mutating a SARS-CoV-2 spike protein. Mutations are applied to each trimer in the complex and a resulting atomic structure file is generated along with an output of the resulting free energy.

To study the landscape of potential mutations that can appear in the RBD region, we used SpikeMutator to exhaustively mutate each amino acid in the RBD region to the 19 other canonical amino acids and generated a database of the 3D conformations of all possible spike trimer mutants. Every trimer structure contained one mutation that was simultaneously applied to each of the three aggregated spike proteins. The free energies generated by the all-atom simulation runs are reported in Figs. 3 and 4. Figure 3 plots a 3D mutation energy landscape of the receptor-binding domain. One axis in the landscape represents the amino acids positioned at residues 331 through 524, the second axis plots the 20 canonical amino acids, and the third axis captures the free energy of the mutant structure defined by the two other axes. Lower energy values correlate with favorable mutations that stabilize the RBD and can potentially improve binding with the ACE2 enzyme. Higher energy values suggest mutations that increase instability in the domain and potentially alter the binding dynamics with the ACE2 enzyme.

Figure 3
figure 3

Mutation Landscape of the receptor-binding domain in SARS-CoV-2. Energy values are in kcal/mol and are computed by Eq. (1). Each x,y coordinate represents a mutation x at a position y. The z-values are the free energies of the mutated structures. Lower energies (blue) correspond to increased stability.

The 3D landscape maps the possible outcomes of single-point mutations and exposes conserved regions and unstable ones. For some amino acid positions, the appearance of a mutation does not impact stability. At other positions, most mutations will cause an increase in destabilizing forces, illustrated in the 3D plot by red curves that create a wall-like barrier. Interestingly, the introduction of some amino acid mutations such as a C or D throughout the spike generally improves structural stability.

Figure 4 provides a 2D projection of the landscape and makes it easier to visualize the energies across different areas of the RBD. We observe that a mutation including a negatively charged, polar and hydrophilic amino acid such as aspartic acid (D) or glutamate (E) would increase the stability of the receptor-binding domain. A positively charged, polar, and hydrophilic amino acid such as arginine (R) introduces some of the most unfavorable mutations in the region 419–434, located relatively inside of the spike trimer and far from the solvent accessible surface. The spike trimer contains an arginine (R) at position 355. The landscape plot suggests that this is an unstable residue position and that any other amino acid mutation at this position would increase the stability of the domain region. Hence, it is highly likely to observe a mutation at this position, which has actually been reported in a sequence from the UK on GISAID23. In addition, we found that 15 out of 19 mutations at position 331 and another 15 out of 19 mutations at position 343 increase the instability of the spike complex. Experimentally, it has been shown that disrupting the glycosylation in both of these positions greatly reduce the infectivity of SARS-CoV-238.

Figure 4
figure 4figure 4

2D Projection of the RBD Mutation Landscape. The introduction of a C or D mutation throughout most positions of the spike consistently suggests an improvement in the overall stability of the protein. Mutations in positions 355, 357, and 408 strikingly result in stable conformations. Positions 364 and 389 appear to be unstable spike regions. 2D Projection of the RBD Mutation Landscape. Positions 420, 427, 428, 444, and 465 appear to be unstable spike regions. On the other hand, positions 454, 462, 466, 501, and others colored in blue can exhibit mutations that stabilize the spike protein.

Using the 65,000 spike protein sequences collected internationally and made available at GISAID23, we found 3405 spike sequences that diverged from the Walls et al.34 reference spike sequence. 2491 of those sequences contained missing readings in the receptor-binding domain and were excluded from further analysis. Twenty-one sequences presented indels (insertions and deletions) that might have affected the progression or severity of the disease. A multiple alignment diagram of these sequence is presented in Table S2 of the supplementary material. Countries reporting these sequences with indels are Australia, China, India, Israel, Qatar, UK, and USA. Among these countries, Qatar was the only country to report a spike protein with an insertion resulting in the longest known spike mutant with 196 amino acids. It is not known whether or not this longer sequence contributed to the country’s low COVID-19 mortality rate (0.15%).

The remaining 894 sequences had equal length to the reference spike sequence but exhibited one or more point mutations. We report in Table 1 the top country origins of these sequences and the \(\Delta E\) values generated using our mutation landscape and Eq. (2). \(\Delta E\) measures the change in free energy between the mutant and non-mutant spike structures. Positive values indicate increase in instability and negative values indicate increase in stability within the RBD. We observe that all the countries in Table 1 report mutations that both stabilize and destabilize the receptor-binding domain, which could explain the varying severity of disease symptoms and health conditions reported in those countries39. Table S6 of the supplementary material is an extension of Table 1 and contains the complete list of countries. We report in Table S10 the \(\Delta E\) values of the top 10 stabilizing and top 10 destabilizing single-point mutations calculated using three different molecular dynamics forcefields. The results show that the forcefields are in agreement regarding the reported stabilizing mutations. The predicted mutations in the RBD region are increasing the Spike stability. The last column of the table presents an error estimate of the AMBER forcefield we used in comparison with the other forcefields. The error bounds are low, suggesting that the stability outcome of the reported mutations are independent of force-field uncertainties

Table 1 Top reporting countries for spike sequences on GISAID.

Table 2 lists the spike mutations that appear worldwide ranked in order of the number of reporting countries. Mutation V367F was reported in 12 countries and appeared in 51 sequences. Our energy landscape suggests that this mutation is favorable and further stabilizes the receptor-binding domain (\(\Delta E < 0\)). Other mutations exhibit positive \(\Delta E\) values and could lead to different binding potentials with the ACE2 enzyme, affecting the rate of disease, incubation periods and patient symptoms. Table S5 of the supplementary material reports the data for all 894 sequences. Table S7 of the supplementary material presents the point mutations reported in each country as of July 2020.

Table 2 Occurrences of spike sequence mutations reported on GISAID.

Out of the 894 sequences with mutations, 26 sequences exhibited 2 or more simultaneous point mutations. Table 3 reports these multiple-point mutations. It is interesting to note that all recent 2, 3, and 4-point mutations have a \(\Delta E < 0\), suggesting a mutation drive to further stabilize the receptor-binding domain and potentially increase infectivity. The 2-point mutation (V367F G413V) that was reported in Spain appears to have evolved from the widely spread mutated sequence (V367F) reported in 12 countries. No other sequence reported a G413V mutation. Although we do not know the order in which the other mutations have appeared, it is theoretically possible for a sequence to undergo a destabilizing mutation first and then after some time experience a strong stabilizing one that brings the structure to a more overall stable conformation. The sequence with the 3-point mutation (Q506H P507S Y508N) could have undergone its first destabilizing mutation (Q506H \(\Delta E > 0\)) and then experienced two stabilizing single point mutations (P507S \(\Delta E < 0\)) and (Y508N \(\Delta E < 0\)). If this is the case, then other spike sequences that have become mild or less dangerous because of destabilizing mutations could potentially experience future mutations that cause a regain in toxicity potential, and cause a periodic increase and decrease in COVID-19 symptoms.

Table 3 Multiple-point mutations in the spike protein reported on GISAID.
Table 4 Prediction scores of top stabilizing 2-point mutations in the receptor-binding domain.

We report in Table S8 of the supplementary material two sequences from China with 6 and 7-point mutations reported in 2019. We have not included these in our analysis as it was strange for that number of mutations to occur early on during the pandemic. A potential vaccine for SARS-CoV-2 or molecular therapeutic, that can inhibit the binding between the receptor-binding domain and the ACE2 enzyme, would need to work on different spike mutants that have started to appear and spread throughout different countries. It is not feasible to create a vaccine tailored to work on each mutant. However, if a vaccine can show good results on one of the most stable mutant structures, then it is possible that it will also show good results on many less stable ones. In light of this, we utilized the energy values of the single-point landscape to generate estimated \(\Delta E\) values (\(\Delta {{\tilde{E}}}\)) for 2-point mutations. Using Eq. (4), we generated \(\Delta {{\tilde{E}}}\) values for all possible 7.5 million 2-point mutations. Table 4 reports the top ten most stable 2-point mutations and Table S9 of the supplementary material reports the top 1000. Although not exact, the \(\Delta {{\tilde{E}}}\) values given by Eq. (4) are close approximations to the \(\Delta E\) values generated from full atomic simulation runs. The margin of error of \(\Delta {{\tilde{E}}}\) values was on average 6.5%. This method predicted that the 2-point mutation (R355D K424E) contributes strong structural stability to the spike protein and should be tested against potential vaccines and inhibitors. Other tools in the literature such as DynaMut240, and PROVEAN41 have also supported these findings. In Table 4, we report the change in binding affinity induced by mutation (\(\Delta \Delta G\)) of the top 2-point mutations run on DynaMut2. Positive and negative signs correspond to destabilizing and stabilizing mutations predicted to decrease and increase binding affinity respectively. The PROVEAN tool predicted that the top 2-point mutation (R355D K424E) might have an impact on the biological structure and function of the spike protein.

To further explore the effect of the (R355D K424E) mutation on the spike’s structure and stability compared to the non-mutant native type, we ran molecular dynamics simulation of 50 nanoseconds on both the full non-mutant native type and mutant structures. We report in Fig. 5a the RMSD graph of the simulation run. The RMSD graph shows that the mutant structure exhibits a more stable and energetically favorable conformation compared to the non-mutant native type. When we superimposed the resulting structures at time = 50ns onto their initial conformations, we found that the N-terminal domain (NTD) of the mutant structure was more conserved than the NTD in the non-mutant native type. Figure 5c,d shows a schematic of the 2 structures superimposed on their initial starting conformations, plotted in gray. The increase in RMSD values in the mutant graph of Fig. 5a at around time = 20 ns was due to a slight translation shift of the entire RBD region. Figure 5b shows the RMSF plot of the simulation run and captures this change in RBD position. Although the mutation has induced an overall increase in stability, the RBD region’s conformation has been altered. We plot in Figs. S1 and S2 the residue contact map of both the native and mutant structures at time t = 0 ns and t = 50 ns. The results show that the mutant structure is more stable than the native protein. Compared to the native structure, the mutant had lost 15 less contacts, made 4 extra new contacts, and preserved 10 additional contacts. This gain in stability can affect binding to the ACE2 enzyme, and might impose new identification challenges for antibodies. For these reasons, the (R355D K424E) should be tested against potential vaccine candidates.

Figure 5
figure 5

Full 50ns MD Simulation of the non-mutant native spike (blue) and the (R355D K424E) mutant structure (orange). Subfigures (a) and (b) report the RMSD and RMSF plots, respectively, of the MD simulation run, (c) superimposing the non-mutant structure at time 50ns (blue) on the native structure at time 0ns (gray), and (d) superimposing the 2-point mutant structure at time 50ns on its starting conformation at time 0ns (gray).

The 2-point and 3-point mutations that countries have reported are still far from the most stable conformations possible. It appears that the virus still has millions of possible candidate mutations to select from to increase the receptor-binding domain’s stability and potentially become more toxic.

Methods

The cryo-EM SARS-CoV-2 spike glycoprotein trimer structure (PDB ID 6VXX) was used as a 3D blueprint model on which mutations were performed for “closed” spike conformations and (PDB ID 6VYB) was used as a model for the “open” spike conformations34.

Generating the mutation landscape

Algorithm 1 outlines the process of generating 3D atomic models for all possible single-point mutations in the receptor-binding domain (positions 331-524) and calculating the free energies of each mutant structure. Each of the 194 amino acids in this region was mutated into the 19 other possible canonical amino acids using SCWRL442, producing 3,880 single-point mutation possibilities. Every mutation was applied to each of the three chains in the trimer structure separately and the three structures were subsequently joined and run through an energy minimization step to relax any steric clashes. The result of running Algorithm 1 produced the energy values presented in Figs. 3 and 4.

figure a

Computing spike structure energies

The free energy of a protein molecule in a solvent medium is correlated with its structural stability. In general, increased stability promotes better binding potential with other molecules. Lower stability can indicate weaker binding potential. We calculate the free energy \(E^K_m\) of a mutant spike structure with a mutation m at position K by computing the LJ, Coulomb, and solvation energy values of the mutant trimer using Eq. (1),

$$\begin{aligned} E^K_m = LJ^K_m + Coul^K_m + S^K_m \end{aligned}$$
(1)

where LJ\(^K_m\) is the Lennard-Jones potential, Coul\(^K_m\) is the Coulomb energy, and \(S\) \(^K_m\) measures the solvation energy resulting from the contact of the trimer surface with water molecules for the spike structure with mutation m at position K. The LJ and Coul terms measure the electrostatic potential and charges between the atoms of the trimer structure computed after undergoing an energy minimization step to reduce any steric clashes introduced in the mutation phase.

Low free energy values E indicate increased stability in the receptor-binding domain, and improved overall stability in the spike protein. Conversely, high E values can indicate increased instability in the receptor-binding domain, and lower overall stability in the spike trimer structure.

The E values generated in this study are displayed in Fig. 3. The figure describes an energy landscape of all possible single-point mutations in the receptor-binding domain. The amino acid positions of the domain are plotted on the x-axis, the 20 possible mutations on the y-axis, and the E value produced by a (position, mutation) pair makes up the z-axis.

Each mutation can be characterized by the difference in energy between its final mutated state and its initial non-mutated state. This change of energy at a position K to some amino acid m can be captured by Eq. (2),

$$\begin{aligned} \Delta E^K_{m} = E^K_m - E^{(0)} \end{aligned}$$
(2)

where \(E^K_m\) is the free energy of the spike protein with amino acid m at position K and \(E^{(0)}\) is the free energy of the initial non-mutated spike. Negative \(\Delta E\) values suggest mutations that increase stability and positive \(\Delta E\) values suggest mutations that are potentially destabilizing.

Solvation energy using dipolar water solvent

The Solvation term is computed using a fast and detailed dipolar water model that solves the dipolar nonlinear Poisson–Boltzmann–Langevin equation using the AQUASOL subroutine43. More precisely, the solvation energy in SpikeMutator is computed by Eq. (3). To lighten the notation, we omit the indices for the solvation energy term, S.

$$\begin{aligned} S &= F_{(p_0, C_{dip})} - F_{(0, 0)} \nonumber \\&\quad-\left( k_B T \frac{ln(1-N_A C_{dip} a ^3)}{N_A C_{dip} a ^3} \right) \int _{solvent} d {\mathbf {r}} \rho _{dip} ({\mathbf {r}}) \end{aligned}$$
(3)

where \(F_{(p_0, C_{dip})}\) defines the free energy of the system defined at dipoles of moment values \(p_0\) and concentration \(C_{dip}\), \(F_{(0, 0)}\) the free energy of the system with solvent concentration set to zero, \(a^3\) is the lattice grid size volume of the solvent, \(k_B\) is the Boltzmann constant, T temperature in Kelvin, and r is the surface definition of the solvent-accessible surface probe. Detailed parameters of the dipolar model setup can be found in the Supplementary Material.

Energy minimization and molecular dynamics

The process of mutating an amino acid of a structure into another amino acid can often introduce steric clashes with neighbouring residues in its spacial vicinity. After performing a point mutation in each of the three spike chains, SpikeMutator joins the three structures back into their initial conformation. Prior to calculating the energy of this new structure, we perform a short energy minimization (EM) run using the GROMACS 2019.3 molecular dynamics package44 to relax the structure and remove any severe steric clashes. The setup of the run is as follows: molecules were prepared in a cubic box (with a minimum distance of 35 \(\AA \) from any edge of the box to any atom) and neutralized with chloride ions and modeled using the AMBER99SB-ILDN45 force field along with the TIP-3P water model. We used a cutoff of 10 \(\AA \) for van der Waals and short range electrostatic interactions, and calculated long range electrostatic interactions using a particle mesh Ewald sum46,47. 2000 EM steps were performed for every structure using a steepest descent algorithm and the Verlet cut-off scheme. Simulations to generate the RMSD and RMSF plots in Fig. 5 were prepared for a full MD run in both isothermal-isobaric and canonical equilibration ensembles. Twenty five million time steps were used with an integration time step of 2 fs to simulate 50 ns. The residue contact maps of structures undergoing MD have been generated using the GoContactMap server48 with the following GO parameters: Sequence distance = 4, Cutoff long = 1.1nm, and Cutoff short = 0.3nm.

Multiple-point mutations

The number of possible mutations grows exponentially as we consider higher-order point mutations. In the receptor-binding domain, there are 7.5 million possible combinations of 2-point mutations, 9.6 billion possible combinations of 3-point mutations, and 9.2 trillion possible 4-point mutations. Since it is not feasible to efficiently calculate the \(\Delta E\) values of these multiple-point mutations, we can estimate the change in energy caused by multiple-point mutations by summing the individual \(\Delta E\) values produced by each single-point mutation to produce an estimate change in energy given by Eq. (4),

$$\begin{aligned} \Delta {{\tilde{E}}} = \sum _{K~\in ~dom({{\mathcal {M}}})} \Delta E^K_{{{\mathcal {M}}}(K)} \end{aligned}$$
(4)

where \({{\mathcal {M}}}\) is a map between spike amino acid positions and mutations, dom(\({{\mathcal {M}}}\)), returns the positions of the requested mutations, \({{\mathcal {M}}}{(K)}\) returns the desired amino acid mutation at position K.

Supercomputer resources

To perform our simulations, we resorted to the cloud-computing system provided by Amazon Web Services (AWS). Constructing the energy landscape was possible with the utilization of 24 EC2 machines with 16 CPUs each running in parallel for several weeks. The full molecular dynamics simulation of the native and the (R355D K424E) mutant structures required 2 machines with 36 CPUs each. The SpikeMutator algorithm and a copy of the GROMACS 2019.3 molecular dynamics package44 was installed on each machine. Generating the 7.5 million possible combinations of 2-point mutations and ranking their energies was also made possible with these machines on AWS.

Conclusion

We studied in this work the stability of the spike protein under all possible single-point mutations in the receptor-binding domain and explored mutations that can influence structural stability and affect binding with the ACE2 enzyme. We devised a tool, called SpikeMutator, to construct full atomic protein structures of the mutant spike proteins and generated a database of all possible single-point mutant trimer structures. We observed that the sequences isolated from COVID-19 patients exhibited some mutations that both increased and decreased the spike’s structural stability. Out of the 7.5 million possible 2-point mutation combinations, we found that the (R355D K424E) mutation produces one of the most stable spike proteins and should be included in the testing of possible vaccines and molecular inhibitors of SARS-CoV-2.

Our future work will be dedicated to the elaboration of a mutation model that captures the transitions between different spike mutation states in order to detect multiple-point mutations that can cycle between low to high energy states. This would potentially provide some empirical evidence on how mutations can manifest different clinical symptoms. We also aim to explore mutations that appear outside the RBD region, such as the D614G mutation located in position 614 highlighted in Fig. 1b that has become one of the dominant spreading mutations worldwide49.