Natural variants in SARS-CoV-2 Spike protein pinpoint structural and functional hotspots with implications for prophylaxis and therapeutic strategies

In December 2019, a novel coronavirus, termed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was identified as the cause of pneumonia with severe respiratory distress and outbreaks in Wuhan, China. The rapid and global spread of SARS-CoV-2 resulted in the coronavirus 2019 (COVID-19) pandemic. Earlier during the pandemic, there were limited genetic viral variations. As millions of people became infected, multiple single amino acid substitutions emerged. Many of these substitutions have no consequences. However, some of the new variants show a greater infection rate, more severe disease, and reduced sensitivity to current prophylaxes and treatments. Of particular importance in SARS-CoV-2 transmission are mutations that occur in the Spike (S) protein, the protein on the viral outer envelope that binds to the human angiotensin-converting enzyme receptor (hACE2). Here, we conducted a comprehensive analysis of 441,168 individual virus sequences isolated from humans throughout the world. From the individual sequences, we identified 3540 unique amino acid substitutions in the S protein. Analysis of these different variants in the S protein pinpointed important functional and structural sites in the protein. This information may guide the development of effective vaccines and therapeutics to help arrest the spread of the COVID-19 pandemic.


SARS-CoV-2 Spike protein.
The SARS-CoV-2 S protein is 1273 amino acids long; it contains a signal peptide (amino acids [1][2][3][4][5][6][7][8][9][10][11][12][13], the S1 subunit (14-685 residues) that mediates receptor binding, and the S2 subunit (686-1273 residues) that mediates membrane fusion 30 . To identify areas in the S protein that are the least divergent as the virus evolves in humans, we obtained viral sequences from GISAID (Supplementary Table S1) that as of March 1, 2021, included 633,137 individual virus sequences isolated from humans throughout the world. As compared with the index WIV04 (MN996528.1, also known as the Wuhan variant or index virus) sequence of February 2020 31 , the 1273 amino acid S protein 8 had 3540 variants. This number of variants only includes filtered sequences (441,168) that are complete and do not contain an abnormal number of mutations (see "Methods"). As there are 3540 variants, on average, each position in the 1273 amino acid protein sequence has approximately three variants (Fig. 1d). However, some regions harbor 9 variants in a single amino acid position whereas others have no variants ( Fig. 1d; Supplementary Table S3). Regions in S protein with 2 or fewer variants/position (marked in light blue, Fig. 1d,e) are more prevalent in the structurally critical trimer interface (46% of the amino acids; Fig. 1d, Supplementary Fig. S1b,c, see Supplementary Table S4) Table S3). Of those amino acid substitutions in the RBD, only 3% are predicted by PROVEAN software 32 to be structurally or functionally damaging (Supplementary Table S2). Using PROVEAN, we also examined the predicted impact of the amino acid substitutions in the common more infective variants (B. Furin proteolysis sites. We next examined other regions in the S protein for which functions have been assigned. Furin proteolysis at the S1-S2 boundary (681-685) and in S2 (811-815) exposes the RBD to enable hACE2 binding, and the S2 domain to initiate membrane fusion 5 . Recent studies show that these cleavage sites are not necessarily specific for furin-mediated proteolysis and that S protein may be processed by multiple proteases to open the RBD into the active conformation [2][3][4]33 . Consistent with these observations, both the furin proteolysis consensus sites and the arginines that are critical for proteolysis are not conserved in the S protein (Fig. 2a), in agreement with a prior analysis of furin cleavage site 1 34 .
Glycosylation sites. The S protein also has 66 glycosylation sites in each trimer, which facilitate protein folding and may lead to host immune system evasion 18 , as 40% of the S protein's surface is shielded by glycans 17 . Surprisingly, with one exception, none of these glycosylation sites were invariable, suggesting that not all the www.nature.com/scientificreports/ glycosylation sites are essential for the S protein's functions (Fig. 2b,c). The only asparagine serving as an invariable glycosylation site is N343 in the RBD, located more than 25 Å away from hACE2-binding site, and therefore unlikely to mediate receptor binding.
Neuropilin-1 interaction site. Neuropilin-1 (NRP-1) is a transmembrane receptor that regulates angiogenesis 35 and immune response 36 and is expressed in many cell types 36 such as the endothelium 37 , immune cells 38 , and neurons 39 . Interaction between NRP-1 and S protein was proposed to regulate SARS-CoV-2 transmission [20][21] . Proteolysis of furin cleavage site 1 in the S protein of the index variant by furin was found to expose a C terminal motif, RXXR (where R is arginine and X is any amino acid), known to be the binding motif in NRP-1 19,21 . For example, a monoclonal antibody against the RXXR-binding site on NRP-1 reduced SARS-CoV-2 infectivity in culture 21 . Nevertheless, we found that the NRP-1 interaction-site in S protein is not conserved (Fig. 2d). Although the variants are predicted to have a neutral effect on the S protein structure (using Linoleic acid-binding site. A fatty acid-binding pocket has been identified in the inactive conformation of S protein 9 (Fig. 3a,b). The amino acids that make this pocket are conserved in other coronaviruses 9 and are unchanged (less than 2 variants) in 75% of the positions (Fig. 3a,b). Furthermore, among the 20 amino acids that line this pocket, 71% of the identified variants are predicted to have a neutral effect using PROVEAN (Supplementary Table S2). Analysis of the LA-bonding site identifies a potential pharmacophore that may fit small molecules (Fig. 3c), perhaps by mimicking ω-3 fatty acids 22 . . This less variable region is relatively hydrophobic, yet a substantial number of residues remain exposed in the open and closed conformations (Fig. 4c). Six residues, V551, T553, C590, V595, V608, Y612, in this relatively invariable region form a part of the largest hydrophobic patch in the protein measuring 370 Å 2 (Fig. 4d,e). Five of these residues (excluding T553) www.nature.com/scientificreports/ along with other residues that make this hydrophobic patch tolerate very few mutations and almost all the mutations that are tolerated change to other hydrophobic amino acids (Fig. 4d). We examined this region using Site Finder in Molecular Operating Environment (MOE) 40 and found that there is a binding site with a positive score for the propensity of ligand binding 41 , which encompasses several residues from this region (i.e. Cys590, Ser591, Phe592, Gly593) ( Supplementary Fig. S1e). This hydrophobic region is also 81% identical between SARS-CoV and SARS-CoV-2, but less than 15% identical when comparing the SARS-CoV-2 sequence with that of MERS-CoV (Fig. 4f).

Discussion
While SARS-CoV-2 has a lower mutation rate than other viruses due to proof-reading mechanisms 23 , aspects such as a relatively high R 0 of 1.9 to 2.6 42 , comparatively long asymptomatic incubation and infection periods, and zoonotic origins, leads to high variability in mutations in specific regions compared to the original reference sequence. This has been illustrated with the divergence of 6 major lineages in the past few months (Table 1). Our analysis of the frequency of variants throughout the S protein of SARS-CoV-2 identified regions of high and low divergence, which may aid in developing effective prophylactic and therapeutic treatments. In this analysis of mutations in the S protein, we did not consider the frequency of a particular mutation or in how many countries the mutation was found. Such analysis, as was done for D614G 43 , may further aid in determining the potential improved viral fitness acquired by a particular mutation. Protein glycosylation is essential for viral infection 44 . In SARS-CoV-2 S protein, there are 22 known N-glycosylation sites per monomer (Fig. 2b,c), but only one, asparagine 343, appears to be conserved. Furthermore, we found 156 positions in S protein that mutate to an asparagine residue in the existing 3540 variants that we analyzed, and many of them are exposed on the S protein ( Supplementary Fig. S1d). We propose that some of these new asparagine residues may create new glycosylation sites on the S protein that can contribute to immune evasion. Such an impact on the immune evasion by changes in the positions of glycosylation sites of viral envelope proteins have been described for influenza viruses; e.g., H3N2 has numerous new N-linked glycans on the viral hemagglutinin that enabled the virus to escape antibody neutralization and evade the host's immune system 45 . The formation of new glycosylation positions may also affect viral susceptibility to existing antibodies and to the immune response of infected individuals. A cryo-electron microscopy study has already suggested that coronaviruses mask important immunogenic sites on their surface by glycosylation 46 . Furthermore, recent work suggests that changes in glycosylation sites on the S protein of the virus may affect recognition of the S protein by other potential human proteins and receptors, inducing the toll-like receptors, calcitonin-like receptors, and heat shock protein GRP78, thus leading to a more severe inflammation that characterizes a more severe form of COVID-19 47 .
Additional sites on the S protein have been suggested to be critical for viral infectivity, including the trimer interface, the furin proteolysis sites and the NRP-1 binding site. However, our analysis suggests that not all these sites will be effective targets for prophylaxis and therapeutics. Specifically, the trimer interface is less accessible and therefore unlikely to be druggable. Another issue relates to the furin cleavage sites. As the viral S protein activation appears to require furin proteolysis 2-4 , protease-specific inhibitors are tested as a means to protect from infection 48 . However, our analysis suggests that this may not be an effective strategy, given the high variability of furin cleavage sites. This suggestion is consistent with previous data showing that other proteinases expressed throughout the body may work synergistically to activate the S protein 2,33 . Therefore, drugs that focus on inhibiting any single protease may not be effective preventative treatment against all SARS-CoV-2 variants. Similarly, the NRP1-binding site that is generated by proteolysis and the exposure of a C-terminal RXXR motif 19,21 may not be a good target for treatment against all SARS-CoV-2 variants, unless such a motif is also created by other proteases.
Are there additional sites on the S protein that can be explored to identify new treatments of COVID-19 or prevention of infections by SARS-CoV-2? There might be a benefit in focusing on the LA-binding site that help stabilize the S protein in the inactive closed conformer. Small molecules that mimic LA and bind into the LA pocket may stabilize the S protein in the closed/inactive conformation, thus reducing infectivity (Fig. 3a-c). Therefore, exploring the LA pharmacophore (Fig. 3c) with small molecules that can hold the S-protein in closed conformation, thus preventing the presentation of RBD to hACE2, could be of great interest as this may reduce viral infectivity. Our data also suggest that it may be beneficial to develop passive and active vaccines that target the RBD, instead of the entire glycosylated S protein; the RBD is less variable relative to the whole S protein (compare Fig. 1e,d). However, similar to some of the common viral isolates, such as the South African, B.1.351, new amino acid substitutions in the RBD may evade such therapeutics; e.g., loss of immunoreactivity to monoclonal antibodies 24 .
Finally, our study suggests that drugs and antibodies targeting region 541-612, a relatively conserved and exposed region on the protein's surface that we identified (Fig. 4a-d), warrant further study. Determining how druggable the pocket encompassing this region is (residues Cys590, Ser591, Phe592, Gly593), provided its solvent exposure, and whether modulating S protein by engaging this site will have a biological consequence is a challenge (Supplementary Fig. S1e). Very recently, Q564 within this region (star in Fig. 4a) has been proposed to act as a 'latch' , stabilizing the closed/inactive conformation of the S protein 49 . The high degree of conservation of hydrophobicity in this region potentially indicates its role in membrane fusion and/or maintaining structural integrity. The sequence similarity between SARS-CoV-2 and SARS-CoV (Fig. 4f) further supports the importance of this region, especially as both viruses have a similar route of infection. Determining the role of this invariable region warrants a further study, as it may be another Achilles heel to target for anti-SARS-CoV-2 treatment.

Materials and methods
Database of S protein amino acid variants, the world regions from where the virus was obtained, and whether the sequence is predicted to be deleterious. A FASTA formatted file containing 633,137 S protein sequences was retrieved on 03/01 from the GISAID database. This file had previously been preprocessed by the database with the individual alignment of genomes to the WIV04 (MN996528.1 31 ) reference sequence, using mafft 50 , via the command "mafft -thread 1 -quiet input.fasta > output.fasta" with subsequent translation into protein from the S protein-coding region at 21,563 to 25,384. For the analysis in this paper, only sequences sampled from humans were retrieved with the S protein sequences realigned via mafft 50 against the WIV04 (MN996528.1, 31 ) reference utilizing parameters ideal for a large number of highly similar protein sequences as well as using the option to maintain position numbering against the reference.
"grep -i "|Human|" input.fasta -A1 > output.fasta" "mafft --6merpair --thread -1 --keeplength --addfragments input.fasta reference.fasta > output.fasta" A python script (Supplementary Table S5) was generated to filter sequences based on set quality thresholds that included (1) Table S1), were chosen by the strict quality thresholds to remove low quality and potentially error prone sequences based on those that were incomplete, contain uncommon deletions, insertions, and have an unusually high number of mutations.
Calculating number of variants. The raw data for variants in the S protein was read into R studio 51 (v. 1.3.1093) and analyzed using the Tidyverse package 52 (Supplementary Table S4). The number of unique variants was calculated for each position, excluding insertions. Graphs were created for specific regions and each position was color-coded according to the number of variants present in that position (i.e., 0 -no color, 1-2 is blue, 3-4 is yellow, > 5 is red). See sample code below: Calculating variants: Calculating predicted effect of variants in PROVEAN. The amino acid sequence of S protein from the reference EPI_ISL_402124 (WIV04; Wuhan 31 ) sequence was uploaded to PROVEAN (http:// prove an. jcvi. org/ index. php) 32 . Every variant observed in S protein was also uploaded to compare against the reference sequence. Each variant was either predicted to be 'deleterious' or 'neutral' . The PROVEAN predictions were also read into R studio 51 Supplementary Fig. S1a (left), d, e, and PDB ID: 6ZB5 9 was used to prepare Supplementary Fig. S1a (right), Fig. 3a, c, 4c (right).
Sequence alignment. The Spike protein sequences from SARS-CoV-2, SARS-CoV, and MERS-CoV were uploaded to Jalview 53 . The Mafft alignment was then performed to align each amino acid sequence.
Pharmacophore generation. PDB ID: 6ZB5 9 was opened and prepared using the QuickPrep functionality at the default settings in MOE. Dummy atoms were created at the LA-binding site formed by chains 6ZB5.A Proportion with regions with 2 or fewer = # of Positions with 2 or fewer variants Total # of Positions * 100%