Hotspots for mutations in the SARS-CoV-2 spike glycoprotein: a correspondence analysis

Spike glycoprotein (Sgp) is liable for binding of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) to the host receptors. Since Sgp is the main target for vaccine and drug designing, elucidating its mutation pattern could help in this regard. This study is aimed at investigating the correspondence of specific residues to the SgpSARS-CoV-2 functionality by explorative interpretation of sequence alignments. Centrality analysis of the Sgp dissects the importance of these residues in the interaction network of the RBD-ACE2 (receptor-binding domain) complex and furin cleavage site. Correspondence of RBD to threonine500 and asparagine501 and furin cleavage site to glutamine675, glutamine677, threonine678, and alanine684 was observed; all residues are exactly located at the interaction interfaces. The harmonious location of residues dictates the RBD binding property and the flexibility, hydrophobicity, and accessibility of the furin cleavage site. These species-specific residues can be assumed as real targets of evolution, while other substitutions tend to support them. Moreover, all these residues are parts of experimentally identified epitopes. Therefore, their substitution may affect vaccine efficacy. Higher rate of RBD maintenance than furin cleavage site was predicted. The accumulation of substitutions reinforces the probability of the multi-host circulation of the virus and emphasizes the enduring evolutionary events.


Results
Sequence data. Sgp SARS-CoV-2 , a 1273 amino acid long sequence, is divided into five distinct domains as shown in Supplementary Table S1. The available SARS-CoV-2 Sgp homologous sequences were collected in the libraries of non-redundant sequences (proteins of similar length) based on the hidden Markov model profiling to cluster the complete sequences of spike proteins.
To better focus on domains of the protein, the sequences of the divided domains were searched against the databases separately. The search results were used to build the non-redundant libraries of sequences. Each library included sequences of similar length and e-value lower than 10 -4 . A preliminary review of the libraries showed that the libraries of RBD and N-terminal domain (NTD) were mostly occupied by beta coronaviruses, while other libraries contain more divergent members. Clustering experiments-in the following section-will better assess this issue.
The disparity index test showed a homogenous pattern of substitution for all datasets (data not shown); therefore, all sequences were retained for further evaluations and considered suitable for alignment approaches.
Clustering the sequences. To define the relationship between sequences, each library was clustered based on the strength of their all-against-all pairwise sequence similarities. The network-based clustering approach also identified the closely related sequences and divided them into separate groups. Members of the sequence libraries in this section belong to coronaviruses excluding the SARS-CoV-2 (the limitation strategy of BLAST). The sequence collections are uniform in length and are the result of HMM profiling by querying the Sgp SARS-CoV-2 .
As illustrated in Fig. 1, when the dataset of the whole sequence of the Spgs was clustered, three groups were assigned. The results showed the true separation of beta-coronaviruses from other genera (cluster 1); as alphacoronaviruses were collected in cluster 2, and gamma-coronaviruses were collected in cluster 3. In contrast to the complete sequence, when some small segments of the protein were administered, the clustering approach yielded more specialized groups. The datasets derived from HMM profiling were clustered in more tangled sections when going through the C-terminal of the protein. The clustering results clearly showed that NTD and RBD segments are divided into distinct groups. The distinct groups are affiliated to beta-coronaviruses, reflecting the specificity of these domains even in one genus.  Table 2.
Sequence alignments and correspondence analysis. A comparative analysis of the sequence libraries was conducted to find corresponding residues in each alignment set. Therefore, the minimal requirement was multiple sequence alignment, which was done for each library separately. The datasets were purged for duplicated sequences before the alignment process. The alignments were represented by sequence bundles as a visualization technique to view the one-to-one relationship between the sequences. This visualization technique in combination with correspondence analysis allows for saliently exploring physical properties and location of specific amino acids in respective positions. To identify distant covariant sites in multiple sequence alignments (MSAs), a correspondence analysis was performed. This analysis provides a lower-dimensional representation of the alignment data in a scatterplot. The most striking observation that emerged from correspondence analysis was the dependencies of major domains (RBD, NTD, and furin cleavage motif) to a few residues ( Table 3). The majority of the corresponding residues are structurally part of coils. Some residues occurred only once in our dataset, suggesting the existence of unique and specific mutations in the Sgp SARS-CoV-2 (total number of aligned sequences are mentioned in Table 3; the details of each sequence library on which alignments were built, is provided as Supplementary Data 2).

RBD domain significantly corresponds to Thr500 and Asn501. A couple of corresponding sites
were identified in the RBD domain viz. Thr500 and Asn501 and occurred in 45.9% and 7% of the MSA, respectively (Fig. 2). These two residues are directly involved in the interaction of RBD and ACE2 (Fig. 3). The interface residues in RBD and ACE2 complex are defined and labeled in Fig. 3. Moreover, when coupled with centrality evaluations, a significant Z-score endorsed on Asn501 (Z-Score: 3.008), reflecting a likely important role for this residue (Supplementary Table S2). The replacement of previously defined residues in these positions by Thr and Asn is a relatively radical substitution based on the Grantham distance matrix (Supplementary Table S3).
Herein, as well as two aforesaid positions, two other positions, namely 486 (Gln) and 493 (Phe), were evaluated, because they are all involved in receptor-ligand interaction. These are relatively variable sites (Fig. 4) and were introduced as major determinants for host range determination and tissue tropism in the earlier studies 38 . Sequence bundle visualization of MSAs allowed us to extrapolate the harmonious location of the residues in these sites.
The position of Asn501 in Sgp SARS-CoV-2 is mostly occupied by Thr in other sequences (for example 6ACG; SARS-CoV 39 ). In our dataset, the sequences containing Asn at the same position include A0A023PTS3, A0A023PUW9, A0A2D1PXC0, U5WHZ7, and U5WLJ7. The striking observation is that all these sequences belong to the viruses that are hosted by Rhinolophus affinis (intermediate horseshoe bat).
We speculate that these three positions work in balanced harmony. To account for their relation and coordination, these positions were named from N-terminal to C-terminal as position 1 (F486 Sgp SARS-CoV-2 ), position 8 (493 Sgp SARS-CoV-2 ), and position 16 (501 Sgp SARS-CoV-2 ) (Fig. 4). As is evident in Fig. 4, position 1 is mostly occupied by polar amino acids, position 2 is always occupied by polar amino acids, and position 3 is always occupied by hydrophobic amino acids. These results show the existence of a striking harmony in the respective positions.

Conservancy rate of receptor-binding motif (RBM) and the remaining section of RBD.
The existence of corresponding residues in critical positions mentioned in the preceding paragraphs has prompted us to answer a critical question: why these substitutions were singled out through the correspondence analysis. We attempted to examine the variation in the evolutionary rate of different sites of RBD. This section sketches the physicochemical properties of the RBD, derived from sequence data to estimate the evolutionary rate. The data presented here is derived from the sequence data; structural data were also included (for comparison) (Supplementary Data 3).
To explore the differences between ten variables of surface accessibility, flexibility, buried area, as well as CX, DPX, CN, Bfactor, accessible surface area, similarity scores, and identity scores, the non-parametric Mann-Whitney U test was performed for each variable. The purpose of this assessment was to investigate whether the variables are significantly different between subpopulations of receptor-binding motif (RBM) and the remaining part of RBD. The results showed that the distribution of flexibility, ASA, CX, Average Bfactor, identity scores, and similarity scores were not the same in the two populations (details are provided in Supplementary Data 3). Additionally, the test was performed for comparing the values of three focused residues of RBM [(F486 Sgp SARS-CoV-2 ), (493 Sgp SARS-CoV-2 ), and (501 Sgp SARS-CoV-2 )] with the remaining residues. Additionally, the Mann-Whitney U test was used to investigate the differences between those three focused residues of RBM [(F486 Sgp SARS-CoV-2 ), (493 Sgp SARS-CoV-2 ), and (501 Sgp SARS-CoV-2 )] and the remaining parts of RBM or the remaining parts of RBD. The results    Collectively, data in this section revealed the existence of evolutionary rate variations among RBM in comparison with the whole RBD.
The furin binding motif is modified in favor of furin activity. The sequence alignment section of the furin binding pocket of Spgs suggests a high level of the conservancy of this motif among the dataset. Surprisingly, the exception was the sequence of Sgp SARS-CoV-2 (red line in the alignment bundle in Fig. 6). In the evaluation of the furin cleavage site, the pattern introduced by the seminal work of Tian et al. and similar nomenclature was followed, because we found it plausible for explaining the properties of the furin binding motif of Sgp SARS-CoV-2 . The authors explained the furin binding motif as a core region surrounded by two flanking boxes (see Refs. 44,45 ). The core region is occupied by positively charged residues, and the flanking regions are more flexible and surface-accessible residues. The furin binding site significantly corresponds to Gln675, Gln677, Thr678, and Ala684 ( Fig. 6), while these positions in the alignment are occupied by other amino acids. In the other words, these residues have occurred only once in the MSA. Therefore, it can be concluded that these non-conserved residues are likely species-specific. Notably, based on the Grantham replacement matrix, these substitutions are relatively conservative (Supplementary Table S3). These residues are also involved in some experimentally validated epitopes (Supplementary Tables S5, S6).
The modification of the aforesaid residues resulted in vast modifications in the biochemical properties of the furin binding site (Fig. 7) of Sgp SARS-CoV-2 . Figure 7 shows how corresponding residues, which are in positions 2, 8, 9, and 11 of the cleavage motif, alter the physicochemical properties of the furin binding motif. The core domain of the furin cleavage site is more positively charged and occupied by residues with high isoelectric points. The P1′, which is the exact cleavage site, is resided by alanine, a small hydrophobic amino acid.
As illustrated in Fig. 7, polarity, flexibility, hydrophilicity, and surface accessibility of the flanking regions were increased upon mutations. It is evident in Figs. 6 and 7 that the corresponding residues are, in part, responsible for these modifications.
In comparison with other coronaviruses, the furin binding site is more charged, more flexible, more hydrophilic, and more accessible.
Besides, the interaction of furin binding motif with each other and with other residues generates a relatively complicated network (Fig. 8, right panel); central residues and their corresponding Z-scores are presented in (Supplementary Table S4). The Z-scores were calculated based on the free molecule. As shown in Supplementary  Table S4, none of the corresponding residues from alignment analysis achieved a significant Z-score.
The substrate (furin binding pocket of Sgp SARS-CoV-2 ) and the proprotein convertase (furin) were docked. The resulted complex was used for centrality analysis. This complementary approach examined whether or not the corresponding residues attain significant centrality Z-score in the complex form. The results (Supplementary Table S4) showed that the Z-scores of centrality are significantly different in the two states (these are furin binding motif, free and in complex with furin).
Tracking the substitutions in Sgp SARS-CoV-2 . True SARS-CoV-2 sequences were collected by filtering the BLAST result of the RBD nucleotide sequence against all the available SARS-CoV-2 sequences to match records with expected values between 0 and 6e−26 and a query coverage between 80 and 100. The best model for describing this dataset was defined as Tamura three parameters. The probability of rejecting the dN = dS in favor of dN > dS or dN < dS was not significant, therefore no sign of positive or purifying selection was observed in sequence variants of Sgp SARS-CoV-2 . www.nature.com/scientificreports/ Tracking the substitution frequency of RBD, RBM, and furin cleavage site, after one year of in-host evolution, suggests no significant differences between these domains and other parts of the sequence (all defined substitutions are provided as supplementary data 5; the data was obtained from GISAID database).

Discussion
Amino acid changes can range from biochemically similar (conservative substitutions), to dramatically dissimilar amino acids (radical substitutions) 10,46-48 . Among many substitutions that happen in a protein, few mutations are critical and are actual targets of evolution. A simple and fast method to find these sorts of substitutions might be helpful to understand the nature of emerging diseases, developing novel vaccines, and explaining the behavior of progressive virulence factors. In emerging viruses, the amino acid changes would drastically affect the sensitivity of protein to certain neutralizing antibodies or cause a vaccine or therapeutic failure 49 . Understanding the rate and positions of the mutations and even predicting the stability of certain parts of the protein is important for vaccine design, planning any therapeutic approach, and studying the nature of the emerging sequence. Furthermore, it is critical to define how these substitutions shape the novel traits of the emerging pathogen. Among pathogens, the genome adaptability of RNA viruses makes them more susceptible to jump to new hosts 50 . The spillover and transmissibility in these cases usually depend on the existence of virus receptor(s) on the host cells 51 ; the more the receptor is conserved among different target species, the higher the virus is anticipated to spread. The recent examples within the last two decades are members of coronaviruses causing the SARS and the Middle East Respiratory Syndrome (MERS), and more recently SARS-CoV-2 14,52 .
The interactions of residues in a protein and protein-protein interface rule the maintenance of new substitutions, highlighting the existence of great harmony in the whole protein. In a MSA, the single sites include relatively less information than the entire MSA. Therefore, large and diverse datasets are required for detecting    53 . This paper appraised the alignment method along with principal component analyses (correspondence analysis) to describe the dependencies of small segments of Sgp SARS-CoV-2 to specific residues. The work also attempted to predict whether these substitutions would be stable or tend to be modified. The multi-domain Spg is likely the most important determinant of coronaviruses, because it is responsible for the multi-step process of host recognition and tissue tropism 54 . This prevailing role has propelled the Spg to the forefront of the coronavirus infection investigations, vaccine design, and arrangement of therapeutic plans. While small modifications on RBD may significantly alter the host selection theme of the virus 36,55 , it has dramatically shaped queries on these little differences. Moreover, maintenance or changeable prophecy of the protein segments would allow us to select stable and effective epitopes for vaccine design or drug targets 56 .
Two adjacent Thr500 and Asn501 amino acids are the corresponding sites of RBD. Asn501 contains a significant Z-score of centrality, when the molecule is in complex with its ligand (whereas a free molecule does not obtain a significant Z-score), emphasizing the important role of this residue and context dependencies. These two amino acids are in the C-terminal proximity of RBM and are involved in the receptor-binding interface. Ascribing a central role for Asn501 corroborates the Li et al. 36 argument in which the authors attributed the important role of human-human transition or host range determination to this residue. The authors especially highlighted the role of the side chain 36 . Additionally, significant differences in the Z-scores of centrality between free RBD and RBD-ACE2 complex confirmed the importance of this residue and the context wherein this site is involved. The more obvious evidence involves the alpha variant of SARS-CoV-2, in which the substitution of Asn501 with Tyr501 made the virus dominant 14 .
RBM is the receptor-binding region, isolated from the edge of RBD, with specific sites entangled in ACE2. Presenting the alignments as sequence bundles, which is a sequence-oriented technique, unveiled some hidden properties of the sequences. Our primary assumption for interpreting the conservancy features of RBD and especially RBM was that the less conserved residues would be more effectively involved in the host range determination of the virus, because conserved residues are present in all strains and may not centrally affect the jumping or the specific nature of the viral infection. Together with the role of mutation at a specific site, the effect of the amino acid coalition should be considered when discussing the protein properties. Fourteen positions in RBM are the key residues for binding of SARS-CoV to human ACE2 14 . In Sgp SARS-CoV-2 , six out of fourteen residues are semiconservative compared to SARS-CoV: N439 SARS-CoV-2 (R426 SARS-CoV ), L455 SARS-CoV-2 (Y442 SARS-CoV ), F486 SARS-CoV-2 (L472 SARS-CoV ), Q493 SARS-CoV-2 (N479 SARS-CoV ), Q498 SARS-CoV-2 (Y484 SARS-CoV ), and N501 SARS-CoV-2 (T487 SARS-CoV ) 57 . The sequence data herein suggests that the RBD (and also NTD) is species-specific. Moreover, it appears that the presence of semi-conservative residues, which are the differed parts among previous beta coronaviruses, could have important roles and most likely has an influence on host range determination, tissue tropism, and the current rapid SARS-CoV-2 transmission in humans. Our focus on three hotspots of RBM, namely F486 SARS-CoV-2 , Q493 SARS-CoV-2 , and N501 SARS-CoV-2 , surmises the overall conserved physicochemical properties of RBM, caused by the collaboration of specific residues, leading to a successful viral attachment and cell entry.
The viral entry into susceptible host cells is a complex process 51 and demands maintaining harmony between certain residues of the Spg; it is an indication of a complex tangled bank of amino acid interactions. The full functionality of a protein requires maintaining a balance between the physicochemical properties of major amino acids.
Viruses extracted from the sporadic SARS cases, during 2003-2004, all had asparagine at position 479 and serine at position 487; each virus was an independent cross-species event without the human-to-human transmission 36,58 . Based on these observations, Li et al. 36 concluded that the side chain of the residue at 487 is a key factor for shaping severity (and likely human-to-human transmission) 59 . These positions in Sgp SARS-CoV-2 are replaced by Q493 and N501, respectively. The coexistence of amino acids with specific physicochemical properties could be a marker of the harmonious interaction of residues in this specific region. Therefore, the binding properties of RBM could be more complicated than has been thought earlier.
Similarity and identity scoring strategies reveal the existence of many substitutions in RBD (mostly conservative). The corresponding residues of RBD were found as parts of RBM in the alignment set. Evolutionary variation among different sites depends on various physicochemical properties of the amino acids including surface accessibility 60,61 , packing density, and flexibility 29,[62][63][64] . Surprisingly, regarding the increased levels of surface accessibility and flexibility in association with a decreased level of contact density of RBM in comparison with the remaining parts of RBD, it can be concluded that RBM has a greater evolutionary rate. Therefore, it is evident that the harmonious interaction of residues goes far beyond a small motif. While the evolutionary rate of RBM is higher than the remaining part of RBD, it can be finally concluded that the residues in RBM are targeted by evolution, and other parts tend to preserve these substitutions.
Most VOCs are carrying substitutions in the 501 positions. For example, an emerging UK variant: B.1.1.7 harbors an N501Y mutation which increases the interaction of spike with ACE2 receptor 27,65,66 . The modifications along with increase the affinity of Sgp SARS-CoV-2 to the ACE2 receptor, cause failure of S gene targeting by molecular diagnostics; an example includes Thermo Fisher TaqPath COVID-19 assay 15 .
It is worth mentioning that this position is focused in our study and was defined by correspondence analysis including previous coronavirus sequences, which strongly highlights the usefulness and efficacy of our method.
Not only the amino acids of a protein but also hosts and viruses are in a tangled bank of interactions 67 . The successful completion of viral life spam highly depends on the host elements. In the case of coronaviruses and SARS-CoV-2, the cleavage of the spike by host proteases is important in the infectivity and host range modulation 68 . For instance, a study on MERS-CoV strengthened the concept that along with the virus receptor, the repertoire of proteases expressed by a given cell type, could significantly affect the infectivity 69 . Activation of the spike is a crucial step of the infection and depends on the host's furin activity 70 15 . Furthermore, the presence of glycan near the S1/S2 boundary may completely abolish the proteolytic priming of the virus 71 . Cleavage at different sites can occur in a different lifestyle of the virus during biosynthesis or virus entry; whenever it happens, it can critically affect the cell and tissue tropism as well as host range determination 14,39 . Sgp SARS-CoV-2 harbors a furin cleavage site at the S1/S2 boundary, which is treated during biosynthesis 14 .
Furin cleavage site is known as a consensus pattern of R-X-[K/R]-R⇓ (where X is any amino acid). However, all furin cleavage sites do not follow this pattern 38 . Exploring the first release of Sgp SARS-CoV-2 sequence data, at the first stages of the COVID-19 pandemic, evidenced a four residue insertion at the S1 and S2 boundary in comparison with other SARS coronaviruses 39 . Indeed, we examined this region as a broader motif of 20 amino acids.
An evolutionary conserved 20 amino acid motif could better describe the furin cleavage site as explained by Tian et al. 38 . Their seminal work also mentioned the conservancy of the physical property of this motif among mammals, bacteria, and viruses 14 . The motif was defined as a core region (P6-P2′) that fits into the catalytic pocket of furin and two flanking flexible solvent-accessible regions (P7-P14 and P3′-P6′). The core region determines the binding strength of the enzyme and its substrate; while the flanking regions provide the core region accessibility to furin. The alteration of residues in this motif would drastically affect the efficiency of furin cleavage 39 . It also may affect viral expansion, cell and tissue tropism, transmissibility, and pathogenicity 72 .
In the furin cleavage pocket, the balance maintenance between hydrophobicity and hydrophilicity is a fascinating characteristic of viral fusion proteins 47 . Our data showed that all modified properties are in favor of furin cleavage activity. It is worth mentioning that these differences are derived from exchanging the conserved residues (mostly radical substitutions) in the Sgp SARS-CoV-2 sequence along with the insertion of a short peptide. Similar to RBD, it could be hypothesized that these radical substitutions are targets of evolution, and other sets of substitutions are present for retaining these sites.
Radical substitutions are more probable to be chosen against conservative substitutions 73 . Additionally, organisms with a small effective population size tend to accumulate more radical substitutions than those with larger effective population size and more efficient natural selection 74 . Although the currently available database did not provide adequate information to trace a positive or a negative selection, it is not possible to predict the fate of the spike protein. Nevertheless, regarding the huge effective population size of the virus, the accumulation of conservative substitutions is expectable. Therefore, the modification of the furin cleavage site is more likely to happen, and the maintenance of RBD in the current composition is presumable. Moreover, different sites of the protein may face diverse environmental contexts, thus might have a dissimilar evolutionary fate. This assumption is consistent with our findings on the sequence diversity of Sgp SARS-CoV-2 . Since the C-terminal of Sgp SARS-CoV-2 is located in a relatively constant microenvironment (viral envelope), a low diversity level can be observed in these segments.
Due to the proof-reading properties of RNA polymerases of coronaviruses 10,75 , the mutations are reduced in this family including SARS-CoV-2, relative to other RNA viruses. However, as many research groups are continuously monitoring the genomic diversity of SARS-CoV-2 10 , many mutations are indeed reported. The emergence of mutations in the Spg as the most antigenic determinant of coronaviruses, would cause antigenic drift and subsequently vaccine or drug stagnation. Previous research has demonstrated the altered capacity of some neutralizing antibodies against Sgp SARS-CoV-2 due to the recent mutations. None of the discussed residues in the present study was included in the set of evaluated mutations 76 . Furthermore, surveying the genomic database (https:// www. gisaid. org/ epiflu-appli catio ns/ phylo dynam ics/) of SARS-CoV-2 revealed that more mutations are accumulated in the furin cleavage site (and its vicinity) than RBD, which confirms our assumption of maintenance and modification probability of RBD and furin cleavage site, respectively. This paper shows how sequence-based computational approaches could be applied solely to extrapolate important features of an emerging sequence prior to availability of more complex costly structural information. The most prominent feature of this study is the data presentation, especially the visualization of the MSAs as sequence bundles. Dissemination of sequence data and coupling these observations with structural information manifests the usefulness of in silico tools to delve into important features of emerging virulence agents. Moreover, in silico tools hold great potential of screening bioactive components 77 for inhibiting critical enzymes of the virus 78 or other non-structural components through molecular docking or molecular dynamic simulations 79,80 .
A wide research ground is provided here for future studies to describe the dynamic and energetic features of sequence modifications and manifest the role of other nearby residues and their implications in the protein architecture as a whole. In this regard, the accumulation of various substitutions that occurred in Sgp SARS-CoV-2 could be a signature of long-lasting evolution. Given these enduring events, it could be hypothesized that the coronavirus has been confronted with different environmental contexts and thus, faced different evolutionary pressures. It is plausible for further studies to be focused on this assumption.

Conclusion
The slight differences of SARS-CoV-2 with its close relatives, shape its distinguished characteristics that are responsible for the easy spread of the virus and its spillover. Within many residue substitutions, a few belonging to RBM and furin cleavage motif, were shown to be correlated with the corresponding domains. Our results implicated that singled-out residues may be the real targets of evolution and other substitutions tend to maintain these resident amino acids at certain positions. Residues in the consortium are responsible for explicit features of RBD and furin cleavage motif. The location of amino acids in certain positions revealed a tangled bank of amino acid interaction web. The compensatory role of amino acids may explain this harmonic localization. While the identified residues are parts of experimentally identified epitopes, it should be pinpointed that antibodies or vaccines that target the mentioned residues would remain effective. www.nature.com/scientificreports/ While the initial molecular information on emerging pathogens mostly includes the sequence data, the methods that rely on sequences could be the most helpful approaches. This paper illustrates how sequence-oriented techniques and visualization approaches together can be drastically helpful for the interpretation of existing facts prior to the release of structural information obtained through more complicated and costly experiments. Many human pandemics have been rooted in host shifting. Introducing a fast and reliable approach to describe the emerging sequences will help us to tackle them and to discover effective medications and vaccinations.

Methods
Data sources. All sequences including the Sgp SARS-CoV-2 and its homologous sequences were obtained from the Uniprot Knowledge Base (UniprotKB) 79 at www. unipr ot. org (the accession number of Sgp SARS-CoV-2 : P0DTC2; this reference sequence is one of the first sequenced Spgs of SARS-CoV-2).
The Immune Epitope Database (IEDB) 81 was surveyed to extract the experimentally defined and validated linear and conformational epitopes of Sgp SARS-CoV-2 .
Hidden Markov model profiling. Similar sequences were collected by hidden Markov model profiling by HMMER software tool as provided by www. ebi. ac. uk. HMMER profiling simultaneously defines the domains on the protein sequence and collects homologous sequences from several optional databases. The database used for building the profile was the UniprotKB 82 . The data on the domains of Sgp SARS-CoV-2 were retrospectively collected from the available literature 82 and automatic annotation of UniProtKB (www. unipr ot. org). Following this procedure, major domains were defined and were separately searched against UniprotKB (www. unipr ot. org) for collecting sequences similar to each domain.
The disparity index test 83 was performed on all datasets to estimate the probability of rejecting the null hypothesis of substitution pattern heterogeneity. The judgment was stemmed from the extent of composition biases between the sequences. A Monte Carlo test with 500 replicates was employed for estimating the p-values 84 . The p-values lower than 0.05 were considered significant. The disparity index test was performed by the MEGAX software tool 83,85 for each dataset separately.
Clustering the sequence. Sequence clusters were built for all datasets by an all-against-all BLAST approach at MPI Bioinformatics toolkit by CLANS (CLuster ANalysis of Sequences) (https:// toolk it. tuebi ngen. mpg. de/ tools/ clans) 84 , at a p-value of 10 -3 and at least 1000 repulsions to avoid collapsing the nodes. The pairwise similarities were visualized in a graph by the CLANS stand-alone java application. The resulting CLANS files were further clustered by the network-based clustering function of the CLANS application. The network-based similarity clustering put similar sequences in separate groups, thereby making it easier to differentiate similar and dissimilar sequences in a complicated network of similarities.
Additionally, the overall mean distances in subpopulations and entire populations were estimated by the MEGAX software tool (ver. 10.1.7) 86,87 . The method allowed us to estimate the diversity of various groups. These groups were assigned in the datasets based on their viral genome origins.
Alignments, analysis, and visualization. The MSAs were generated for all datasets by the Tcoffee algorithm 34 , as provided by the MPI bioinformatics toolkit at https:// toolk it. tuebi ngen. mpg. de 88 . The alignments were then visualized and dissected by the Alvis alignment visualizer tool 89 . This alignment visualization as a sequence bundle by Alvis, has several useful features, including the precise definition of each position in the alignment, probing the harmonious location of certain amino acids in certain positions of any sequence in the MSA, and correspondence analysis, which are explained in the next section. The arrangement of lettercoded amino acids by physiochemical properties in the Y-axis of MSA vision makes the MSA presentations more informative. This physicochemical arrangement facilitates the sequence comparison and observation of the residual substitution effect(s).
Correspondence analysis. The explorative interpretation of MSAs was done in a series of numerical experiments. The alignment kernels 90 were computed for each MSA. The selected substitution matrix was BLA-SUM62. As numerical embedding, the Fisher scores of the emission probabilities 35 were calculated by Alvis (ver. 0.1) after training a hidden Markov model 35 on the MSAs. Then, the correspondence test 91 was performed by Alvis (ver. 0.1). The correspondence test is an unsupervised (versus supervised) ordination method to detect dependencies between the sequences, sequence groups, and sites responsible for grouping in the alignment (for details see Ref. 92 ).
Structures. In addition to the characterization of homologous domains by HMMER, the sequence of Sgp SARS-CoV-2 was analyzed for locating the secondary structure elements and disordered regions. The sequence was analyzed by the RaptorX server (http:// rapto rx. uchic ago. edu) 91 and PSIpred (http:// bioinf. cs. ucl. ac. uk/ psipr ed) 93 to reach a consensus position of the structural elements. SARS-CoV-2 related structures were obtained from the Protein Data Bank at www. rcsb. org, including 6VW1: 2019-nCoV chimeric receptor-binding domain complexed with its receptor, human ACE2, and 6VXX: Spg at its closed state 94 .
A homology modeling approach was also included to achieve a complete structure to avoid missing residues. The homotrimeric structure of Sgp SARS-CoV-2 was built by Galaxyhomomer 95 at http:// galaxy. seokl ab. org/. The built structure was automatically refined based on the Cryo-electron microscopy structure of a coronavirus Spg trimer 96 99 . The hydrogen bonds, contacts, and distances (distance threshold < 5 Angstrom) were considered in the RINAnalyzer setting for extracting the network. Before network mining, the residues of interest were selected in Chimera (ver. 1.13). The network of interaction between the selected residues and neighboring residues was then extracted and visualized by Cytoscape (ver. 3.7.2).

Centrality analysis.
The key residues in the interaction networks were determined by the centrality analysis approach in the RINSpector software (ver. 1.1.0) 98 . The centrality measurement is based on the modification of the average shortest path length under the removal of individual residues 100 . This shortest path within two nodes (residues in the structure) is the path in the network that is required for connecting the first node to the second one with the minimum number of edges. This minimum number of edges is known as the shortest path length and the average shortest path length of all possible pairs of nodes identifies the average shortest path. The specific central residues were determined by calculating the Z-score; based on the alteration of an average shortest path length compared to the primary one. By increasing the average shortest path length upon removing a node, a Z-score would be increased. The Z-scores of greater than 2 were considered relevant 101 . The centrality analysis was done on the structures of RBD both as a free molecule and in complex with ACE2 (PDB entry: 6W41 101 ). Similarly, centrality analysis was performed on the structure of the furin cleavage motif both in the free state and in complex with furin. The structure of the furin cleavage motif nestled in the furin active cleft was obtained by docking the predicted structure of the furin cleavage motif to unbounded furin (PDB ID: 5JXG 102,103 ). The docking approach was based on the ZDOCK algorithm 104 as provided by http:// zdock. umass med. edu.
The evolutionary rate of RBM versus RBD. The physicochemical properties of the RBD sequence were based on amino acid scales of flexibility, surface accessibility, and buried area as calculated by the Protscale at www. expasy. ch 105 . The contact map of RBD was predicted using its sequence by RaptorX contact predict 100,106 as provided by http:// rapto rx. uchic ago. edu/. The contact map is indicative of interaction density; this interaction density here is inferred from the sequence data. The identity and similarity scores to the RBD of Sgp SARS-CoV-2 , from MSA, were mapped onto the structure of RBD (PDB entry: 6W41 106 ; the structure was selected based on validation criteria). The mapping approach was based on the ProtSkin algorithm 106 at http:// www. mcgnmr. mcgill. ca/ ProtS kin/. The conservation property of each site in RBD alignment was calculated even as the percentage of identity to the query sequence, or the average similarity score to the query sequence. The scores were calculated using the BLOSUM62 Block Substitution Matrix by the ProtSkin algorithm 100,107 . The obtained scores from the MSA file then were mapped onto the structure by a color gradient.
Protrusion (or convexity) index (CX), the depth of each atom in a protein structure (DPX), and the contact number of each residue were calculated by protein core/surface visualization workbench (PCW) 108 at http:// pongor. itk. ppke. hu/. These data along with B-factor were extracted from the PDB structure of RBD: 6W41 109 . Tracking the mutations in the Sgp SARS-CoV-2 amino acid sequences. To evaluate the positive or negative selections in RBD SARS-CoV-2 sequences, a collection of all available RBD SARS-CoV-2 sequences was built by a BLAST search against the available SARS-CoV-2 sequences. The best model with the lowest Bayesian Information Criterion scores was identified to describe the substitution pattern of this dataset. All positions containing gaps and missing data were completely deleted. The null hypothesis of d N = d S in favor of d N > d S or d N < d S was tested for tracing the positive or negative selections, respectively. These analyzes were performed in the MEGAX software (ver. 10.1.7).
Additionally, to monitor the mutation in each certain position (focal positions of this study), the mutation record of Sgp SARS-CoV-2 was obtained from the GISAID database 97,110 at https:// www. gisaid. org/. The replacement frequency of each position was examined to find any significant differences.
Distance difference for each pair of amino acids was also evaluated based on Grantham's distances 97 .
Statistical analysis. The nonparametric Mann-Whitney U test was performed to investigate the significant differences between certain positions and others. The selected positions were those that have been focused on in previous sections.
Graphical visualization. Images were prepared by the CLANS Java application and graphical reporting tools of Chimera (1.13) 96,111 and Cytoscape (3.7.2) 106 . The conservancy of amino acids of RBD was visualized by the PyMol graphic system (ver. 2.3.4) 106 using the coloring macro generated by the ProtSkin 38,100 based on similarity or identity scores.

Data availability
All data associated with this study are present in the paper or the Supplementary Information.