Analysis of SARS-CoV-2 mutations in the United States suggests presence of four substrains and novel variants

SARS-CoV-2 has been mutating since it was first sequenced in early January 2020. Here, we analyze 45,494 complete SARS-CoV-2 geneome sequences in the world to understand their mutations. Among them, 12,754 sequences are from the United States. Our analysis suggests the presence of four substrains and eleven top mutations in the United States. These eleven top mutations belong to 3 disconnected groups. The first and second groups consisting of 5 and 8 concurrent mutations are prevailing, while the other group with three concurrent mutations gradually fades out. Moreover, we reveal that female immune systems are more active than those of males in responding to SARS-CoV-2 infections. One of the top mutations, 27964C > T-(S24L) on ORF8, has an unusually strong gender dependence. Based on the analysis of all mutations on the spike protein, we uncover that two of four SARS-CoV-2 substrains in the United States become potentially more infectious. Rui Wang et al. report a comprehensive analysis of nearly 13,000 SARS-CoV-2 genome sequences isolated from patients in the United States, comprising more than 7000 single mutations. They show that SARS-CoV-2 genomes cluster into four distinct groups and that two of these groups are potentially more infectious, underlining the urgent need for viral control strategies in the US.

T he ongoing global pandemic of coronavirus disease 2019  caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has led to near a million deaths. The United States (US) has over 6,304,181 confirmed cases and 189,790 deceased cases as of 11 September 2020 1 . The rapid spread of COVID-19 gives rise to a question of whether SARS-CoV-2 has become more transmissible or infectious in the US. We analyze 12,754 complete US SARS-CoV-2 geneome sequences deposited in GISAID 2 to investigate the characteristics of US SARS-CoV-2 strains and understand their ramifications to the US public health.
SARS-CoV-2 has a genetic proofreading mechanism achieved by non-structure protein (NSP) 14 in synergy with NSP10 and NSP12 3,4 . Therefore, SARS-CoV-2 has a higher fidelity in its transcription and replication process than that of other singlestranded RNA viruses. Nonetheless, 7123 single mutations have been detected in the 12,754 US SARS-CoV-2 strains in the past few months with respect to the first SARS-CoV-2 genome collected on 24 December 2019 5,6 . Genome sequencing, SNP calling, and phenotyping provide an efficient means to study the epidemiology of COVID-19 7 and infer the relationship between SARS-CoV-2 protein structures and COVID-19 pathogenicity [8][9][10] . Analyzing genome sequencing and single-nucleotide polymorphism (SNP) calling has been a hotspot for a wide variety of epidemiological, clinical, experimental, biophysical, mathematical, and computational studies 7,[9][10][11][12][13] . Viral virulence and infectivity are some of the most important SARS-CoV-2 characteristics and are determined by its molecular structure and function [14][15][16] . The intrinsic viral infectivity can be measured by virus quantification that counts the number of viruses in a specific volume over a unit time by using either traditional or modern methods 17,18 , including enzyme-linked immunosorbent assay (ELISA) that is based on protein-protein interactions (PPIs), such as antibody-antigen binding events, being counted by chromogenic or fluorescence reporters. Epidemiological and biochemical studies show that the infectivity of different SARS-CoV strains in host cells is proportional to the binding free energy between the spike (S) protein receptor-binding domain (RBD) and angiotensinconverting enzyme 2 (ACE2) expressed by the host cell [18][19][20][21][22] . Mutation-induced protein-protein binding free energy changes (ΔΔG) of S protein and ACE2 complex provides a means to estimate SARS-CoV-2 infectivity. Alternatively, viral virulence can be quantitatively measured by illness, mortality, or pathological lesions 23,24 . However, these measurements cannot be objective due to interference from disease prevention and treatment.
Results and discussion SNP calling analysis Cluster analysis. Complete genome sequence data provide us with a wide variety of opportunities to decode the mutation-induced transmission and infection behavior of  In this work, we downloaded 45,494 complete SARS-CoV-2 genome sequences from GISAID (https://www.gisaid.org/) up to 11 September 2020. Figure S1 illustrates the distribution of mutations on the SARS-CoV-2 coding genome. A website, called Mutation Tracker, has been created to report the unique single mutations as well as associated frequencies and download related information. Among 45,494 sequences, 12,754 sequences are decoded from the genome isolates submitted by the United States, and 7123 unique single mutations are detected. We track the geographical distributions of the 12,754 US isolates with the k-means clustering method (see Fig. S2), showing that the United States are clustered into four distinct clusters, as shown in Fig. 1. The blue, red, yellow, and green represent Cluster A, B, C, and D, respectively. The base color of each state is determined by its dominated cluster. Most of the states are dominated by Cluster A and Cluster D. Table S1  Top mutations in the United States. Here, we focus on the highfrequency or top mutations that represent the most common characteristics of SARS-CoV-2 in the United States. A total of 14 mutations in the United States has a frequency >1000. Among them, 3 mutations are synonymous ones (i.e., 3037C>T-(F106F), 8782C>T-(S76S), and 18060C>T-(L7L)) and 11 mutations are the Fig. 1 Pie chart plot of four clusters in the United States as of 11 September 2020. The blue, red, yellow, and green colors represent clusters A, B, C, and D, respectively. The base color of each state is decided by its dominant cluster. No color is assigned to a state if we cannot find its complete genome sequences at the GISAID database.
28881G>A-(R203K), 28882G>A-(R203K), and 28883G>C-(G204R)). Since synonymous mutations do not change SARS-CoV-2 proteins, we only focus on the other 11 missense mutations. Table 1 lists the frequencies of the top 11 missense mutations in the United States. The dates that these mutations were first detected in the world and in the United States are also included in the Table 1. The first missense mutation with the highest frequency, 23403A>G-(D614G), occurred in China on 24 January 2020. The missense mutation with the second-highest frequency, 14408C>T-(P323L), occurred in Spain on 25 January 2020. Both mutations were first detected in the US on 28 February 2020. The first top mutation recorded in the US was 28144T>C-(L84S), on 19 January. This mutation was originally detected in China on 5 January 2020.
Three of the top 11 missense mutations, i.e., 17858A>G-(Y541C), 17747C>T-(P504L), and 27964C>T-(S24L), appeared in the United States first. In fact, over 87% of these mutations are kept in the United States. Although 28881G>A-(R203K), 28882G>A-(R203K), and 28883G>C-(G204R) have their frequencies being higher than 1000 in the United States, no more than 12% of these three mutations are prevalent in the US. The rest of 88% of 28881G>A-(R203K), 28882G>A(R203K), and 28883G>C-(G204R) are dominated in the European countries. The top mutations in the United States are also the top 7 mutations in the world. However, only one of the next 4 top mutations in the US is ranked within the top 11 globally.
Co-mutation analysis. The statistical values of pairwise comutations from the top 11 high-frequency mutations in Table 2. The upper triangular reveals the total number of comutations for each pair of mutations, the diagonal presents the frequency of every single mutation, and the lower triangular shows the ratios of pairwise co-mutations over single mutations.
Evolutionary analysis. Figure 2 plots time evolution trajectories of top 11 missense mutations. The red curves are the total numbers of genome samples over time, which become very insufficient after mid May 2020. First, as shown in Table 2, mutations 14408C>T-(P323L) and 23403A>G-(D614G) appear concurrently and thus have an identical trajectory as shown in Fig. 2. This pair of mutations exists in essentially all of the US infections. Additionally, mutation 1059C>T-(T85I) always occurs together with mutation 25563G>T-(Q57H). Therefore, its time evolution trajectory is extremely similar to that of 25563G>T-(Q57H). Both mutations were first detected in Singapore on 16 February 2020. This pair of mutations occur in~70% of US COVID-19 cases. The third pair of mutations, 17747C>T-(P504L) and 17858A>G-(Y541C), first detected and occurred mostly in the US, have an identical evolution trajectory. Suggested by genome samples, this pair of US-based mutations on the helicase protein appears to have essentially died out in the US. Unfortunately, because there are very insufficient sequencing in the US now as shown by the red curve in Fig. 2, one cannot rule out the existence of these mutations. Mutation 28144T>C-(L84S), the first known mutation globally, has had a very unsteady trajectory. However, its trajectory became identical to those of its co-mutations 17747C>T-(P504L) and 17858A>G-(Y541C) after 20 February 2020. Finally, mutation 27964C>T-(S24L) has an unusual behavior. The ranking of these 11 mutations in the US and the world are included in the table. NCU.S. and NCWorld represent for the total number of sequences with a specific mutation in the United States and in the world, respectively. The last column records the date that these eight missense mutations were detected for the first time in the world and in the United States. The second-last column records their corresponding countries, i.e., the country lists at the top shows where the mutations first detected, and the country lists at the bottom will always be the United States. Here, AU, US, CN, ES, and SG represent the Australia, the United States, China, Spain, and Singapore, respectively. We use ISO 8601 format YYYY-MM-DD as the date format.
Its peak ratios occurred in early June when there were insufficient sequence samples in the US.
Gender analysis. The last chart in Fig. 2  Therefore, we can say that mutation 27964C>T-(S24L) has a female-dominance pattern. Actually, C>T mutations also have a female preference 6 , which may be due to the strong host cell mRNA editing known as APOBEC cytidine deaminase 32 .
Mutations on 5′UTR. The 5′untranslated region (5′UTR) is a leader RNA fragment, which plays critical roles in the regulation of translation and the gene expression of virus 33,34 . Among 12,754 complete genome sequences in the US, 9673 sequences have mutations on 5′UTR, and there are eight mutations on 5′ UTR with frequency >10 in the US. Mutation 241C>T is the most common mutation that has 9628 and 36,786 frequencies respectively in the US and the World, indicating the 241C>T mutation is important for the genomic replication process 33,34 . Mutation 187A>G has the second-highest frequencies in the US and the World, which are 36 and 207, respectively. The frequencies other six mutations (i.e., 199G>T, 222C>T, 208G>T, 218C>T, 242G>T, and 169A>G) are <50. The detailed information can be found in the Supplementary information Table S2.
Protein-specific analysis. In this section, we discuss the properties of the top 11 missense mutations associated with seven proteins (i.e., NSP2, NSP12, NSP13, Spike, ORF3a, ORF8, and Nucleocapsid). Figure S3 illustrates the proteoforms of SARS-CoV-2 proteins. Moreover, to understand the impact on the protein's structures induced by mutations, we employ artificial intelligence 27 , FRI 26 , and subgraph centrality models 11 as summarized in Table 3.  38 . The red rectangle marks the mutant residue with its neighborhoods. SARS-CoV-2 NSP12 is highly conservative among the other four species. Although P323L mutates the residue of proline (P) to leucine (L), these two residues are both non-polar and aliphatic, indicating P323L may not affect the functionality of NSP12. Figure 3a shows the three-dimensional (3D) structure of SARS-CoV-2 NSP12 created by PyMol 39 . NSP12 is of paramount importance to the SARS-CoV-2 replication and transcription machinery 13,40 . The increasing ratio of P323L in Fig. 2 indicates that this type of mutations may favor SARS-CoV-2 and enhance the transmission capacity of SARS-CoV-2. However, a more likely reason is that P323L is a comutation of D614G, suggesting that mutation P323L may be enhanced by mutation D614G. The negative folding stability Table 2 The statistical values of pairwise co-mutations from the top 11 high-frequency mutations. The values in the diagonal reveal the total number of a specific single mutation in the United States, the values in the upper triangular represent the total number of the co-mutations, and the values in the lower triangular present the ratios of pairwise co-mutations over single mutations. Table 3 suggest that P323L destabilizes the NSP12. Figure 3b, c show the differences of FRI rigidity index and subgraph centrality between the network with wild type and the network with mutant type. The atoms on the mutant residue are mark with big colored balls. We deduce that the slight increase in the rigidity means the mutation makes the protein less flexible or less cooperative in synergistic interactions. From Table 2, we can see that 14408C>T-(P323L) always shows up with 1059C>T-(T85I), 23403A>G-(D614G), 25563G> T-(Q57H), and 27964C>T-(S24L) simultaneously. Therefore, we can deduce that the increasing tendency of P323L ratios per 7days is due to its co-mutation with other infectivity-strengthening mutations, such as 23403A>G-(D614G).

changes in
Mutation on the spike protein. Mutation 23403A>G-(D614G) located on the spike protein has the second-highest frequency in the United States, which has been considered as the key mutation that makes SARS-CoV-2 more infectious worldwide 12,41,42 . From Table 1, one can see that mutation D614G was initially detected in China on 24 January 2020. The first case with the D614G mutation in the United States was reported on 28 February 2020 in Florida (2 sequences) and Rhode Island (1 sequence). A higher prevalence of D614G in the east coast of the US was reported 43 . Since then, SARS-CoV-2 with the mutation D614G has become a major variant, and 82.3% of patients carry D614G in the United States as of 11 September 2020. Figure S5 depicts the sequence alignment for the S protein in different species. The S protein of bat coronavirus RaTG13 has the highest similarity of 97.47% with the S protein of SARS-CoV-2. Amino acids near position 614 are very conservative, indicating that D614G mutation will play an important role in the functions of the S protein of SARS-CoV-2. Figure 4a depicts the 3D structure of the SARS-CoV-2 spike protein that interacts with the host ACE2. The D614G mutation is one of the most prevalent mutations of SARS-CoV-2, which changes the amino acid aspartate (D) with the polar negative charged side changes to the amino acid glycine (G) with a non-polar side chain. Moreover, the D614G mutation ratio in Fig. 2 keeps climbing, and the ratio is approaching the unity after 16 June 2020, which also proves that SARS-CoV-2 becomes more contagious as time goes on. Table 3 shows a positive folding free energy change, indicating the stabilization effect of the mutation. Figure 4b, c illustrate the difference of FRI rigidity index and the subgraph centrality between the network with wild type and the network with mutant type. The FRI rigidity decreases following the mutation, endowing the S protein higher flexibility to interact with ACE2. The same is confirmed by average subgraph centrality.  Figure 5e shows the sequence alignment results in different species. The ORF3a of SARS-CoV-2 and SARS-CoV share a 71.53% sequence similarity. The amino acids nearby position 57 are all conservative in all five species. Moreover, Fig. 5b visualizes the SARS-CoV-2 ORF3a proteoform, where we use red to mark the wild-type amino acid glutamine (Q) and yellow to address the mutant amino acid histidine (H). Mutation Q57H locates at the intramolecular interface and in touch with the membrane, which indicates the special functionality changes that Q57H can induce. Figure 5b is the visualization of ORF3a, which is generated by an online server Protter 44 .
SARS-CoV-2 ORF3a protein is widely expressed in intracellular and plasma membranes, which induces apoptosis and inflammatory responses in the infected cells and transfected cells 45,46 . Figure 5c, d depict the differences of the FRI rigidity index and the subgraph centrality between the network with wild type and the network with mutant type. The ORF3a becomes more rigidity after the mutation, which may result in a less flexible mutant for ORF3a to involve in the apoptosis and inflammatory response.
As illustrated in Fig. 2, the ratio of the 25563G>T-(Q57H) mutation on ORF3a in each 7-day period kept increasing once it was introduced to the United States. This tendency indicates that mutation Q57H becomes prevalent in the viral patients of the United States, which may make the SARS-CoV-2 more infectious. From Table 2, we can see that 7106 sequences have 25563G>T-(Q57H) mutation on ORF3a. Among them, 7079 variants have the [25563G>T-(Q57H), 23403A>G-(D614G)] co-mutations. The negative folding stability changes of mutation Q57H in Table 3 reveals that ORF3a becomes unstable following the Q57H mutation, which may harm the function of ORF3a in apoptosis and increase the viral load in the host cell. However, the Q57H mutation locates near TNF receptor-associated factors (TRAFs), ion channel, and caveolin binding domain 46 , which may affect the NLRP3 inflammasome activation. ORF3a activates the innate immune signaling receptor NLRP3 inflammasome and causes tissue inflammation and cytokine production 47 .
Mutation on the NSP2 protein. As of 11 September 2020, more than half of mutation 1059C>T-(T85I) counts found worldwide are from the United States. The residue T85 on the NSP2 is polar and non-charged, and it changes to a non-polar residue I85 after the mutation. Figure S6a in the Supplementary Information shows the 3D structure of SARS-CoV-2 NSP2. From Fig. S7, we can see that coronaviral NSP2 is relatively conservative for the first 91 residues. Moreover, the T85 residue with its neighbors is all conservative in the other four SARS-like sequences, indicating this type of mutations may be substantial to coronaviral structures and properties.
NSP2 is also a viral protein that does not attract much research attention. In the SARS-CoV genome, the deletion of NSP2 may only result in a modest reduction in viral titers, which is considered to be a dispensable protein 48 . Table 3 shows that the folding stability change of T85I is −0.05 kcal/mol. Although the negative value reveals that T85I may destabilize the structure of NSP2, this small change is negligible. The FRI rigidity change is also minor, as is shown in Fig. 3c, indicating the mutation of T85I on the NSP2 does not change the flexibility of NSP2 too much. However, the growing trend in Fig. 2 still indicates that 1059C>T-(T85I) is an infectivitystrengthening mutation, which may mainly benefit from the comutation with other infectivity-strengthening mutations, such as 23403A>G-(D614G) and 25563G>T-(Q57H). Table 3 The protein folding stability changes of 11 missense mutations.

Mutation
Protein  Mutations on the NSP13 protein. NSP13 of SARS-CoV-2 is a superfamily 1 helicase, which can unwind a double-stranded RNA (dsRNA) or DNA (dsDNA) in the 5′ to 3′ direction into two single-stranded nucleic acids [49][50][51] . As illustrated in Fig. S8, NSP13 of SARS-CoV-2 shares the most homology with the other four species and is one of the most conservative proteins in SARS-CoV-2 genome. Therefore, the existence of two high-frequency mutations on the NSP13 is very unusual. Similar to 27964C>T-(S24L), although 17858A>G-(Y541C), 17747C>T-(P504L) are in the final list in Table 1, more than 87% of them were detected in the United States. Mutation Y541C changes the amino acid tyrosine (Y) to cysteine (C). Figure S9a in the Supplementary Information shows the 3D structure of SARS-CoV-2 NSP13. Notably, tyrosine has an aromatic side chain, while cysteine only has a polar non-charged side chain, indicating that the 3D structure of NSP13 will be incredibly affected.
From Fig. 2, both mutations on the NSP13 have the same trajectory of the mutation ratios over time. Once mutations 17858A>G-(Y541C) and 17747C>T-(P504L) were first found in the United States, they had a rapid increase in the first 2 weeks. However, these two mutations do not benefit SARS-CoV-2. In early March, the ratio of both mutations start to decrease and approach zero after 19 May 2020, suggesting that mutations 17858A>G-(Y541C) and 17747C>T-(P504L) may hinder the transmission of SARS-CoV-2. In Table 2, 17858A>G-(Y541C) and 17747C>T-(P504L) do not show up with 23403A>G-(D614G) in more than a thousand SNP profiles. As reported in Deng et al. 52 , 17858A>G-(Y541C) and 17747C>T-(P504L) occurred among Grand princess strains and some WA state strains (aka WA1 lineage). One possible reason for the declined ratios for 17858A>G-(Y541C) and 17747C>T-(P504L) may be due to the efficient lockdown measures enforced in WA and CA at the early stage of the outbreak in the US. Moreover, another possible reason is that mutations 17858A>G-(Y541C) and 17747C>T-(P504L) may weaken the transmission capacity of SARS-CoV-2. However, we hope that interested labs could verify our assumption through culture/animal studies. Table 3 shows that both high-frequency mutations Y541C and P504L have negative folding stability changes, which will destabilize the structure of NSP13. Mutations 17858A>G-(Y541C) and 17747C>T-(P504L) happen simultaneously after analyzing 45,494 genome sequences, indicating the folding stability changes on the NSP13 are superimposed by two simultaneously occurred mutations. This also explains the same decreasing tendency in Fig. 2 after early March. Based on the protein-specific analysis mentioned above, we deduce that mutations Y541C and P504L prevent SARS-CoV-2 from efficiently interacting with host interferon signaling molecules and impede the NSP13 from efficacious participation in the replication/transcription process. Figure 3b shows the difference of the FRI rigidity index between the network with wild type and the network with the mutant type. One mutation (17747C>T-(P504L)) does not affect the rigidity much, whereas the other mutation (17858A>G-(Y541C)) leads to a decrease in the NSP12 rigidity, which may make NSP13 not as robust as before to involve in the viral infection and replication process.
Mutations on the ORF8 protein. The ORF8 protein has two highfrequency mutations, 28144T>C-(L84S) and 27964C>T-(S24L). More than 94.2% mutation 27964C>T-(S24L) worldwide were found in the United States. The first confirmed case with 27964C>T-(S24L) was discovered on 9 March 2020, in the United States, suggesting that S24L initially happened in the US. Additionally, these two high-frequency mutations S24L and L84S mutate reversibly. The amino acid serine (S) has a non-charged polar side chain, while the leucine (L) has a non-polar aliphatic  Figure S10 illustrates the sequence alignment of SARS-CoV-2 ORF8 with the other four species. The SARS-CoV-2 ORF8 shares a really low similarity among all the other four SARS-like species. SARS-CoV, Bat coronavirus RaTG13, Bat coronavirus CoVZC45, and Bat coronavirus BM48-31 have the same residues at positions 24 and 84. Nonetheless, SARS-CoV-2 ORF8 owns different types of residues. Here, we would like to address that although the ORF8 of SARS-CoV-2 at position 84 has a different residue compared to the other four species, it mutates back to S in 1727 isolates in the United States. Figure S11a in the Supplementary Information shows the 3D structure of SARS-CoV-2 ORF8. ORF8 protein of SARS-CoV-2 shares the least homology with SARS-CoV among all viral proteins, which mediates the immune evasion by interacting with major histocompatibility complex molecules class I (MCH-I) and down-regulating the surface expression of MHC-I on various cells 53,54 . The overall upward trend of the S24L ratio over time reveals that S24L may enhance SARS-CoV-2's ability to spread. In contrast, the time evolution plot shows that the ratio of mutation 28144T>C-(L84S) goes up before the beginning of March, and then the ratio goes down and approach zero after 23 May 2020. Due to the small number of sequence data, we can say that the ratio of L84S has a decreasing tendency. The female patients with S24L mutation on ORF8 account for a large proportion, which indicates that the S24L is most likely to happen in the female population in the United States. Table 3 shows that the folding stability change of 28144T>C-(L84S) is −0.99 kcal/mol, indicating that ORF8 becomes unstable. Figure S10b in the Supplementary Information shows that the ORF8 becomes slightly less rigidity after both L84S and S24L mutations. The rigidity changes induced by S24L is less than the L84S. Based on the function of ORF8 that involved in the immune response, we deduce that L84S may be one of the factors that disfavor SARS-CoV-2 and favor the host immune surveillance to decrease the viral load in the human cells, which provides an explanation that the ratio of L84S in Fig. 2 keeps decreasing. Meanwhile, the positive folding stability change of 27964C>T-(S24L) lists in Table 3 reveals that this type of mutation may enhance the function of ORF8. Therefore, the MHC-I will be inhibited more, and the eradication of SARS-CoV-2 in vivo will be hindered. This could be one possible reason why the ratio of S24L is on the rise. Notably, after analyzing 28,726 complete genome sequences, none of them have mutations 28144T>C-(L84S) and 27964C>T-(S24L) happened simultaneously.
Mutations on the nucleocapsid protein. Nucleocapsid (N) protein is a structural protein for the SARS-CoV-2, which participates in the RNA packaging, virus particle releasing, and the ribonucleoprotein core forming process 55 . Figure S12 in the Supplementary Information illustrates the 3D structure of SARS-CoV-2 N protein. The high-frequency mutations that detected on the Nucleocapsid (N) protein are 28881G>A, 28881G>A, and 28883G>C. From Fig. 2, we can see that the ratio of these three mutations increase rapidly after early May, and in the middle of June, their ratios start to decline. Only 14.3% sequences in the US have mutations 28881G>A, 28881G>A, and 28883G>C.
R203K is caused by both 28881G>A and 28882G>A. On the protein level, mutation 28883G>C leads to G204R. Figure S13 is the sequence alignments for the N protein of SARS-CoV-2, SARS-CoV, bat coronavirus RaTG13, bat coronavirus CoVZC45, Fig. 4 The 3D structure and network analysis plot of SARS-CoV-2 S protein. a Illustration of S-protein and ACE2 interaction. The RBD is displayed in green, the ACE2 is given in red, and mutation D614G is highlighted in red. b The difference of FRI rigidity index between the network with wild type and the network with mutant type. c The difference of the subgraph centrality between the network with wild type and the network with mutant type. and bat coronavirus BM48-31. We can see that positions 203 and 204 are both conservative. Residues Arginine (R) and Lysine (K) are both positively charged. Therefore, the mutation R203K may not affect N protein function. To be noted, Glycine (G) is a nonpolar residue. Mutation 28883G>C-(G204R) may affect the hydrophilicity of N protein, which is in consistent with the predicted negative folding stability changes of R203K and G204 shown in Table 3.
Analysis of binding free energy changes. SARS-CoV-2 enters the host cell by the interaction of the S protein and ACE2. The viral S protein is primed by TMPRSS2 to entail its cleavage at two potential sites, Arg685/Ser686 and Arg815/Ser816 18 . Epidemiological and biochemical studies show that the infectivity of different SARS-CoV strains in host cells is proportional to the binding free energy between the S protein RBD and host ACE2 [18][19][20][21][22] . Therefore, the infectivity of SARS-CoV-2 can be theoretically estimated by the predicted RBD-ACE2 binding free energy. Additionally, since the natural selection favors infectivitystrengthening mutations, those mutations that are predicted to have positive binding free energy changes should be observed more frequently.
We found 434 non-degenerate single mutations on the spike protein. Among them, 46 single mutations are detected on the receptor-binding domain (RBD), and 19 single mutations occurred on the receptor-binding motif (RBM). We separate 12,754 complete SARS-CoV-2 genome sequences in the US into four clusters and calculate the mutation-induced binding free energy changes of S protein RBD and ACE2 in each cluster, which will help us understand the potential transmission tendency induced by the mutations on the S protein RBD. The binding free energy change induced by single mutation The visualization of SARS-CoV-2 ORF3a proteoform. The high-frequency mutation 25563G>T-(Q57H) on ORF3a is marked in color. The red color represents the wild type and the yellow represents the wild type. c The difference of FRI rigidity index between the network with wild type and the network with mutant type. d The difference of the subgraph centrality between the network with wild type and the network with mutant type. e Sequence alignments for the ORF3a protein of SARS-CoV-2, SARS-CoV, bat coronavirus RaTG13, bat coronavirus CoVZC45, and bat coronavirus BM48-31. Detailed numbering is given according to SARS-CoV-2. One high-frequency mutation 25563G>T-(Q57H) locates on the ORF3a protein. Here, the red rectangle marks the Q57H position with its neighborhoods. ΔΔG = ΔG W − ΔG M is defined as the subtraction of the binding free energy of the mutant type (ΔG M ) from the binding free energy of the wild type (ΔG W ). Furthermore, the positive binding free energy change of a single mutation means that the mutation can enhance the binding free energy of the S protein RBD and ACE2 and make SARS-CoV-2 more infectious. Figure 6 illustrates the time evolution trajectories of 434 single mutations on SARS-CoV-2 S protein RBD. The red line shows the mutations that have positive predicted binding free energy changes and the blue lines represent the mutations that have negative predicted binding free energy changes. The red lines gradually outpace the blue lines as time progresses, suggesting that these mutations are favored by natural selection that may enhance the infectivity of SARS-CoV-2. Additionally, green lines indicate the evolutions of mutations that locate away from the RBD. The mutation that has the highest frequency is D614G, which was reported to enhance SARS-CoV-2 infectivity 56,57 . The trajectories of the other two high-frequency S protein mutations (Q675R and E583D) indicate that they are co-mutations with infectivity-enhancing S protein mutations, such as D614G. We found that the other high-frequency S protein mutation L5F is independent of mutation D614G. Figure 7 illustrates overall predicted binding free energy changes ΔΔG (kcal/mol) induced by 46 single mutations on SARS-CoV-2 S protein RBD. The color bar on the left-hand side of the figure represents the mutation frequency. Here, 41% single mutations have positive binding changes (19 out of 46). Moreover, the frequency of mutations with positive predicted binding free energy changes is higher than those with negative predicted binding free energy changes, suggesting that SARS-CoV-2 is more likely to be infectious.
Mutation 23010T>C-(V483A) has the highest frequency (31) localized on the RBM has the positive binding free energy change, which indicates that V483A is prevalent in COVID-19 patients' in the United States has a potential capacity to enhance the infectivity of SARS-CoV-2. However, mutations that locate away from the RBM will also have a crucial impact on the infectivity 58 . Although away from the RBM, the relatively high-frequency and positive binding free energy changes of V367F, R403K, and A411S indicate that more attention should be paid to them in the future. Additionally, an interesting finding is that the mutations occurred at the same residue position such as A348S and A348T, P384L and P384S have similar binding free energy changes. Based on the SNP calling results, we separate 12,754 SNP variants from the United States into four clusters.
Cluster A infectivity. Figure 8a depicts the binding free energy changes of mutations in Cluster A. A total of 12 single mutations are found in Cluster A. Five of them have the positive binding free energy changes, while the other seven mutations induced the negative binding free energy changes. The L455F mutation localized on the RBM has low frequency but the highest absolute Fig. 6 The time evolution of 264 SARS-CoV-2 S protein mutations. The red lines represent the RBD mutations that strengthen the infectivity of SARS-CoV-2 (i.e., ΔΔG is positive), the blue lines represent the RBD mutations that weaken the infectivity of SARS-CoV-2 (i.e., ΔΔG is negative), and the green lines are for S protein mutations that away from the RBD. The mutation with the highest frequency is D614G. binding free energy changes, while the A411S localized outside the RBM with low positive binding free energy change has the highest frequency. Although N354K has relatively high binding free energy change, the frequency is low. Mutations P384S and Q414E have negative binding free energy changes, with the total frequency equals to 4. In general, we can say mutations in Cluster A lead to a less contagious SARS-CoV-2 substrain. It is worth noting that from Table S1, mutations in Cluster A are involved in all of the 20 states with a relatively large number of sequences.
Cluster B infectivity. Figure 8b describes the binding free energy changes of mutations in Cluster B. There are 12 single mutations on the S protein RBD and five single mutations on the RBM. Although the number of single mutations with positive binding free energy changes is less than that with negative binding free energy changes, the high frequency of K444N on RBM enhances the infectivity of SARS-CoV-2. Therefore, the mutations in Cluster B may strengthen the infectivity of SARS-CoV-2. We notice that all of the states in Table S1 are associated with Cluster Cluster C infectivity. Figure 8c describes the binding free energy changes in Cluster C. Although mutatin A475V on the RBM has a negative binding free energy change with a relatively high frequency (5) on the RBM, the much higher frequencies of two mutations G476S (7) and V483A (31) with positive binding free energy change suggests that the mutations in Cluster C may strengthen the infectivity of SARS-CoV-2 in general. Notably, the V483A mutation is localized on the RBM with the highest frequency, indicating that V483A may favor SARS-CoV-2 by natural selection and cause SARS-CoV-2 more infectious. Although A348T has relatively low frequencies due to the limited number of genome samples, their high binding free energy changes may lead to a more contagious SARS-CoV-2 substrain. It is worth noting that from Table S1, mutations in Cluster C are involved in all of the 20 states except for NM. However, AK, CT, DC, and LA each only have fewer than 5 SARS-CoV-2 isolates related to Cluster C.
Cluster D infectivity. The binding free energy changes of RDB mutations in Cluster D are illustrated in Fig. 8d. Twenty-five different single mutations are classified in Cluster D. Among them, 13 mutations have negative binding free energy changes and relatively high frequencies, showing that overall the mutations in Cluster D may reduce the transmission capacity of SARS-CoV-2. Moreover, on the RBM, we can see that mutation Q493L has a relatively high binding free energy change and frequency. In addition, mutation A344S has the highest frequency among the 13 infectivity-weaken mutations. As shown in Table S1, all of these 20 states have mutations in Cluster D, especially in CA, MN, NY, WA, and WI.
Based on the clusters' infectivity analysis, mutations in Clusters A and D may make the SARS-CoV-2 more contagious, while Clusters B and C may lead to less contagious SARS-CoV-2 strains. However, since the infectivity-strengthening D614G mutation is associated with all clusters and essentially all the US genome isolates, it may be reasonable to say all of the US SARS-CoV-2 substrains become more infectious compared with the original genome collected on 24 December 2019 in China.

Methods
Data collection and pre-processing. On 5 January 2020, the complete genome sequence of SARS-CoV-2 was first released on the GenBank (Access number: NC_045512.2) 5 . Since then, there has been a rapid accumulation of SARS-CoV-2 genome sequences. In this work, 45,494 complete genome sequences with high coverage of SARS-CoV-2 strains from the infected individuals in the world were downloaded from the GISAID database 2 (https://www.gisaid.org/) as of 11 September 2020. All the incomplete records and those without the exact submission date in GISAID were not considered. To rearrange the complete genome sequences according to the reference SARS-CoV-2 genome, multiple sequence alignment (MSA) is carried out by using Clustal Omega 59 with default parameters.
Single-nucleotide polymorphism calling. Single-nucleotide polymorphism (SNP) calling measures the genetic variations between different members of a species. Establishing the SNP calling method to the investigation of the genotype changes during the transmission and evolution of SARS-CoV-2 is of great importance 9,10 . By analyzing the rearranged genome sequences, SNP profiles, which record all of the SNP positions in teams of the nucleotide changes and their corresponding positions, can be constructed. The SNP profiles of a given SARS-CoV-2 genome isolated from a COVID-19 patient capture all the differences from a complete reference genome sequence and can be considered as the genotype of the individual SARS-CoV-2.
Distance of SNP variants. In this work, we use the Jaccard distance to measure the similarity between SNP variants and compare the difference between the SNP variant profiles of SARS-CoV-2 genomes.
The Jaccard similarity coefficient is defined as the intersection size divided by the union of two sets A and B 62 : The Jaccard distance of two sets A and B is scored as the difference between one and the Jaccard similarity coefficient and is a metric on the collection of all finite sets: Therefore, the genetic distance of two genomes corresponds to the Jaccard distance of their SNP variants. In principle, the Jaccard distance of SNP variants takes account of the ordering of SNP positions, i.e., transmission trajectory, when an appropriate reference sample is selected. However, one may fail to identify the infection pathways from the mutual Jaccard distances of multiple samples. In this case, the dates of the sample collection provide key information. Additionally, clustering techniques, such as k-means, UMAP, and t-distributed stochastic neighbor embedding (t-SNE), enable us to characterize the spread of COVID-19 onto the communities.
K-means clustering. K-means clustering aims at partitioning a given data set X ¼ fx 1 ; x 2 ; Á Á Á ; x n ; Á Á Á ; x N g; x n 2 R d into k clusters {C 1 , C 2 , ⋯ , C k }, k ≤ N such that the specific clustering criteria are optimized. The standard K-means clustering algorithm picks k points as cluster centers randomly at beginning and separates each data to its nearest cluster. Here, k cluster centers will be updated subsequently by minimizing the within-cluster sum of squares (WCSS): where μ k is the mean of points locating in the kth cluster C k and n k is the number of points in C k . Here, ∥⋅∥ 2 denotes the L 2 distance. The aforementioned algorithm offers an optimal partition of k clusters. However, it is more important to find the best number of clusters for the given set of SNP variants. Therefore, the Elbow method is employed. A set of WCSS can be calculated in the k-means clustering process by varying the number of clusters k, and then plot WCSS according to the number of clusters. The optimal number of clusters will be the elbow in this plot. The WCSS measures the variability of the points within each cluster which is influenced by the number of points N. Therefore, as the number of total points of N increases, the value of WCSS becomes larger. Additionally, the performance of k-means clustering depends on the selection of the specific distance metric.
In this work, we implement k-means clustering with the Elbow method for analyzing the optimal number of the subtypes of SARS-CoV-2 SNP variants. The Jaccard distance-based representation is considered as the input features for the kmeans clustering method. If we have a total of N SNP variants concerning a reference genome in a SARS-CoV-2 sample, the location of the mutation sites for each SNP variant will be saved in the set S i , i = 1, 2, ⋯ , N. The Jaccard distance between two different sets (or samples) S i , S j is denoted as d J (S i , S j ). Therefore, the N × N Jaccard distance-based representation is This representation is used in our k-means clustering.
Topology-based machine learning prediction of protein-protein binding free energy changes following mutations. The topology-based network tree (Top-NetTree) model was developed by an innovative integration between the topological representation and network tree (NetTree) to predict the binding free energy changes of protein-protein interaction (PPI) following mutation ΔΔG 28 . The TopNetTree is applied to predict the binding free energy changes upon mutations that occurred on the RBD of SARS-CoV-2. Algebraic topology 30 is utilized to simplify the structural complexity of protein-protein complexes and embed vital biological information into topological invariants. NetTree integrates the advantages of convolutional neural networks (CNN) and gradient-boosting trees (GBT), such that CNN is treated as an intermediate model that converts vectorized element-and site-specific persistent homology features into a higher-level abstract feature, and GBT uses the upstream features and other biochemistry features for prediction. The performance test of tenfold cross-validation on the dataset (SKEMPI 2.0 63 ) carried out using gradient boosted regression trees (GBRTs). The errors with the SKEMPI2.0 dataset are 0.85 in terms of Pearson correlations coefficient (R p ) and 1.11 kcal/mol in terms of the root mean square error (RMSE) 28 .
Topology-based machine learning prediction of protein folding stability changes following mutation. In this work, the prediction of protein folding stability changes upon mutation is carried out using a topology-based mutation predictor (TML-MP) (https://weilab.math.msu.edu/TML/TML-MP/) which was introduced in literature 27 . The folding stability change following mutation ΔΔG = ΔG w −ΔG m measures the difference between the folding free energies of the wild type ΔG w and the mutant type ΔG w . More specifically, a positive folding stability change ΔΔG indicates that the mutation will stabilize the structure of the protein and vice versa. The essential biological information is revealed by persistent homology 30 . The machine learning features are generated by the element-specific persistent homology and biochemistry techniques. The dataset includes 2648 mutations cases in 131 proteins provided by Dehouck et al. 64 and is trained by a gradient boosted regression trees (GBRTs). The error with the corresponding dataset is given as Pearson correlations coefficient (R p ) of 0.79 and root mean square error (RMSE) of 0.91 kcal/mol from previous work 27 . The persistent homology is widely applied in a variety of practical feature generation problems 30 . It is also successful in the implementation of predictions of protein folding stability changes upon mutation 27 . The key idea in TML-MP is using the element-specific persistent homology (ESPH) which distinguishes different element types of biomolecules when building persistent homology barcodes. Commonly occurring protein element types include C, N, O, S, and H, where hydrogen and sulfur are excluded according to that hydrogen atoms are often absent from PDB data and sulfur atoms are too few in most proteins to be statistically important. Thus, C, N, and O elements are considered on the ESPH in protein characterization. Features are extracted from the different dimensions of persistent homology barcodes by dividing barcodes into several equally spaced bins which is called binned barcode representation. The auxiliary features, such as geometry, electrostatics, amino acid type composition, and amino acid sequence, are included for machine learning training as well. In TML-MP, gradient boosted regression trees (GBRTs) 29 are employed to train the dataset according to the size of the training dataset, absence of model overfitting, non-normalization of features, and ability of nonlinear properties 27 .
Graph network models. Graph networks can model interactions and their strength between pairs of units in molecules. These approaches are employed to understand mutation-induced structural changes. The biological and chemical properties are measured by comparing descriptors on different networks. In this work, the network consists of a set S of C α atoms from every residue of protein structure except the target mutation residue such that a C α atom is included if it is within 16 Å to any atom of the target mutation. The total atom set T is defined as the atoms (C, N, and O) of the target residue and C α atoms of the network set S. Moreover, two vertices are connected in the network if their distance is <8 Å. Thus the adjacency matrix A can be defined as well where A is a matrix containing 0 and 1 such that A(i, j) = 0 if ith and jth atoms are disconnected and A(i, j) = 1 if ith and jth atoms are connected. Two graph network models employed in this work are described below.
Flexibility-rigidity index. FRI was introduced to study the flexibility of protein molecules 25,26 . The single residue molecular rigidity index measures its influence on the set S which is given as where α = w or m stands for the wild type w or the mutant type m, N S is the number of C α atoms of the set S, and N T is the number of atoms in total atom set T.
Here, ∥r i − r j ∥ is the distance between atoms at r i and r j . The molecular FRI rigidity R η measures the topological connectivity and the geometric compactness of the network consisting of C α at each residue and the heavy atoms involved in the mutant.
Average subgraph centrality. Average subgraph centrality is built on the exponential of the adjacency matrix, E = e A , where A is the aforementioned adjacency matrix. The subgraph centrality is the summation of weighted closed walks of all lengths starting and ending at the same node 11,31 . Thus the average subgraph centrality reveals the average of participating rate of each vertex in all subgraph and the network motif, which is given as where I T is the index set of the mutation residue.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
In this work, 45,494 complete genome sequences with high coverage of SARS-CoV-2 strains from the infected individuals in the world were downloaded from the GISAID database 2 (https://www.gisaid.org/) as of 11 September 2020. The reference genome of SARS-CoV-2 was first released on the GenBank (Access number: NC_045512.2) 5 . All of the complete genome sequences we used in this paper are listed in the The amino acid sequence of NSP2, NSP12, NPS13, Spike protein, ORF3a, ORF8, and Nucleocapsid were downloaded from the GenBan. The three-dimensional (3D) structures of NSP12, spike protein, and ORF3a used in this work were extracted from the Protein Data Bank (https://www.rcsb.org/), denoted as 7BTF, 6VYB, and 6XDC, respectively. The 3D structures of NSP2, ORF8, NSP13, and Nucleocapsid were generated by I-TASSER model 61 . The 3D structure graph is created by using PyMOL 39 .