## Introduction

Since the outbreak in December 2019, COVID-19 has been pandemic in over 200 countries. Cases of infection and mortalities have been surging and are an ongoing threat to public health1,2. COVID-19 is caused by infection with the novel coronavirus SARS-CoV-23,4,5. Although as a coronavirus, SARS-CoV-2 has genetic proofreading mechanisms6,7,8, the persistent natural selection pressure in the population drives the virus to gradually accumulate favorable mutations6,9,10,11. Much attention has been paid to the mutations and evolution of SARS-CoV-212,13,14,15,16,17, since mutations are related to the infectivity and pathogenicity of viruses14,18,19,20,21,22. Beneficial mutants of the virus can better evolve and adapt to the host9, either strengthening or weakening the infectivity and pathogenicity. In addition, certain variants may generate drug resistance and reduce the efficacy of vaccines and therapeutics23,24,25,26,27,28. In short, studying mutations and evolution in detail is vital to understand the transformations of viral properties and to control the pandemic.

A new variant of SARS-CoV-2 named VOC-202012/01 (Variant of Concern 202012/01) or lineage B.1.1.7 was first detected in the United Kingdom last December29. It appears to be substantially more transmissible than other variants30. The variant has been growing exponentially in the United Kingdom and rapidly spreading to other countries31,32. However, it is not yet clear whether it evolved spontaneously in the United Kingdom or was imported from other countries. Studying how the variant B.1.1.7 mutates will enable researchers to track its spread over time and to understand the evolution of SARS-CoV-2.

In this study, large-scale SARS-CoV-2 sequences, consisting of more than 454,000 spike genes/proteins and 14,000 whole-genome sequences were analyzed. Our extensive sequence analysis showed that many mutations always co-occur not only in the spike protein of B.1.1.7, but in the whole genome of SARS-CoV-2. The mutation trajectories of the spike protein indicate that the early variant B.1.1.7 did not evolve spontaneously in the United Kingdom or even within human populations. We also investigated possible SARS-CoV-2 transmission chains of the variant B.1.1.7 based on the mutation analysis of large-scale spike proteins and the cluster analysis of spike genes. Over the whole genome, the top 25 high-frequency mutations of SARS-CoV-2 converged into several potential co-mutation patterns, each of which showed a strong correlation. The potential co-mutations depicted the evolutionary trajectory of SARS-CoV-2 virus in the population, shaping variable replication of SARS-CoV-2. In addition, we further explored the effect of the dominant (co-)mutations 5′UTR_c241t, NSP3_c3037t, and NSP12_c14408t on viral replication using a SARS-CoV-2 replicon based on a four-plasmid in vitro ligation system. The results suggest that such mutations significantly attenuate the replication of SARS-CoV-2.

## Results

### Evolutionary trajectories of variant B.1.1.7

The variant B.1.1.7 was generally defined by multiple amino acid changes including three deletions (69–70 del and 145 del) and seven mutations (N501Y, A570D, D614G, P681H, T716I, S982A, and D1118H) in the spike protein31. The number of non-adjacent co-occurrent changes indicates that they resulted from accumulated mutations. We therefore explored the evolutionary trajectories of B.1.1.7 by tracing the incremental mutations (Fig. 1a). All routes along the directions of the arrows are possible evolutionary trajectories of lineage B.1.1.7. Among all the mutation routes, the green one was the most probable mutation trajectory based on the number of variant strains. However, it was unlikely that the earliest variant B.1.1.7 (GISAID: EPI_ISL_601443, 2020-09-20, England) with nine mutations evolved from the existing variants with 3–8 mutations, because the former arose much earlier than the latter. More than 454,000 SARS-CoV-2 strains have been collected and extensive sequenced from infected humans without finding intermediate variants with 3–9 mutations. It is therefore unlikely that the intermediate variants with 3–8 mutations have infected humans. Thus, the early variant B.1.1.7 might not have arisen spontaneously in the United Kingdom or within human populations. An alternative hypothesis is that spillover likely occurred from susceptible animals.

The coappearance rates (see Materials and methods) of all nine mutations are shown in Fig. 1b. We found that at least five mutations (145 del, A570D, T716I, S982A, and D1118H) of variant B.1.1.7 significantly co-occurred (rate > 95%), which indicates a potential co-mutation pattern in the spike protein, causing us to wonder what selection pressure drove such co-occurrences of mutations and rapid evolution in the population of SARS-CoV-2. Note that coronaviruses generally tend to exhibit rapid evolution when they jump to a different species33. We therefore analyzed the key spike genes and proteins of existing SARS-CoV-2 strains collected from animals to find a possible direct progenitor of variant B.1.1.7. The variant with mutations “56” (labeled by “*” in Fig. 1a, termed star variant) had the minimum phylogenetic distance with EPI_ISL_699508, which was collected from a dog on 2020-07-28 (Fig. 2) using MEGA34,35 (see Materials and methods). The strains collected from tigers, minks, and cats were also close to the star variant. Our extensive analyses including mutations, phylogeny (Fig. 2), collection date/location and the number of sequences (Supplementary Tables S1–S3) suggested that Canidae, Mustelidae or Felidae, especially the Canidae family (e.g., dog) could be a possible host of the direct progenitor of variant B.1.1.7. The possible transmission chains of variant B.1.1.7 are shown in Fig. 1c. This star variant strains in humans could not have evolved into the early variant B.1.1.7, but they might have infected high-density yet susceptible animals (such as dogs) and adapted to these species through rapid mutation. Such progenitor variants comprised most or all of the mutations of the early variant B.1.1.7 within the Canidae family populations, and they may have spilled back to humans after the rapid mutation period.

### High-frequency mutations converge into potential co-mutations

Based on sequence alignment and mutation analysis, we found that 7441 nucleotide alterations in the viral 29,903-letter RNA code occurred at least once in the samples from COVID-19 patients. These mutations were dispersed in the 14,427 SARS-CoV-2 strains collected from all around the world. As shown in the heatmap of the top 1% high-frequency mutations (Supplementary Table S4), some sites show very similar mutation rates on most days in samples isolated globally (Supplementary Fig. S1), including 8898 and 815 samples isolated from the U.S. (Supplementary Fig. S2) and Australia (Supplementary Fig. S3). Therefore, these mutations shown in Supplementary Fig. S4a were selected and clustered into co-occurrences, which we called potential co-mutation patterns. From the landscape of the mutation rates (Supplementary Fig. S4a), 25 nucleotide sites were clearly clustered into several potential co-mutation patterns. Among these patterns, there was one consisting of the top 4 high-frequency mutations (i.e., 5′UTR_c241t, NSP3_c3037t, NSP12_c14408t, and S_a23403g), which converged into a dominant potential co-mutation pattern. Such co-occurrence lineage has been found in almost all sequenced samples of SARS-CoV-2. Within this co-occurrence pattern, mutation S_ a23403g resulted in the amino acid change (D614G) that apparently enhances viral infectivity6,20, albeit debate exists18. Notably, there were three successive sites at the 28881st to 28883rd positions of the virus (N_g28881a, N_g28882a, and N_g28883c) that strictly co-occurred. Comparing Supplementary Fig. S4a–c and Table S4, we found that the top 14 high-frequency mutations formed five common co-occurrence patterns.

To assess the above co-occurrence patterns, we analyzed the correlations and statistical significance levels of the high-frequency co-occurrence mutations. The heatmap of the paired Pearson-correlation-coefficients (PCC, Fig. 3a) shows that the top 25 high-frequency mutations clearly cluster into several potential co-mutation groups/patterns with very strong correlation (≥0.8). By regression analyses, the above co-occurrence patterns have statistical significance levels with P-values < 10−44 (Fig. 3b). The detailed mutation transitions (Fig. 3c–f, g–i, j–k, l–n and Supplementary Figs. S5–S7) provide further evidence that the above mutations form co-mutation patterns.

### Dominant mutations attenuate viral replication

We further explored the effect of the dominant mutations 5′UTR_c241t, NSP3_c3037t, and NSP12_c14408t on viral replication using a SARS-CoV-2 replicon based on a four-plasmid in vitro ligation system. This replicon is devoid of the viral structural proteins while undergoing viral replication, and the viral replication is sensitive to the antiviral agent remdesivir36. The 5′UTR_c241t mutation resides in a highly conserved region in the 5′UTR (Fig. 4a). The NSP3_c3037t mutation is synonymous. The NSP12_c14408t mutation is nonsynonymous with an amino acid change of a conserved amino acid P323 in the viral RNA-dependent RNA polymerase (Fig. 4b). We introduced the NSP12_c14408t mutation or the NSP12_c14408t mutation with the other two mutations 5′UTR_c241t and NSP3_c3037t into the replicon plasmids. The fragments were released from the plasmids by BsaI digestion, and then assembled by in vitro ligation with T4 ligase (Fig. 4c). Replicon RNA transcribed from the ligation products was co-transfected with nucleocapsid (N) mRNA into Huh7 cells. RNA replication was monitored by measuring the secreted Gaussia luciferase activity in the supernatants. Enzymatic dead mutants (759-SAA-761) of the RNA-dependent RNA polymerase (RdRp) NSP12 were introduced, and the mutated replicon served as a non-replication control. As shown in Fig. 4d, transfection of wild type replicon RNA resulted in an obvious increase of luciferase activity, and dead mutant SAA RNA did not replicate. Introduction of NSP12_c14408t mutation resulted in a significant reduction of viral replication. The combination of NSP12_c14408t mutation with the other two mutations further reduced viral replication. These results demonstrate that the P323L mutation in the viral RdRp reduces viral replication, and the synonymous mutations may further attenuate viral replication.

## Discussion

A well-resolved phylogeny of variant B.1.1.7 spike genes provides an opportunity to understand the evolutionary process and transmission chains of variant B.1.1.7. Our incremental mutation and phylogenetic analyses on large-scale SARS-CoV-2 spike proteins/genes revealed that the early variant B.1.1.7 might not have evolved spontaneously in the United Kingdom or within human populations. In this case, the spillover likely occurred from susceptible animals. Current evidence37,38,39 indicates that SARS-CoV-2 can effectively infect both domestic animals (for example, dog, cat, pig, and bovine) and wild animals (e.g., mink, rabbit, and fox) by binding host angiotensin converting enzyme 2 (ACE2). Our further analyses including mutations, phylogeny, collection date/location and the number of sequences suggested that the earliest variant B.1.1.7 possibly originated from Canidae, Mustelidae or Felidae, especially the Canidae family (e.g., dog). The cases40 that the variant B.1.1.7 can easily infect dogs and cats indicated that both are susceptible to B.1.1.7. Still, due to the limited information available to date, an alternative hypothesis is that the direct progenitor of variant B.1.1.7 is yet to be sampled. In addition to variant B.1.1.7, as a future topic we will work on the analysis of other lineages such as P.1, B.1.351, B.1.427, and B.1.42, when sufficient numbers of their sequences are available.

By tracing the mutation trajectories, we found that at least five mutations of the spike proteins always co-occurred, and a large number of potential co-mutations appeared in the top 1% high-frequency mutations of SARS-CoV-2 whole genome. It has been documented that the mutation S_a23403g results in the amino acid change of the spike protein D614G and enhances viral infectivity14,41,42,43,44. Here, by using a SARS-CoV-2 reporter replicon system, we demonstrated that one of the dominant co-mutations NSP12_c14408t significantly reduced viral replication and combination of NSP12_c14408t mutation with the other two synonymous mutations 5′UTR_c241t and NSP3_c3037t further reduced viral replication further. As the 5′UTR plays an important role in regulating viral replication, the synonymous mutations 5′UTR_c241t may attenuate viral replication by change RNA secondary structure45. These findings imply that SARS-CoV-2 undergoes an evolution toward enhancing viral infectivity while attenuating viral replication. SARS-CoV-2 has exhibited significant mutations and co-mutations. We evaluated the replication of a co-mutation pattern including three dominant mutations. Whether other mutations act similarly on the viral replication needs to be verified. These results can be further explored for efficient vaccine design by combining the information of the associated network biomarkers46,47, dynamic network biomarkers48,49,50 or conserved k-mers51,52,53 in our future work. In summary, this study provides insights into the transmission chains of variant B.1.1.7 and the effect of viral dominant mutations on viral evolution.

## Materials and methods

### Data selection and pre-processing

The 454,443 spike gene/protein sequences of SARS-CoV-2 were obtained at https://www.gisaid.org/. The NCBI website at https://www.ncbi.nlm.nih.gov/sars-cov-2/ has released >1.7 thousand sequences of SARS-CoV-2 viruses before July 31, 2020. We selected 14,427 sequences that satisfied two criteria: (1) having specific collection dates; (2) sequence-lengths being no less than 29,305 nt (29903*0.98). It is inevitable that some sites of sequences are equivocal owing to the limitation of sequencing depth. For instance, many sites were labeled as letter N in genome sequences. The noise of indeterminate nucleic acids was taken into consideration in our experiments so as to boost accuracy. The co-mutation rate of multi-site co-mutations was calculated by $${\mathrm{co}} {\hbox{-}} {\mathrm{mutation}}\;{\mathrm{rate}} = \frac{{{\mathrm{number}}\;{\mathrm{of}}\;{\mathrm{sequences}}\;{\mathrm{containing}}\;{\mathrm{co}}{\hbox{-}} {\mathrm{mutations}}}}{{{\mathrm{number}}\;{\mathrm{of}}\;{\mathrm{all}}\;{\mathrm{sequences}}}}$$. Moreover, the co-appearance rate of a mutation in B.1.1.7 variant was defined by $${\mathrm{co}} {\hbox{-}} {\mathrm{appearance}}\;{\mathrm{rate}} = \frac{{{\mathrm{number}}\;{\mathrm{of}}\;{\mathrm{B}}.{\mathrm{1}}.{\mathrm{1}}.{\mathrm{7}}\;{\mathrm{sequences}}}}{{{\mathrm{number}}\;{\mathrm{of}}\;{\mathrm{sequences}}\;{\mathrm{containing}}\;{\mathrm{a}}\;{\mathrm{mutation}}}}$$.

### Analyses on the possible animal hosts

In addition to the phylogenetic analysis, we further explored the possible animal hosts of the direct progenitor of variant B.1.1.7 by mutations, collection time/space of strains, the number of sequences and the edit distance54,55 of mutations (Supplementary Tables S1, S2). Owing to the late lockdown policies of some governmental agencies, the spread of SARS-CoV-2 has not been prevented well in Europe, America, and Australia. We could ignore the impact of policies for studying the origin of variant B.1.1.7. We quantified the multiple impact factors of viral transmission as shown in Supplementary Table S3 based on the criterion that the smaller the value, the more similar to the star variant. The results still supported that the Canidae family is a possible host of the direct progenitor of variant B.1.1.7.

### MEGA version and parameter settings

Version: MEGA-X

Statistical method: maximum likelihood

Test of phylogeny: none

Model/method: Jones-Taylor-Thornton (JTT) model

Rates among sites: uniform rates

Gaps/missing data treatment: use all sites

ML Heuristic method: nearest-neighbor-interchange (NNT)

Initial tree for ML: make initial tree automatically (Default-NJ/BioNJ)

Branch swap filter: none

### Statistical analysis

PCC is a classic statistic that measures linear correlation between two variables. Its value ranges from –1.0 to 1.0. Normally, the two variables meet a strong correlation or a very strong correlation when the absolutes value of PCC is between 0.6 and 0.8 or between 0.8 and 1.0. Linear regression is a linear approach to model the relationship between a scalar response and one or more variables. We used PCC and significance level (P-value) of regression analysis to evaluate the relationships of the co-occurrence mutations in large-scale SARS-CoV-2 examples.

### Plasmids

Four plasmids encompassing the viral genome (pLC-nCoV-A-BsaI, pLC-nCoV-B-BsaI, pLC-nCoV-C-BsaI, and pnCoV-D-sGluc-BsaI) were described previously36. The 5′UTR_c-241-t and NSP3_c-3037-t mutations were introduced into the pLC-nCoV-A-BsaI by fusion Polymerase Chain Reaction (PCR). The NSP12_c-14408-t mutation was introduced into the pLC-nCoV-B-BsaI by fusion PCR.

### Cell lines

The human hepatoma cells Huh 7 were purchased from the Cell Bank of the Chinese Academy of Sciences (www.cellbank.org.cn) and routinely maintained in Dulbecco’s modified medium supplemented with 10% FBS (Gibco) and 25 mM HEPES (Gibco).

### In vitro ligation

BsaI digested fragments were gel purified using Gel Extraction Kit (OMEGA) and ligated with T4 ligase (New England Biolabs) at room temperature for 1 h. The ligation products were phenol/chloroform extracted, precipitated by absolute ethanol, and resuspended in nuclease-free water, quantified by determining the A260 absorbance.

### In vitro transcription

Purified in vitro ligated product was used as template for the in vitro transcription by mMESSAGE mMACHINE T7 Transcription Kit (Ambion) according to the manufacturer’s protocol. For N mRNA production, we amplified the N protein-coding region by PCR (sense: GGC ACA CCC CTT TGG CTC T; antisense: TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT TCT AGG CCT GAG TTG AGT CAG CAC) with phCMV-N as template. Then the purified PCR product was used as a template for in vitro transcription by mMESSAGE mMACHINE T7 Transcription Kit as described above. RNA was purified by RNeasy mini Elute (Qiagen), eluted in nuclease-free water, and quantified by UV absorbance (260 nm).

### Transfection

Cells were seeded onto 48-well plates at a density of 7.5 × 104 per well and then transfected with 0.3 µg in vitro transcribed RNA using a TransIT-mRNA transfection kit (Mirus) according to the manufacturer’s protocol.

### Luciferase activity

Supernatants were taken from cell medium and mixed with equal volumes of 2× lysis buffer (Promega). Luciferase activity was measured with Renilla luciferase substrate (Promega) according to the manufacturer’s protocol.