Potential transmission chains of variant B.1.1.7 and co-mutations of SARS-CoV-2

The presence of SARS-CoV-2 mutants, including the emerging variant B.1.1.7, has raised great concerns in terms of pathogenesis, transmission, and immune escape. Characterizing SARS-CoV-2 mutations, evolution, and effects on infectivity and pathogenicity is crucial to the design of antibody therapies and surveillance strategies. Here, we analyzed 454,443 SARS-CoV-2 spike genes/proteins and 14,427 whole-genome sequences. We demonstrated that the early variant B.1.1.7 may not have evolved spontaneously in the United Kingdom or within human populations. Our extensive analyses suggested that Canidae, Mustelidae or Felidae, especially the Canidae family (for example, dog) could be a possible host of the direct progenitor of variant B.1.1.7. An alternative hypothesis is that the variant was simply yet to be sampled. Notably, the SARS-CoV-2 whole-genome represents a large number of potential co-mutations. In addition, we used an experimental SARS-CoV-2 reporter replicon system to introduce the dominant co-mutations NSP12_c14408t, 5′UTR_c241t, and NSP3_c3037t into the viral genome, and to monitor the effect of the mutations on viral replication. Our experimental results demonstrated that the co-mutations significantly attenuated the viral replication. The study provides valuable clues for discovering the transmission chains of variant B.1.1.7 and understanding the evolutionary process of SARS-CoV-2.


Introduction
Since the outbreak in December 2019, COVID-19 has been pandemic in over 200 countries. Cases of infection and mortalities have been surging and are an ongoing threat to public health 1,2 . COVID-19 is caused by infection with the novel coronavirus SARS-CoV-2 [3][4][5] . Although as a coronavirus, SARS-CoV-2 has genetic proofreading mechanisms [6][7][8] , the persistent natural selection pressure in the population drives the virus to gradually accumulate favorable mutations 6,[9][10][11] . Much attention has been paid to the mutations and evolution of SARS-CoV-2 [12][13][14][15][16][17] , since mutations are related to the infectivity and pathogenicity of viruses 14,[18][19][20][21][22] . Beneficial mutants of the virus can better evolve and adapt to the host 9 , either strengthening or weakening the infectivity and pathogenicity. In addition, certain variants may generate drug resistance and reduce the efficacy of vaccines and therapeutics [23][24][25][26][27][28] . In short, studying mutations and evolution in detail is vital to understand the transformations of viral properties and to control the pandemic.
A new variant of SARS-CoV-2 named VOC-202012/ 01 (Variant of Concern 202012/01) or lineage B.1.1.7 was first detected in the United Kingdom last December 29 . It appears to be substantially more transmissible than other variants 30 . The variant has been growing exponentially in the United Kingdom and rapidly spreading to other countries 31,32 . However, it is not yet clear whether it evolved spontaneously in the United Kingdom or was imported from other countries.
Studying how the variant B.1.1.7 mutates will enable researchers to track its spread over time and to understand the evolution of SARS-CoV-2.
In this study, large-scale SARS-CoV-2 sequences, consisting of more than 454,000 spike genes/proteins and 14,000 whole-genome sequences were analyzed. Our extensive sequence analysis showed that many mutations always co-occur not only in the spike protein of B.1.1.7, but in the whole genome of SARS-CoV-2. The mutation trajectories of the spike protein indicate that the early variant B.1.1.7 did not evolve spontaneously in the United Kingdom or even within human populations. We also investigated possible SARS-CoV-2 transmission chains of the variant B.1.1.7 based on the mutation analysis of largescale spike proteins and the cluster analysis of spike genes. Over the whole genome, the top 25 high-frequency mutations of SARS-CoV-2 converged into several potential co-mutation patterns, each of which showed a strong correlation. The potential co-mutations depicted the evolutionary trajectory of SARS-CoV-2 virus in the population, shaping variable replication of SARS-CoV-2. In addition, we further explored the effect of the dominant (co-)mutations 5′UTR_c241t, NSP3_c3037t, and NSP12_c14408t on viral replication using a SARS-CoV-2 replicon based on a four-plasmid in vitro ligation system. The results suggest that such mutations significantly attenuate the replication of SARS-CoV-2.

Results
Evolutionary trajectories of variant B.1.1.7 The variant B.1.1.7 was generally defined by multiple amino acid changes including three deletions (69-70 del and 145 del) and seven mutations (N501Y, A570D, D614G, P681H, T716I, S982A, and D1118H) in the spike protein 31 . The number of non-adjacent co-occurrent changes indicates that they resulted from accumulated mutations. We therefore explored the evolutionary trajectories of B.1.1.7 by tracing the incremental mutations (Fig. 1a). All routes along the directions of the arrows are possible evolutionary trajectories of lineage B.1.1.7. Among all the mutation routes, the green one was the most probable mutation trajectory based on the number of variant strains. However, it was unlikely that the earliest variant B.1.1.7 (GISAID: EPI_ISL_601443, 2020-09-20, England) with nine mutations evolved from the existing variants with 3-8 mutations, because the former arose much earlier than the latter. More than 454,000 SARS-CoV-2 strains have been collected and extensive sequenced from infected humans without finding intermediate variants with 3-9 mutations. It is therefore unlikely that the intermediate variants with 3-8 mutations have infected humans. Thus, the early variant B.1.1.7 might not have arisen spontaneously in the United Kingdom or within human populations. An alternative hypothesis is that spillover likely occurred from susceptible animals.
The coappearance rates (see Materials and methods) of all nine mutations are shown in Fig. 1b. We found that at least five mutations (145 del, A570D, T716I, S982A, and D1118H) of variant B.1.1.7 significantly co-occurred (rate > 95%), which indicates a potential co-mutation pattern in the spike protein, causing us to wonder what selection pressure drove such co-occurrences of mutations and rapid evolution in the population of SARS-CoV-2. Note that coronaviruses generally tend to exhibit rapid evolution when they jump to a different species 33 . We therefore analyzed the key spike genes and proteins of existing SARS-CoV-2 strains collected from animals to find a possible direct progenitor of variant B.1.1.7. The variant with mutations "56" (labeled by "*" in Fig. 1a, termed star variant) had the minimum phylogenetic distance with EPI_ISL_699508, which was collected from a dog on 2020-07-28 ( Fig. 2) using MEGA 34,35 (see Materials and methods). The strains collected from tigers, minks, and cats were also close to the star variant. Our extensive analyses including mutations, phylogeny (Fig. 2), collection date/location and the number of sequences (Supplementary Tables S1-S3) suggested that Canidae, Mustelidae or Felidae, especially the Canidae family (e.g., dog) could be a possible host of the direct progenitor of variant B.1.1.7. The possible transmission chains of variant B.1.1.7 are shown in Fig. 1c. This star variant strains in humans could not have evolved into the early variant B.1.1.7, but they might have infected high-density yet susceptible animals (such as dogs) and adapted to these species through rapid mutation. Such progenitor variants comprised most or all of the mutations of the early variant B.1.1.7 within the Canidae family populations, and they may have spilled back to humans after the rapid mutation period.

High-frequency mutations converge into potential co-mutations
Based on sequence alignment and mutation analysis, we found that 7441 nucleotide alterations in the viral 29,903letter RNA code occurred at least once in the samples from COVID-19 patients. These mutations were dispersed in the 14,427 SARS-CoV-2 strains collected from all around the world. As shown in the heatmap of the top 1% high-frequency mutations (Supplementary Table S4), some sites show very similar mutation rates on most days in samples isolated globally ( Supplementary Fig. S1), including 8898 and 815 samples isolated from the U.S. (Supplementary Fig. S2) and Australia ( Supplementary  Fig. S3). Therefore, these mutations shown in Supplementary Fig. S4a were selected and clustered into cooccurrences, which we called potential co-mutation patterns. From the landscape of the mutation rates  ( Supplementary Fig. S4a), 25 nucleotide sites were clearly clustered into several potential co-mutation patterns. Among these patterns, there was one consisting of the top 4 high-frequency mutations (i.e., 5′UTR_c241t, NSP3_c3037t, NSP12_c14408t, and S_a23403g), which converged into a dominant potential co-mutation pattern. Such co-occurrence lineage has been found in almost all sequenced samples of SARS-CoV-2. Within this cooccurrence pattern, mutation S_ a23403g resulted in the amino acid change (D614G) that apparently enhances viral infectivity 6,20 , albeit debate exists 18 . Notably, there were three successive sites at the 28881st to 28883rd positions of the virus (N_g28881a, N_g28882a, and N_g28883c) that strictly co-occurred. Comparing Supplementary Fig. S4a-c and Table S4, we found that the top 14 high-frequency mutations formed five common cooccurrence patterns.
To assess the above co-occurrence patterns, we analyzed the correlations and statistical significance levels of the high-frequency co-occurrence mutations. The heatmap of the paired Pearson-correlation-coefficients (PCC, Fig. 3a) shows that the top 25 high-frequency mutations clearly cluster into several potential co-mutation groups/ patterns with very strong correlation (≥0.8). By regression analyses, the above co-occurrence patterns have statistical significance levels with P-values < 10 −44 (Fig. 3b). The detailed mutation transitions (Fig. 3c-f, g-i, j-k, l-n and Supplementary Figs. S5-S7) provide further evidence that the above mutations form co-mutation patterns.

Dominant mutations attenuate viral replication
We further explored the effect of the dominant mutations 5′UTR_c241t, NSP3_c3037t, and NSP12_c14408t on viral replication using a SARS-CoV-2 replicon based on a four-plasmid in vitro ligation system. This replicon is devoid of the viral structural proteins while undergoing viral replication, and the viral replication is sensitive to the antiviral agent remdesivir 36 . The 5′UTR_c241t mutation resides in a highly conserved region in the 5′UTR (Fig. 4a). The NSP3_c3037t mutation is synonymous. The NSP12_c14408t mutation is nonsynonymous with an amino acid change of a conserved amino acid P323 in the viral RNA-dependent RNA polymerase (Fig. 4b). We introduced the NSP12_c14408t mutation or the NSP12_c14408t mutation with the other two mutations 5′ UTR_c241t and NSP3_c3037t into the replicon plasmids. The fragments were released from the plasmids by BsaI digestion, and then assembled by in vitro ligation with T4 ligase (Fig. 4c). Replicon RNA transcribed from the ligation products was co-transfected with nucleocapsid (N) mRNA into Huh7 cells. RNA replication was monitored by measuring the secreted Gaussia luciferase activity in the supernatants. Enzymatic dead mutants (759-SAA-761) of the RNA-dependent RNA polymerase (RdRp) NSP12 were introduced, and the mutated replicon served as a non-replication control. As shown in Fig. 4d, transfection of wild type replicon RNA resulted in an obvious increase of luciferase activity, and dead mutant SAA RNA did not replicate. Introduction of NSP12_c14408t mutation resulted in a significant reduction of viral replication. The combination of NSP12_c14408t mutation with the other two mutations further reduced viral replication. These results demonstrate that the P323L mutation in the viral RdRp reduces viral replication, and the synonymous mutations may further attenuate viral replication.

Discussion
A well-resolved phylogeny of variant B.1.1.7 spike genes provides an opportunity to understand the evolutionary process and transmission chains of variant B.1.1.7. Our incremental mutation and phylogenetic analyses on largescale SARS-CoV-2 spike proteins/genes revealed that the early variant B.1.1.7 might not have evolved spontaneously in the United Kingdom or within human populations. In this case, the spillover likely occurred from susceptible animals. Current evidence [37][38][39] indicates that SARS-CoV-2 can effectively infect both domestic animals (for example, dog, cat, pig, and bovine) and wild animals (e.g., mink, rabbit, and fox) by binding host angiotensin converting enzyme 2 (ACE2). Our further analyses including mutations, phylogeny, collection date/location and the number of sequences suggested that the earliest variant B.1.1.7 possibly originated from Canidae, Mustelidae or Felidae, especially the Canidae family (e.g., dog). The cases 40  By tracing the mutation trajectories, we found that at least five mutations of the spike proteins always cooccurred, and a large number of potential co-mutations appeared in the top 1% high-frequency mutations of SARS-CoV-2 whole genome. It has been documented that the mutation S_a23403g results in the amino acid change of the spike protein D614G and enhances viral infectivity 14,[41][42][43][44] . Here, by using a SARS-CoV-2 reporter replicon system, we demonstrated that one of the dominant co-mutations NSP12_c14408t significantly reduced viral replication and combination of NSP12_c14408t mutation with the other two synonymous mutations 5′UTR_c241t and NSP3_c3037t further reduced viral replication further. As the 5′UTR plays an important role in regulating viral replication, the synonymous mutations 5′UTR_c241t may attenuate viral replication by change RNA secondary structure 45 . These findings imply that SARS-CoV-2 undergoes an evolution toward enhancing viral infectivity while attenuating viral replication. SARS-CoV-2 has exhibited significant mutations and co-mutations. We evaluated the replication of a co-mutation pattern including three dominant Fig. 3 The strong correlations suggest that the top 25 mutations form eight potential co-mutation patterns. a The correlation heatmap of the top 25 mutations. These mutations could be grouped into several clusters with high Pearson-correlation-coefficient (PCC). b Regression analysis of mutations shows that eight clusters all denote the statistical significance level: ***P-value < E-44. c-f, g-i, j-k, and l-n show the transitions of the high-frequency mutations. The sky-blue represents the rate per day of initial residue in population and the golden represents the rate per day of substitution/mutant. These mutation transitions provide further evidence that the above mutations potentially form co-mutation patterns.
mutations. Whether other mutations act similarly on the viral replication needs to be verified. These results can be further explored for efficient vaccine design by combining the information of the associated network biomarkers 46,47 , dynamic network biomarkers [48][49][50] or conserved k-mers 51-53 in our future work. In summary, this study provides insights into the transmission chains of variant B.1.1.7 and the effect of viral dominant mutations on viral evolution.

Data selection and pre-processing
The 454,443 spike gene/protein sequences of SARS-CoV-2 were obtained at https://www.gisaid.org/. The NCBI website at https://www.ncbi.nlm.nih.gov/sars-cov-2/ has released >1.7 thousand sequences of SARS-CoV-2 viruses before July 31, 2020. We selected 14,427 sequences that satisfied two criteria: (1) having specific collection dates; (2) sequence-lengths being no less than 29,305 nt Fig. 4 Dominant co-mutation attenuates viral replication. a Predicted RNA structure of the SARS-CoV-2 5′UTR. RNA structure of the 400-nt 5′UTR was predicted by "RNAstructure" (http://rna.urmc.rochester.edu/RNAstructureWeb). The start codon for nsp1 is gray, the TRS-L is orange, and the mutated nucleotides are red. The bottom panel shows the alignment of the 5′UTR of SARS-CoV-2 with 5′UTRs of related viruses, with c241 highlighted. b Structure of SARS-CoV-2 RNA-dependent RNA polymerase (RdRp) or RNA complex. The structure of SARS-CoV-2 RdRp/RNA complex (PDB, 6X2G) was visualized by Chimera (UCSF). The P323 mutation is highlighted in red, with the alignment of the amino acid sequences of SARS-CoV-2 and related viruses near the P323 position. c Schematic of the in vitro ligation system for SARS-CoV-2 replicon. Four plasmids encompassing the viral genome were digested by BsaI to release the four fragments. After gel purification, the fragments were ligated by T4 ligase. The ligation products were purified and used as template for RNA in vitro transcription. sGluc, secreted Gaussia luciferase; 2A, foot-and-mouth disease virus (FMDV) 2A peptide; BSD, blasticidin. d Huh7 cells were co-transfected with in vitro transcribed replicon RNA (wild type (WT) or the indicated mutants) and an mRNA encoding the SARS-CoV-2 N protein. The luciferase activity in the supernatants was measured at the time points indicated. Medium was changed at 8 h post-transfection. Data are shown as mean ± SEM (n = 8). SAA, the NSP12 polymerase active-site mutant. Unpaired Student's ttest was performed between the mutants and WT and between the mutants as indicated. *P < 0.05, **P < 0.01, ***P < 0.001.
(29903*0.98). It is inevitable that some sites of sequences are equivocal owing to the limitation of sequencing depth. For instance, many sites were labeled as letter N in genome sequences. The noise of indeterminate nucleic acids was taken into consideration in our experiments so as to boost accuracy. The co-mutation rate of multi-site comutations was calculated by co-mutation rate ¼ number of sequences containing co-mutations number of all sequences . Moreover, the co-appearance rate of a mutation in B.1.1.7 variant was defined by co-appearance rate ¼  (Supplementary Tables S1, S2). Owing to the late lockdown policies of some governmental agencies, the spread of SARS-CoV-2 has not been prevented well in Europe, America, and Australia. We could ignore the impact of policies for studying the origin of variant B.1.1.7. We quantified the multiple impact factors of viral transmission as shown in Supplementary Table S3 based on the criterion that the smaller the value, the more similar to the star variant. The results still supported that the Canidae family is a possible host of the direct progenitor of variant B.1.1.7.

MEGA version and parameter settings
Version: MEGA-X Statistical method: maximum likelihood Test of phylogeny: none Model/method: Jones-Taylor-Thornton (JTT) model Rates among sites: uniform rates Gaps/missing data treatment: use all sites ML Heuristic method: nearest-neighbor-interchange (NNT) Initial tree for ML: make initial tree automatically (Default-NJ/BioNJ) Branch swap filter: none Number of threads: 7.

Statistical analysis
PCC is a classic statistic that measures linear correlation between two variables. Its value ranges from -1.0 to 1.0. Normally, the two variables meet a strong correlation or a very strong correlation when the absolutes value of PCC is between 0.6 and 0.8 or between 0.8 and 1.0. Linear regression is a linear approach to model the relationship between a scalar response and one or more variables. We used PCC and significance level (P-value) of regression analysis to evaluate the relationships of the co-occurrence mutations in large-scale SARS-CoV-2 examples.

Cell lines
The human hepatoma cells Huh 7 were purchased from the Cell Bank of the Chinese Academy of Sciences (www. cellbank.org.cn) and routinely maintained in Dulbecco's modified medium supplemented with 10% FBS (Gibco) and 25 mM HEPES (Gibco).

In vitro ligation
BsaI digested fragments were gel purified using Gel Extraction Kit (OMEGA) and ligated with T4 ligase (New England Biolabs) at room temperature for 1 h. The ligation products were phenol/chloroform extracted, precipitated by absolute ethanol, and resuspended in nuclease-free water, quantified by determining the A260 absorbance.

In vitro transcription
Purified in vitro ligated product was used as template for the in vitro transcription by mMESSAGE mMA-CHINE T7 Transcription Kit (Ambion) according to the manufacturer's protocol. For N mRNA production, we amplified the N protein-coding region by PCR (sense:  GGC ACA CCC CTT TGG CTC T; antisense: TTT TTT  TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT  TTT TCT AGG CCT GAG TTG AGT CAG CAC) with phCMV-N as template. Then the purified PCR product was used as a template for in vitro transcription by mMESSAGE mMACHINE T7 Transcription Kit as described above. RNA was purified by RNeasy mini Elute (Qiagen), eluted in nuclease-free water, and quantified by UV absorbance (260 nm).

Transfection
Cells were seeded onto 48-well plates at a density of 7.5 × 10 4 per well and then transfected with 0.3 µg in vitro transcribed RNA using a TransIT-mRNA transfection kit (Mirus) according to the manufacturer's protocol.

Luciferase activity
Supernatants were taken from cell medium and mixed with equal volumes of 2× lysis buffer (Promega). Luciferase activity was measured with Renilla luciferase substrate (Promega) according to the manufacturer's protocol.