Human SARS-CoV-2 has evolved to reduce CG dinucleotide in its open reading frames

The outbreak of COVID-19 has brought great threat to human health. Its causative agent is a severe acute respiratory syndrome-related coronavirus which has been officially named SARS-CoV-2. Here we report the discovery of extremely low CG abundance in its open reading frames. We found that CG reduction in SARS-CoV-2 is achieved mainly through mutating C/G into A/T, and CG is the best target for mutation. Meanwhile, 5′-untranslated region of SARS-CoV-2 has high CG content and is capable of forming an internal ribosome entry site (IRES) to recruit host ribosome for translating its RNA. These features allow SARS-CoV-2 to reproduce efficiently in host cells, because less energy is consumed in disrupting the stem-loops formed by its genomic RNA. Notably, genomes of cellular organisms also have very low CG abundance, suggesting that mutating C/G into A/T occurs universally in all life forms. Moreover, CG is the dinucleotide related to CpG island, mutational hotspot and single nucleotide polymorphism in cellular organisms. The relationship between these features is worthy of further investigations.


Scientific Reports
| (2020) 10:12331 | https://doi.org/10.1038/s41598-020-69342-y www.nature.com/scientificreports/ observed to expected frequency of a dinucleotide 10 . The genome of SARS-CoV-2 (29,903 nucleotides 2 , sequence number NC_045512) has 29.94% of A, 32.08% of T (T is used here instead of U for simplicity), 19.61% of G and 18.37% of C. Thus, the expected frequency of CG dinucleotide in viral genome is 3.60% (i.e. 19.61% × 18.37%). However, only 439 CGs are observed, which means the observed frequency of CG dinucleotide is 1.47% (i.e. 439/29,902). Therefore, odds ratio of CG in SARS-CoV-2 is 0.41 (i.e. 1.47%/3.60%). Furthermore, odds ratio of CG in open reading frames (ORFs) of the virus is 0.39, being the lowest among 24 coronaviruses under survey ( Fig. 1a and Table S1). Because a codon is composed of three nucleotides, a dinucleotide (e.g. CG) has three possible locations. Herewith, they are designated as (CG) 12 , (CG) 23 and (CG) 31 respectively. We found that the odds ratio of (CG) 23 in ORFs of SARS-CoV-2 is as low as 0.25, while that of (CA) 23 and (CT) 23 is as high as 1.54 and 1.92 respectively (Fig. 1c). Moreover, odds ratio of (CG) 31 in ORFs of SARS-CoV-2 is 0.50, while that of (AG) 31 and (TG) 31 is 1.52 and 2.64 respectively (Fig. 1d). These data strongly suggest that (CG) 23 has been mutated into (CA) 23 and (CT) 23 , and (CG) 31 has been mutated into (AG) 31 and (TG) 31 . The above-stated mutations are possible because very few of these mutations lead to changes in amino acids. To be specific, there are four codons containing (CG) 23 . They are TCG, CCG, ACG and GCG which code for serine, proline, threonine and alanine, respectively. Mutation of G at codon position 3 into T, C or A in all of them does not change the amino acid they encode. As for (CG) 31 , there are 16 codons having C at position 3. If this C is mutated into T, all 16 codons have the same meanings. And if it is mutated into A, 9 out of 16 codons still have the same meanings. Therefore, it is concluded that SARS-CoV-2 has evolved to reduce CG in ORFs mainly through mutating its G of (CG) 23 and C of (CG) 31 into A and T. Among them, C-to-T (i.e. C-to-U in RNA) occurs at a very high frequency probably because it is the simplest way to change a nucleotide (C becomes U after deamination). Besides, odds ratio of (CC) 23 is much lower than that of (CA) 23 and (CT) 23 (Fig. 1c). This does not mean that (CG) 23 has not been mutated into (CC) 23 . In fact, low odds ratio of (CC) 23 is due to high frequency of C-to-T mutation at position 3, i.e. from (CG) 31 into (TG) 31 (Fig. 1d). The above views are also supported by codon usage bias in SARS-CoV-2 ( Fig. 2), which shows that A/T-ended codons are much more frequently used than their synonymous G/C-ended codons. Besides, all four codons containing (CG) 23 have the lowest percentages of usage among their correspondent synonymous codons.
Low CG content in other coronaviruses. Odds ratios of CG in ORFs of other coronaviruses are also very low (mean value = 0.50, Fig. 3 and Table S1). This could have profound effect on viral replication, because ORFs of coronaviruses are immediately translated by host ribosomes after being released into the cytoplasm of host cells 9 . The translation of viral RNA is affected by two factors. One is that host ribosomes must be recruited to the 5′-UTR (untranslated region) of viral RNA for initiation of translation. The other is that stem-loops formed by ORFs of viral RNA must be disrupted to expose coding information during translation. In contrast to ORFs, 5′-UTR of coronaviruses have quite high odds ratios of CG (mean value = 0.84, Table S2). This would facilitate formation of stable secondary structure that could serve as the internal ribosome entry site (IRES) [11][12][13] for host ribosome (Fig. 4). Meanwhile, the viral RNA beginning at the translation start site (TSS) forms relatively unstable secondary structure, because its stem-loops are maintained by less hydrogen bonds (A-T and C-G base pairs have two and three hydrogen bonds respectively).
Stability variations of viral genomes at 5′-UTR and TSS-to-end regions could probably determine virulence of different viruses, because high stability of IRES structure means high efficiency in initiating translation, and high stability of TSS-to-end region means high energy consumption during translation. After high, medium and low stability of both 5′-UTR and TSS-to-end regions is given 3, 2 and 1 points respectively, virulence of coronaviruses can be classified into five grades, i.e. very high, high, medium, low and very low (Table 1). For example, human MERS (Middle East respiratory syndrome) coronavirus has very high virulence, because both its 5′-UTR and TSS-to-end regions are highly stable. High stability of 5′-UTR means that host ribosomes can be recruited to translate viral RNA at high rate. And, high stability of ORFs means that more energy is consumed to disrupt stem-loops in viral RNA during translation. Thus, normal translation of host cell mRNAs is greatly affected, suggesting that MERS coronavirus is highly virulent. SARS (severe acute respiratory syndrome) coronavirus has high virulence, because its 5′-UTRs is less stable than MERS coronavirus. SARS-CoV-2 has medium virulence, because it has medium stability in both 5′-UTR and TSS-to-end regions. This classification is consistent with estimations on case fatality ratio of MERS, SARS and COVID-19, which is 35%, 9% and 2.4% respectively 14 and with our observations on odds ratio of CG in their ORFS, which is 0.56, 0.44 and 0.39 respectively (Table S1). Moreover, compared to SARS coronavirus, SARS-CoV-2 could infect and replicate more efficiently in human lung tissues but induce expression of less inflammatory cytokines/chemokines and mediators 15 . In our opinion, it is the lower C/G content in genomic RNA that allows SARS-CoV-2 to reproduce higher number of virus particles before triggering the immunoreaction of host cells, because less energy is consumed in replicating each virus particle.
Two other human coronaviruses have medium virulence as well. Among them, NL63 has medium stability in both 5′-UTR and TSS-to-end regions, whereas 229E has low stability in 5′-UTR but high stability in TSS-toend region. Another human coronavirus (i.e. HKU1) has very low virulence, because it has low stability in both 5′-UTR and TSS-to-end regions ( Table 1). The worldwide transmission of SARS-CoV-2 probably means that a coronavirus with medium virulence is more likely to spread rapidly. In comparison, a coronavirus with high or very high virulence could kill its host before causing severe epidemic, whereas a coronavirus with low or very low virulence is not able to replicate itself efficiently for further transmission.

Discussion
Our present study provides a novel insight into the evolution of human SARS-CoV-2. It is evident that this virus has evolved to reduce CG intensely in its ORFs. Such reduction is achieved mainly through mutating G of (CG) 23 and C of (CG) 31 into A or T (Fig. 1). Meanwhile, C or G not of CG may also be mutated. For example, TCA in Scientific Reports | (2020) 10:12331 | https://doi.org/10.1038/s41598-020-69342-y www.nature.com/scientificreports/   Table S1, which are shown in blue background. Those of cellular organisms are from our previous work 15 . Filled triangle or filled inverter triangle indicates that odds ratio of a dinucleotide in coronavirus is significantly higher or lower than that in cellular organisms at p = 0.05 level. Open triangle or open inverter triangle indicates that odds ratio of a dinucleotide in coronavirus is insignificantly higher or lower than that in cellular organisms.  www.nature.com/scientificreports/ SARS-CoV-2 of S-type has been mutated into TTA 16 . GTC and GGT in SARS-CoV-2 isolated from France have been mutated into TTC and GTT respectively 17 . Although the mutated C or G is not of CG and not at codon position 3, they do reduce C or G in viral RNA. C/G reduction is favourable for increasing efficiency of viral RNA translation, because stem-loops formed by less C/G-containing segments can be disrupted more easily. In fact, genomic RNA stability is closely related to nucleotide composition in coronaviruses (Fig. 5). First, RNA stability is positively correlated to content of C, G and C + G but negatively correlated to content of T, A and T + A (Fig. 5a). Second, RNA stability is also positively correlated to content of GC, GG, CG and CC but negatively correlated to content of AT, AA, TA and TT (Fig. 5b). Third, RNA stability is only positively correlated with odds ratio of dinucleotide GC and CG (Fig. 5c). As odds ratio measures the relative abundance of a specific dinucleotide, the extremely significant correlation between CG odds ratio and RNA stability strongly suggests that CG has been selected as the major target for mutation in coronaviruses. Then, if reducing hydrogen bonds is the goal of base mutation, why is CG but not GC, GG or CC taken as the target for mutation? An examination on number of silent mutations of each dinucleotide at various codon positions reveals that CG has the highest number (47) of silent mutations among these four dinucleotides (Table 2  and Table S3). This explains why CG is the best target for mutation. Although CT has the same highest number like CG, it is not taken as the target for mutation because a T-to-C or T-to-G mutation would increase number of hydrogen bonds between potential base pairs, which is contradictory to the goal of mutation.
It seems that the strategy of "reducing CG content to increase gene expression efficiency" has also been adopted by cellular organisms. As we have observed, CG in both ORFs and inter-genic regions of bacteria, archaea, fungi, plants and animals has an average odds ratio of 0.81, and that in introns of fungi, plants and Table 1. Stability of secondary structure formed by genome of coronavirus. *Fee energy of 5′-UTR (untranslated region) was obtained by using 200 nucleotides immediately upstream of TSS (translation start site) for secondary structure prediction. Free energy of TSS-to-end region is normalized using the average genome size (28,085 nt) of all surveyed coronaviruses based on actual accumulated free energy of a specific genome (Table S2). 5′-UTR region of human MERS-CoV and TSS-to-end region of bat CoV HKU9 have the lowest free energy respectively, which are thus given the highest stability index (100). H (high), M (medium) and L (low) indicate stability index of ≥ 90, 80 to 89, and < 79, respectively. Virulence grade is based on stability of both 5′-UTR and TSS-to-end regions, in which H, M and L stability is given 3, 2 and 1 points respectively. For example, human SARS-CoV has M and H stability in 5′-UTR and TSS-to-end regions. Thus, its virulence is of grade 5 (i.e. 2 + 3). Various grades of virulence are interpreted as follows: 6-very high, 5-high, 4medium, 3-low and 2-very low. MERS: Middle East respiratory syndrome. SARS: severe acute respiratory syndrome. PEDV: Porcine epidemic diarrhea virus. The viruses listed in the table were selected to represent different subgenera of coronaviruses.  18 , we did not know why CG has such a low odds ratio in surveyed organisms. Now, after analysing cases in coronaviruses, we realize that low CG content in cellular organisms should also be the evolutionary consequence of increasing gene expression efficiency, because lowered CG content means reduced number of hydrogen bonds between DNA double strands (of the same length). Expression of a gene with low CG content saves energy not only in separating DNA double strands during transcription but also in disrupting stem-loops formed by mRNA during translation. Coincidently, CG is the very dinucleotide related to existence of CpG island, mutational hotspot, and single nucleotide polymorphism (SNP) in DNA sequences of cellular organisms. A CpG island is defined as a region of DNA with less methylated C, and this region generally contains actively expressed genes [19][20][21] . A mutational hotspot is defined as CG with methylated C, in which the methylated C is frequently mutated into T through deamination [22][23][24] . SNP refers to single nucleotide difference in genome sequences among individual organisms, which is observed most frequently at CG dinucleotide 25,26 . The relationship between CG reduction and these three important features of cellular DNA sequences is worthy of further investigations.

Methods
Genome sequences of coronaviruses were retrieved from GenBank (www. ncbi. nlm. nih. gov). Odds ratios of dinucleotides were calculated using formulae developed by Karlin and Mrázek 10 and by Wang et al. 18 with selfcompiled computer programs (C++ scripts are available upon request). Secondary structure and free energy of viral RNA is predicted using RNAstructure (version 5.7) 27 . SPSS software (version 17.0) was used to conduct independent-sample t-test for comparing difference in odds ratio of nucleotide between coronaviruses and cellular organisms, and to conduct correlation analysis between RNA stability and nucleotide composition in viral genomes. Only TSS-to-end region of viral genome is included for analysis (TSS: translation-start-site). * and **above data bar indicate that the correlation reaches significant (0.01 < p < 0.05) and extremely significant (p < 0.01) level, respectively. Detailed data for correlation analysis are listed in rows 67 to 103 of Table S2. Table 2. Number of silent mutations of each dinucleotide at various codon positions. When a dinucleotide is located at codon positions 1 and 2 or at codon positions 2 and 3, there are four codons that contain this dinucleotide. Theoretically, they can be mutated into any of the rest 60 codons. When a dinucleotide is located at codon positions 3 and 1, only the nucleotide at position 3 is considered to mutate. There are 16 codons containing this nucleotide. Theoretically, they can be mutated into any of the rest 48 codons. Therefore, values in the table are number of silent mutations out of 60, 60 and 48 mutations for a dinucleotide at codon positions 1 and 2, 2 and 3, or 3 and 1, respectively.