Coronaviruses (CoVs) are widely found in humans as well as in mammalian and avian species, causing asymptomatic infections or respiratory tract disorders and gastroenteritis of varying severity1. CoVs belong to the genus Coronavirus of family Coronaviridae and are classified into four genera, Alphacoronavirus (α-CoV), Betacoronavirus (ß-CoV), Gammacoronavirus and Deltacoronavirus based on phylogenetics and serology1. To date, α-CoV [including human CoV (HCoV)-229E and NL63] and ß-CoV [including HCoV-OC43, HKU1, Severe Acute Respiratory Syndrome CoV (SARS-CoV) and Middle East Respiratory Syndrome CoV (MERS-CoV)] are known to infect human beings1.

Interspecies transmission is a common phenomenon for CoVs that may be responsible for generation of new CoV epidemics during viral evolution2,3,4,5. For instance, the feline CoV (FCoV) type II, a member of α-CoV, might be generated by a double recombination between FCoV type I and canine CoV4. Interestingly, porcine hemagglutinating encephalomyelitis virus, bovine CoV (BCoV) and HCoV-OC43 of ß-CoV are thought to have evolved from the same common ancestor5. During evolution, high frequencies of homologous RNA recombination and gene mutations are considered the main forces that push CoVs to adapt to specific hosts. Such events can lead to emergence of new strains or genotypes within a certain species and even to new species, causing epidemic or zoonotic outbreaks that continuously threaten human health2,3. This phenomenon is exemplified by the recent emergence of SARS-CoV and MERS-CoV6,7. However, the detailed evolutionary mechanism of interspecies transmission and the persistence of CoVs in specific hosts have yet to be fully elucidated.

CoVs have a positive-sense, single-stranded RNA genome, with a length of ~27–31 Kbs1. The spike (S) protein of CoVs protruding on the surface of virions is the major antigenic protein for inducing neutralizing antibodies. However, it is in turn under the highest selection pressure among the viral proteins1,8. S proteins are often cleaved into S1 and S2 subunits to achieve receptor binding and membrane fusion, respectively1. The S1 subunit is composed of two distinct domains, the N-terminal domain (NTD) and the C-terminal domain (CTD), which play important roles in receptor binding1. The NTD is responsible for sugar receptor binding in some CoVs, such as BCoV and HCoV-OC43, or for protein receptor binding in murine hepatitis virus1,9,10,11. The CTD functions as the protein receptor binding domains (RBD) for most of the CoVs1. The S1 subunits of all CoV genera have similar topological structure, preserved sugar-binding functions, but different receptors-binding functions, which suggest that subtly adaptive mutations occur in functional domains during evolution of CoVs12,13. Similar adaptive amino acids mutations around the receptor-binding region have also been found in norovirus, contributing to its epidemic in humans14. Investigation on the evolutionary insights of the S gene, particularly the functional domains, is imperative for understanding the evolution of CoVs and for tracing spillover events and ecological niches.

HCoV-OC43 is the most prevalent CoV in humans and the relatively abundant number of clinical cases and corresponding epidemiological data make it a good model for HCoV adaption evolution1,5,9,15,16,17. Although five genotypes (A to E) have been identified, genotype D has been the dominant OC43 genotype from 2004 to 201215,17. Previous studies by our group and others have demonstrated that recombination contributes to the generation of new OC43 genotypes15,17, but little is known about how HCoV-OC43 genotypes persist in human populations. It is assumed that the continuous adaption of viral antigenic gene is required for the persistence of OC43 genotypes18. However, this hypothesis has not been carefully examined by precise evolutionary pattern analysis. In the present study, we characterized HCoV-OC43 evolution based on the phylodynamic and phylogenetic analysis of full-length S genes to provide insights into its transmission and the adaption.


Relative effective population sizes of OC43 genotypes

To verify the epidemic history of OC43 genotypes over the study period, the relative effective viral population size over time was inferred by analyzing the genetic diversity of the S gene using the Bayesian skyline model. Only the strains obtained from clinical samples containing the full length S, RdRp and N genes retrieved from PubMed ( and GenBank ( for the years 2003 to 2012 were used (see Supplementary Table S1 on line). A varying population size profile was observed (Fig. 1). The overall plot of OC43 (including genotypes B, C, D and E) showed two large bottlenecks. The first was found between 2006 and 2007, showing a decrease in relative genetic diversity in 2006, followed by an increase in 2007. The second was found between 2008 and 2009, with a transient decrease before 2008, followed by an increase in relative genetic diversity coinciding with a temporal epidemic in population size. Genotype B exhibited a transient increase in 2004, then followed a steady but slowly increased relative genetic diversity, corresponding to global increases in its detection17. Genotype D showed a decreased genetic diversity before 2007, then an increase after 2009, consistent with its dominant epidemic from 2007 to 201217. Genotype C and E both exhibited a short and steady relative genetic diversity during the study period. The history of the relative genetic diversity obtained here corresponds to global epidemic data, indicating that the relative genetic diversity measured by the S gene generally reflects the HCoV-OC43 dynamic in population size. The relative genetic diversity of genotype D corresponded to the two bottlenecks of OC43, indicating that genotype D accounts for much of the genetic diversity of OC43 and is the predominant genotype.

Figure 1
figure 1

Relative genetic diversity dynamics of OC43 and each genotype.

Population size was determined using sequences of 96 OC43 S genes obtained from 2003 to 2012. The median estimates (g) are represented by the black lines and 95% high posterior densities are shown in the color regions.

Evolutionary rate of OC43 genotypes

To explore why genotype D became dominant after 2007, we calculated the evolutionary rate of OC43 genotypes. Using the constant population size under a relaxed-clock method, the mean evolutionary rate of the S gene was estimated to be 8.48 × 10−4 substitutions/site/year for OC43, consistent with previous reports8. For genotypes B, C, D and E, the mean evolutionary rate of S gene was 9.85, 4.85, 8.83 and 6.01 × 10−4 substitutions/site/year, respectively (Table 1). Similar results were obtained using the exponential growth model (Table 1). The highest evolutionary rates were observed in genotype B and D, suggesting that the two genotypes evolved faster than others.

Table 1 Evolution rate of the OC43 S gene in different genotypes.

Natural selection of the S gene

To determine whether positive selection took place during the evolution of OC43, the dN/dS values and positive selection amino acids (aa) were calculated. The mean dN/dS ratio was observed in genotype D (0.31), followed by genotypes B (0.29), C (0.20) and E (0.15) (Table 2). Calculations for positive selection sites with a probability (Pr) of > 0.5 in the S gene identified 25 residues in genotype D, six sites in genotype B, one in genotype C, but none in genotype E (Table 2). Seven positive selection sites with Pr of > 0.9 were identified in genotype D. Further analysis showed that 12 of 16 positive selection sites of genotype D and 3 of 5 genotype B in S1 were located at the NTD (aa 15–312, reference strain 5240/07 KF572844), while four positive selection sites in genotype D and one site in genotype B were found in the predicted RBD (aa 339-549, reference strain 5240/07 KF572844)15.

Table 2 Selection analysis of the S gene of OC43 genotypes.

The relatively long epidemic time and more available sequences of genotype D allowed us to analyze in detail the nonsynonymous changes throughout the evolutionary history of OC43. The ancestral sequences were reconstructed and the occurrence time of each nonsynonymous substitution was estimated in the temporal evolution routes using the maximum clade credibility (MCC) tree. The residues at positions 33, 90, 93, 120, 184, 195 and 521 were positive selection sites, with six (33, 90, 93, 120, 184 and 195) located in NTD and one (521) in the predicted RBD. All but one strain (5445/07) identified after 2007 contained the mutation of Y521H. All the strains identified after 2008 contained the mutation K90L. The mutations L195S and Y521H were found in strains identified between 2008 and 2011, but not in those of 2012. Other mutations, including N33D, T93K, D120H and K184N, were found in several strains. The chronologically traced positive sites might be associated with the ladder-like MCC phylogeny of the virus, indicating the key role of aa substitutions in the population dynamics of the virus (Fig. 2).

Figure 2
figure 2

Molecular clock analysis of OC43 genotype D.

The complete S gene sequences sampled from 2007 to 2012 were used to reconstruct the phylogeny. The calculated positive selection sites in each node are drawn on the MCC tree.

To confirm whether aa substitutions exist in NTD of other genotypes over time, the aa sequences of S from the 12 genotype B and 17 genotype C strains were also aligned. Eleven aa site mutations were observed in genotype B relating to the strain’s isolated over time and seven sites were located in NTD (see Supplementary Fig. S1 online). No aa mutations were found in genotype C over time.


Our analysis on the global epidemic of OC43 in recent years showed the temporal transition of genotypes17. It is striking that the evolution of genotype D of OC43 is epochal among the four epidemic genotypes. First identified in 2004, it was generated by recombination15; after a stasis period in 2005 and 2006, a dominant epidemic was found over a longer timescale. As no new recombinant events were detected in the subsequently identified genotype D strains17, the mutations of the S gene−the major antigenic gene, likely plays an important role in driving viral epidemics18.

It is interesting that the positive selection sites calculated in S gene from different isolates over time are corresponding to the genetic variability of genotypes in this study. Genotypes B and D, which have high genetic variability contained more selections sites than that of C and E genotypes. The possible link between the positive selection sites and the genetic variability of genotypes should be evaluated in the future studies. The evaluation of relative genetic diversity, substitution rate, dN/dS value and positive selection sites further showed that genotype D had a significant influence on the relative genetic diversity of OC43 and that its S gene evolved towards heterogeneity. More positive selection sites were also found in the S1 subunit of genotype D. Amino acid substitutions in the surface proteins are considered an important adaption strategy for the persistence of a virus to evade host immune pressure8,18. Although the neutralizing epitopes have not been identified, the substitutions identified in the S1 subunit of OC43 in this study are predicted to be in the antigenicity region (data not shown), which may allow viruses to escape host neutralizing antibodies19,20,21,22. Collectively, these findings suggest that genetic drift may play an important role in maintaining the spread of genotype D in the human population. Whether the genetic variations affect the related antigenic phenotype will need to be confirmed by antigen analysis using S genes with such mutants.

Most of the positive selection sites in S1 subunit are mapped in NTD. Molecular clock tree of genotype D also showed that 6/7 of the predicted substitution were located in NTD. The aa substitutions in NTD seem to be a common evolutionary strategy for CoVs, as the high variability in NTD has also been observed in BCoV and HCoV-NL6321,22. However, this observation warrants further investigation.

BCoV shares a high nucleotide and antigenic similarity with OC4323. The conserved sugar-binding sites identified in BCoV_NTD were also conserved in that of OC43 (see Supplementary Fig. S2 online), indicating the conservation of the core motif in NTD during the species-cross transmission and evolution in human hosts. It is interesting to point out that the evolutionary dynamic pattern of the conserved core and variable outer-region in NTD is similar to that of RBD observed in ß-CoV, suggesting that these functional domains retain some of the ancient records during viral evolution12.

The sugar moieties near the CoV receptor are considered critical co-factors to CoV infection. Antigenic analysis has predicted that there are some epitopes in this domain. Whether the aa mutations around the functional region of NTD are relevant to subtle remodeling of the binding process or antigenic evolution, like that observed in norovirus and influenza virus need to be investigated further20,24,25.

The RBD of OC43 has been predicted15. We found that the positive selection sites in RBD are less than those present in NTD. Notably, a Y521H substitution was found in the genotype D strains identified from 2008 to 2011. The significance of this mutation is unclear. It has been reported that a single aa substitution in RBD can cause marked antigenic differences and enable the virus to escape host immunity in influenza virus and norovirus24,25. Whether this single aa mutation can influence the host receptor binding activity needs to be investigated further. It is interesting that after 2012, the genotype D strains contained less mutations than those identified before 2012. The impact of these changes on the viral prevalence needs continuous surveillance of the OC43 genotypes in the future.

In summary, we report a model for the persistence of OC43 genotypes in human populations based on the first intensive evolutionary analysis of the S gene. We infer that the genetic drift of the S gene is likely to be one of the mechanisms of the adaptation evolution of HCoV-OC43. These findings provide insights into the evolution of CoVs and may have implications in the surveillance of HCoV infections.


Sequences and phylogenetic analysis

The full length sequences of OC43 S gene available in GenBank ( were retrieved on 30 May 2013 and analyzed together with the sequences identified previously by our group17. A total of 96 full-length S gene sequences obtained from clinical samples were used for analysis. The sequences were aligned using Clustal W program implement in MEGA 5.126. The genotypes of these sequences were determined as reported15,17, including 12 genotype B, 18 genotype C, five genotype E and 61 genotype D. The background information of the sequences including accession numbers, collection dates, isolation areas and genotypes can be found as Supplementary Table S1 online.

Evolutionary analysis

The demographic histories and evolution rates of different OC43 genotypes were determined based on S gene sequence data by the Bayesian Markov Chain Monte Carlo (Bayesian MCMC) method implemented in BEAST (v1.8.1), using a relaxed molecular clock (uncorrelated lognormal-distributed model)27. The best substitution models were selected using Modeltest (version3.7) according to Akaike information criterion (AIC)28. The constant size and exponential growth tree models were used for the inference. Each Bayesian MCMC analysis was run for 100 million states and sampled every 2,000 states. Posterior probabilities were calculated using Tracer (version 1.5). The trees were annotated by the Tree Annotator program implemented in the BEAST package and the MCC tree was visualized using Figtree software (version 1.3.1). Bayesian skyline plots for OC43 genotypes were estimated to depict the relative viral genetic diversity over time.

Positive selection analysis

To infer the positive selection sites of S gene at aa level, the deduced aa sequence entropy was determined using BioEdit (version 7.2.5)29. The ratios of nonsynonymous (dN) to synonymous (dS) substitution were estimated to evaluate the selection pressures on the OC43 S genes, using the codon-based phylogenetic method in CODEML (distributed in PAML, version 4)30. Posterior probabilities of the inferred positive selection sites were calculated using the Bayes empirical Bayes (BEB) approach which accounts for sampling errors31. The chronological evolution of dN changes throughout the evolutionary history of OC43 genotype D was traced using HyPhy software (version 2.2) and dN substitution was estimated using the maximum clade credibility (MCC) tree generated from Bayesian MCMC molecular clock analysis27,32. The aa sites were positioned according to the HCoV-OC43_D 5240/07 (KF572844).

Additional Information

How to cite this article: Ren, L. et al. Genetic drift of human coronavirus OC43 spike gene during adaptive evolution. Sci. Rep. 5, 11451; doi: 10.1038/srep11451 (2015).