Comparative phylogenetic analyses of recombinant noroviruses based on different protein-encoding regions show the recombination-associated evolution pattern

Noroviruses are the major cause of acute gastroenteritis worldwide, and recombination is recognized as the important mechanism for its continuous emergence. In this study, for the common GII.P12 and GII.3 recombinants, phylogenetic relationships based on different proteins in three ORFs were comparatively analyzed, focusing on the influence of intergenic recombination. By using newly designed primers, genomes of two GII.P12/GII.3 Guangzhou recombinants were firstly amplified. Combined with other reported sequences of GII.P12_ORF1 (n = 20), GII.3_ORF2 (n = 131), GII.3_ORF3 (n = 36), all GII.P12 and GII.3 strains could be divided into 6, 8, and 7 clusters based on different ORFs, which showed an obvious recombination-associated and temporally sequential evolution pattern (with the exception of GII.P12/GII.13 recombinants). Based on multiple alignments, 126 informative sites were identified in three ORFs (44, 54, and 28), and four proteins (p48, p22, VP1, and VP2) were found under positive selection. Furthermore, by using homology modeling, predicted epitopes were mapped on the P proteins of seven GII.3 representative strains, without one (Epi: 353–361) specific to the GII.4 VA387 strain. In summary, via the genome analyses, phylogenetic relationships of GII.P12 and GII.3 recombinants based on the different proteins presented a special temporally sequential evolution process associated with their recombinant types.

motifs that are responsible for the host-binding function and the viral antigenicity 5 . ORF3 encodes a minor structural protein VP2, and it may contribute to the expression and stability of the viral capsid 6,7 . Different viral proteins do not show consistent homology, and four proteins, including p48, p22, VP1 (P2), and VP2, exhibit the highest degree of sequence variability in the genome 8 .
NoV has a great genetic diversity. It could be classified into six genogroups (GI to GVI) based on the full-length sequences of capsid protein VP1, and a tentative GVII has been proposed recently 9,10 . GI, GII and GIV could infect human. Each genogroup can be subdivided into different genotypes, including at least 9 genotypes in GI and 23 genotypes in GII. GII.4 is the most studied one compared with others, since it is the most prevalent human NoV variant detected and most frequently associated with epidemics (>80%) worldwide during the last two decades 11,12 . Every two or three years, new GII.4 epidemic variants would appear and cause new infection peaks globally. Some other NoV genotypes could also be classified into different variants during their constant evolutionary process. For example, GII.6 and GII.17 variants could be divided into three GII.6 clusters (GII.6-a to GII.6-c) and four GII.17 clusters (GII.17-a to GII.17-d), respectively [13][14][15] . The continuous evolution of the non-epidemic NoV variants is also an interesting issue for their potential pandemics 16 .
Antigenic drift and recombination are regarded as the main mechanisms for NoV evolution, and it could cause NoVs immune evasion and the long-term persistence in human populations 5,17 . In addition to accumulating point mutations associated with error-prone RNA replication, NoVs undergo frequent recombination and generate variants with different genotypes in ORF1 and ORF2. The templates of the RNA polymerase may be switched during the co-infection with different NoV strains, which could generate new viral genomes with both parental sequences 17,18 . With the exchange of ORF1 and ORF2/ORF3 genes between different NoV lineages, recombination contributes to the genetic diversity and the emergence of new epidemic variants. Kinds of recombinant NoV variants were often detected during the epidemiological surveillance in most countries, including ORF1-based GII.P12 and ORF2-based GII.3, which are prone to recombination 19,20 . Therefore, it is still essential to monitor the prevalence and emergence of different recombinant variants for public health.
NoV recombination variants were also often detected in most cities in China. Recently, a GII.P12/GII.3 recombinant was once identified as a predominant genotype that compared to the GII.4 epidemic variants in Chongqing, Western China 21 . The same recombinant was also detected in South China in two consecutive cold seasons (2013-2014) 22 . The rising circulation of the NoV recombinant variants illustrated their significance in the viral evolution and epidemic in China. Therefore, it is necessary to collect more comprehensive information on the persistence and characteristics of recombinant variants. In this study, focusing on the characteristics of intergenic recombination, genomes of two NoV recombinant strains in South China were sequenced and comprehensively analyzed with other emerged GII.P12 and GII.3 strains based on the different protein-encoding regions.
To verify their recombination and identify the location of the putative recombination breakpoint, SimPlot and RDP analyses were performed with the reference strains GII.12/Gifu-96/JP (AB045603) and GII.3/ Amsterdam-1994/NL (KJ194504). Two Guangzhou strains were both confirmed to be GII.P12/GII.3 recombinants with predicted recombination points corresponding to nt position 5068 for GZ2013-L20 and 5075 for GZ2014-L304, slightly upstream of the ORF1/ORF2 overlap region (Fig. 1). In addition, the genotypes of two Guangzhou strains were also confirmed by using the NoV online genotyping tool.
Comparative analyses of Guangzhou strains with reference strains of the same genotype based on different viral proteins. Complete GII.P12 ORF1 sequences (n = 18), GII.3 ORF2 sequences (n = 129), and GII.3 ORF3 sequences (n = 34) were extracted from GenBank database. The positions of six non-structural protein-encoding regions in ORF1 and subdomains in ORF2 were verified by comparison with representative sequences (GenBank AB045603 for GII.P12 ORF1 and KJ194504 for GII.3 ORF2). An attempt was made to clarify the RdRp genotype for all collected ORF2-based GII.3 strains, but there were still 27/129 strains with no related information.
Based on comparison with above reference sequences, a total of 57 and 2 nucleotide substitutions were introduced in the ORF1 and ORF3 regions by the Guangzhou stains, which resulted in 13 and 2 amino acid changes, respectively. Besides, a deletion mutation was also found in the p22 region of GZ2013-L20 (Supplementary Table  S2). However, no nucleotide substitutions or amino acid changes were found in ORF2 of Guangzhou strains.
To understand the variability differences of viral proteins, pairwise identities of six non-structural proteins (p48, NTPase, p22, VPg, 3CL, and RdRp) in GII.P12 ORF1, different domains in GII.3 ORF2, and GII.3 ORF3 were calculated, respectively. Different viral proteins or domains do not show consistent homology. In ORF1, the percentages of variable amino acid sites in six non-structural proteins (p48, NTPase, p22, VPg, 3CL, and RdRp) in ORF1 were identified as 4.55%, 1.91%, 10.61%, 2.26%, 2.21% and 1.18%, respectively; p22 was the most variable protein at the nucleotide (92.44%) and amino acid (94.47%) levels. In ORF2, the P domain was more variable than the S domain, and the P2 subdomain exhibited the highest degree of sequence variability (88.88% at nucleotide level, 91.06% at amino acid level); 59 variable amino acid sites were identified in the capsid protein VP1, and most (40/59) were located in P2 subdomain. No deletion/insertion mutations occurred during the evolution of GII.3 NoVs. Most variable sites were introduced by the Sweden strains (n = 31), which were detected in patients with chronic infection. For ORF3, 36 variable amino acid sites were found by multiple alignments, and the pairwise identity of the nucleotide sequences was almost equal to that of the amino acid sequences (92.35% at the nucleotide level versus 92.78% at the amino acid level), which meant a high ratio of the non-synonymous rate to the synonymous rate (Supplementary Table S3).

Phylogenetic analyses of recombinant strains based on different protein-encoding regions and subregions.
Phylogenetic analyses based on the nucleotide sequences of six non-structural proteins (p48, NTPase, p22, VPg, 3CL, RdRp) in ORF1, VP1 and its subregions (S, P, and P2), and VP2 were conducted. One GII.1 NoV strain GII.1|HCU07611|Hawaii virus was used as the root of the phylogenetic tree, and the most appropriate models were selected using the general time-reversible (GTR) substitution models by jmodeltest 2.
In ORF1, based on the nucleotide sequences of six non-structural proteins, all ORF1-based GII.P12 strains (n = 20) strains could be almost subdivided into five clusters, which were associated with their recombination types and switches between ORF2-genotypes, including GII.4-2003, GII.3, GII.10, GII.12 and GII.13 (Fig. 2). However, due to the different homologies of six non-structural proteins, the topologies of six phylogenetic trees were not completely consistent, especially for the conserved VPg protein. With the exception of two strains of the ORF2 GII.13 cluster, the remaining 16 strains of other four clusters exhibited clearly epochal characteristics that were clustered in a temporally sequential pattern. Six GII.P12/GII.12 strains could be divided into subclusters, and two GII.P12/GII.10 strains showed high homology with the ORF2_GII.12 subcluster, which was consisted of AB044366|1999|JP|hiroshima, KJ196282|2001|JP|Saitama/T15, and KJ196299|2001|JP|Saitama/T18.

Identification of informative amino acid sites and positive selection analyses.
To understand the genetic diversity and key mutations during the evolutionary process of different GII.P12 and GII.3 recombinant variants, multiple amino acid sequence alignments of six non-structural proteins and two capsid proteins (VP1 and VP2) were performed. For six GII.P12 ORF1clusters, a total of 44 concrete differences were identified. P48 and p22 were two regions that had the most informative sites (13 and 15); rather, only one informative site was located in the VPg-encoding region (Fig. 5a). For GII.3 capsid protein VP1, a total of 54 informative sites were found, and most were located in the variable P2 subdomain (n = 38) (Fig. 5b). For GII.3 minor capsid protein VP2, there was also 28 informative sites identified (Fig. 5c).
Positive selection analyses were also performed for six non-structural proteins and two capsid proteins (VP1 and VP2) using three different methods SLAC, FEL and REL (Table 1). In ORF1, only p48 and p22 were the proteins which have the sites under positive selection; in ORF2, four residues (385, 406, 404, and 389) on viral capsid of GII.3 NoVs were identified under the positive selection, which were all located in the P2 domain; in ORF3, Scientific RepoRts | 7: 4976 | DOI:10.1038/s41598-017-01640-4 three residues (140, 226, and 27) were also identified. All above amino acid sites predicted to be under positive selection were also informative sites during the substitution of different recombinant clusters.
Homology modeling and structure alignment of GII.3 capsid P proteins. Representative capsid protein sequences of nine GII.3 strains (including one Sweden strain) were then selected and compared the sequential and structural differences with the epidemic GII.4 strain VA387. Two insertions (Q300 to Y312, R296 of GZ2013-L10) and three deletions (G392 to N394, E340, N373 of VA387) were found, which all occurred in the P2 subdomain. By using homology modeling, the structural differences of GII.3 strains and VA387 mainly embodied in 5 motifs, which were caused by the former insertions and deletions. The insertion resulted in one expanding loop in the P2 subdomain, whereas the deletion between 392 and 394 truncated the original loop of VA387 (Fig. 6a). The predicted structures were also submitted to SEPPA to calculate possible spatial epitopes (Fig. 6b). In general, no obvious differences in potential conformational epitope regions between the different GII.3 variants and GII.4 NoV were observed, positions of potential conformational epitope regions on sequences are similar or adjacent to each other for GII.4 and different GII.3 strains, but one predicted epitope region (Epi: 353-361) was also found specific to the GII.4 genotype 14,15 .

Discussion
NoVs are recognized as the common cause of outbreaks and sporadic cases of nonbacterial acute gastroenteritis worldwide, and recombinant strains are often detected during their epidemic surveillances. However, few studies on NoV recombination based on the full-length viral genome were conducted, and only a few completed recombinant NoV genome sequences are available all over the world. During our ongoing local NoV epidemic surveillance in Guangzhou, China, the recombinant GII.P12/GII.3 was detected in two consecutive winter seasons. In this study, the genomes of new Guangzhou strains were sequenced, and the evolutionary processes of two common recombinants ORF1_GII.P12 and ORF2_GII.3 were both performed based on different viral protein-encoding regions and subregions.
As RNA viruses, the rich genetic diversity is an important factor for their ubiquity, adaptability and uncontrollability. In addition to the fallibility of RNA polymerase, NoV recombination was also reported promoting the   mutation rate of viral capsid protein and/or change the viral immunogenicity 23,24 . Due to the lack of a suitable in vitro validation system, genetic information of different NoV strains is necessary to identify their genetic diversity and evolution characteristics. In our previous studies, a "4 + 1 + 1" amplification strategy was used to obtain GII.4 and GII.17 genome sequences, which could be completed in one working day 25,26 . The next-generation sequencing was employed for assembling of NoV genome recently 8,27 . Nevertheless, it was time consuming and high cost when compared to our method. Therefore, a set of primers was designed for amplifying six overlapped fragments of GII.P12/GII.3 recombinant genomes based on the same strategy, and RNA genomes of two Guangzhou strains were successfully sequenced. The accumulation of this recombinant genome information is necessary to understand the intact phylogenetic relationships based on different viral proteins.
The basis for epidemiological fitness of different NoV variants, as determined by the incidence and ability to cause pandemics, is partially dependent on their evolution capability. Recombination is a major driving force of viral evolution that allows a substantial exchange of genetic material. GII.P12 and GII.3 are two genotypes that    The GII.3 capsid protein VP1 sequences from the immunocompromised patient in Sweden were included in this study 30 . Because the mutations emerging during the chronic infection were the major source of the new epidemic strains [31][32][33] , and it was reported that novel alterations in the virus capsid are not required to infect naive pediatric hosts. Therefore, comparison between the Sweden strains and other epidemic strains could point out the informative sites for GII.3 evolution. The results showed that there existed 13 new amino acid substitutions special to the Sweden strains, 9 of which were conserved during the evolution of other GII.3 clusters. Besides, the usage of amino acid on the other sites was also different from that of other GII.3 clusters but could be employed in some strains (Y311, H312, and G355). The verification of these amino acid substitutions in the mutants could improve the understanding of its evolution requirements.
Throughout the evolution of the GII.P12 ORF1, at least four recombination events led to the acquisition of ORF2 genes of differing genotypes (Capsid protein switching). Thus, the GII.P12 ORF1 has evolved under the influence of several different capsid proteins. All GII.P12 recombinant strains appeared to be restricted to a specific geographic region, including Japan, China, and South Korea, where it was a major cause of pediatric norovirus disease in 2003 in Asia. A series of different sites within the GII.P12 ORF1 exhibited conserved amino acid substitutions which defined individual lineages and/or different capsid groups (Fig. 5a). The fixation of these substitutions within a lineage indicates an evolutionary advantage in variation at these sites. The conserved amino acids constituted a mutational profile for each lineage, defining mutations that allowed divergence into a new lineage. This suggests that the conserved substitutions within each lineage have been important evolutionary changes that led to fitter versions of previous viruses. In addition, variable amino acid sites could also be identified by their different clusters, but no sites showed continuous variation. It has been reported that RdRp switching may have been a mechanism that allowed increased rates of mutation through acquisition of an RdRp with lower fidelity and/or increased replicative ability. This can lead to the generation of large reservoirs of mutants that enable the virus to adapt to changing environments, improving viral fitness under certain selective pressures. However, the function of capsid protein switching for the virus is still unclear, which needs further investigations.
For most viruses, positively selected sites were mainly located in the exposed regions of viral capsid presumably as a result of selective pressure from neutralizing antibodies. Four sites under positive selection (sites 304, 385, 389, and 406) had been firstly identified in GII.3 capsid protein sequences spanning the years 1975 to 2006, thus they are likely to be major sites of selection at the population level 23 , and another one site 297 was also found when adding new sequences before 2010 24 , which were all located in the P2 domain. A similar result was obtained in this study. Four sites of 406, 385, 389, 404 were identified by three methods, and another seven sites of 304, 297, 547, 510, 6, 9, 392, 372 were also identified by SLAC (P < 0.25). However, these sites did not belong to the informative sites during GII.3 capsid evolution. The relationships of the location of sites under positive selection between GII.3 and other NoVs indicated that the evolution patterns of these genotypes might be different and should be investigated in further studies. Figure 6. Homology modeling and prediction of potential spatial epitopes of the GII.3 capsid P domain. (a) Homology modeling of the P domain of GZ2013-L20. The structural superposition of the predicted structure for GZ2013-L20 P domain (purple) and the crystal structure of GII.4 NoV strain VA387 (green) was performed, which are displayed in cartoon mode. The regions with the major structural differences are highlighted in blue and red for GZ2013-L20 and VA387, respectively. (b) Sequence differences between the P domains of representative GII.3 strains and the GII.4 VA387 strain. A black dot indicates the same usage of amino acid as that of the GII.4 strain, and a dash indicates a deletion-mutant. The potential spatial epitopes are highlighted in yellow, and the positions with different amino acids among GII.3 subclusters are colored in red. The predicted epitope region (Epi: 353-361) that is specific to the GII.4 genotype is highlighted with a bold red line below. The sites above the amino acids of the viral capsid protein are highlighted as in Fig. 5b.
Scientific RepoRts | 7: 4976 | DOI:10.1038/s41598-017-01640-4 VP2 is one of the variable proteins that could promote the expression and stability of virus particles, and co-evolution of VP1 and VP2 of GII.4 variants was also proved previously 34 . All variable sites of VP2 were also in the predicted interaction regions. It could be known that GII.3 VP2 was consistent with their VP1 genotypes, and all VP2 could be subdivided into different clusters based on their recombinant types. However, the effect of VP2 variation on the epidemic or survival is still unclear. If suitable variation of three different ORFs is the driving force for the emergence of new NoV epidemic strains, the roles of different NoV proteins during viral evolution need to be further explored. Moreover, the recombination could also occur at the junction of ORF2 and ORF3, which would be a complement of the NoV recombination mechanism.
In this study, the evolutionary process of the GII.P12/GII.3 recombinants was analyzed using phylogenetics, with genome sequences of two Guangzhou strains, and an obvious temporally sequential evolution process was found which was associated with their recombinant types. Moreover, no obvious difference was observed between the capsid proteins of different GII.3 clusters, but the lack of a GII.4-specific epitope was identified when compared with the GII.4 VA387 strain. The findings of this study could be utilized for further genetic and evolutionary studies of NoV recombinant variants, which might be also important for effective vaccine design.

Methods
Clinical samples. During a continuous NoV surveillance study in Guangzhou, China, clinical specimens were collected and stored at −80 °C for further analyses. Viral RNA was extracted from 10% (w/v) fecal suspension using a QIAamp Viral RNA Mini Kit and then detected by one-step reverse transcription polymerase chain reaction (RT-PCR) 22 . Two NoV strains GZ2013-L10 and GZ2014-L304 were identified as the GII.P12/GII.3 inter-genotype recombinants based on the sequences of regions A and C. The study was approved by the Research Ethics Committee at the Third Affiliated Hospital of Sun Yat-sen University and the Institutional Review Board (IRB) at the Chinese CDC for the protection of human subjects (project approval number 2013(2)-76). Informed patient/guardian consent was obtained from all patients or their parents prior to sample collection. All experiments were performed in accordance with relevant guidelines and regulations.
Primer designing and genome sequencing. The previous reported "4 + 1 + 1" strategy was employed for primer designing. The full-length and nearly full-length GII.P12 ORF1 and GII.3 ORF2/ORF3 sequences were collected from GenBank. Based on multiple alignments of these reference sequences, six amplifying primer pairs and two sequencing primers for GII.P12/GII.3 NoV genomes were designed by PRIMER PREMIER 5.0 software (Premier Biosoft International, Palo Alto, CA, USA) (Supplementary Table S1). In order to rule out co-infection by two strains of different genotypes, the junction of ORF1/ORF2 and flanking RdRp and capsid regions were contained in one amplicon, which was amplified with the primer pair II.12-4 F/II-4R.
Using viral RNA extracted from the fecal supernatant, six overlapping fragments encompassing the viral genome were amplified by one-step RT-PCR and then sequenced directly (more than three times) using an automated sequencer (ABI 3730XL, Applied Biosystems, Foster City, CA, USA). Viral genomes could be constructed by connecting overlapping sections as a single sequence.

Multiple alignments and phylogenetic analyses.
Positions of different protein-encoding regions in new genomes were verified by compared with those in the representative sequences (GenBank AB045603 for GII.P12 ORF1 and KJ194504 for GII.3 ORF2/ORF3), and all protein-encoding regions of Guangzhou strains and other NoV reference strains of the same genotype were comparatively analyzed to investigate their differences using BioEdit (version 7.0.1) and DNASTAR Lasergene MegAlign program (DNASTAR Inc., WI, USA).
Multiple alignments of different protein sequences were performed using ClustalX for Windows (version 1.83) with the default parameters 35 . Model selection was performed using jmodeltest 2 for nucleotide sequences 36 . The best-fit models for phylogenetic analyses were selected based upon the Akaike information criterion (corrected) (AICc). The maximum likelihood (ML) phylogenetic trees were then conducted in PhyML version 3.0 using the appropriate substitution models 37 . The reliability of clustering results was evaluated using a bootstrap test (1000 replications). The resulting ML trees were visualized using MEGA version 6.0 38 . All nucleotide sequences were also genotyped using the NoV automated online genotyping tool (www.rivm.nl/mpf/norovirus/typingtool) 39 . Information amino acid substitutions during the viral evolutionary process were analyzed by comparison within and between different clusters of the same genotype.

Positive selection analyses. Using the HyPhy package implemented in the Data Monkey Web
Server (http://www.datamonkey.org/), signatures of site-specific positive selection are inferred if the ratio of non-synonymous substitutions per non-synonymous substitutions site (dN) over synonymous substitutions per synonymous sites (dS) among all sequences present in different proteins alignments is statistically higher than the value observed under neutrality. The best-fitting nucleotide substitution model was first determined using the automated tool available. Three codon-based maximum likelihood methods were then used for analyses, including fixed effects likelihood (FEL), single-likelihood ancestor counting (SLAC), and random effects likelihood (REL). The SLAC model is based on the reconstruction of ancestral sequences and counts the number of dS and dN changes at each codon position of the phylogeny. FEL estimates ratios of dN to dS changes for each site in an alignment. REL uses a flexible distribution and allows dS and dN to vary across sites independently. The limit of significance level for SLAC was 0.25, FEL was 0.15, and the minimum Bayes factor value was set at 50.

Recombination analyses.
For recombination analyses, full-length genome sequences of two Guangzhou strains were analyzed using the SimPlot program (version 3.5.1) to investigate the break-point position 40 . The SimPlot analysis was conducted with the reference genome sequences (GenBank AB045603 and KJ194504) by using a window size of 200 bp and a step size of 20 bp. The identification of recombinants was also verified by using the Recombination Detection Program (RDP) 41 .
Scientific RepoRts | 7: 4976 | DOI:10.1038/s41598-017-01640-4 Homology modeling. Homology modeling of the viral capsid was performed by generating a Protein Data Bank (PDB) file from the amino acid sequence by using the Swiss Model server of the Swiss Institute of Bioinformatics 42 . Modeling was performed using the published crystal structure of GII.4 NoV strain VA387 (PDB: 2OBT) as a template. Protein structures were visualized and manipulated using PyMOL v1.4.1 (DeLano Scientific, LLC, San Francisco, CA). Possible spatial epitopes were then predicted by submitting modeled 3D structures of capsid proteins to the protein antigen spatial epitope prediction web server (SEPPA) with the default threshold 43 . Reference sequences. A total of 18 GII.P12 ORF1 sequences (from 1996 to 2014), 129 GII.3 ORF2 sequences (from 1975 to 2014), and 34 GII.3 ORF3 sequences (from 1978 to 2014) available in the GenBank database were collected for analyzed in this study. All collected sequences comprised the complete ORF sequences (5100 bp for ORF1, 1647bp for ORF2, and 783 bp for ORF3) and their collection information (sampling dates and sites), which were provided in Table S4-S6. An attempt was made to establish the RdRp genotype for all collected strains used in the GII.3 VP1 and GII.3 VP2 evolutionary analyses. For the strains obtained from GenBank, the RdRp genotype was confirmed using the norovirus genotyping tool as above. A literature search was also conducted to determine the assigned RdRp genotype for strains lacking published RdRp sequence data.