Sequence Analysis of the Fusion Protein Gene of Human Respiratory Syncytial Virus Circulating in China from 2003 to 2014

The human respiratory syncytial virus (HRSV) fusion (F) protein is important for HRSV infection, but few studies have examined the genetic diversity of the F gene from Chinese samples. In this study, a total of 330 HRSV F sequences collected from different regions of China between 2003 and 2014 were analyzed to understand their genetic characteristics. In addition, these sequences were compared with 1150 HRSV F sequences in Genbank from 18 other countries. In phylogenetic analysis, Chinese HRSV F sequences sorted into a number of clusters containing sequences from China as well as other countries. F sequences from different genotypes (as determined based on the G gene sequences) within a HRSV subgroup could be found in the same clusters in phylogenetic trees generated based on F gene sequences. Amino acid analysis showed that HRSV F sequences from China and other countries were highly conserved. Of interest, F protein sequences from all Chinese samples were completely conserved at the palivizumab binding site, thus predicting the susceptibility of these strains to this neutralizing antibody. In conclusion, HRSV F sequences from China between 2003 and 2014, similar to those from other countries, were highly conserved.

after amino acid residues 109 and 136 to generate three polypeptides: F1 (aa 137-574), F2 (aa 1-109) subunits and an intervening 27 amino acid peptide, pep27, (aa 110-136) 16,17 . The mature F protein is a homotrimer of the F1 and F2 subunits, and the F1 subunit is essential for the protein to cause membrane fusion. The F0 precursor contains 5 or 6 predicted N-linked glycosylation sites depending on the HRSV strain. After activation, 2 predicted N-linked glycosylation sites in F2, 1 predicted N-linked glycosylation site in F1 and 2-3 predicted N-linked glycosylation sites in in the pep27 are left 18,19 .
The F protein has been identified as having at least two dominant conformations: the prefusion and postfusion forms 20 . The functional F protein trimer in the virion membrane is in a metastable, prefusion form. This prefusion F protein had a 'lollipop' shape by electron microscopy 21,22 . In the prefusion form of the F1 protein, the fusion peptide at the N terminus of F1 is followed by 4 short α-helices connected by 3 non-helical peptides 5 . The structure of the postfusion F protein revealed a cone-shaped molecule, with a globular head and an extended stalk 21 . Three F2/F1 subunits that make up the trimeric molecule are tightly intertwined, with 3-fold symmetry that runs the length of the molecule. The globular head contains both the F2 and F1 subunits, as well as the cysteine-rich region. The stalk region is almost entirely helical, composed of the 6-helix bundle that is characteristic of the postfusion state of many type I viral fusion proteins 5,21,23 . The F protein is a target of virus-specific cytotoxic T lymphocytes (CTLs). Three related human HLA class I-restricted epitopes, HLA-A*01, HLA-B*57 and HLA-Cw*12, have been identified [24][25][26] , and 4 peptides of HRSVB were found to bind to HLA-A*0201 in HLA-A2 transgenic mouse 27 . In addition, the F protein is a target of neutralizing antibody and vaccine development due to its high sequence conservation. To date, 6 antigenic sites have been identified in F protein: Ø, I, II, IV, V, and VI. Antigenic sites I, II and IV are present in the postfusion F conformations 23,28 , whereas antigenic site Ø is found only in the prefusion F conformation 21 . Antigen site Ø is at the apex of the prefusion trimer composed of an α-helix from F1 and a strand from F2; the other 5 antigenic sites are all within the F1 subunit 5,21 . Antigenic site I is located in the middle of cysteine rich region 29 , and antigenic sites IV, V, and VI overlap with the C terminus of the cysteine rich region 30 . Antigenic site II is the binding site of the neutralizing antibodies palivizumab and motavizumab, especially the domain spanning aa 262-275 [31][32][33] . It has been shown that mutations at some of the residues in this domain resulted in resistance to palivizumab and/or motavizumab 31,34,35 .
The molecular epidemiology of HRSV in China has been studied quite extensively based on the G gene 12,13 , but information regarding the genetic diversity and sequence characteristics of the F gene from Chinese HRSV samples is limited. In the present study, the phylogenetic relationship and the sequence diversity of the full-length coding DNA sequences (CDS) of the F genes from HRSV samples collected from different regions in China were compared to those from other countries.  Table S1). These samples were selected on the basis of their representation of the circulating strains in China [based on HRSV subgroup, genotype, genetic diversity (different clustering within a genotype in phylogenetic analysis using the sequences of the G gene), geographical region, and year of collection] 7,10 . In addition, all Chinese HRSV F gene sequences with full-length CDS that were available in GenBank as of October, 2016 (n = 149) (Supplementary Table S2) were downloaded and analyzed together with the 181 F sequences.

Samples information.
Therefore, a total of 330 Chinese HRSV F gene sequences collected from 2003 to 2014 were analyzed to determine their genetic diversity and sequence characteristics (Table 1 and Fig. 1). Moreover, 1150 HRSV F gene sequences (770 of HRSVA and 380 of HRSVB) with full-length CDS collected from 18 other countries between 1956 and 2014 were downloaded from GenBank for comparison study (Supplementary Table S2). More details of the sequences are available in Supplementary Tables S1,S2 and S3.  A total of 494 full-length CDS of the HRSVB F gene, including 114 sequences from China, sorted to 9 clusters in phylogenetic trees generated by neighbor joining method, B1-B9 (Fig. 3). Chinese HRSVB sequences distributed into clusters B1, B7, and B9 (Fig. 3a The phylogenetic analyses were also conducted using the maximum likelihood method. Phylogenetic trees generated by maximum likelihood method (Supplementary Figures S1 and S2) had the same cluster designations.

Pairwise-distance calculations of the F sequences.
To determine the sequence variability of the F genes from samples collected in China or other countries, the p-distances of the sequences in the nucleotide and deduced amino acid levels were calculated. Overall, nucleotide and amino acid p-distances of the F sequences within the HRSVA or HRSVB subgroup from different countries were much smaller than those between the 2 subgroups within the same country ( Table 2), indicating that the sequences were more similar within a subgroup than between subgroups. The nucleotide and amino acid p-distances of F sequences from China within or between the HRSVA or HRSVB subgroups were similar to those values for other countries. The nucleotide p-distances were usually ≤0.03 or ≤0.02 within the HRSVA or HRSVB subgroup, respectively, whereas the amino acid p-distances were usually ≤0.01 or ≤0.007 within the HRSVA or HRSVB subgroup, respectively. In contrast, the nucleotide and amino acid p-distances between the HRSVA and HRSVB F sequences from each of the source countries were in the range of 0.20-0.211 and 0.093-0.104, respectively. Furthermore, the nucleotide p-distance between each cluster in a HRSVA or HRSVB phylogenetic tree was in the range of 0.02-0.06 or 0.01-0.03, respectively (Supplementary Tables S5a and S5b). Of interest, F sequences of HRSVA appeared to be slightly more variable than those of HRSVB at both the nucleotide and amino acid levels ( Table 2). When the amino acid sequences of the F protein from a country were compared to those from another country, there was usually over 98% or 99% identity for HRSVA or HRSVB sequences, respectively (Supplementary Table S5c and d). HRSV F sequences from China and other countries were highly conserved at both the nucleotide and amino acid levels.   In this study, no substitutions were detected at aa 106, 108-118 and 551-559 in HRSVA sequences from China. Two Chinese sequences had a substitution at aa 107 (A107T) (Supplementary   Overall, CTL-specific epitopes in HRSV sequences from all countries, including China, were well conserved. The substitutions, if any, in the sequence of the epitopes appeared to vary by positions depending on the countries of origin of the samples, but some substitutions (e.g. N276S in HRSVA) were found across different countries.

Sequence analysis of the N/O-glycosylation sites of Chinese HRSV F sequences. N-or
O-glycosylation can modify the biological activity of a protein. In this study, 6 N-glycosylation sites (aa 27, 70, 116, 120, 126 and 500) were predicted in the F protein of the HRSV reference strains of Long and CH18537 (Table 3). All Chinese F protein sequences had the 6 predicted N-glycosylation sites, with the exception of a single HRSVA sequence from Guangdong which had a mutation at aa 122 (T122P) leading to the change of the amino acid sequence from NNTK to NNPK and the loss of the predicted N-glycosylation site at aa position 120. Selection pressure site prediction. The selection pressure on Chinese HRSV was estimated using the dN/ dS ratio. The mean dN/dS ratios for Chinese HRSVA and HRSVB were 0.092 and 0.098, respectively, while the mean dN/dS ratios for HRSVA and HRSVB from other countries were 0.106 and 0.120, respectively (Table 4). In this analysis, dN/dS ratios > 1 were considered as evidence of positive selection. More negative selection sites than positive selection sites were predicted in HRSV sequences from China and other countries. No positive selective

Discussion
Little is known about the sequence characteristics and genetic diversity of Chinese HRSV F gene. In this study, 330 Chinese HRSV F gene sequences from different regions in China were analyzed together with 1150 HRSV F gene sequences from 18 other countries by phylogenetic analysis and other sequence analyses. The present study provides important information regarding the genetic diversity and sequence characteristics of Chinese HRSV F sequences from 2003-2014. Results presented in this study provide information for the development of vaccines, neutralizing antibodies and other therapies for HRSV infection. The sequence of the HRSV G gene is highly variable. Due to the genetic diversity in the G gene, sequence encoding the second hypervariable region in the C-terminal of the G protein has been used for genotyping of HRSV. In contrast, the sequence of the HRSV F gene is highly conserved. Most of the molecular epidemiology studies of HRSV were conducted using the sequences of the G gene. However, there is increased interest in the molecular epidemiology of the F protein in recent years as F protein is a target of neutralizing antibodies and vaccine development. In this study, phylogenetic analysis of ~1500 HRSV F sequences from samples collected from the extended period of 1956-2014 worldwide, including 330 sequences from China from 2003-2014, identified 11 and 9 clusters with the sequences of the HRSVA and HRSVB subgroups, respectively. Of interest, F sequences from different genotypes (as determined based on the G gene sequences) within a HRSV subgroup could be found in the same clusters in phylogenetic trees generated based on F gene sequences, e.g. cluster A11 of HRSVA contained sequences from HRSVA genotypes NA1, NA3 and ON1.
Our p-distance calculations showed that there was a high level of sequence identity between the F sequences from China and other countries. In addition, the variability of the F sequences from HRSVA was slightly higher than that from HRSVB samples collected from China as well as many other countries, which was consistent with observations reported previously 26 .
All Chinese HRSVA sequences were highly conserved at antigenic site Ø (aa 62-69 and 196-210), but 3 Chinese HRSVB sequences classified as the SAB4 genotype from Beijing had the K65Q substitution and sorted to a separate group in the B7 cluster of the phylogenetic tree, suggesting that K65Q could be a genotype specific amino acid change. Other HRSVB substitutions such as R202Q and Q209K/R were also detected at antigenic site Ø. Antigenic site I is located in the middle cluster of the cysteine-rich region of the F1 chain. The F1 subunit is essential for membrane fusion 19 . Antibodies binding to antigenic site I have marginal effect in virus neutralization. Similar to antigenic site I, antigenic sites IV, V, and VI are also located near the C-terminal end of the cysteine-rich region and not far from the heptad repeat adjacent to the membrane of the F1 chain. During the fusion process, neutralizing antibodies binding to both sides of the cysteine cluster would inhibit conformational changes of the F1 chain 30 . Previous studies have shown that substitutions at aa 262, 268, 272 and 275 of the F protein conferred resistance to palivizumab in vitro or in vivo 34,35 . Only 1 Chinese HRSVB sequence has the substitution of S259A at antigenic site II. The impact of the S259A substitution on viral pathogenesis remains to be determined.
The CTL epitopes were well conserved in HRSV sequences from China and other countries. However, N276S substitution was commonly detected in the CTL epitopes of HRSVA F sequences from China as well as many other countries. In our study, 84.3% (182/216) Chinese HRSVA sequences carried the N276S substitution. The percentage of Chinese HRSVA strains with the N276S substitution was found to be increasing after 2014 36,37 . Although aa N276 is next to the palivizumab binding site (aa 262-275), the N276S substitution has been proved to be a natural polymorphism and does not impact the susceptibility of HRSVA to palivizumab 35,37 .
N-glycans on viral glycoproteins are important structural components that could affect the folding, transport, activity, stability and immunological properties of the viral glycoproteins 19 . It has been shown that the loss of the N-glycan in HIV gp120 could affect the sensitivity of HIV to neutralizing antibodies and modulate the structure, stability or accessibility of viral epitopes in the CD4 binding site and coreceptor binding region of HIV 38 . There are generally 5 or 6 potential N-linked glycosylation sites in HRSV F protein depending on the viral strains 5 . In   Table 4. Predicted selection pressure sites in HRSV F sequences. SLAC = single likelihood ancestor counting method, FEL = fixed effects likelihood method, IFEL = internal fixed effects likelihood method.
this study, only 1 out of 330 Chinese HRSV F sequences was found to have lost one of the N-glycosylation sites (aa 120), which is located in pep27 (aa 110-136); previous study has shown that pep27 has no impact for virus maturation and infectivity 39 . Substitutions at aa N500 resulted in pronounced reduction of the fusion activity of F protein by about 90~100%, indicating that the single N-glycan of the F1 subunit at aa N500 was essential for efficient syncytium formation. N-glycosylation of the F2 subunit at aa N27 and N70 have been shown to have minor impact on the fusion activity of the F protein 19 . In this study, a number of O-glycosylation sites were predicted in Chinese HRSV F protein sequences. This predication needs to be investigated further, as there has been no report of O-linked glycan on a HRSV F protein 19 . Previous study showed that the dN/dS ratios for the hemagglutinin genes of wild-type measles virus H1 genotype were <1 and demonstrated the hemagglutinin amino acid substitutions were the result of random genetic drift rather than accumulated mutations 40 . In this study, we found that the mean dN/dS ratios of the F genes for both HRSVA and HRSVB strains in China and other countries had a value < 1, suggesting that the amino acid substitutions in F protein of HRSV were also the results from random genetic drift. In addition, it was reported that most HRSV genes, with the exception of the G gene, have negative selection or neutrally evolving sites 41 . Consistent with this finding, we identified only 3 positive selection sites for HRSVA (aa 152, 384 and 574) and 4 positive selection sites for HRSVB (aa 15, 125, 201 and 573) in ~1500 HRSV F sequences collected worldwide. Positive selection sites at aa 15 and aa 125 are located in the F2 subunit and pep27, respectively, whereas the other 5 positive selection sites are all located in the F1 subunit. Amino acid 384 is located in antigenic site I, and we found that the substitutions V384I/T were present in 96% HRSVA sequences collected all over the world. Amino acid 125 is located in pep27. Recently, it has been shown that the binding to a pep27-containing peptide (F 101-121) was higher with sera from HRSV-infected infants with neutralizing antibodies than infants without neutralizing antibodies, indicating the presence of novel antigenic site(s) in the unprocessed F0 conformation 42 . Consistent with the results from previous studies 20, 36 , substitutions L125S/P were found in 8% HRSVB sequences in our study. The potential effect of substitutions at these positive selection sites on the pathogenesis of HRSV remains to be determined.
There are currently no specific antiviral treatments or vaccines approved for HRSV infection. Palivizumab is a humanized monoclonal antibody indicated for the prevention of serious lower respiratory tract disease caused by HRSV infection in high-risk infants in the USA and other countries. Palivizumab binds to the highly conserved antigenic site II of the HRSV F protein, and is cross-reactive with the F protein of HRSVA and HRSVB due to the sequence conservation at its binding site 32,43 . Only a small number of substitutions associated with resistance to palivizumab have been identified in vitro or in vivo; these substitutions were at aa 262, 268, 272 and 275 of the F protein 34,35 . In this study, we found that the palivizumab binding site in the F protein sequences from all of the 330 Chinese HRSV samples was 100% conserved, and 100% identical to the corresponding site found in the F protein sequences of prototypic HRSVA and HRSVB strains, both which are susceptible to the neutralization by palivizumab. No mutations were found at the amino acid positions which are known to confer resistance to palivizumab. Taken together, these results predict that the endemic Chinese HRSV strains would be susceptible to the neutralizing effect of palivizumab in the clinical setting. The susceptibility of Chinese HRSV strains to neutralization by palivizumab should be confirmed in vitro and in vivo.
There are some limitations in the present study. This study is a retrospective study and the source of available sequences was limited by the number of samples collected previously. Most of the sequences were downloaded from GenBank without the information of DNA sequencing methods, which might have different detection limits. In addition, while sequences of most samples were amplified directly from clinical specimens, some were amplified from samples previously isolated, which might have some impact on the sequencing results. Different countries had different number of sequences and the collection dates of the samples were not successive. Furthermore, samples were from 6 representative regions of China instead of from all over China, and the number of samples from each region was different.
In conclusion, this study investigated the sequence diversity of the HRSV F gene from 330 samples collected in China from 2003-2014. The F sequence was highly conserved in HRSV strains from China as well as those from 18 other countries. The high level of sequence conservation in the HRSV F sequences worldwide supports the F protein as a target for vaccine development and antiviral therapy. The evolution of the F gene of HRSV strains from China and the rest of the world should be monitored continuously for genetic diversity and changes in functional and antigenic properties.

Methods
Ethics statement. This study did not involve human experimentation and only nasopharyngeal precipitates (aspirates) were collected from patients with respiratory infections. Written informed consent for the use of their nasopharyngeal precipitates was obtained from adult patients and parents or guardians of pediatric patients. This study was approved by the second session of the Ethics Review Committee of the National Institute for Viral Disease Control and Prevention (NIVDC) of the Center for Disease Control and Prevention (CDC) in China and the methods were performed in accordance with the approved guidelines.
Sample collection. 181 representative Chinese HRSV samples were selected from over 700 HRSV-positive samples for the determination of the full-length CDS of the F genes based on HRSV subgroup, genotype (designation based on the sequence of the G gene), genetic diversity (different clustering within a genotype in phylogenetic analysis using the sequences of the G gene), geographical region, and year of collection 7 Nucleotide amplification and sequencing. The second hypervariable region of the G gene of each HRSV sample was first sequenced and then analyzed in phylogenetic analysis to determine its genotype as described previously 7,10 . Depending on the designated genotypes and other criteria as described in "Sample collection" above, 181 HRSV samples were selected for F gene sequencing. PCR amplifications of the full-length CDS of the F gene (in 3 separate fragments) were performed using the One Step RT-PCR kit (TaKaRa Biotechnology, Dalian, China). The PCR and sequencing primers are shown in Supplementary Table S8. Reaction mix contained 5 ul RNA, 12.5 ul reaction buffer, 1 ul One Step Enzyme Mix and 0.4 uM of a forward and reverse primer. The amplification was conducted at 50 ° for 30 min for reverse transcription, 94 °C for 2 min for denaturation, and with 40 cycles of 94 °C for 30 sec, 50 °C for 30 sec, 72 °C for 60 sec (105 sec for HRSV F2 fragment) for amplification, followed by a final extension at 72 °C for 10 min. The PCR products were all purified using a QIAquickGel Extraction Kit (Qiagen) and sequenced using an ABI Prism 3710 × 1 DNA Analyzer. The sequences were edited using Sequencher software vision 5.0 (Gene Codes, Ann Arbor, MI, USA).
Phylogenetic, p-distance and amino acid variation analyses. Phylogenetic trees were generated using the software MEGA 5.0 with the neighbor joining method with Kimura 2-parameter model or maximum likelihood method with GTR + G + I model. Evaluation of the reliability of phylogenetic inference was estimated using the bootstrap method with 1000 replicates with a cut-off value of usually >70 44 . In this study, a cluster was defined as a group of sequences within a distinct branch in the phylogenetic tree, and the higher bootstrap, the higher reliability. Sequences were aligned using ClustalW in the MEGA 5.0 software and the pairwise distance (p-distance) of nucleotide and deduced amino acid among sequences were also calculated using MEGA 5.0 45 . The N-and O-glycosylation sites were predicted using the NetNGlyc 1.0 and NetOGlyc 4.0 server 46,47 , respectively. Selection pressure was determined on the Datamonkey website (http://www.datamonkey.org/) by estimating the ratio of non-synonymous (dN) and synonymous (dS) substitution per site based on the SLAC, FEL, and IFEL methods with a significance level of 0.05. The reference strains used in this study were Long (GenBank accession number JX198112) and CH18537 (GenBank accession number JX198143).

Nucleotide sequence accession numbers and HRSV sequences downloaded from
GenBank. All full-length HRSV F gene sequences that were available in GenBank as of October 2016, including 149 sequences from Xinan, Zhongnan, Huabei and Huadong regions of China and 1150 sequences from 18 other countries from Asia, Europe, North America, South America, Australia, and Africa (details shown in Supplementary Tables S2 and S3) were downloaded for analysis together with the 181 full-length F gene sequences that were generated from samples collected in this study.