Genetic diversity and molecular evolution of human respiratory syncytial virus A and B

Human respiratory syncytial viruses (RSVs) are classified into two major groups (A and B) based on antigenic differences in the G glycoprotein. To investigate circulating characteristics and phylodynamic history of RSV, we analyzed the genetic variability and evolutionary pattern of RSVs from 1977 to 2019 in this study. The results revealed that there was no recombination event of intergroup. Single nucleotide polymorphisms (SNPs) were observed through the genome with the highest occurrence rate in the G gene. Five and six sites in G protein of RSV-A and RSV-B, respectively, were further identified with a strong positive selection. The mean evolutionary rates for RSV-A and -B were estimated to be 1.48 × 10–3 and 1.92 × 10–3 nucleotide substitutions/site/year, respectively. The Bayesian skyline plot showed a constant population size of RSV-A and a sharp expansion of population size of RSV-B since 2005, and an obvious decrease 5 years later, then became stable again. The total population size of RSVs showed a similar tendency to that of RSV-B. Time-scaled phylogeny suggested a temporal specificity of the RSV-genotypes. Monitoring nucleotide changes and analyzing evolution pattern for RSVs could give valuable insights for vaccine and therapy strategies against RSV infection.

Site-specific selection analysis of the RSV G gene. As were observed above, the average ratios of nonsynonymous to synonymous nt substitutions in the G genes of RSV A and B were indicative for positive selection, we further examined the coding sequences of different isolates to detect the codons that provide stronger evidence of purifying or diversifying selection in the gene. It was shown that both negative and positive selection happened in the G gene, with 63 and 65 sites were respectively defined as negative selection in RSV-A and RSV-B by FUBAR method, while 11 sites in RSV-A and 9 sites in RSV-B were defined as positive selection (by at least one of the two methods), among which five sites in RSV-A and six sites in RSV-B had the strongest support, by both method MEME and FUBAR methods ( Table 2). Two and three positively selected sites respectively in RSV-A and RSV-B were located in the first hyper-variable region (amino acid positions 94, 115 in RSV-A and 77, 115, 133 in RSV-B), while seven and six positively selected sites were located in the carboxy-terminal third of the G ectodomain.

Evolutionary rate estimation.
To determine the evolutionary relationship between the different RSV strains, a Bayesian MCMC estimation of the time of the most recent common ancestor (MRCA) was performed. Results suggested that the uncorrelated lognormal relaxed molecular clock fit the data best. Under the best-fit model, the totally evolutionary rate for RSVs was calculated as 2.03 × 10 -3 nucleotide substitutions/site/year, with 95% HPD interval as 1.46 × 10 -3 to 3.06 × 10 -3 . The evolutionary rate for RSV-A was calculated as 1.48 × 10 -3 nucleotide substitutions/site/year (95% HPD interval: 1.27 × 10 -3 to 1.71 × 10 -3 ), while the evolutionary rate for RSV-B was 1.92 × 10 -3 (95% HPD interval: 1.69 × 10 -3 to 2.16 × 10 -3 ). The MRCA of RSV and BRSV was estimated to be around 921 years ago, while the MRCA of RSV-A and RSV-B dates back to 338 years ago. The earliest genotypic differentiation for RSV-A and RSV-B was 76 and 54 years ago, i.e. in the year of 1943 and 1965, respectively (Fig. 2).

Population dynamic analyses.
We estimated the effective population sizes of the prevalent strains of RSV using Bayesian skyline plot analyses. It was shown that the total RSVs witnessed a constant population size until around 15 years ago, when a sharp increase in population size lasted for about 5 years, followed by a short plateau period of 1-2 years before going down and became stable again (Fig. 3a). The fluctuation of the effective population size of RSV-B witnessed the same pattern with the total RSV population (Fig. 3c). The effective population size of RSV-A witnessed a persistent population size through the time axis (Fig. 3b).

Time-scaled phylogeny reconstruction.
To trace the spreading history of RSV, an unrooted maximum likelihood tree was built using the aligned complete G gene. The results suggested that the prevalence of the genotypes for both RSV-A and RSV-B was temporal specific. For RSV-A, the genotype with the longest epidemic duration was GA5, but it disappeared since 2015, SAA1, GA7 and GA6 genotypes were only prevalent in the 1980s and before, GA1 and GA4 genotypes were prevalent before 2000, GA3 prevalent before 2010, while GA2 was prevalent between the year of 2001 and 2014, since 2015, the only prevalent genotype was NA1 (Fig. 4a). For RSV-B, the mainly prevalent genotypes in the early stage (1970s and 1980s) were GB1, GB2, GB4 and NZB2, SAB1-4 genotypes were prevalent in 1990s to 2000s, BA genotypes (BA1-BA14) was most predominant in the last 20 years, and since 2015, the only prevalent genotype was BA9 (Fig. 4b).

Discussion
RSV is a large family containing two different antigenic groups, further clustered into distinct genotypes. The variability and evolutionary dynamics of the circulating strains of RSV-A and -B in the past 50 years worldwide has been clarified in this study. Generally, RNA viruses maintain extensive genetic variability through which rapid evolution can occur. Recombination is an important way for viral evolution both in positive-and negative-sense   [32][33][34] , it involves the generation of chimeric nucleic acid molecules consisting of components from different parental viruses. However, recombination events yielding different progeny have not been reported in RSV except one study involving the potential recombination during mixed infection in vitro 35 . Our recombination analysis on representatives of RSV-A and -B genotypes in this study also showed there was no recombination event occurred between the groups, but potential small fragment recombination events were observed in the intra-groups of both RSV-A and RSV-B. However, since continuous changes in temporal distribution of RSV subtypes/genotypes have been recorded 9 , co-infection with different subtypes/genotypes in the same host was rarely reported 20,36,37 . Therefore, we are tend to believe that the potential intragenic recombination events in this study were not true, as current recombination detection programs have not enabled a clear distinction between recombination or genetic drift in the reported breakpoints of closely related strains. Mutation is another important factor in the evolution of RNA virus. The mutations occurred by chance for lacking of proof-reading capabilities of RNA-dependent RNA polymerase of the virus, which shape quasispecies to allow rapid adaptation of the virus to the changed environment such as host's immune pressure. In this study, comparison of mutation rates for different genes of RSV showed that SNP occurrence rate of the G gene in both RSV-A and -B was much higher than that of other genes, and positive selection was only found in the G gene compared with negative selections in all the other genes. This phenomenon suggested that the pretty high genetic diversity G gene reflects the impact of host immune responses in the epidemic cycle, whereas purifying selection in the remaining genes reflects avoiding deleteriously mutations and favoring the optimization of viral replication and transmission. Substitutions, insertions, deletions, duplications, stop codon usage change and frame shift mutation all involved in the variability of G gene and may be related to its escape from the protection of the host neutralizing antibodies. Our study showed that except for G gene, the mutation frequencies of other genes were higher in RSV-A than in RSV-B, which may be attributed by a wider prevalence of RSV-A 38,39 .  Table 2. Positive selection sites in G gene of RSV-A and -B. Numbers in bold are positive selection sites by both the MEME and FUBAR methods. "*": p value ≤ 0.05; " # ": poster probability of 0.9. www.nature.com/scientificreports/ The mutation frequency of the G gene of RSV-B was higher than that of RSV-A, which may suggested a broad antigenic diversity in RSV-B, and this result was consistent with the phenomenon that RSV-A reacted with all the antibodies, whereas RSV-B showed different epitope characteristics in G gene in previous study in the very early time 3 . Although G gene exclusively harbors the most diversified mutations selected partially as response to host immunological pressures 29 , adaptive evolution is able to affect only certain sites of the gene because of strong functional constraints. In this study, the number of positive selection sites in G gene was less than those in the previous studies 40,41 , which may be caused by strict standards of P value (0.05 versus 0.1) and poster probability (0.9 versus 1.0). However, a similar distribution of the positive sites was observed: a small part of sites located in the first hypervariable region and a large part located in the C-terminal third of the ectodomain of the G, an important determinant of RSV evolution, but whether the immune pressure on G protein drives the selection completely needs to be further explored, since the random genetic drift following potential bottleneck effect rather than immune selection is still a challenge confronted or a point to be excluded: on one hand, the C-terminal strain specific epitopes are poor or secondary as neutralizing antigens and contribute much less in inducing the potent neutralizing antibody compared with those highly conserved neutralizing epitopes in F and G proteins; on the other hand, a recent advances in the vaccine development showed a robust and durable neutralizing antibody responses, induced by recombinant adenovirus encoding prefusion F (pre-F), can confer cross-protection against RSV-A and -B infection in old healthy people 42 , suggesting the short-lived and low level of neutralizing antibody response against F protein following RSV natural infection is responsible for the transmission and evolution of diversified RSV. In other words, the highly diverse and positively selected domains in G protein are not the main contributor to the evolution of RSV. It may be more easily accepted to notice the fact that different from the positively selected and constantly changed neutralizing epitopes in hemagglutinin (HA) of influenza virus 43 , the strong neutralizing epitopes of RSV exist either in the conserved F protein or in the highly conserved central domain of G protein. In sharp contrast, the positively selected antigens in G protein is much poor in inducing neutralizing antibody. Therefore, it was speculated that if the mutations from the positive selection sites on the G gene do have some contributions biologically and immunologically to the escape and evolution, it should be the secondary to be taken into account. www.nature.com/scientificreports/ The evolutionary rates estimated for the genes of RSV-A and -B in the present study were 1.48 × 10 -3 and 1.92 × 10 -3 nucleotide substitutions/site/year, respectively, which were close to the previously estimated rates that approximately ranged from 1.83 × 10 -3 to 4.68 × 10 -3 substitutions/site/year in RSV-A and 1.95 × 10 -3 to 5.89 × 10 -3 substitutions/site/year in RSV B 44,45 . Notably, it has been suggested that the evolutionary rate of G genes were about 10 times faster than that of the complete genomes and F genes of RSV 40,46,47 . The possible explanation for the faster mutation rate of G genes was that this gene may be under a strong selective pressure from host immune defense which was stronger than the F and other genes of the RSVs, and can lead to variants that may alter the pathogenicity and fitness 48 . Based on the evolutionary rates, it was estimated that the MRCA for the tree root can date back to the year of 1943 for RSV-A and 1965 for RSV-B, which was comparable to the results in the previous studies 41,44,45 . The calculations in our study suggested that the divergence of RSV-A and -B viruses occurred approximately 338 years ago, indicating the MRCA of the two groups can date back to the year of 1681.

RSV group
The population dynamics showed that RSV-A witnessed a constant population size through the time axis, while an obvious increase in population size about 15 years ago (in 2004) was observed in RSV-B, which may be caused by the application of the next generation sequencing technology at the time that had detected a large number of RSV sequences. However, the obvious decrease in RSV-B detected about 6-7 years ago (in 2012-2013) might result from dying out of some genotypic lineages, as the temporal dynamic analysis in this study also showed that different BA types were detected in 2011-2014, while since 2015, only BA9 was observed. Because of RSV-A being more prevalent than RSV-B 17,38,39,49 , the factors affecting RSV-B population size have little influence on RSV-A. The change trend of the population size of total RSV was consistent with that of RSV-B, which suggested that changes in effective population size were group specific and the recent change in RSV-B may be the source of the observed change of the whole RSV family.
Our estimation of the temporal dynamics suggested that different new genotypes of both RSV-A and -B appear periodically and tend to be the predominated circulating strains. For RSV-B, GB1 was the earliest prevalent genotype, while the BA was the only prevalent genotype in the last 20 years, and since 2015, the only prevalent genotype was BA9; while for RSV-A, the longest duration of the epidemic was GA5 genotype, and since 2011, the only prevalent genotype was NA1. It is worth noting that the most prevalent genotypes for RSV-A and-B in the last 10 years was NA1 (especially ON1 lineage) and BA (BA1-14), respectively. A 72-nt and a 60-nt nucleotide duplication was respectively found in the C-terminal region of the G protein of ON1 and BA genotypes, which made the virus had a fitness advantage to the host, and the corresponding genotypes soon became the predominant strains in the world 15,22,50 .  www.nature.com/scientificreports/ In summary, this study provides data for genetic diversity and molecular evolution with RSVs circulated in the world. Since the genetic diversity of viral genome and the variability of G glycoprotein play a significant role in RSV pathogenicity by allowing immune evasion, monitoring changes and analyzing evolution pattern in the sequences of both groups could give useful insights for future vaccine and therapy strategies of RSV.  www.nature.com/scientificreports/ nk). All the genomic sequences were aligned by MAFFT 7 (online version: https:// mafft. cbrc. jp/ align ment/ softw are/).

Recombination analysis.
To detect whether any potential recombination occurred in the inter-or intragroups of RSVs, the whole-genome sequences from their representatives of different genotypes of RSV A and B were aligned and analyzed by Simplot software (version 3.5.1) using the boot scanning method, with neighbor joining algorithm as 100 pseudoreplicates.
Calculation of genetic diversity. SNP calling of the genes of RSVs was done on the ViPR (https:// www. viprb rc. org/). Diversities of the different genes that encode the RSV proteins were quantified as the mean genetic distance calculated for the pairs of nucleotide sequences using MEGA software (version 7). The ratio of nonsynonymous substitutions per nonsynonymous site (dN) and synonymous substitutions per synonymous site (dS) were calculated using the method of Nei and Gojobori with the Jukes-Cantor correction for multiple substitutions, conducted using MEGA software (version 7). The dN/dS ratio is an indicator of the strength of positive (> 1) or negative (< 1) or neutral (= 1) selection pressure on a viral variant.
Site selection pressure on the RSV G gene. Estimation

Data availability
The datasets analyzed during this study are included in this published article in Supplementary Information files.