Evolution of respiratory syncytial virus genotype BA in Kilifi, Kenya, 15 years on

Respiratory syncytial virus (RSV) is recognised as a leading cause of severe acute respiratory disease and deaths among infants and vulnerable adults. Clinical RSV isolates can be divided into several known genotypes. RSV genotype BA, characterised by a 60-nucleotide duplication in the G glycoprotein gene, emerged in 1999 and quickly disseminated globally replacing other RSV group B genotypes. Continual molecular epidemiology is critical to understand the evolutionary processes maintaining the success of the BA viruses. We analysed 735 G gene sequences from samples collected from paediatric patients in Kilifi, Kenya, between 2003 and 2017. The virus population comprised of several genetically distinct variants (n = 56) co-circulating within and between epidemics. In addition, there was consistent seasonal fluctuations in relative genetic diversity. Amino acid changes increasingly accumulated over the surveillance period including two residues (N178S and Q180R) that mapped to monoclonal antibody 2D10 epitopes, as well as addition of putative N-glycosylation sequons. Further, switching and toggling of amino acids within and between epidemics was observed. On a global phylogeny, the BA viruses from different countries form geographically isolated clusters suggesting substantial localized variants. This study offers insights into longitudinal population dynamics of a globally endemic RSV genotype within a discrete location.


Results
Over 15 RSV epidemics (2002 to 2017), a total of 903/12203 (7.4%) respiratory samples tested positive for RSVB. G gene sequencing was attempted for all RSVB positive samples and was successful for 857 (95%) RSVB positive samples. The proportion of RSVA and RSVB samples in each epidemic ranged from 3.8 to 77.0% (Table 1) (Table 1). The annual distribution of RSV cases by age from 2002 to 2017 are shown in Supplementary Table 1. Although the hospital-based RSV surveillance in Kilifi was established in 2002, the genotype BA was not detected until January 2003. From the RSVB positive samples, RSVB BA viruses were confirmed during sequence assembly and analysis by the presence of a 60-nucleotide duplication in the C-terminal of the G gene. Overall, a total of 735 genotype BA G gene sequences spanning nucleotides 214 to 981 of the RSVB strain B1 (DQ227363) were obtained.
Genetic diversity of the genotype BA. The sequenced G gene region showed 3% nucleotide divergence (overall mean P distance) over the entire period and varied by a maximum of 38 nucleotides within an epidemic. Genetic divergence increased proportionally with time, Fig. 1A, indicating clock-like evolution. Molecular clock phylogenies showed a well-ordered diversification of the genotype BA since its introduction in Kilifi (Fig. 1B). Viruses from the same epidemic formed multiple phylogenetic clusters and often, sequences from the preceding epidemic were often positioned at the basal nodes of those in the successive epidemic, suggesting sequential virus circulation and persistence between epidemics. A total of 56 distinct variants were identified, some were singletons suggesting either low level transmission or unsampled genetic diversity. The 2007/8 and 2009/10 epidemics were the most heterogeneous, with 12 and 11 variants, respectively. Up to four variants circulated over multiple epidemics (Fig. 1C), and some were observed in non-consecutive epidemics likely indicating reintroduction of variants that had been previously observed locally.
Evolutionary history of the genotype BA in Kilifi. Our previous study reported the number of virus introductions into Kilifi from 2002/3 to 2011/12 epidemics 24 . To place the Kilifi genotype BA viruses in the context of global RSVB diversity, we analyzed the 2012/13 to 2016/17 Kilifi sequences (n = 233) and 583 global sequences sampled from 16 countries between 2012 and 2017. In the current study, distinct geographical clusters were evident with viruses clustering primarily by country of origin. The 2012/13 to 2016/17 RSVB epidemics in Kilifi were seeded by at least 15 virus introductions (Fig. 2). For many countries, viruses circulating during the same year were not placed into single monophyletic groups but in multiple clusters of assorted sizes (Fig. 2), indicative of multiple virus entries followed by local spread and genetic expansion. Between 2012 and 2017,      Figure S1).

Discussion
This study documents the natural history of the RSVB genotype BA viruses in Kilifi, Kenya, over 15 epidemics. Our analysis found substantial genetic diversity exhibited by multiple variants co-circulating within and between epidemics. Shifts in the predominating strains have been suggested to confer a selective advantage by variations   240  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255  256  257  258  259   2002/3  T  E  R  D  T  S  T  P  Q  S  T  V  L  D  T  T  T  S  The 60-nucleotide duplication region showed further accumulation of nucleotide substitutions further than previously reported 15 , which likely points to enhanced fitness and molecular adaptation providing evolutionary advantage of the BA genotype 41 compared to other RSVB genotypes. Accumulation of nucleotide changes in RSV G gene was also attributed to antibody selection to some extent, as well as selective constraints other than immune selection, for instance, high mutation rate, protein malleability and bottleneck effects 42 .
We observed amino acid reversion patterns in the Kilifi genotype BA viruses. Evolutionary reversals provide a means of escape by generating antigenic novelty and reflect fluctuating dynamics in immune responses, in which cross-immunity wanes, thereby controlling pathogen replication and facilitating virus transmission 30 . Escape mutations within or flanking functionally conserved epitopes come under selective pressure to revert to the wild type in hosts that do not mount an immune response against the epitope 43 . Amino acid reversions could also be a necessary consequence of a limited set of possible replacements at epitopes, a constraint on the repertoire of functionally viable amino acids 30 .
The Arg-180 substitution may have altered the positive charge of the G protein cysteine noose. The role of electrostatic charge in pathogenesis has been appreciated for HIV rapid progression to AIDS by the presence of charged residues at two amino acid positions on the gp120 subunit 44 . In Influenza A, several immune escape mutants showed an increased positive charge in HA and virions with a higher net positive charge in HA may promote favourable electrostatic interactions with target cells since host cell receptors possess a strong negative charge 45 . A change in surface charge (Glu to Lys) of the envelope glycoprotein was also implicated in the emergence of enzootic and epizootic viral phenotypes 46 . In addition, a potentially significant element of the immune escape repertoire is the selection of 'adsorptive mutants' , showing enhanced binding to target cells 47 . An important aspect of receptor-binding avidity is that of electrostatic charge 45 . The Arginine replacement at position 180 may have resulted in increased net positive surface charge at the cysteine noose consequently promoting electrostatic interactions with target host cells. Future therapeutic design efforts will need to assess for possible loss of human G-directed mAbs cross-reactivity to epitopes containing the N178S and Q180R mutations.
Both pervasive positive selection and episodic diversifying selection were detected in amino acid positions within the sequenced G gene segment, six of which (227, 270, 287, 305, 306 and 313) were in agreement with previous studies 30,48 . Similarly, four putative N-glycan binding sites reported in this study (86, 144, 296, and 310) were previously reported for RSV G 49 . Amino acid position 144 was also identified in our study as being under positive selection by at least two methods.
In summary, our analyses show that the genotype BA dynamics in Kilifi have been marked by multiple cocirculating variants even within an epidemic, which are serially replaced over time perhaps by virus importations or evolution in situ. There has been gradual diversification specially in the C-terminal of the G protein possibly due to antibody-driven selection or functional constraints. In later epidemics (from 2007/8) the genotype was characterized by marked evolutionary reversions that reflect fluctuating immunological dynamics-either loss of pre-existing immunity or positive selection in a newly susceptible populations 30 -in the local communities. www.nature.com/scientificreports/ Contemporary sampling and sequencing of RSVB particularly in Africa is still insufficient, compared to HIV-1 viruses, for which the regional spatial ecology has been extensively characterized 50,51 . Paucity of sequence data from neighboring countries limited our inferences on virus importations and movement within the sub-Saharan region. Future studies on local epidemics, transmission and spatial dynamics of the genotype BA could greatly benefit from improved sampling and sequencing in the region. Nonetheless, this study highlights the value of longitudinal sampling within a discrete location and of sequencing sufficiently over a long time to characterize RSV population dynamics. Further studies are required to determine the importance of variant-specific genetic differences in immune response and virus transmission. While the G gene is the fastest evolving RSV gene and has been extensively useful in RSV molecular epidemiology, studying the diversity of the rest of the RSV genome could allow for a deeper understanding of the long-term evolutionary dynamics of the RSV genotype BA.

Methods
Study samples and epidemic patterns. This study was part of longitudinal surveillance study established to understand the epidemiology and disease burden of RSV associated pneumonia 21 . The respiratory samples were collected into the universal transport media (UTM; Copan Diagnostics, USA). We used sequence data from RSV positive samples (nasal washes, aspirates and swabs) collected between January 2003 and August 2017. A majority of the samples were from children < 5 years admitted to Kilifi County Hospital (KCH) (2004-2017) with lower respiratory tract illness 20,21 . The 2003 samples were from the Kilifi RSV birth cohort acute respiratory infection cases 52 . All sample sets arise from the same catchment population and were processed similarly in the laboratory as detailed below.
Informed consent was sought from parents or guardians, and the study protocols were reviewed and approved by the Scientific and Ethical Review Unit, KEMRI, Kenya. All methods were carried out in accordance with relevant guidelines and regulations.
RSV epidemics in Kilifi typically start in November and run through to May with sporadic (inter-epidemic) infections occurring in June to October 21 . However, there is year-to-year variation and therefore we took a pragmatic approach and defined an epidemic as running from October of one year through to September of the following year. The dominant RSV group within a given epidemic was associated with ≥ 60% of the cases; otherwise, the groups were considered co-dominant. Viral detection, amplification and sequencing. RSV detection was performed by immunofluorescence antibody test (IFAT; RSV DFA kit, Light Diagnostics, UK). Beginning 2007, a real-time RT-PCR assay with primer/probe sets targeting the highly conserved nucleoprotein (N) gene was used to discriminate RSVA and RSVB for all respiratory samples collected during the surveillance period as previously described 24,53,54 . In 2015, the real-time RT-PCR assay for RSVB was updated following failed detection of drifted variants 55 . For genotyping purposes, a 900 base pair (bp) fragment of the G gene was amplified from purified viral nucleic acids of RSVB positive samples and sequenced in a nested PCR reaction as previously described 18,53 . Sequence assembly was done using Sequencher v5.0 (Gene Codes Corp., USA). The RSVB genotype BA viruses were identified following sequence alignments and phylogenetic analysis of the assembled sequences.
Global data set. In addition to the Kilifi sequences, we retrieved RSVB G gene sequences collected between 2012 and 2017 from GenBank. We excluded sequences were shorter than 700 nucleotides, non-BA genotype sequences and those without sampling date or location information. Sequences were aligned using MAFFT 7.222 and alignments manually edited in Aliview 56 .
Phylogenetic analyses. Model selection (ModelFinder) and Maximum likelihood (ML) phylogenetic trees were estimated with IQTREE 1.6 57,58 . Node support values were estimated using bootstrap resampling (1000 replicates). Temporal signal in the data was examined using TempEst 59 . Phylogenetic relationships and viral demographic histories were inferred using BEAST 1.10 60 . We used a GTR substitution model with discrete gamma distributed rate variation among sites and uncorrelated relaxed molecular clock with branch rates drawn form a lognormal distribution to account for evolutionary rate variation among lineages. A Skyride demographic prior with time-aware smoothing 61 was selected and a CTMC rate reference prior was specified for the mean clock rate. Chain length of MCMC sampling was 300 million generations, sampling every 10,000. Stationarity and mixing were examined using Tracer 1.7 and maximum clade credibility (MCC) trees summarized using TreeAnnotator.
A variant was defined as a virus or group of viruses with 4 nucleotide differences compared to other viruses and/or falling into a distinct cluster with at least 60% bootstrap support 24 . Selection analysis. We tested the G gene data for evidence of natural selection using methods implemented in the Datamonkey 2.0 webserver 62 . Mean non-synonymous to synonymous rate ratio (d N /d S ) was estimated using the SLAC method. Pervasive selection analyses were performed using SLAC and FEL methods. Episodic (frequently transient) selection was evaluated using MEME, and the FUBAR method was used to evaluate the difference between nonsynonymous and synonymous rates per codon site. Significance of SLAC, FEL and MEME results used a P-value cutoff of 0.05, and FUBAR results were assessed posterior probability of 0.9. N-linked glycosylation sites. The N-Glycosite web tool (http://www.hiv.lanl.gov/conte nt/seque nce/ GLYCO SITE/glyco site.html) was used to identify putative N-glycosylation sites (amino acid configuration N-x-S/T, where x is not Proline (P)).