Evolutionary and network analysis of virus sequences from infants infected with an Australian recombinant strain of human parechovirus type 3

We present the near complete virus genome sequences with phylogenetic and network analyses of potential transmission networks of a total of 18 Australian cases of human parechovirus type 3 (HPeV3) infection in infants in the period from 2012–2015. Overall the results support our previous finding that the Australian outbreak strain/lineage is a result of a major recombination event that took place between March 2012 and November 2013 followed by further virus evolution and possibly recombination. While the nonstructural coding region of unknown provenance appears to evolve significantly both at the nucleotide and amino acid level, the capsid encoding region derived from the Yamagata 2011 lineage of HPeV3 appears to be very stable, particularly at the amino acid level. The phylogenetic and network analyses performed support a temporal evolution from the first Australian recombinant virus sequence from November 2013 to March/April 2014, onto the 2015 outbreak. The 2015 outbreak samples fall into two separate clusters with a possible common ancestor between March/April 2014 and September 2015, with each cluster further evolving in the period from September to November/December 2015.

The picornaviruses are small, single-stranded positive sense RNA viruses causing disease in animals and humans and include the human parechoviruses (HPeV). Like other picornaviruses, HPeV evolve rapidly due to nucleotide substitutions and recombination events, generating strains associated with changed infectivity, virulence or host range. The first detected outbreak of HPeV type 3 (HPeV3) in infants in Australia occurred in and around Sydney in New South Wales in late 2013, with subsequent infection in other parts of Australia in the first part of 2014. A second outbreak occurred across Victoria and Southern Australia in the second half of 2015. As described previously 1 this latter outbreak included the first HPeV3 cases in infants seen at University Hospital Geelong and was shown to be caused by a novel recombinant strain of HPeV type 3 1 . We now extend our previous study to include the near complete virus genome sequences and the phylogenetic and transmission network analyses of 12 of the Geelong 2015 cases and another 6 Australian cases from 2012-2015 for a total of 18 HPeV3 infant cases in Australia. Thus, our study provides valuable, near complete genome HPeV3 sequences and explores the potential value of phylogenetic and network analyses for better understanding HPeV3 transmission and evolution over time.
HPeV type 1 (HPeV1) and 2 were initially classified as echoviruses 22 and 23 in the genus Enterovirus, but reclassified in 1998/99 into their own genus based on nucleotide and biological features including a lack of host cell protein synthesis shut-off during replication [2][3][4][5][6] . The RNA genome of HPeV is approximately 7300-7400 nucleotides with 5′ and 3′ untranslated regions (UTR) at each end of a single long open reading frame (ORF) [7][8][9] . The single ORF encodes a polyprotein including the P1 structural/capsid and the P2 and P3 nonstructural proteins. The capsid proteins of HPeV consist of only 3 proteins, VP0, VP3 and VP1, as VP0, in contrast to some other picornaviruses, is not proteolytically cleaved into VP4 and VP2 7,10 . Some of the HPeV types have an arginine-glycine-aspartic acid (RGD) motif near the carboxy-terminus of VP1, thought to be the receptor-binding site likely binding to host cell integrins 11,12 . However, this RGD motif is not present in HPeV3, HPeV7-16, the related rodent/zoonotic virus Ljungan virus or in some strains within HPeV types normally encoding the RGD motif, for example for HPeV4 7,[13][14][15][16] . Like other picornaviruses, recombination of HPeV often involves breakpoints at the very end of VP1 and thus may remove or add the RGD motif, or alternatively occur downstream in the nonstructural coding region 7,[17][18][19][20][21] .
HPeV infections are prevalent in humans and most often cause only asymptomatic/subclinical or mild gastrointestinal and/or respiratory disease. More severe disease including central nervous system (CNS) dysfunction may be seen in particular in infants and one relatively large study reported mortality among young infants as high as 6% 22,23 . HPeV3 appears to cause more severe disease in young infants compared to the other HPeV types and is the most common type identified in cerebrospinal fluid (CSF) samples from young infants with CNS infection and/or sepsis-like presentation 22 . Evaluation by cranial magnetic resonance imaging (MRI) or ultrasound of a group of 10 infants with HPeV3 CNS infection showed abnormal periventricular white matter in all infants and for a number of the infants, clinical follow-up revealed sequelae including cerebral palsy, epilepsy, learning disabilities, and visual impairment 24,25 . Further cases like this including reports of autopsy cases of two infants that died with active HPeV3 infection and severe CNS disease are referenced in ref. 22. In regard to pathogenesis, it appears that HPeV3 may target vascular smooth muscle cells in the leptomeninges, and in the lung, and thus that clinical symptoms and severe CNS tissue damage may be caused by vascular changes 22 .
HPeV3 was first isolated in 1999 and reported in the literature in 2004 and subsequent surveillance suggests a two to three-year cycle between outbreaks 11,23,[26][27][28][29][30][31][32][33] . The reason for this is not known and as these viruses are easily transmitted by the oral-faecal and respiratory routes 30 , it may be related to exposure and immunity within families e.g. outbreaks occurring when there is both an older sibling lacking prior HPeV3 exposure plus waning or no maternally derived HPeV3 passive immunity in the young infant 30,34 . Studies from Japan have indicated that antibodies to HPeV3 are only present in 15% of children from 7 months to 1 year of age, but that seropositivity increases to 45% in 2-3 year old and to 85% in 4-6 year old children 11,35 . Due to these factors, virus testing results and virus sequences available for analyses are likely to only represent a minority of HPeV3 infections. Therefore, transmission network analyses will help elucidate part of a much larger transmission network, i.e. will represent sequences sampled from severe disease in young infants among a much larger population of infected older children and adults with mild or no clinical disease and thus not sampled and tested.
Molecular clock analyses have estimated that the parechoviruses have existed for approximately 400 years with HPeV3 and HPeV7 diverging approximately 150 years ago 36 . However, it appears that more recent strains have spread worldwide only within the last 20-30 years 28,37 , and during this time, may have evolved into being more virulent. One such example may be the Yamagata 2011 lineage of HPeV3 shown to cause severe disease in young children and myalgia in adults [38][39][40] . We have previously shown that the first reported outbreaks of HPeV3 in infants in Australia in 2013/14 and 2015, respectively, were caused by a novel recombinant HPeV3 with the capsid encoding region closely related to the Yamagata 2011 lineage, while the nonstructural encoding region derived from an as yet unknown/unpublished HPeV(s) 1 .
We now present the near complete virus genome sequences with phylogenetic and network analyses of potential transmission from a total of 12 of the Geelong 2015 cases as well as from another 6 positive Australian cases. These include an early Australian sample from 2012 obtained by retrospective testing, samples from Sydney from 2013 and 2015 in addition to two samples from Adelaide and one from Darwin from 2014 making a total of 18 cases of HPeV3 in infants in Australia. Overall the results support our previous finding 1 that the major recombination event took place between March 2012 and November 2013 and we now show evidence of further temporal virus evolution and possibly recombination. While the nonstructural coding region appears to be evolving both at the nucleotide and amino acid level, the capsid encoding region appears to be very stable, particularly at the amino acid level. This apparent stability of the virus capsid may indicate that development of a vaccine or antibody treatment to avoid severe disease in infants may be feasible. It should be noted that the ancestor of the nonstructural coding region of the Australian recombinant HPeV3 is still not known and further sequencing and publishing/depositing of results from cases worldwide should be a priority to further the understanding of generation, evolution, spread and possibly even host range of new/recombinant strains of HPeV3.

Results and Discussion
Next generation sequencing (NGS) was performed on RNA extracted from the clinical samples (Table 1) as well as on available virus isolates. The coverage and quality of the sequences obtained varied somewhat between samples and those with a lower coverage or quality were included in multiple runs. A total of 5 NGS runs were performed to obtain sufficient data to assemble high quality consensus sequences for all samples included. Sample coverage varied from approximately 83000 mapped reads with an average coverage of 2400 for sample CS-HP-16018 (which is a sample from 2012 and not the novel recombinant Australian outbreak virus 1 , thus not fully covered by our Ampliseq panels), to 2.3 million mapped reads with an average coverage of 67000 for sample CS-HP-16012. Apart from samples CS-HP-16018 mentioned above and CS-HP-16003 for which we only had very limited sample material left from our previous study, samples averaged 0.5-1.5 million reads and an average coverage of 14000-50000. This allowed us to assemble high quality near complete virus genome consensus sequences from each sample apart from sample CS-HP-16018, where high quality sequences were obtained for the first approximately 5000 nucleotides only. For that sequence we have chosen to only report and include in our analyses the capsid encoding region from nucleotide 701-3013. All sequences generated have been deposited in GenBank under accession numbers KY556659-KY556676.
We first compared sequences obtained from different sample types from the same patient (CSF and faecal samples obtained from CS-HP-16004 and CSF, faecal and nasal samples from CS-HP-16006; Table 1). The sequences obtained from the different samples collected from the individual patient were in 100% agreement. These results support the robustness and accuracy of the NGS sequences and moreover, allowed us to focus on a single sequence from each case in the phylogenetic analyses. As a second step, we compared the sequences obtained from virus isolates (see Materials and Methods) to the sequences obtained from the corresponding clinical sample. For 4 out of the 6 virus isolates available, the obtained sequence of 7334 nucleotides were in 100% agreement with that obtained from the clinical sample, while sample CS-HP-16018 and CS-HP-16019 each had a single non-coding change between the clinical sample and the corresponding virus isolate. For sample CS-HP-16018 the single non-coding nucleotide change was located in the capsid coding region and for CS-HP-16019 it was located in the 5′-UTR. The comparison between clinical samples and corresponding virus isolates served two purposes. First, it further supports the very high accuracy of the NGS generated sequences indicated by comparing different sample types mentioned above. Second, it suggests that the obtained virus isolates do not appear to have been selected/adapted during the cell culture process and may accurately reflect in vivo virus, which is useful information to have if a virus isolate is used for animal trials or in vitro studies.
As a first step in our analyses of the sequences obtained, we screened for recombinants among the sequenced viruses using GARD (genetic algorithm recombination detection) analysis. We discovered in our previous study that the Australian HPeV3 outbreak virus is a recombinant, with the sequence up to approximately nucleotide 3115 derived from a Yamagata 2011-like virus, whereas the downstream sequence is derived from recombination(s) with an as yet unknown/unpublished HPeV/s 1 . The analysis done here did not include CS-HP-16018 as we did not have the near full-length sequence and moreover, have already demonstrated that the virus in this sample is not the novel recombinant virus as in the other Australian HPeV3 samples sequenced, but rather consistent/ similar to a full, i.e. not further recombined, Yamagata 2011 lineage virus 1 . Interestingly, although only indirectly related to our analyses presented here and to our knowledge not reported in the literature, the Yamagata 2011 lineage viruses reported [38][39][40] and deposited in GenBank contain what appear to be two different clusters most likely derived by recombination at approximately nucleotide 5000 (equivalent to nucleotide 4300 in the deposited Yamagata 2011 lineage viruses as they only included the coding regions). These two clusters are represented by GenBank Accession number AB759207 in one of the two clusters and AB759204 and AB759205 in another. We inadvertently based our initial NGS panel on AB759207 1 and this particular panel coincidentally resulted in high quality sequence for sample CS-HP-16018 up to approximately nucleotide 5000, consistent with this sample, from March 2012, being of full Yamagata 2011 lineage, but really matched most closely to the AB759204/AB759205   Table 2. The range of the calculated transition/transversion bias in the datasets was from 7.85-9.68 using the maximum likelihood (ML) method and from 9.1-18.2 using the maximum composite likelihood (MCL) method depending on the dataset. This shows that the variation among the sequences is heavily biased towards transitions, in particular in the capsid encoding region that also had the lowest dN/dS ratio of 0.06 compared to the nonstructural coding region and the full ORF with a ratio of 0.16 and 0.13, respectively. This is consistent with a strong indication of negative/purifying selection in the capsid coding region and some indication of positive selection at a few sites in the nonstructural coding region ( Table 2) Fig. 1A and B). It should be noted that this clustering may have been caused by another recombination event as the samples in cluster 1, with the exception of CS-HP-16004 and 16005, all were identified by the GARD analysis as having potential recombination sites or other cause of incongruent evolution as compared to the other samples included here. Interestingly, the analysis indicated that the Australian 2013/14/15 recombinant outbreak strain/lineage was more closely related to AB759204 Yamagata 2011 than to the "non-recombinant" 2012 Australian virus (nasal sample CS-HP-16018). The midpoint rooted tree obtained for the 17 Australian recombinant virus sequences for the capsid region ( Fig. 2A) is very similar to that for the nonstructural coding region (Fig. 2B) although the variability for the nonstructural coding region is somewhat higher (Fig. 2 and Table 2). A similar picture and relationship is also evident in the trees for the full ORF (Fig. 3A) and for the full sequence from nucleotide 1-7334 (Fig. 3B). Thus, although the different regions appeared to be under different evolutionary pressures, inferences as shown by the clustering and topology of the trees generated, were relatively robust. Overall, this analysis indicated that the 17 Australian recombinant outbreak virus sequences had the 2013/14 virus sequences clustering together while the 2015 virus sequences fell into two separate discrete clusters as mentioned above. Phylogenetic transmission network construction was then done using the program Network v5. It should be emphasized that our virus sequences all came from severely ill infants and likely only represent a fraction of infected individuals. Consequently, the aim was not to look at direct transmission between these cases, but rather to look for an indication of a potential transmission network likely containing a large number of unsampled infections among older children and adults (likely to only show mild or no clinical symptoms and therefore not tested for HPeV). Similar to the ML phylogenetic analysis shown above, we did this for the individual regions, for the full ORF and for the full sequence. Similar to what is mentioned above, we found the potential transmission networks generated to be very similar and again suggesting that the closest ancestor for the capsid region of the recombinant Australian 2013/14/15 virus sequences was AB759204 of the Yamagata 2011 virus lineage. Interestingly, while the Australian 2012 "non-recombinant" virus sequence (sample NAS18/CS-HP-16018) clearly was closely related to the Yamagata 2011 virus sequences, it may not be the direct ancestor of the subsequent Australian outbreaks that may be caused by separate introduction(s) (Fig. 4A) [21 sequences, capsid only]. As the apparent transmission networks for the regions, the full ORF and the full sequence were very similar, and moreover the non-coding regions only had a total of 8 variable sites (Table 2)   However, as HPeV is very stable in the environment, we anticipate that this infection would include many unsampled infected individuals. In this outbreak, this discrepancy could most likely be explained by case CS-HP-16017 being infected relatively early in September 2015, possibly by someone at the end of a long unsampled network, or alternatively, may be explained by some cases in that cluster (i.e. CSF samples CS-HP-16006 and CS-HP-16007, both sampled in October) derived from cases that have been infected from virus excreted into the environment already in September. Another difference between the ML tree in Fig. 3A and the transmission network in Fig. 4B involves sample CSF19 (CS-HP-16019) which although sampled in November 2013 is not consistently basal to the Australian samples from early 2014 (FEC21, 22 and 23). The reason for this is unknown, but may at least in part be explained by the fact that CSF19 appears to be more closely related to sample FEC23 than to the other samples from 2014 and moreover, that the trees get somewhat skewed by the fact that FEC22 is more closely related to the included Yamagata 2011 lineage viruses than to CSF19. However, considering that these particular samples (CSF19 and FEC21, 22 and 23) represent a very few samples from affected infants separated by many hundreds of kilometres (Table 1), speculation on exact causes is not possible apart from noting the fact that although both collected from cases in Adelaide and only separated in time by one month, samples FEC21 and FEC23 appear to represent two slightly different clusters somewhat similar to what is observed for the Geelong 2015 cases (Fig. 4A  and B).
In addition to the temporal aspect, we also attempted to look at the apparent transmission network from a spatial perspective using the approximate home location of each case (Table 2). Not unexpectedly, considering the anticipated large number of unsampled infected individuals and the likely wide travel network involved, in particular in relation to the holiday season up to December, we could not detect any spatial connection between either the 2015 clusters, or the earlier Australian virus sequences. However, for the 2015 virus sequences it is interesting to note that those with the highest number of nucleotide differences and located most distant in the Hospital Geelong which, except for CSF sample CS-HP-16008, have home addresses located outside or even at some distance from Geelong (Table 1 and Fig. 4B).
We have here reported near complete virus genome sequences from 17 samples and a partial sequence of one sample essentially doubling the availability of near complete HPeV3 genome sequences in the NCBI databases. Furthermore, we show that consensus sequences obtained from different sample types or directly from clinical sample as compared to virus isolate are essentially identical. Overall our results support our previous finding that the major recombination event leading to the generation of the Australian 2013-2015 outbreak virus took place between March 2012 and November 2013 possibly followed by additional recombination events contributing to the evolution of seperate clusters. While the nonstructural coding region derived from an as yet unknown HPeV  Table 1.
appears to be evolving both at the nucleotide and amino acid level and may still be evolving/adapting to its full replicative potential in this recombinant virus setting, the capsid encoding region appears to be very stable, in particular at the amino acid level. As we have noted previously, the stability of the virus capsid indicates that development of a vaccine or antibody treatment to avoid severe disease in infants may be feasible. However, the ancestor of the nonstructural coding region of the Australian recombinant HPeV3 is still not known and we encourage further sequencing and publishing/depositing of results from cases worldwide in order to further the understanding of generation, evolution, spread and possibly even host range of such new/recombinant strains of HPeV3.

Materials and Methods
Clinical specimens and virus isolates. An outbreak of HPeV3 was identified in infants in the Geelong area of Victoria, Australia during August-December 2015 and a total of 16 cases were identified at the University Hospital Geelong as described previously 1 . For the studies described here we had available a total of 15 samples from 12 of these cases and also 3 stored clinical samples from the Victorian Infectious Disease Reference Laboratory (VIDRL) collection described previously 1 . An additional 3 positive samples selected from VIDRL's stored collection of HPeV3 positive samples from Australia provided a total of 21 clinical samples from 18 cases of HPeV3 in infants in Australia; see Table 1 for details.
Virus isolates from the following 6 samples were also available for analysis: CS-HP-16006 faecal and nasal swab samples, CS-HP-16004 faecal sample, CS-HP-16018 nasal sample, CS-HP-16019 CSF sample and CS-HP-16020 faecal sample. These virus samples were cultured on monolayers of Vero cells as described previously 1 .
All samples were stored at −80C and RNA extracted and converted into cDNA as described previously 1 .
The studies described here were performed in accordance with relevant guidelines and regulations and was provided with ethical exemption by the Barwon Health Human Research Ethics Committee.
Next generation sequencing (NGS) of HPeV-positive samples. As described previously 1 , we initially constructed a composite reference genome from related HPeV3 sequences available in NCBI (National Center for Biotechnology Information) and used this initial reference (here called composite reference 1, see ref. 1 to design a custom Ion AmpliSeq Panel (here referred to as panel 1) for use with Ion Torrent S5 System (Thermo Fisher Scientific, Vic. Australia). Based on initial NGS results and additional Sanger sequencing (described previously 1 ), we subsequently updated this initial reference sequence (updated reference available as GenBank accession number KY020128) and used that to design an updated Ion Ampliseq Panel, referred to as panel 2. Based on additional NGS runs, we further updated the reference sequence and designed a third Ion Ampliseq Panel, here referred to as panel 3. Details of panel 2 and 3 can be seen online (Supplementary Table S1).
Extracted RNA converted to cDNA from all samples were PCR-amplified with the Ion AmpliSeq Library Kit 2.0 and Ion AmpliSeq Panel 1, 2 or 3. The panels comprise 37 or 38 overlapping primer sets split into 2 pools per sample and amplified with the 5X Ion AmpliSeq HiFi Master Mix. The details of our NGS protocol has been described in detail previously 1 and for the results described here we only made minor adjustments including raising the number of initial PCR cycles from 30 to 35 cycles and then leaving out the post-amplification step and finally, we used the Ion Library TaqMan ™ Quantitation Kit to quantify the amplified libraries. Libraries were then pooled prior to loading onto Ion 530 Chips and loading into the Ion Chef Instrument. Following template preparation, the chips were run on the Ion Torrent S5 System following company protocols. NGS and associated reactions were performed at the Geelong Centre for Emerging Infectious Diseases (GCEID), Geelong, Victoria, Australia. For the NGS results described here we did a total of 5 Ion Torrent S5 runs with a total of 6 Ion 530 Chips generating a total of approximately 25 billion nucleotides.
Next generation sequence analyses. Sequences obtained from the Ion Torrent S5 were analysed based on each sample having a unique barcode. Raw sequences were analysed using the Ion Reporter Software contained within the Ion S5 Instrument using either of the 3 reference genomes or various HPeV reference genomes from NCBI. Obtained reads were then visualised and analysed using the Integrative Genomics Viewer (IGV) 41 . IGV calculates the consensus sequence for the NGS results for each sample as described by Cavener 42 ; if the frequency of a single nucleotide at a specific position is greater than 50% and greater than twice the number of the second most frequent nucleotide it is assigned as the consensus nucleotide or if the sum of the frequencies of two nucleotides is greater than 75% (but neither meet the criteria for a single nucleotide assignment) they are assigned as co-consensus nucleotides. If no single nucleotide or pair of nucleotides meets the criteria, IGV assign an 'N' . In the few occasions where our consensus sequences contained co-consensus nucleotides at a given position, we manually inspected the results and assigned a single nucleotide when a given nucleotide was present in 60% or more of the reads. The final assembled virus genome sequences were based on the NGS results indicated above, but also included nucleotides 1-29 and 7292-7334 (the very 5′-and 3′end sequences). These did not derive directly from the NGS sequencing, but are assumed based on: (i) the high efficiency of the end-primers used; (ii) comparison to other closely related sequences, and (iii) the fact that the rest of the 5′-and 3′-UTR sequences obtained were highly conserved as indicated in the results.

Phylogenetic analyses and transmission network construction.
Screening for recombinants among the sequenced viruses were done using GARD (genetic algorithm for recombination detection) analyses 43,44 and the HyPhy package 45,46 available on the Datamonkey webserver and using the HKY85 substitution model as suggested by using the automatic Model Selection Tool available on the site. The GARD technique identifies evidence for recombination breakpoints by searching multiple sequence alignments for phylogenetic incongruences.
To understand phylogenetic relationships among the HPeV3 sequences, they were compared to the NCBI database using nucleotide BLAST (Basic Local Alignment Search Tool) 47,48 and aligned using Clustal-W 49 . Maximum Likelihood phylogenetic analyses were conducted using the MEGA 6 software 50 and the Tamura-Nei model 51 (identical results were obtained using the HKY85 and the HKY + G models, data not shown). The robustness of different nodes was assessed by bootstrap analysis using 1000 replicates. To understand the chains of transmission between obtained samples, sequences were used to construct phylogenetic (transmission) networks using the Median Joining method implemented in the program Network v5 (http//www.fluxus-engineering.com) developed to reconstruct all possible shortest, least complex phylogenetic trees (all maximum parsimony) and depict this as a network. The parameter epsilon was set at the same value as the weight of characters used to calculate the genetic distances (weight value = 10) as described previously 52 . As the virus sequences available for this analysis most likely only represent a minority proportion of HPeV3 infections occurring in the total transmission network, i.e. only represent sequences sampled from severe disease in young infants among a much larger population of infected older children and adults with mild or no clinical disease and thus not sampled, connections in the transmission network are not an indication of direct contact transmission between individual cases. Rather, it should be viewed as cases linked by chains of transmission that include an unknown number of mild or undiagnosed infections, yet still able to indicate temporal evolution and chains of transmission.
Analysis of selection pressure. Transition/transversion bias was estimated using Maximum Likelihood (ML) and Maximum Composite Likelihood (MCL) analysis available in MEGA 6. Site-specific selection pressures were measured as nonsynonymous (dN) -synonymous (dS) nucleotide substitutions per site and estimated using the single-likelihood ancestor counting (SLAC), fixed-effects likelihood (FEL), internal fixed-effects likelihood (IFEL), and random effects likelihood (REL) methods available at the Datamonkey online version of the HyPhy package. All analyses utilized the HKY85 nucleotide substitution model, which was tested as the best fitting model for the data sets, and employed input Neighbor Joining phylogenetic trees. A cut-off p-value to classify a site as positively or negatively selected was set at 0.01 or 0.05 for SLAC, FEL, and IFEL methods and the cut-off value for the Bayes factor in the REL method was set at 100 to reflect a positive or negative selection at a given site as described previously 52 .
Data Availability. All sequences generated have been deposited in GenBank under accession numbers KY556659-KY556676. Other datasets generated or analysed during the current study are available from the corresponding author on reasonable request.