A universal primer-independent next-generation sequencing approach for investigations of norovirus outbreaks and novel variants

Norovirus (NoV) is the most common cause of non-bacterial gastroenteritis and is a major agent associated with outbreaks of gastroenteritis. Conventional molecular genotyping analysis of NoV, used for the identification of transmission routes, relies on standard typing methods (STM) by Sanger-sequencing of only a limited part of the NoV genome, which could lead to wrong conclusions. Here, we combined a NoV capture method with next generation sequencing (NGS), which increased the proportion of norovirus reads by ~40 fold compared to NGS without prior capture. Of 15 NoV samples from 6 single-genotype outbreaks, near full-genome coverage (>90%) was obtained from 9 samples. Fourteen polymerase (RdRp) and 15 capsid (cap) genotypes were identified compared to 12 and 13 for the STM, respectively. Analysis of 9 samples from two mixed-genotype outbreaks identified 6 RdRp and 6 cap genotypes (two at >90% NoV genome coverage) compared to 4 and 2 for the STM, respectively. Furthermore, complete or partial sequences from the P2 hypervariable region were obtained from 7 of 8 outbreaks and a new NoV recombinant was identified. This approach could therefore strengthen outbreak investigations and could be applied to other important viruses in stool samples such as hepatitis A and enterovirus.


Results
Using NGS directly on samples. Despite a large sequencing depth allocated to each sample (1.5 to 5.5 million reads), only a relatively small proportion of the obtained reads were of NoV origin (on average: 0.25%; corresponding to ~700 to ~22,000 reads).

Evaluation of the poly(A)-capture technique.
To specifically enrich for NoV RNA and reduce the amount of non-polyadenylated bacterial RNA, a poly(A)-capture method was employed after nucleic acid extraction. To evaluate this enrichment strategy, NoV viral load was measured in 6 GGI and 3 GGII quantitated survey samples were (called QS1 to QS9, See Materials and methods and Table 1) along with five non-quantified survey samples (called S1 to S5, See Table 1 and Materials and Methods). All samples were split after RNA extraction with only one part subjected to poly(A)-capture. SMARTer libraries were constructed from both extracted parts and subjected to MiSeq sequencing simultaneously. The efficiency (Table 1) was evaluated by measuring the proportion of reads mapping to full genome sequences from the common human gut bacterial species Bacterioides uniformis and Ruminococcus bromii L2 + 63 31 or from a set of 16 sRNA sequences identified in human microbiome studies 32,33 . Poly(A)-capture increased the proportion of obtained NoV reads over the entire range of NoV input RNA copies (Log 10 1,89 to 6,82; see Table 1 and Fig. 1), despite some variation for especially samples with low numbers of input NoV RNA copies. While the proportion of bacterial reads was reduced by 0.28 to 0.41 fold, the number of NoV reads increased by on average 45.1 ± 27.77 -fold. Although the average Ct value decreased by 0.96 ± 0.07-fold after poly(A)-capture, the poly(A)-captured NoV was also eluted in only one fifth of the suspension volume used before poly(A)-capture. The average percentages of reads from the non poly(A)-captured survey samples mapping to three approximately equally-sized parts of the NoV reference genome sequences were: 1 st part (genome-position: 1-2499): 32.2% (±17.7%), 2 nd part (genome-position: 2500-4997): 53.3% (±12.8%) and 3 rd part (genome-position: 4998-7496): 14.5% (±8.9%), while the average percentages of NoV reads from the poly(A)-captured survey samples mapping to these regions were: 1 st part: 10.1% (±3.2%), 2 nd part: 43.5% (±8%) and 3 rd part: 46.4% (±8.9%).
Outbreak analysis. Samples from all eight outbreaks were subjected to the poly(A)-capture method and SMARTer library construction. A general linear trend was observed between the Ct values measured after poly(A)-capture and the number of reads obtained (Fig. 2), although a few samples deviated from this trend by containing a higher than expected number of NoV reads per million reads. Although full-genome coverage (>99%) was observed at ~4,800 NoV reads in total, equivalent to an average coverage of ~80 per sample (Fig. 3), sufficient sequence quality along the entire genome was only observed above ~11,000 reads with an average coverage of ~260.
Assigning genotypes to outbreak samples. The first level of sequence comparison in an outbreak is the comparison of genotypes obtained from different persons in the outbreak. Complete NoV genotyping relies on sufficient sequence coverage in two regions: ORF1 (RdBp/pol) and ORF2 (Cap) for complete genotyping. Using the NGS approach, 14 complete and one partial genotype were detected in 15 samples from 6 of the 8 outbreaks (see Table 2) containing a single NoV genotype compared with 10 complete and five partial genotypes detected with the STM approach.
Since it had been demonstrated by Real time PCR and STM that two NoV genogroups and several genotypes were involved in Ob-4 and Ob-6, HMM searches for additional genotypes was performed on de novo assembled contigs (See Materials and Methods). From sample Ob-4-1, 10,765 and from sample Ob-4-2 15,391 de novo assemblies were generated, of which 11 and 107 were identified as norovirus assemblies by the HMM search respectively. The candidate NoV contigs were further investigated by BLASTN and genotyping of the contigs, with subsequent reference based mapping which confirmed the presence of the following genotypes in the two samples: Ob-4-1: GII.Pg_GII.1 and GII.4_Sydney, Ob-4-2: GI.Pb_GI.6 and GII.7P_GII.6 ( Table 2). Due to insufficient reads mapping to the GI.Pb_GI.6 reference in sample Ob-4-1, a valid phylogenetic comparison could not be performed, although a BLASTN of the consensus sequence generated from the 22 mapped reads indicated this Scientific RepoRts | 7: 813 | DOI:10.1038/s41598-017-00926-x to be GI.Pb_GI.6 as well. The HMM analysis also detected genotype GII.Pg_GII.1 in the Ob-4-1 sample which was not identified by STM.
From Outbreak 6, the following numbers of HMM hits were obtained out of the total number of de novo assembled contigs: Ob-6-1: 0 of 18,361, Ob-6-2: 2 of 1,928, Ob-6-3: 11 of 18,973, Ob-6-4: 6 of 34,959, Ob-6-5: 5 of 37,507, Ob-6-6: 4 of 14,250 and Ob-6-7: 1 of 43,308. Following the same procedure as described for Ob-4 lead to the identification of GII.P7_GII.6 and GI.P3_GI.3 in all seven samples. The GII.P7_GII.6 genotype was supported by a large number of reads in all samples except Ob-6-3 and could be compared phylogenetically, whereas the GI.P3_GI.3 genotype was only supported by a low number of reads in all samples. This also suggested that the two NoV genotypes were present in all the samples at variable concentrations.
Overall, support for the hypothesis of a common infection source by shared pol and cap genotypes for at least one genotype and in at least two different persons from the outbreak was obtained for seven of eight outbreaks with the NGS method and for five of eight outbreaks with the STM method. Similar genotypes were observed in  Table 1. Summary of quantification, sequencing, mapping and genotyping results of outbreak samples before and after poly(A)-capture. Legend: Column 1: Sample name (S: Survey sample, QS: Quantitative Survey sample, data for the sample both before and after poly(A) capture is shown), Column 2: Genotype identified in the sample by STM and used here as a reference sequence, Column 3: Total number of input NoV RNA copies (log 10 ) used as input to poly(A) capture and library construction, Column 4: Ct value, Column 5: Total number of reads obtained before QC and trimming, Column 6: Total number of reads after quality trimming and filtering, Column 7: Number of NoV reads mapped to reference sequence indicated in Column 2, Column 8: NoV reads expressed per million reads in total, Column 9: NoV reads expressed per million reads in total, Column 10: B.uniformis reads expressed per million reads in total, Column 11: R. bromii uniformis reads expressed per million reads in total, Column 12: 16 sRNA reads expressed per million reads in total. The two rows at the bottom of the figure shows the calculated average fold change and standard deviation for all relevant measurements before and after poly(A) capture.  Fig. 4 for the individual outbreaks. The comparison included either the complete hypervariable P2 region (P2 region is 456 to 483 nt depending on genotype) for Ob-1, Ob-6, Ob-7 and Ob-8, and partial P2 region comparisons for Ob-2 (471 nt), Ob-3 (243 nt) and Ob-5 (291 nt). The phylogenetic analysis revealed that the NoV involved in Ob-3, Ob-5 and Ob-7 were 100% identical (Fig. 4f,h and l), whereas Relationship between the number of NoV RNA copies used as input and the obtained number of NoV reads before and after poly(A) capture NoV input was quantified with real time PCR and GGI and GGI standards and the total number of NoV genome copies used as input was calculated as NoV RNA copies (log 10 ) and shown on the X-axis. The number of NoV reads obtained per million reads is shown on the Y axis (log 10 scale). For each sample, the obtained NoV reads per million reads are shown both with poly(A) capture (filled squares) and without poly(A) capture (filled circles).  differences in the NoV genomes (Fig. 4b,d,j and o) were observed for: Ob-1 (1 nt difference; two samples had an A-residue at reference sequence position 3321 while three samples had a G-residue), Ob-2 (3 nt differences in the P2 region), Ob-6 (1 nt difference in ORF3 and several differences in the 3′ non-coding A rich part of the genome), Ob-8 (1 nt difference in the P2 region).
Identifying a new recombinant. When reads from the two samples from Ob-8 were mapped to the two reference sequences known to be present from the initial partial genotyping, a mutually exclusive distribution of reads was observed ( Fig. 5a and b). In addition, reads that spanned the ORF1/ORF2 junction of an in silico  generated reference sequence were observed (Fig. 5c), confirming that both these samples harbored a novel GII. P16_GII.4_Sydney recombinant.

Discussion
The use of poly(A)-capture significantly enhanced the number of norovirus reads obtained from stool samples, allowing comparisons of full or near full (>85%) genome sequences from 4 outbreaks and partial genome comparisons in 3 outbreaks. In total, 14 complete and one partial genotype were detected in the 15 samples from the 6 outbreaks containing a single NoV genotype compared with 10 complete and five partial genotypes detected with the STM approach. In addition, additional genotypes (partial or complete) were identified with the NGS approach in the two mixed-genotype outbreaks samples (Ob4 and Ob6). STM generated more genotype information than NGS in four cases. In these cases, either none or a low number of NoV reads of mapped to the specific genotype, however none of the reads mapped to the ORF1 (pol) or ORF2 (cap) genotyping-regions. This showed that although the NGS method overall improved the genotyping results, some samples might be challenging due to low amounts of available virus RNA in combination with the random distribution of reads obtained. One way to reduce this problem would be to allocate a larger sequencing depth for especially samples with low amounts of virus. The NGS derived consensus sequences used for phylogenetic comparison ranged from 1226 to 7692 nt (average 4800 nt) and included either the complete or a substantial proportion of the hypervariable P2 region. In comparison, STM only covers ~9% of the genome and does not include the P2 region. Therefore, even in the three outbreaks, in which only partial genomes were recovered, the data were found to significantly improve the molecular resolution of outbreaks.
Interestingly, minor nucleotide variations between sequences from different samples from three of the outbreaks were observed. Two of these differences were mapped to the P2 region, known to be highly variable 12, 34-36 and a single nucleotide difference was observed between two groups of samples from a single epidemiologically linked outbreak (Ob-2). This challenge the 100% identity-paradigm used in general NoV outbreak investigations 12, 36 that normally distinguishes only between identical and non-identical strains. Other studies have also questioned if these strict criteria should be maintained 37 , when comparing larger parts of the NoV genome.
NoV bioaccumulation in or adhesion to food items such as oysters and lettuce generates complex outbreak profiles including several genotypes 38,39 , which require separate RT-PCR amplification steps if STM are used 39 . In this study, six NoV genotypes were identified in samples from two mixed outbreaks, three of which was supported by high genomic coverages (66% to 99% of the entire NoV genome). HMM improved the detection of genotypes by identifying a genotype (GII.Pg_GII.1) missed using STM. Although phylogenetic comparisons could not be performed for all genotypes due to varied sequence coverage of some genotypes in the samples, greater sequencing depth may circumvent this problem in future analysis. Interestingly, a mutually exclusive presence of genotypes was observed for three of the four genotypes identified in the two samples from Ob-4 and different relative abundances of the two genotypes found in Ob-6 was found for sample Ob-6-3 compared with the other to 5239 of JX459908 (the junction is marked with a "J"); reads were subsequently mapped to this reference sequence to identify junction-spanning reads.
Scientific RepoRts | 7: 813 | DOI:10.1038/s41598-017-00926-x samples. This could indicate differences in host exposure and/or susceptibility to different NoV genotypes in complex outbreaks.
A near-complete genome sequence of (>90%) a new GII.P16_GII.4_Sydney recombinant NoV was directly confirmed from the NGS data by using reads spanning the ORF1/ORF2 junction of the two different genotypes, showing that NGS can be used to distinguish between co-infection with different genotypes and new emergent recombinants.
This study was performed retrospectively on samples stored at −20 °C and previously analyzed by STM where samples had all been freeze-thawed at least twice, which may have resulted in some degree of degradation of the NoV. Five samples were excluded after poly(A)-capture, as a large increase (>5) in Ct values were observed, indicating fragmentation of NoV RNA. Therefore, for future applications of the present method, it will be of great importance to retain NoV RNA integrity until library preparation.
We have introduced a novel NoV enrichment NGS-based approach to investigate foodborne outbreaks without discriminating between genotypes. This method can be used directly to enrich other clinically important viruses in stool such as enteroviruses, or other positive-sense RNA viruses with a polyadenylated 3′ tail. Although the poly (A)-capture lead to a 3′ bias in sequencing depth, it allowed for a significant enrichment of NoV reads obtained from the samples. Future studies are required to test the efficiency of enrichment from other specimen types. Although the likelihood of obtaining complete NoV genomes is strongly dependent on NoV concentration in the sample, deeper sequencing would likely allow for retrieval of more NoV reads even in more scarce NoV samples. With common access to benchtop sequencers, we anticipate that NGS will soon become a definitive, non-discriminatory tool for viral infection control and serve to monitor both the evolution and spread of genotypes and enhance viral outbreak investigations.

Materials and Methods
Ethics statement. According to the "Danish Act on Research Ethics Review of Health Research Projects" this study does not require approval by the ethics committees, as it is considered a quality development/control project and does not analyze human sequences. This was confirmed by the Committees on Health Research Ethics for the Capital Region of Denmark in a specific waiver of approval (H-16019654).
Sample material. Twenty-four NoV positive samples from eight different foodborne outbreaks (termed Ob1 to Ob8) were analyzed (Table 3). Five survey samples (termed S1 to S5) and 9 quantitative survey samples (termed QS1 to QS9) were analyzed with or without poly(A)-capture to assess the efficiency of this method. Five samples where Ct values increased >5 after poly(A)-capture vs. before were excluded from NGS analysis as they were considered to be too degraded.
Extraction of nucleic acids, poly(A) capture, real-time RT-PCR and norovirus typing. Nucleic acids were extracted from 10% stool suspensions (kept at −20 °C) using the MagNA Pure LC (Roche Diagnostics); poly(A)-capture was performed using a Dynabeads mRNA Purification Kit (Ambion Cat. No. 61006) according to the manufacturer's instructions with modifications to use 100 µL input material and 26 µL Dynabeads. The concentration of nucleic acids was measured using 1 µL extract on a NanoDrop 1000 Spectrophotometer (NanoDrop Technologies). The presence of NoV Genogroup I and II was assessed using real-time multiplex PCR 40 and genotyping was performed as described previously 40, 41 . Quantification of NoV RNA. A quantitative NoV GGI standard was obtained from ATCC (Quantitative Synthetic Norovirus G1 (I) RNA (ATCC ® VR3234SD ™ ; specification range (log 10 ) 5-6 RNA copies/µL, of which the lower end range was used for the calculations. In addition, a previously published NoV GGII standard 42 was obtained from collaborators at the Danish Technical University at a confirmed concentration of 5.19 (log 10 ) ± 4.80 (log 10 ) RNA copies/µL. Both standards were diluted in a fivefold 1:10 dilution series and analyzed in triplicates in the real time multiplex PCR (described above) alongside 9 NoV Quantiative Survey samples (QS1 to QS9; all both with and without poly(A)-capture). Analysis of real time data was performed in MxPro Mx3005 P v4.10,

Outbreak (Ob)
Month and year   resulting in the following standard curves for GGI and GGII respectively: Y = −3.047xLOG(X) + 41.74; R 2 : 0.994 and Y = −3.090xLOG(X) + 41.18; R 2 : 0.971. Calculations of the amount of NoV genomes used as input in the extraction/capture and NGS analyses were also performed in MxPro.

Preparation of samples for Illumina MiSeq sequencing.
Single-indexed cDNA libraries were generated using the SMARTer Stranded RNA-Seq Kit (Clontech Inc.) in accordance with the manufacturer's instructions. Fluorescent measurement of DNA concentrations in each library was performed using Qubit dsDNA BR and ssDNA assay kit (Thermo Fischer Scientific).
Quality trimming and filtering. Sequences were imported into CLCbio's Genomics Workbench (v. 8.5) with the removal of failed reads. Quality trimming within the workbench was performed using both a modified Mott trimming algorithm implemented (limit = 0.5) and by trimming reads containing more than two ambiguous nucleotides. Human sequence reads were removed by alignment to the homo sapiens hg19 reference genome (similarity fraction = 0.8).