Contamination-controlled high-throughput whole genome sequencing for influenza A viruses using the MiSeq sequencer

Accurate full-length genomic sequences are important for viral phylogenetic studies. We developed a targeted high-throughput whole genome sequencing (HT-WGS) method for influenza A viruses, which utilized an enzymatic cleavage-based approach, the Nextera XT DNA library preparation kit, for library preparation. The entire library preparation workflow was adapted for the Sentosa SX101, a liquid handling platform, to automate this labor-intensive step. As the enzymatic cleavage-based approach generates low coverage reads at both ends of the cleaved products, we corrected this loss of sequencing coverage at the termini by introducing modified primers during the targeted amplification step to generate full-length influenza A sequences with even coverage across the whole genome. Another challenge of targeted HTS is the risk of specimen-to-specimen cross-contamination during the library preparation step that results in the calling of false-positive minority variants. We included an in-run, negative system control to capture contamination reads that may be generated during the liquid handling procedures. The upper limits of 99.99% prediction intervals of the contamination rate were adopted as cut-off values of contamination reads. Here, 148 influenza A/H3N2 samples were sequenced using the HTS protocol and were compared against a Sanger-based sequencing method. Our data showed that the rate of specimen-to-specimen cross-contamination was highly significant in HTS.

have 12 nucleotides at the 3′ terminus and 13 nucleotides at the 5′ terminus that are conserved and complementary, namely Uni-12 and Uni-13, respectively 11 . The inclusion of Tuni-12 (i.e. Sequence complementary to  and Tuni-13 (i.e. Uni-13) sequences allows genome-wide amplification and sequencing for most of the influenza A subtypes 11 , including: A/H1N1/2009, A/H3N2, seasonal A/H1N1, as well as avian influenza A/H7N9 and A/H5N1 subtypes. To achieve even coverage across the entire genome sequence, complete Nextera transposon sequences (Illumina) were added as overhanging regions adjacent to the 5′ -end of Tuni-12 and Tuni-13 sequences to form the first set of PCR primers, i.e. HFAdaptor and HRAdaptor (Fig. 1b). Additionally, to increase The direct integration of the transposon sequences to the Tuni-12 and Tuni-13 primer sequences and inclusion of decoy primers during the initial genome-wide PCR before the Nextera XT library preparation. This increases the read depths at both ends of the amplicons. the assay sensitivity, partial Nextera transposon sequences were added to the 5′ -end of Tuni-12 and Tuni-13 sequences to form decoy PCR primers, i.e. HF and HR (Fig. 1b). The lack of i5 and i7 index PCR priming sites in the partial Nextera transposon sequences prohibit index PCR amplification of the 3′ -and 5′ termini of amplicons generated by the decoy PCR primers. All the primer sequences used in this study are provided in Fig. 1b. A primer titration experiment, performed during the initial experimental set-up, produced an even coverage of reads in each of the gene segments. The shorter 5′ -overhang regions also helped to increase the PCR sensitivity in amplifying the genome sequences, from 10 8 RNA copies/PCR to 10 4 RNA copies/PCR ( Supplementary Fig. 1).
Selection of the best-matching reference sequence. Accuracy and efficiency of HTS read alignments are dependent on the degree of matching of the target sequence to the reference sequence. Selecting the most appropriate reference sequence for read mapping is crucial in virology work, especially for fast-evolving RNA viruses, as the genomic differences can be significant even within similar strain populations 12,13 . This study first performed a de novo contig assembly of the sequencing reads for a sample, followed by mapping of the contigs to a local database of all influenza A virus sequences downloaded from the NCBI Influenza Virus Resource (Last accessed: 25 November 2015). This approach ensured the best-matching reference sequences for subsequent read mapping and variant calling. Also, it provided immediate strain/subtype identification for the virus causing an outbreak. The best-matching reference strain identity can be obtained from the FASTA defline assigned to the FASTA sequence of the best-matching reference.
Another novel feature of our experimental design was the replacement of all degenerate sequences of the best-matching reference gene segments derived above with non-degenerate sequences (i.e. A, T, C, or G), using an in-house developed algorithm. Such replacement ensured the reference nucleotide in the.vcf file generated by the GATK variant caller software [14][15][16] will always be non-degenerate nucleotides. The in-house developed algorithm for generating final consensus sequence for a sample required non-degenerate reference nucleotides from the.vcf file.
Artificial nucleotide replacement of the best-matching reference sequence. The read depths of the entire sequence region were critical for contamination filtering developed in this study. The GATK variant caller used in this study will not produce a variant call when there is no mismatch found in a region against the reference sequence, regardless of the read depth of the sequence. To ensure regional filtering of contamination across the entire gene segment, a nucleotide replacement was performed every 150 th nucleotide starting from the first nucleotide of the best-matching reference sequence, using the in-house developed algorithm. As such, the read depth of the entire gene length will be estimated in regional manner, according to read depths called for the variants introduced artificially. The final consensus sequences for the amplicons were generated based on the best-matching reference sequence, according to the noise-filtered variant calls mentioned below.
Analysis of contamination reads. The HTS protocol was applied initially on 15 influenza A/H3N2 samples, using the Nextera XT DNA Library Prep Kit. Amplicons of the RNA replicase gene of the MS2 bacteriophage (1130 bp) were included as a Negative System Control (NSC) in a separate reaction well in each run, to collect cross-contaminating reads. The full influenza A/H3N2 genomes were successfully obtained for these 15 samples. The NSC should contain only amplicons of the RNA replicase gene of the MS2 bacteriophage. In the event of any cross-contamination by influenza A/H3N2 amplicons from any surrounding well, they would be identifiable by mapping all the NSC reads against the influenza A/H3N2 reference sequence (Fig. 2).
Subsequently, the genome sequencing of 148 influenza A/H3N2 clinical respiratory samples were subjected to the above protocol using a semi-automated library preparation protocol optimized on the Sentosa SX101 liquid handling machine (Vela Diagnostics, Singapore). All the 148 samples were sequenced in a total of 10 runs. The NSCs were placed in different well positions of the 10 library preparation plates (Fig. 2). The viral loads of the 148 samples ranged from 3.5 × 10 2 to 6.6 × 10 7 viral copies/μ L of RNA extract (median = 1.7 × 10 5 viral copies/μ L), quantitated using a clinically validated real-time influenza A/B screening assay 17 (Supplementary Dataset 1).
The highest amount of contaminating reads found for gene segments 1 to 8 in the NSCs included in the 10 separate runs were 159, 76, 188, 88, 139, 74, 706, and 178, respectively. The depth of coverage for the contaminating influenza genome found in the 10 NSCs were surveyed at a fixed interval of 150 nucleotides, starting from the first nucleotide of the reference sequence onwards, for all the 8 different gene segments (Fig. 2). The number of contamination reads (representing the depth of coverage) were plotted against the number of total pass-filter reads in the NSC. The upper limits of 99.99% prediction intervals derived from each regression model of the surveyed depth of contaminating coverage were adopted as cut-off values of contamination reads (Fig. 2). This had the effect of assigning the appropriate cut-off value that is dependent on the number of total pass-filter reads, instead of a universal cut-off. This optimizes the filtering process, where a high total pass-filter reads will be assigned a higher cut-off value, and vice-versa. Of note, a 99.99% prediction interval is an estimate of an interval in which future observations will fall, with a probability of 99.99%, given what has already been observed 18 , i.e. previous sequencing runs in this case.
All base calls with read depths below the cut-off values were considered as contamination reads and excluded from further analysis. The median R 2 values derived from least squares regressions for every 150 th nucleotide position for gene segments 1 to 8 were 0.48, (range: 0. 36 (Table 1). All the sequences were compared against those derived from a Sanger sequencing method 19 , except for segment 1 of the influenza A/Singapore/H2013.901/2013(H3N2) strain and segment 3 of the influenza A/Singapore/H2012.779/2012(H3N2) strain, where the Sanger sequencing failed. The GenBank accession numbers of the genomes and their respective sample viral loads are provided in Supplementary Dataset 1. Notably, for HTS using the MiSeq instrument, gene segments with sequence lengths > 2000 nucleotides (segments 1 to 3) produced significantly fewer numbers of fully concordant sequences compared to gene segments with sequence lengths < 2000 nucleotides (segments 4 to 8; p value < 0.0001) ( Table 1).
This HTS protocol successfully obtained high read-depths covering the entire 5′ -and 3′ -ends of the 8 gene segments of influenza A/H3N2 viruses. However, the read depths per base were significantly different between gene segments. By including 16 samples in a run, the read depths per base reached as high as 119889 reads for the relatively short gene segment 7 (MP) and as low as 3183, 2027, and 2990 reads for the longer gene segments 1 (PB2), 2 (PB1), and 3 (PA), respectively (Fig. 3a). Depth of coverage for all the 148 influenza A/H3N2 viruses included in this study is shown in Supplementary Fig. S3.

Sequence comparison between NGS and Sanger sequencing.
In all, 34 base calls arising from 30 genes were identified as homogeneous (pure) base calls by HTS but were identified as either a different homogeneous (n = 18) or mixed (n = 16) base call by Sanger sequencing (Table 1). Re-examination of the Sanger chromatograms revealed that 15 discordant homogeneous base calls were incorrectly identified by the ATF contig assembler software version 1.0.2.41 (Connexio Genomics, Fremantle, Australia). The remaining 19 discordant base calls (representing 1.7% of all gene segment sequences) from HTS were found mainly near the 5′ -and 3′ -ends of the amplicons. That these discordant calls could also have been due to Sanger sequencing errors cannot be excluded.

Figure 2. Contamination noise analysis using in-run Negative System Controls (NSCs).
Only nonstructural (NS) gene (segment 8) of the influenza A/H3N2 virus is used for illustration purpose. The depth of coverage for the contaminating NS gene found in the NSCs included in the 10 separate runs were recorded at every 150 th nucleotide position starting from the first nucleotide of the NS gene onwards. As shown here, the number of contamination reads at the 451 th nucleotide position was plotted against the number of total pass-filter reads in the NSC. The linear regression model at this nucleotide position of NS gene had the lowest R 2 value (0.24) among all from all 8 genes (ranging from 0.24 to 0.70). The upper limits of 99.99% prediction intervals derived from the regression model of current 451 th nucleotide position were adopted as cut-off value of contamination reads for nucleotide sequence located in the grey-highlighted region. All base calls with read depths below the cut-off values were considered as contamination reads and excluded from further analysis.  In contrast, the HTS found an additional 1,025 mixed base calls that were absent in the results of the Sanger sequencing. Most of these 1,025 discordant mixed base calls belonged to the 3 longer gene sequences: segments 1/PB2 (270 detected from 75 sequences; 26%), 2/PB1 (247 detected from 65 sequences; 24%), and 3/PA (374 detected from 83 sequences; 36%). The majority of the additional minor variants detected in these 3 segments were found at the 5′ -and 3′ -ends of the segments (Fig. 4a).
These variants were mainly found in poor sequencing results that had extremely high reads clustered at the 5′ -and 3′ ends of the 3 segments (Fig. 3b). They accounted for 83% of the 270 PB2 minor variants (224 detected from 43 sequences), 82% of the 247 PB1 minor variants (202 detected from 39 sequences), and 76% of the 347 PA minor variants (283 detected from 38 sequences). The remaining 134 additional mixed calls were found in segments 4 to 8. Overall, most of the additional mixed variants detected by HTS in all gene segments represented minor populations of 5-20% frequencies (Fig. 4b).
Variant quality. All the 15245 variants that passed the contamination filtering (i.e. upper limit of 99.99% prediction interval) had mapping quality of > 40 20,21 , calculated by the GATK UnifiedGenotyper. Of the total variants, 13555 variants (89%) had strand bias FisherStrand (FS, an indicator of strand bias in sequencing) of < 0.001. Of the remaining 1690 variants with FS > 0.001, 99 (0.6%) had FS > 60, which is the recommended threshold from GATK to determine if there is a significant strand bias between forward and reverse strands for the reference or alternate allele 20,21 . The higher the FS value, the more likely there is to be bias. All the 99 variants with FS > 60 were obtained from the 5′ -and 3′ -end regions except 12 (12%) from the middle regions of PB2, PB1, and PA genes. Most of the variants located at 5′ -and 3′ -ends of the PB2, PB1, and PA genes had uneven clustered reads mentioned above (Fig. 3b). Nine of the 12 variants (75%) located at the middle region of the 3 genes had total read depth of < 419, while the remaining 3 (25%) had read depths of 814, 1206, and 1246, respectively.
Native sequencing error rate. To assess the native sequencing error rate, the sequences of the RNA replicase gene of the MS2 virus, amplified from the same sample in ten separate runs, were analyzed. Ten minor variants were consistently detected in the ten runs, with population frequencies ranging between 0.1% and 15.1%. The 10 variant nucleotide positions had total read depths ranging between 53032 and 442361 (median = 202124). They likely represent the true minority variants that are present in the sample. Another 4 minor variants were found only in 1-5 of the 10 runs, with total read depth of < 40. Such read depths may be far too low to represent the native sequencing error. The 4 variants (i.e. nucleotide positions 2, 5, 1124, and 1125) were located < 8 bp away from the 5′ -and 3′ -ends of the amplicons, and were part of the PCR primer sequences. The low read depths at the 5′ -and 3′ -ends of the amplicons were expected as they were sequenced using the original protocol, without the primer modifications for PCR amplification prior to Nextera XT library preparation.

Discussion
Even coverage amplicon sequencing. One of the key improvements of the amplicon-based HTS method described in this study was the successful sequencing of amplicons with even coverage across 5′ -and 3′ -ends of the amplicons, hitherto problematic regions of unacceptable low coverage prior to the modifications introduced. The direct integration of the transposon sequences to the Tuni-12 and Tuni-13 primer sequences during the initial genome-wide PCR negates the need for enzymatic addition of the transposon sequences by the transposase (Fig. 1b). This increases the read depths at both ends of the amplicons.

Statistical exclusion of contamination reads.
During the two-step PCR HTS library preparation of the Nextera XT DNA Library Prep kit, there is a high risk of cross-contamination during the set-up of the second PCR, especially by carry-over of the amplicons from the first PCR due to the high number of amplicons 10 . The amount of cross-contamination may vary due to different number of samples pooled during the library preparation step, and inter-operator variability. A semi-automated HTS library preparation protocol was optimized on the Sentosa SX101 liquid handling machine to control and minimize these errors.
To statistically exclude possible artefactual sequences caused by cross-contamination, an in-run NSC was included in each of the 10 library preparations to collect cross-contamination reads that may occur in different well positions. An upper limit of 99.99% prediction interval was determined based on the contaminating read depths observed in the NSCs of the 10 runs. Upon the initial amplicon dilution performed according to the library preparation protocol, tagmentation by the kit enzyme cleaves the diluted amplicons into fragment lengths of approximately 300 bp. Next, these fragments are subjected to the indexing PCR. All these steps may introduce between-sample contamination. The longest contaminating fragments or full-length amplicons (introduced during sample dilution) should fall within approximately 300 bp and above. Therefore, the artificial nucleotide replacements performed at every 150 th nucleotide starting from the first nucleotide of the reference sequence and subsequently called as a variant by the GATK variant caller should ensure that the contaminating noise was monitored across the whole segment, even though the base calls were initially perfectly matched with those of the reference sequence and may contain unacceptable read depths below the contamination cut-offs. Sequences with a base call (at any position surveyed) containing read depths below the background noise cut-off were excluded from further analysis. As such, the validity of variant calling for all nucleotide positions across the whole genome can be assured, rather than focusing only on certain mutational hotspots.
On a separate note, MS2 contaminating reads appeared to be more frequent in influenza samples that were placed right next to the NSC compared to ones that were placed at the far end from the NSC (data not shown). The entire HTS sequencing and bioinformatics protocol were tested on 8 RNA segments of influenza A/H3N2 viruses with different lengths (ranging between 890 and 2341 nucleotides) and genetic variability (ranging between 1.88 × 10 −3 and 4.84 × 10 −3 nucleotide substitutions/site/year) 22,23 . The substitution rate of 1 × 10 −3 substitutions/site/year is found in many RNA viruses 24 . This study has successfully produced full-length sequences for segments 4 to 8 of influenza A/H3N2 viruses that were highly concordant with the sequences generated by Sanger sequencing method (Table 1). Clustered reads at termini sequences of the PB2, PB1, and PA genes. As with a previous study conducted on primary clinical samples of influenza A/H1N1/2009 pandemic virus 25 , a significant number of primary clinical samples of influenza A/H3N2 virus in this study may have a higher number of defective interfering particles (DIPs), producing clustered reads at 5′ -and 3′ -ends of the PB2, PB1, and PB2 genes (i.e. segments 1 to 3). Previous studies found that the influenza defective interfering RNAs typically retain the termini sequences of the PB2, PB1, and PA genes 25,26 . The relatively much higher HTS reads found for the MP gene (i.e. segment 7, Fig. 3a,b) in all H3N2 samples in this study may also be due to its importance in packaging full-length or truncated genome into viral-like particles, i.e. viable standard viral particles and/or DIPs [27][28][29] .
The clustered reads found at the 5′ and 3′ termini may also be due to sample degradation of the archived extracted RNA samples. All the RNA samples used in this study had been subjected to multiple freeze-and-thaw cycles previously, to carry out 19 PCRs for genome-wide amplification for Sanger sequencing 19 . The long PB2, PB1, and PA gene segments might have broken apart before further degradation. Re-extraction and re-sequencing from the archived raw clinical samples (nasal/nasopharyngeal/throat swabs in universal transport medium) for some of the problematic samples were able to resolve this problem (An example is shown in Supplementary Fig. 4). It is notable that the PA gene of such case did not have clustered reads at the 5′ and 3′ ends of the gene segment, suggesting a higher resistance to degradation in this gene. Lastly, the higher background noise detected at both the 5′ -and 3′ -ends may also be due to higher cross-contamination or higher biases/errors arising from PCR duplicates 30 or errors generated during PCR extensions 31,32 . Strand bias. Similar to a previous study utilizing Nextera XT DNA Library Preparation kit, minimal strand bias was observed in this study 21 . The strand bias generally was limited to low read depth 21 or 3′ -and 5′ -end regions with clustered high read depth of the PB2, PB1, and PA genes.
Briese et al. 33 have recently showed potential genome-wide sequencing using virome capturing probes, without the initial genome-wide PCR prior to the HTS library preparation 33 . This could have efficiently reduced the sequencing artefacts found in this study. However, their work was conducted on a much higher sequencing throughput platform, the HiSeq 2500 (Illumina) instrument. Alternatively, mechanical shearing of the genome-wide amplicons followed by enzymatic ligation of sequencing adaptors should also result in even coverage throughout the whole genome. However, the latter approach would be more labor-intensive and would involve additional open-tube processing steps that may introduce a higher risk of between-samples contamination.
It is critical to understand the reliability of genome sequences generated by HTS methods, and under which circumstances the HTS-derived sequences may be acceptable without independent confirmation by Sanger sequencing 34 . Multiple studies have been performed to compare variant calling between HTS and Sanger sequencing methods [35][36][37][38] . However, few studies have compared whole-genome sequences generated by HTS to those produced by Sanger sequencing 39 . The pairwise comparison of genome sequences derived from current HTS and existing Sanger sequencing methods is complementary in phylogenetic analyses for viral sequences derived separately from both platforms. For instance, the current HTS data resolution for the HA or NA genes can be statistically reduced to match the Sanger sequencing data, i.e. to regenerate final consensus sequences by excluding minor variant lower than 20%, based on the statistics of difference found in Fig. 4b. It is well known that Sanger sequencing has a lower minor variant detection limit of 20% 40,41 .
Finally, it should be noted that this study quantified the overall contamination rate based on the sample processing/handling system of the current laboratory. The statistics of contamination is likely laboratory-dependent. Other laboratories performing HTS of influenza should establish their own thresholds for contaminant reads.
In conclusion, this study provides an estimate of the rate of specimen-to-specimen cross-contamination in clinical HTS testing. Clinical bioinformatics pipelines should incorporate robust contamination detection systems when interpreting HTS sequence data, particularly when dealing with low-frequency variants generated by amplicon-based HTS. This report also represents an attempt to address the requirements of the recent College of American Pathologists' Laboratory Standard for Next-Generation Sequencing Clinical Tests, which stated the preference of giving laboratories performing HTS-based assays the flexibility to determine how a clinical test should be validated, but still addressing common problems that arise in the course of testing 34 .

Ethics statement.
All research studies involving the use of these clinical samples were reviewed and approved by the local institutional ethics review board (National Healthcare Group: B/09/360 and E/09/341). All experiments were performed in accordance with the approved guidelines and regulations. This study only involved the use of archived, residual samples that were collected for routine diagnostic purposes. These samples were anonymized prior to analysis. The ethics committee exempted the requirement for written or verbal consent.

Clinical samples. A total of 148 archived influenza A/H3N2 clinical samples collected from different patients
between May 2009 and November 2013 were selected randomly for this study. All samples were received for diagnostic testing at the National University Hospital in Singapore and were confirmed positive using two clinically validated, in-house, real-time influenza A/B screening 17 and subtyping assays 42,43 . The samples included nasal/ nasopharyngeal/throat swabs collected in universal transport medium, endotracheal tube aspirates, or sputum samples. Virus mini kit v2.0 using the proprietary Bio Robot EZ1 automated platform (Qiagen), according to the manufacturer's instructions. All extracted RNAs were eluted into a final volume of 60 μ L of elution buffer. A similar extraction protocol was performed for the 10 4 -fold diluted MS2 viruses. To confirm the absence of MS2 RNA in the cRNA and/or the extraction kit, an extracted RNA that was not spiked with the MS2 virus was tested using a laboratory-developed real-time RT-PCR assay (Protocol will be provided upon request). The sample tested negative by the MS2 target-specific PCR.

Preparation of MS2 bacteriophage dilution.
Reverse transcription polymerase chain reaction. The genome-wide RT-PCR amplification of the influenza A/H3N2 virus was performed using the Superscript III one-step RT-PCR system with Platinum Taq  DNA library preparation and high-throughput sequencing. The DNA library preparation was performed using the Nextera XT DNA Library Prep Kit (Illumina), according to the manufacturer's protocol. The protocol was later adapted onto the Sentosa SX101 liquid handling machine for automation. Briefly, all PCR products were subjected to tagmentation, a step that enzymatically cleaved the DNA to fragments of approximately 300 bp in length and tagged the cleaved DNA with Illumina adapters. After the tagmentation, the cleaved DNA was subjected to a 12-cycle second PCR for indexing/barcoding with a unique combination of i5 and i7 index primers. After the indexing PCR, the PCR products were purified using Agencourt AMPure XP beads (Beckman Coulter Inc., CA, USA) and normalized using the Nextera XT Library Normalization Beads.
High-throughput sequencing was performed with MiSeq Reagent Kit v2 (300-cycle, Illumina) to generate 10-25 million 150-nucleotide paired-end reads per sequencing run, using the proprietary Illumina MiSeq System. Resequencing application under the Small Genome Sequencing category was selected to generate a sample sheet of the sequencing run, using the proprietary Illumina Experiment Manager. The default resequencing analysis settings were selected, i.e. flagging PCR duplicates and variant quality filter threshold of 30. A chloride wash was performed on the MiSeq sequencer before starting a new run, according to the manufacturer's guide, to ensure minimal carry-over read contamination from the previous run.
Bioinformatics. The bioinformatics analytic workflow was assembled using Python 3.4 and performed in CentOS 6.6 Linux operating system, using a workstation equipped with Dual Intel Xeon E5-2687W v2 3.4Ghz CPUs (16 cores) and 32-GB DDR3-1600 RAM. Briefly, the analytic workflow started with de novo assembly using VICUNA_v1.3 software 44 on paired-end fastq files generated by MiSeq Reporter v2.6 (Illumina) or bcl-2fastq2 Conversion Software v2.17 (Illumina). The sequence contigs were subjected to mapping against a local database consisted of all influenza A virus sequences available in the NCBI Influenza Virus Resource (Last accessed: 25 November 2015) to obtain the most matching reference gene segments, using blastn megablast task of BLAST 2.2.31+ software. Next, all degenerate sequences of the reference gene segments were replaced with non-degenerate sequences (i.e. "R" to "A", "Y" to "C", "S" to "G", "W" to "A", "K" to "G", "M" to "A", "B" to "C", "D" to "A", "H" to "A", "V" to "A", and "N" to "A"). Subsequently, each gene segment was annotated with a different nucleotide at every 150 th nucleotide position, starting from the first nucleotide of the sequence (i.e. "A" to "T" and "T/G/C" to "A").
Paired-end alignments were performed using the aln and sampe modules of the BWA-0.6.2 software 45 , against the influenza A/H3N2 reference genome, followed by variant annotation using the UnifiedGenotyper module of GATK-3.4-46 [14][15][16] . Reads or fragment of reads with quality score below 15 were trimmed during bwa alignment. A minimal phred-scaled confidence threshold of 10 at which variants should be called by the GATK was used. Instead of using the quality score calculated by the GATK, the upper limit of 99.99% prediction interval of contamination reads was used as the cut-off for calling an absolute or mixture base. A detailed description of the workflow and program scripts can be downloaded from http://hkailee.github.io/FluSeq. Exclusion of erroneous variant calling that may be due to contamination. In-house developed python scripts were used to filter the variant calls by considering an upper limit of 99.99% prediction interval Scientific RepoRts | 6:33318 | DOI: 10.1038/srep33318 of background/contaminating reads. The upper limit for 99.99% prediction intervals for the average value of Y 0 (expected contamination/background reads) for a given X 0 (total amount of pass filter reads for a sample), is Pairwise comparison with Sanger sequencing results. The generated consensus sequences were compared against Sanger-generated sequences. We have previously published the genome sequences of the 148 H3N2 primary clinical samples derived from our Sanger sequencing method 19 . Most of the influenza genome was covered bi-directionally by Sanger sequencing reads with high quality values to ensure accurate base-calling 19 . Forward and/or reverse sequencing were repeated for ambiguous mixed base results (i.e. the second highest chromatography signal peak height (minor variant) lower than 25% of the highest peak height (major variant)), which were either present in only forward or reverse sequencing read or the ambiguous region was only covered uni-directionally.