Efficient and unbiased metagenomic recovery of RNA virus genomes from human plasma samples

RNA viruses cause significant human pathology and are responsible for the majority of emerging zoonoses. Mainstream diagnostic assays are challenged by their intrinsic diversity, leading to false negatives and incomplete characterisation. New sequencing techniques are expanding our ability to agnostically interrogate nucleic acids within diverse sample types, but in the clinical setting are limited by overwhelming host material and ultra-low target frequency. Through selective host RNA depletion and compensatory protocol adjustments for ultra-low RNA inputs, we are able to detect three major blood-borne RNA viruses – HIV, HCV and HEV. We recovered complete genomes and up to 43% of the genome from samples with viral loads of 104 and 103 IU/ml respectively. Additionally, we demonstrated the utility of this method in detecting and characterising members of diverse RNA virus families within a human plasma background, some present at very low levels. By applying this method to a patient sample series, we have simultaneously determined the full genome of both a novel subtype of HCV genotype 6, and a co-infecting human pegivirus. This method builds upon earlier RNA metagenomic techniques and can play an important role in the surveillance and diagnostics of blood-borne viruses.


Sample sets. Blood-Borne Virus (BBV) Panel.
A complex host-enriched sample was prepared by diluting in negative human plasma (NHP, negative for each HIV, HCV, and HEV) stored plasmas from four samples previously characterized by routine diagnostic testing to contain HCV (x2), HIV and HEV (see Table 1 for details). NHP was obtained by centrifuging negative human blood for 10 minutes at 500× g to remove cell debris. The final concentration of each virus in the primary panel sample was 10 6 IU/ml (copies/ml for HIV -implied by IU henceforth for convenience), and three serial tenfold dilutions in NHP were prepared from this stock.
Virus Multiplex Reference (VMR) Panel. A reagent comprising a suspension in PBS of 18 RNA viruses with different genomic and structural characteristics was provided by the National Institute for Biological Standards and Controls (NIBSC, Potters Bar, UK). Each viral component and its approximate relative concentrations is given in Mee et al. 42 . Prior to extraction, the panel was mixed 1:1 with NHP. Duplicate 400 μl extractions were performed  44 , as was genotyping of the HCV 43 and HEV 68  Clinical samples of indeterminate HCV genotype. Four plasma samples collected from a patient between 2014 and 2016 were submitted to Public Health England (PHE) for metagenomic analysis as previous genotyping results had been inconsistent. The most recent such test employed NS5b sequencing 43 , and reported the presence of a virus belonging to genotype 6 but was unable to resolve the subtype with any further precision.
RNA extraction and quantification. Before extraction, all samples were centrifuged for 10 min at 2,500 × g to remove cell debris. Triplicate, duplicate and single extractions were performed on the diluted VMR Panel samples (referred to as 'VMR Panel A/B/PBS'), the BBV Panel samples ('10 6-3 -A/B'), and the patient sample series, respectively. A negative control comprising 200 μl of the same plasma used to dilute the panels was also extracted.
The SPLIT RNA extraction kit (Lexogen) was used to extract 200 μl of each sample input, according to the manufacturer's instructions. Acidic phenol was used to preferentially recover the large RNA fraction, which was eluted in 12 μl of nuclease-free water. RNA eluates were quantified using Qubit RNA HS Assay Kit (Thermo Fisher Scientific), which is accurate for concentrations between 250 pg/μl and 100 ng/μl. Depletion of ribosomal RNA and DNA digestion. Ribosomal RNA depletion and DNA digestion was achieved using the RiboErase kit (KAPA Biosystems). As all sample extracts were below the detection limit of the Qubit quantification system, the total RNA input was less than the recommended 100 ng. The manufacturer's specifications were followed with the exception of using the entire 10 μl of the extract, and after the DNA digestion reaction clean up, eluting the residual RNA was in 10 μl of nuclease-free water.
In the case of the BBV Panel, two of the three extracts of each dilution were treated with the RiboErase kit before RNA library preparation. The third set of extracts remained untreated and was used to monitor the effect of the rRNA depletion and DNA digestion upon the subsequent library preparation and sequence analysis. In the case of the VMR Panel (the two duplicates) and the negative control, rRNA and DNA depletion was performed on all extracts. In the case of the uncharacterized HCV strain, extracts from all four samples were treated with RiboErase. An additional, untreated, extract of sample 4 was included, again to monitor the process.
RNA library preparation with ultra-low RNA input. Libraries were constructed from 10 μl of extracted RNA or 10 μl of rRNA-depleted DNAse-digested RNA, using the NEBNext Ultra Directional RNA Library Prep Kit (New England Biolabs). As the protocol is designed to use a minimum RNA input of 10 ng, several modifications were made to adapt it to an ultralow RNA input. These are listed in Table 2. Libraries were analysed for size distribution using the High Sensitivity DNA Kit (Agilent) on a 2100 Bioanalyser Instrument, and were quantified using the KAPA SYBR FAST Universal qPCR Kit for Illumina libraries (KAPA Biosystems) on a 7500 Real-Time PCR System (Applied Biosystems).

qPCR.
To determine the relative abundances of viral inserts, libraries constructed from the BBV Panel were analysed by qPCRs with primers and probes targeting each of the three viral components (Refs 44-46 and  Supplementary Table S1). Reactions were performed using the Quantitect Virus Kit (Qiagen) according to the manufacturer's instructions.

Sequencing.
Libraries labelled with different indexes were diluted to 2 nM and pooled. Sequencing was performed on an Illumina MiSeq instrument using the MiSeq Reagent Kit V2 (300 cycles) (Illumina) according to the manufacturer's guidelines, with the following minor modifications. The library pools were denatured with 0.2 N sodium hydroxide for 2 minutes rather than 5, diluted in kit reagent HT1 to produce a 20 pM solution and then these were further diluted to 11 pM. Of this library pool dilution, 600 μl were loaded onto the MiSeq cartridge.
Data analysis. All paired end FASTQ files were processed with Trimmomatic v0.30, removing the Illumina adaptor sequences, then trimming leading and trailing bases with phred scores below 20. Reads were discarded where the length of either trimmed end was below 50 bases.
For the determination of genome sequences of blood-borne viruses, trimmed FASTQ sets were normalised using the normalise-by-median.py script in the Khmer package (k = 31, C = 5) 47 and submitted to the SPAdes de novo assembler 48 without error-correction, applying the default kmer sizes of 21, 33, and 55. Output contigs that matched each virus were identified with the nhmmer function of the HMMER v3.1b2 package 49 using hidden Markov models (HMMs) built from alignments of each virus (detailed in Supplementary Table S2). Where necessary, the ends of contigs were trimmed to the whole genome alignment. BWA MEM (v0.7.5a, default parameters) 50 was used to map the original trimmed FASTQs to the genome sequence, and the SAM files were converted to BAM files using samtools v0.1.19 51 while discarding reads with either 0 × 04 and/or 0 × 08 flags set (i.e. retaining only fully-mapped paired-end reads). Base frequencies at each nucleotide position within each component virus sequence were obtained from BAM files using QuasiBAM v2.2, an in-house C++ program that tabulates base frequencies at each nucleotide position within a reference and generates consensus sequences based upon user-defined depth and variant percentages 52 .
Mapping of trimmed paired-end FASTQ to one or more virus reference genomes was also performed using BWA MEM 0.7.5a. In each case, two independent mappings were performed, using as a reference the viral sequences, supplemented firstly by the March 2009 'GRCh37' release of the human genome, and secondly by a set of human rRNA sequences (NR_003286.1, NR_003287.1, V00589.1, NR_003285.2, gij251831106:648-1601, and gij251831106:1671-3229, as per Malboeuf et al. 38 ). The second file was used solely to derive counts for reads mapping rRNA which would otherwise be subsumed into the human genome mapping results. From the filtered SAM files, the numbers of reads mapping to each reference sequence were counted. Counts for each of the constituent sequences of the human genome and rRNA were pooled into a "human" count and an "rRNA" count. QuasiBAM was used to derive nucleotide frequencies from which depth and coverage data were calculated. A minimum depth of 10 was required for inclusion in a derived consensus sequence for the BBV Panel (1 for the VMR Panel).
BBV and VMR Panel sequences. The members of the multi-FASTA reference file for the BBV Panel were obtained by submitting FASTQ sets from the rRNA-depleted sample with the highest virus concentrations to the SPAdes-HMMER-mapping approach described in the previous paragraph. VMR Panel references were derived from sequences obtained from GenBank using accession numbers from Mee et al. 42 . Additionally, the complete genome sequence of a human pegivirus (HPgV) present in the plasma diluent was discovered in the SPAdes contigs file. A HMM profile was constructed from an alignment of GenBank sequences (Supplementary Table S2).
Sample with uncharacterised HCV. To obtain full-length HCV genomes, each FASTQ set was submitted to the SPAdes-HMMER-mapping process. Where a complete genome was not obtained, HCV-matching contigs were aligned to the full-length genomes using MEGA5 53 . In addition, contigs with length > 5 kb that did not align to the HMM profile were submitted to BLAST 54 for identification. Following this analysis, an additional pegivirus genome was derived in similar fashion to the HCV genomes, using the same HMM profile as above for the NHP pegivirus. When calculating the read percentages and coverage plots, both sample-derived full-length genome sequences (HCV and HPgV) were used as the reference sequence when mapping that sample's corresponding trimmed paired-end FASTQs, as well where only incomplete HCV genomes were obtained.

Determination of blood-borne virus genomes from complete human plasma. A Blood-Borne
Virus (BBV) Panel was prepared, comprising two strains of HCV (genotypes 1a and 1b), and one each of HIV and HEV diluted in NHP to 10 6 IU/ml. Three tenfold serial dilutions in plasma were made from this original Panel.
Ribosomal RNA depletion was performed on two of each set of triplicate extractions prior to all three being subjected to the modified library preparation protocol. Data from the most concentrated rRNA-depleted samples were used to generate individual virus genome sequences for use in reference mapping. During this data analysis, an unexpected human pegivirus (HPgV) was found and traced to the NHP diluent. The full genome sequence of this HPgV was determined from the 10 3 -B data and included in the mapping references. Table 3 gives the read counts, genome coverages and median depths for each virus-dilution combination, across each of the three samples per dilution (10 6-3 -untreated/A/B). Each test sample yielded over 800,000 reads with the exception of 10 3 -A, which gave just over 140,000 reads. With the exception of the two 10 6 samples, in which only a very small volume of NHP was added to the clinical samples, the percentage of reads mapping to the HPgV remained relatively constant at 29-39%. The exception is 10 5 -B, in which the overall viral read percentage was lower than expected, with a corresponding elevation in reads mapping to the human genome suggesting possible incomplete DNAse digestion during the rRNA depletion step (Supplementary Table S3).
With increasing dilution, the total viral read percentages (excluding HPgV) decline from over 60% to 0.23%. Complete and near-complete genome coverages with depths greater than 10 were achieved at 10 6 and 10 5 IU/ml for all four viruses. A few short regions in HIV had low coverages (<10) with 10 5 -B, reflecting the reduced overall viral reads in this dataset, but at a minimum depth of 1, 99.6% coverage was achieved with this sample, with only a 32-base sequence in Pol having no coverage. At 10 4 IU/ml, HCV 1a and HEV continued to give 98.1-99.7% coverage with median depths over 120. HCV 1b and HIV gave 91.2-93.5% coverage (82-95 median depth) and 55.0-90.2% coverage (12-83 median depth) respectively, and in 10 3 -B, despite only 0.23% of all reads mapping to the four viruses collectively, genome coverages of 18.1-72.5% were achieved, with median depths up to 29. Figure 1 illustrates coverages and depths across target genome at each dilution, showing even distributions of reads across all four target genomes and HPgV. Pooling duplicates consistently improved coverages (final column, Table 3). This is most clearly seen at the lower viral loads, where at 10 4 IU/ml, three of the four viruses achieve combined coverages of >99.4% each, and 93.5% in HIV. At 10 3 IU/ml, the combined coverages for the four viruses are effectively what would be expected were the individual coverages independent, i.e. cov AUB Depletion of rRNA substantially enhances the recovery of blood-borne virus sequences. The percentage of reads mapping to RNA virus genomes in the rRNA-depleted BBV Panel samples was between 40 and 150-fold higher than in corresponding untreated controls. Individual target virus ratios decreased as they became more dilute, from over 100-fold for HCV in 10 5 -A to 2.9-fold for HIV at the lowest dilution. Concomitantly, the ratio for HPgV rose markedly, from 4.8-fold in 10 6 -A to 175 in 10 3 -B, reflecting an effectively constant viral load against decreasing quantities of Panel viruses (Table 3 and Fig. 1). Genome coverage and median depth values were also much higher in the treated samples than untreated comparators. At the two highest virus concentrations, median depths were between 47-and 274-fold higher in the treated versus the untreated samples. Only short fragments of HEV were recovered from the untreated 10 4 dilution, and almost no HIV or HCV sequences. By contrast, near complete genomes from all four target viruses were recovered from the treated comparators, with median depths of between 83 and 457 (as noted above, HIV in 10 4 -A was an exception at 54.0% coverage and a median depth of 13).
Recovery of partial and complete genomes of diverse virus types from human plasma. The ability of our method to recover genome sequences from a range of RNA viruses in the context of human plasma was evaluated using a Virus Multiplex Reference (VMR) Panel, putatively containing 25 genomically and physicochemically diverse viruses. Two plasma-diluted panels and one PBS-diluted panel were tested (Table 4). No reads from either of the three samples mapped to either of the two norovirus genomes, coronavirus 229E or influenza B virus. By the panel distributor's qPCR 42 , the Threshold Cycle (C t ) of the coronavirus was >36 and the other three were not detected, hence these four targets were excluded from further analysis. Notwithstanding influenza virus A H3N2 and parainfluenza virus type 3 also not being detected by the qPCR, we recovered reads from both, with genome coverages ranging from 2.7% to 21.6%. Almost no reads belonging to the panel's DNA viruses were found.
Sixty-nine percent of all reads obtained from the PBS-diluted panel mapped to VMR Panel genomes, dropping to 41-44% for the plasma-diluted samples, although the distribution of reads between targets was very uneven. Parechovirus and rotavirus accounted for 78.8-87.6% and 10.6-19.5% of all viral reads respectively, with the other viruses collectively accounting for 1.7-1.9%. Depths and genome coverages showed some inverse correlation with the given C q values (Fig. 2).
As with the BBV Panel data, coverage plots of the samples diluted in plasma were largely unbiased, giving pooled genome coverages close to those expected by independent distributions of reads between replicates ( Table 4, final column). Rotavirus and coxsackievirus were exceptions, where despite large numbers of mapped reads, almost identical patterns of read coverages and gaps were observed between their replicates, with minimal additive effect. The PBS-diluted sample gave larger read numbers, but their distribution was less even throughout the genomes, resulting in relatively lower coverages.
Characterisation of a new subtype belonging to HCV genotype 6 and discovery of a second virus in a patient sample series. Four plasma samples from a patient with HCV were used as starting material. All extracts were subjected to RiboErase treatment; a second extract of sample 4 remained untreated for comparison. De novo assembly analysis of FASTQ sets from samples 1, 3 and 4 each gave a full-length HCV genome sequence as a single contig. For sample 2, 6 partially-overlapping contigs were obtained, covering 66% of Scientific RepoRts | 7: 4173 | DOI:10.1038/s41598-017-02239-5 the HCV sequence. Additionally, in all four samples, a single contig was obtained that was determined by BLAST and subsequent HMMER analysis to comprise an HPgV genome.
The HCV and HPgV full genome sequences were combined in a single file to carry out reference mapping and nucleotide frequency determination on the four sample FASTQ sets (Table 5). Samples 1, 3 and 4 had HCV read percentages ranging from 1.0 to 24.3%, and gave complete genomes with median depths greater than 700. Sample 2 had the lowest viral load (2,000 IU/ml), had 0.3% of reads mapping to HCV giving a genome coverage of 87% at a minimum depth of 10 (96.5% at depth ≥1) and a median depth of 43. Full coverage of the HPgV genome was obtained from all samples, with median depths over 8,700, and read percentages ranging from 34.2 to 63.3%. The depth plots in Fig. 3 again show unbiased and even coverages across both genomes, and the percentages of reads mapping to viral targets was again much higher in the rRNA-depleted sample than in the untreated comparator (61-fold and 85-fold for HCV and HPgV respectively).  Table 3. Detailed sequencing data from the BBV Panel. For each of the three samples (Untreated, A and B) at each dilution (10 6 -10 3 ), the number and percentage of reads mapping to each virus are given, together with the genome coverages (depth ≥10) and median depths. The final column gives these last two metrics from the combined data sets of both the A and B samples. Included in the analysis are data for the HPgV discovered in the sample diluent.   Analysis of the HCV sequence showed it to belonging to a new subtype within genotype 6 of which the details are presented in a separate manuscript (in preparation). The HPgV clustered with genotype 1 strains, and is distinct from the NHP strain.

Analysis of human origin reads and negative control. Libraries from the BBV Panel extractions
including the NHP negative control were subjected to virus-specific qPCR for the detection and quantification of HCV, HIV and HEV. All were detectable in the sample libraries, but were undetectable in the RiboErase-treated negative control library (Supplementary Table S4).
All samples were mapped against reference sequences that included human genome and human rRNA sequences to evaluate the efficiency of RiboErase treatment. The average ratio of the percentages of reads mapping to rRNA in the untreated versus the treated samples was 32-fold with an approximate halving of the number of reads mapping to the human genome, across all panels (Fig. 4).
With the exception of the expected human pegivirus, mapping of the negative control FASTQ set against the reference sequences of the four BBV Panel viruses, the two pegiviruses, the VMR Panel and the patient HCV gave  Table 5. Detailed sequencing data from the patient sample series. For each of the four samples 1-4, the number and percentage of reads mapping to both the HCV and HPgV genomes are given, genome coverages (depth ≥10) and median depths. The analysis of sample 4 extracted without host rRNA depletion is in the Untreated column. very low numbers of reads mapped to viral genomes and no consensus sequences could be derived. Further data for this section are found in Supplementary Tables S3 and S5.

Discussion
In light of the large and ever-increasing number of human RNA virus pathogens, it is perhaps unsurprising that standard serological assays and nucleic acid tests suffer from a lack of sensitivity to diverse variants of target viruses, overlook the presence of new or unexpected viruses, and provide only limited information about those targets they do successfully detect. Hence the three main aims of metagenomic virology are to detect & identify known agents irrespective of their diversity, to discover novel agents of disease, and to obtain complete sequence information of detected viruses. Most existing protocols achieve a maximum of two of these aims, but difficulties in selectively isolating viral RNA species and short read sequences from those of the super-abundant host nucleic acid have limited the utility of metagenomic approaches in diagnostic virology. This study has addressed these limitations by establishing a novel methodology suitable for the agnostic detection and characterization of blood-borne RNA viruses in plasma samples. By depleting host-derived nucleic acids and making modifications to an existing library preparation protocol to account for ultra-low RNA input quantities, we have been able to reconstruct effectively full-length genomes of HCV, HEV and HIV from plasma samples with viral loads of 10 4 IU/ml (copies/ml for HIV) and substantial fractions of complete genomes at 10 3 IU/ml. When applied to a series of clinical samples, we could elucidate simultaneously the full genome sequences of both a novel subtype belonging to HCV genotype 6 and a hitherto-undetected human pegivirus. Additionally, our system was able to recover viral sequences from a panel of diverse RNA viruses diluted in human plasma, with a broad correlation between the genomic coverage and depth metrics and approximate concentration. Although  full genomes were not assembled in many cases, the independence of read distribution gave sufficient genome coverage for identification.
The vast majority of RNA molecules in a human plasma sample are host-derived, of which up to 80% comprises the six species of human rRNA. Their presence in our libraries was minimised by two key protocol steps in our modified protocol. Firstly, we selected an extraction method that combined a phenol/chloroform step with a column format (Lexogen SPLIT RNA) which increased the amount of extracted viral RNA by up to one log when compared to other extraction methods (data not shown). Perhaps more importantly, by controlling the final precipitation step, small RNA molecules below 150 nt such as 5 S rRNA and tRNA are excluded from the eluates, as are the majority of molecules of human genomic DNA.
Secondly, we employed DNA probes complementary to human rRNA such that hybridisation and subsequent digestion by RNAse H dramatically reduced their frequency in the finished libraries. Whilst this methodology has been successfully used in the detection and characterisation of two haemorrhagic fever viruses, the frequency of viral reads was often below 1% and an additional hybrid-capture step was employed to elevate read numbers 36 . Methods that do not deplete rRNA generally give poor recovery of viral reads, yielding viral genome fragments that necessitate further work 27,32,55,56 , low read numbers even at viral loads over 10 4 IU/ml [20][21][22]33 , or at best, requiring dilution of both host and virus in PBS in order to recover full HIV genomes at low copy numbers 38 .
The resultant rRNA-depleted sample extracts typically contain quantities of nucleic acid in the low picogram range. Library preparation through hexamer-mediated reverse transcription followed by Multiple Displacement Amplification constitutes an easy and effective means of amplifying very low amounts of DNA 27,38,57 , but in several studies (and in the authors' laboratory), significant amplification biases have been observed, leading to gaps in target genome coverage 39,[58][59][60] . Consequently, we adopted an approach using a standard RNA library preparation kit, but with substantial modification to compensate for their minimum RNA input requirements of at least 10 ng and optimally 100 ng-1 µg.
We made key changes to the RNA fragmentation and adaptor-ligation steps of the NEBNext Ultra Directional RNA Library Prep Kit protocol. While prior RNA fragmentation with heat and divalent cations improves sequence coverage, over-fragmentation of target genomes leads to the loss of material during the library preparation process 37 . Lower amounts of RNA thus require shorter optimum fragmentation times and we found that 1 minute at 94 °C was optimal in terms of breadth of genome coverage.
Under standard kit conditions, our ultra-low RNA inputs dramatically skewed the ratio of cDNA to adaptor. The resulting adaptor excess led to the preferential amplification of adaptor dimers during the PCR step, and despite increasing cycle number to amplify low RNA inputs, we were generally unable to generate sufficient quantities of target-specific material. Accurate quantification and consequent equimolar pooling of libraries was compromised, as was the MiSeq clustering efficiency. We found that a reduced final adaptor concentration of 1.4 nM was crucial in reducing the amount of adaptor dimers in libraries from rRNA-depleted samples whilst simultaneously extending the PCR cycle number.
In the present study, serial dilutions of the Blood Borne Virus Panel were prepared in negative human plasma, reducing both the absolute quantity and relative frequency of the viral RNA targets while maintaining the complexity of the sample in terms of host nucleic acid, thus mimicking that of a clinical sample. With rRNA depletion, the number and diversity of viral reads was consistently high, with over 35% of all reads mapping to constituent virus genomes. Throughout the three sample series, we obtained relatively high genome coverages of low-frequency viral targets. Co-infections with multiple blood-borne viruses are common 61 , so whilst we speculate that the depths and coverages of target viruses would be greater yet in these samples had it not been for the confounding effect of the unexpected human pegiviruses in both the plasma diluent and the patient sample series, it was reassuring to see the method performed well under such conditions.
In our experiments using negative human plasma as sample diluent, we were able to recover levels of viral genomes comparable to previous work using PBS, both for BBV Panel viruses 38 and for the VMR Panel 41 , and we were able to recover from a patient sample a large percentage of the genome of a previously uncharacterised subtype of HCV genotype 6 when present at 2 × 10 3 IU/ml, a diagnosis not possible using existing genotyping assays. The presence of an undiagnosed pegivirus in this sample further demonstrated the utility of the method in metagenomic analysis of blood-borne virus co-infections where the relative abundances of each virus can be highly variable 22 . Furthermore, in three of the four samples, depths greater than 1,000 were routinely obtained, which are likely to be sufficient to call minority variants for clinical resistance 62 . A full description of the patient series and the new HCV strain are provided in a separate manuscript (in preparation).
Our approach can therefore not only accurately characterise rare or novel variants of existing viruses, but also generates the same level of information regarding unexpected viruses present in the sample. By comparison, VIDISCA 32, 63, 64 and other random amplification-NGS techniques 30,31 have detected novel viruses in diverse clinical samples, but all have required further techniques to achieve full genome sequences.
Together with the VMR Panel results, we were able to recover identifying sequence from both enveloped viruses (HCV, HIV, HEV, influenza, and several paramyxoviruses), and non-enveloped viruses (several enteroviruses, astrovirus, rotavirus, and sapovirus). For the majority of viruses in the VMR Panel, whilst dilution in plasma reduced the total percentage of reads recovered when compared to the panel diluted in PBS, a greater breadth of genome coverage was achieved. In the absence of any host nucleic acid background, it is possible that the PBS extracts had such ultra-low quantities of RNA that despite the adjustments made to the library preparation protocol, the RNA was over-fragmented, leading to a smaller number of genome fragments that were individually amplified to a greater extent than the larger array of fragments surviving the plasma extraction.
In developing a similar approach, Kohl et al. were only able to recover a percentage of reads exceeding 6% at a viral load over 10 7 copies/ml. At an influenza A virus concentration of over 10 5 copies/ml, this dropped to just 0.5%, and at a reovirus concentration of 10 3 -10 4 copies/ml, no viral reads were detected 24 . With our method, whole genomes were obtained for those with the highest viral loads, and for minority viral targets, there was a correlation between ostensible quantity and coverage, including for two viruses undetectable by the panel distributor 42 , a result superior to that recently obtained from influenza in clinical respiratory samples 65 . Again, the presence of high quantities of one or more target is likely to have inhibited the representation of the minority species such that if tested individually, superior depths and coverages would seem likely. With further reduction to the fragmentation time, or even its abolition, it may be possible to use this method to reconstruct genomes from old, partially degraded samples such as those recently used to re-evaluate the early HIV epidemic in the Americas 66 .
Our negative control data suggest that the level of contamination is low, with most viral reads therein belonging to the most abundant VMR Panel member. Nelson et al. 67 identified a second source of contamination consisting of incorrect reads from other libraries that were sequenced during the same sequencing run due to TruSeq index misassignment (~0.06% of reads, 0.02% here). Although cross-contamination between samples during the library preparation can be another source of contamination, the qPCR results suggest no BBV Panel genomes were present after library preparation in the negative control sample.
To conclude, by applying the three adaptations of selective large RNA extraction, rRNA depletion-DNAse treatment, and the extensively modified library preparation in combination, NGS data sets can be produced from plasma samples that are rich in RNA virus sequence data. Complex bioinformatic processing has been employed to identify viruses within a metagenomic dataset 7,25,26,32,64,65 , but here, only simple bioinformatic processing is needed for detection and identification of known viruses, and by applying only moderately more advanced tools, an agnostic approach to virus detection can be taken, together with characterisation of the full genome even at low viral loads.