Enrichment of Viral Nucleic Acids by Solution Hybrid Selection with Genus Specific Oligonucleotides

Despite recent advances, our knowledge of potential and rare human pathogens is far from exhaustive. Current molecular diagnostic tools mainly rely on the specific amplification of marker sequences and may overlook infections caused by unknown and rare pathogens. Using high-throughput sequencing (HTS) can solve this problem; but, due to the extremely low fraction of pathogen genetic material in clinical samples, its application is only cost-effective in special, rather than routine, cases. In this study, we present a method for the semi-specific enrichment of viral conservative sequences in a HTS library by hybridization in solution with genus-specific degenerate biotinylated oligonucleotides. Nucleic acids of the test viruses (yellow fever virus and Japanese encephalitis virus) were enriched by solution hybrid selection using pan-flavivirus oligonucleotides. Moreover, enterovirus (family: Picornaviridae, genus: Enterovirus) sequences were successfully enriched using foot-and-mouth disease virus (family: Picornaviridae, genus: Aphthovirus) oligonucleotide. The enrichment factor relative to the background nucleic acid was about 1,000-fold. As hybridization has less stringent oligonucleotide match requirements than PCR, few oligonucleotides are sufficient to cover the potential sequence variation in the whole genus and may even enrich nucleic acids of viruses of other related genera. Efficient enrichment of viral sequences makes its use in diagnostics cost-efficient.

pathogenic sequence (PATHseq) is based on using a set of 88 specific 8-mer oligonucleotides (which do not match the sequences of the 2,000 most abundant human transcripts) as primers for the synthesis of secondary cDNA strands 13 . Alternatively, the VIDISCA method based on the presence of virus-specific endonuclease restriction sites was suggested 14 . Advanced VIDISCA uses non-rRNA binding hexamers in a reverse transcription (RT) reaction 15 . It is also possible to enrich viral sequences by a broad-range RT-PCR using a set of degenerate oligonucleotides 16,17 .
The recently developed virome capture sequencing platform for vertebrate viruses (VirCapSeq-VERT) technique increases the sensitivity of sequence-based virus detection by HTS 18 . The authors described target selection of genome sequences of known viruses by hybridization in a solution using 1,993,200 biotinylated oligonucleotides representing 342,438 viral coding sequences. However, as this approach used specific (non-degenerate) oligonucleotides, theoretically, it can only detect viruses that have regions with > 80% sequence identity to known viruses.
In this study, we describe a cost-effective method for the significant enrichment of conservative fragments of viral genomes by hybridization with degenerate genus-specific biotinylated oligonucleotides. We analyze the utility of the technique for NA library enrichment in order to increase HTS efficiency and decrease the cost per sample.

Materials and Methods
Selection of probe sequences. All full-genome flavivirus sequences available in Genbank in October 2014 were extracted. After alignment with MAFFT 19 , sequences sharing more than 90% identity were discarded. The data set was refined manually to exclude pegiviruses and incomplete sequences. Foot-and-mouth disease virus alignment was prepared similarly. Hybridization oligonucleotides matching the most conserved genome fragments were designed according to the following principles: -length between 50-90 nt -90% match with all target sequences (degenerate nucleotides were used where necessary) -whenever possible, degenerate positions were simplified to G or K (T/G), given that mismatches involving T and G result in a minimal penalty to hybridization efficiency 20 .
The potential ability of oligonucleotides to hybridize with selected target sequences was checked using the Mfold web server 21 .
Samples, nucleic acid extraction and preparation of artificial samples containing human RNA and viral RNA. Donor whole blood samples were collected with written informed consent from volunteers in the CRIE (Central Research Institute for Epidemiology, Moscow). All experiments were performed in accordance with relevant guidelines and regulations and were approved by the CRIE ethics committee. Total human RNA was extracted from the human mononuclear cell fraction of the samples using the RNeasy Mini Kit (Qiagen).
The Gagar Japanese encephalitis virus (JEV) strain was cultured in pig embryo kidney cells. The 17D yellow fever virus (YFV) strain was passaged in Vero cells. The Gregory Echovirus 11 (E11) strain, which was originally obtained from ATCC, was cultured in human rhabdomyosarcoma cells. Viral RNA was extracted from cell culture medium using the Viral RNA Mini Kit (Qiagen).
The relative concentrations of isolated viral RNA and host RNA were estimated using in-house qPCR assays. Then, total human RNA (100 ng) was mixed with either 20 pg of YFV, 30 pg of JEV or 100 pg of E11. The total volume of each sample was adjusted to 10 µL with water.
Preparation of cDNA libraries. First-strand cDNA was synthesized from RNA using Reverta-L RT kit (AmpliSens) according to the manufacturer's instructions. The kit uses MMLV reverse transcriptase and random hexamer primers.
Second-strand cDNA synthesis was performed by adding 10 µL of RT mixture, 8 µL of 5x buffer C (AmpliSens), 5 µL of dNTP (2.5 mM) (AmpliSens), 1 µL of random hexamer oligonucleotide (168 pmol/µL) (in-house) and 1 µL of RNAse H (New England Biolabs), made up to 50 µL with milli-Q water. The mixture was incubated at 37 °C for 10 min. The enzymes were inactivated by heating at 95 °C for 2 min. After cooling on wet ice, 1 µL of Bst polymerase (New England Biolabs) and 0.5 µL of Klenow exo-(New England Biolabs) were added. Incubation conditions were as follows: 4 °C for 45 secs, 20 °C for 5 min, 37 °C for 5 min, 45 °C for 5 min, 50 °C for 5 min and 68 °C for 5 min. The reaction product was purified with Agencourt AMPure XP beads (Beckman Coulter) using the manufacturer's protocol, with the exception of an altered bead-to-DNA ratio, where we used a 1.5:1 beads-to-DNA solution ratio, instead of a 1.8:1 ratio. DNA was eluted to 17 µL of milli-Q water.
To incorporate sequencing adapters, purified cDNA libraries (15 µL) were mixed with 5 µL of 5x transposase buffer (in-house) and 0.5 µL of the transposase and transposon complex, termed transposome (in-house) 22 . Total reaction volume was adjusted with milli-Q water up to 25 µL. The mixture was incubated at 55 °C for 5 min. The reaction was stopped by adding 1 µL of 200 mM EDTA. The reaction product was purified with Agencourt AMPure XP beads (Beckman Coulter) using the manufacturer's protocol. DNA was eluted to 22 µL of milli-Q water.
The oligonucleotides (total volume: 5 µL) were separated from other parts of the PCR mixture (total volume: 45 µL) by a wax layer. Amplification conditions were as follows: 30 °C for 1 min; 37 °C for 1 min; 72 °C for 1 min; denaturation at 95 °C for 5 min; 18 cycles at 95 °C for 20 s, 65 °C for 30 s and 72 °C for 60 s. Real-time PCR amplification was performed using a Rotor Gene 6000 instrument (Qiagen). The amplification was monitored to stop the reaction during the exponential phase and before the reaction reached the plateau 23 .
Aliquots of PCR product were stored at −20 °C for further analysis. After 15 min at 37 °C, the beads were pulled down by the magnetic rack and washed once at 47 °C for 5 min with 1 mL of prewarmed 1xSSC (saline and sodium citrate)/0.1% SDS (sodium dodecyl sulfate). Next, the beads were washed twice at 47 °C for 5 min with 1 mL of prewarmed 0.2xSSC/0.1%SDS, once at room temperature for 5 min with 1 mL 0.2xSSC/0.1%SDS, and once at room temperature for 5 min with 1 mL of 0.2xSSC/0.1% Tween 20. The beads were resuspended once after each washing step and collected by the magnetic rack. Hybrid-selected DNA, which was attached to the beads, was resuspended with 20 µL of 0.1% Tween 20.
The Enrichment evaluation by specific qPCR. The relative concentration of viral cDNA before and after hybridization was measured by a specific qPCR using GAPDH (Glyceraldehyde-3-Phosphate Dehydrogenase) as the housekeeping gene. The cycle threshold (Ct value) was chosen as being at the middle of the exponential phase of the amplification curve. The ratio between the host and viral NAs was assessed in terms of the Ct value difference. Oligonucleotide primers and probes for viral cDNA quantification (Table 1) were chosen in close proximity to the location of hybridization oligonucleotide probes. The efficiency of PCR amplification was assumed to be 100% (twofold amplification on each cycle) for further calculations.
For HTS, the initial and enriched cDNA libraries were indexed by PCR using Nextera-compatible oligonucleotide primers. The libraries were pooled at equal ratios. Paired-end 250-base sequencing by synthesis was performed on the MiSeq System (Illumina) using protocols provided by the manufacturer. Samples were de-multiplexed using the Illumina software, with FASTQ files generated.
Bioinformatic analysis. Reads in the FASTQ format were filtered by quality (Q30), and the adapter sequences were removed by Trimmomatic 25 . The Bowtie 2 26 tool was used for aligning sequencing reads to references. Alignments were processed using SAMtools 27 and pysamstats software, and figures were generated in the R environment.

Results
First, enrichment efficiency was estimated by comparison of the host:viral cDNA ratio in libraries before and after hybridization ( Table 2).
In unprocessed samples, delta Ct, indicating the host:pathogen ratio, was 6.2, 5.0 and 9.3 for human/YFV, human/JEV and human/E11 libraries, respectively. As a result of hybridization, this ratio changed to −5.1, −7.5 and 2.3 for human/YFV, human/JEV and human/E11 libraries, respectively. According to qPCR data, the ratio of target sequences to host sequences increased by approximately 2^11, or 2,000 times, in the case of the human/ YFV library, approximately 2^12, or 4,000 times, in the case of the human/JEV library after hybridization with three pan-flavivirus oligonucleotides, and about 2^7, or 100 times, for the human/E11 library after hybridization with the FMDV-specific probe. Due to the use of the semi-quantitative PCR, the enrichment efficiency estimates should be considered as approximate; however, they at least indicate the order of magnitude of enrichment efficiency.
HTS was also performed to validate the hybridization-induced change in the host:virus cDNA ratio. Significant enrichment of target viral sequences was observed in samples after hybridization ( Table 3). The number of reads mapped to viral genomes increased from single accidental reads to a significant fraction of the total. The increase in the number of viral reads corresponded well to the enrichment ratio suggested by qPCR ( Table 2).
The pattern of HTS reads coverage along the YFV, JEV and ECHOV genomes corresponded to an expected enrichment in the genome regions, which were complementary to the biotinylated oligonucleotides (Figs 1, S1). Interestingly, the pattern of reads coverage in the E11 library shows a bimodal distribution. After omitting reads shorter than 200 nucleotides from the analysis (Fig. S2), only one peak close to the probe binding site was left. Therefore, the second sequence density peak in Fig. 1 likely was due to short reads from longer cDNA library fragments that were hybridized to the oligonucleotide.

Discussion
Rapidly evolving methods in molecular diagnostics face limitations associated with extremely high genetic diversity of viruses. Furthermore, most of the global virome remains undiscovered, resulting in a lack of reference sequences for novel pathogen identification. One way to overcome this situation is by developing a system, which allows specific enrichment of unknown virus content in processed biological material based on known homologous sequences.  We were able to increase viral cDNA content for two distinct flaviviruses (YFV and JEV) by hybridization of HTS-ready libraries with pan-flavivirus oligonucleotides. Comparison of the genomes of these viruses (KF907504 vs KF297915, respectively) using BLASTn shows a 66% identity level at 27% of the genome coverage. The hybridization oligonucleotides showed no preference towards YFV and JEV in the enrichment experiment. The data set for the oligonucleotide design included over 100 highly diversified flavivirus sequences, therefore, comparable enrichment efficiency may be expected for any known flavivirus. To further test the specificity range of the method, we tested whether it was possible to enrich the cDNA of one virus genus with hybridization probes designed for another genus of the same family. Indeed, it was possible to enrich the content of enterovirus cDNA using an oligonucleotide designed to to the FMDV virus (79.2% sequence identity between the FMDV oligonucleotide and the enterovirus). The Enterovirus and Aphthovirus genera are very distantly related within the Picornaviridae family. Therefore, the specificity breadth of the method goes beyond a genus and could cover distantly related genera within a highly diversified virus family. Importantly, cDNA enrichment by hybridization lacks theoretical specificity limitations typical of PCR, such as the strict requirement of the precise match of the three 3′-end nucleotides of the PCR primer. Any small number of additional mismatches in any part of the hybridization oligonucleotide will only reduce hybridization efficiency, but not completely prevent it. Therefore, this approach is theoretically more robust in terms of sequence variation than genus-specific PCR with degenerate primers. Moreover, the degeneracy of the probes (the number of possible unique oligonucleotides within a degenerate oligonucleotide preparation) was between 2 12 and 2 46 , thereby greatly exceeding the maximum degeneracy of 2 8 , which is acceptable in a PCR primer.
Enrichment of cDNA libraries by hybridization involves four steps of NA copying and amplification: reverse transcription, second-strand synthesis and two rounds of library amplification before and after hybridization. Even a standard library preparation procedure results in the uneven amplification of genome fragments 28 . As expected, this bias was further deepened in our protocol. Therefore, it is advisable to use several hybridization oligonucleotides per taxon. On the other hand, using specifically designed degenerate probes, which are complementary to conserved genome regions, is much more affordable than synthesizing multiple specific oligonucleotides to the whole genome, as suggested earlier 18 .
At the present time, molecular diagnostic tools mostly rely on the specific amplification of marker sequences. In other words, while a standard molecular test answers the question about whether pathogen X is present in the sample, this does help investigators to answer the more important question, "which pathogen is present in the sample?" 29 . Moreover, the correctness of assumption about a pathogen's presence in a sample completely depends on the qualification of a medical specialist.
To date, many tools for unbiased HTS-based pathogen detection exist. However, due to the extremely low fraction of a pathogen's genetic material in clinical samples, its usage is only cost-effective in very special cases. The proposed technique can enrich the share of pathogen genome fragments in a library, even if the NA sequence of a pathogen is currently unknown. It can also significantly enhance the utility of HTS for diagnostics.