Metagenomic sequencing has the potential to transform microbial detection and characterization, but new tools are needed to improve its sensitivity. Here we present CATCH, a computational method to enhance nucleic acid capture for enrichment of diverse microbial taxa. CATCH designs optimal probe sets, with a specified number of oligonucleotides, that achieve full coverage of, and scale well with, known sequence diversity. We focus on applying CATCH to capture viral genomes in complex metagenomic samples. We design, synthesize, and validate multiple probe sets, including one that targets the whole genomes of the 356 viral species known to infect humans. Capture with these probe sets enriches unique viral content on average 18-fold, allowing us to assemble genomes that could not be recovered without enrichment, and accurately preserves within-sample diversity. We also use these probe sets to recover genomes from the 2018 Lassa fever outbreak in Nigeria and to improve detection of uncharacterized viral infections in human and mosquito samples. The results demonstrate that CATCH enables more sensitive and cost-effective metagenomic sequencing.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $20.83 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The latest version of CATCH and its full source code is available at https://github.com/broadinstitute/catch under the terms of the MIT license. For designing the VALL probe set, we used CATCH v0.5.0 (available in the repository on GitHub).
Sequences used as input for probe design are available in the repository at https://github.com/broadinstitute/catch (see Supplementary Table 1 for links to specific versions used). Sequences of the probe designs (with 20-nt adaptors where applicable) developed here are available at https://github.com/broadinstitute/catch/tree/cf500c6/probe-designs. Sequencing data from this study, as well as viral genomes generated as part of this work, have been deposited in NCBI databases under BioProject accession PRJNA431306 (PRJNA436552 for the 2018 Lassa virus genomes).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Houldcroft, C. J., Beale, M. A. & Breuer, J. Clinical and biological insights from viral genome sequencing. Nat. Rev. Microbiol. 15, 183–192 (2017).
Worobey, M. et al. 1970s and ‘Patient 0’ HIV-1 genomes illuminate early HIV/AIDS history in North America. Nature 539, 98–101 (2016).
Andersen, K. G. et al. Clinical sequencing uncovers origins and evolution of Lassa virus. Cell 162, 738–750 (2015).
Dudas, G. et al. Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature 544, 309–315 (2017).
Bedford, T. et al. Global circulation patterns of seasonal influenza viruses vary with antigenic drift. Nature 523, 217–220 (2015).
Metsky, H. C. et al. Zika virus evolution and spread in the Americas. Nature 546, 411–415 (2017).
Quick, J. et al. Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples. Nat. Protoc. 12, 1261–1276 (2017).
Barnes, K. G. et al. Evidence of Ebola virus replication and high concentration in semen of a patient during recovery. Clin. Infect. Dis. 65, 1400–1403 (2017).
Henn, M. R. et al. Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog. 8, e1002529 (2012).
Li, J. Z. et al. Comparison of Illumina and 454 deep sequencing in participants failing raltegravir-based antiretroviral therapy. PLoS One 9, e90485 (2014).
Depledge, D. P. et al. Specific capture and whole-genome sequencing of viruses from clinical samples. PLoS One 6, e27805 (2011).
Matranga, C. B. et al. Enhanced methods for unbiased deep sequencing of Lassa and Ebola RNA viruses from clinical and biological samples. Genome Biol. 15, 519 (2014).
Bonsall, D. et al. ve-SEQ: robust, unbiased enrichment for streamlined detection and whole-genome sequencing of HCV and other highly diverse pathogens. F1000Res 4, 1062 (2015).
Wang, D. et al. Microarray-based detection and genotyping of viral pathogens. Proc. Natl Acad. Sci. USA 99, 15687–15692 (2002).
Lapa, S. et al. Species-level identification of orthopoxviruses with an oligonucleotide microchip. J. Clin. Microbiol. 40, 753–757 (2002).
Palacios, G. et al. Panmicrobial oligonucleotide array for diagnosis of infectious diseases. Emerg. Infect. Dis. 13, 73–81 (2007).
Chalkias, S. et al. ViroFind: a novel target-enrichment deep-sequencing platform reveals a complex JC virus population in the brain of PML patients. PLoS One 13, e0186945 (2018).
Briese, T. et al. Virome capture sequencing enables sensitive viral diagnosis and comprehensive virome analysis. mBio 6, e01491-15 (2015).
Wylie, T. N., Wylie, K. M., Herter, B. N. & Storch, G. A. Enhanced virome sequencing using targeted sequence capture. Genome Res. 25, 1910–1920 (2015).
Stremlau, M. H. et al. Discovery of novel rhabdoviruses in the blood of healthy individuals from West Africa. PLoS Negl. Trop. Dis. 9, e0003631 (2015).
Shi, M. et al. Redefining the invertebrate RNA virosphere. Nature 540, 539–543 (2016).
Mayer, C. et al. BaitFisher: a software package for multispecies target DNA enrichment probe design. Mol. Biol. Evol. 33, 1875–1886 (2016).
Hugall, A. F., O’Hara, T. D., Hunjan, S., Nilsen, R. & Moussalli, A. An exon-capture system for the entire class Ophiuroidea. Mol. Biol. Evol. 33, 281–294 (2016).
Beliveau, B. J. et al. OligoMiner provides a rapid, flexible environment for the design of genome-scale oligonucleotide in situ hybridization probes. Proc. Natl Acad. Sci. USA 115, E2183–E2192 (2018).
Chvatal, V. A greedy heuristic for the set-covering problem. Math. Oper. Res. 4, 233–235 (1979).
Johnson, D. S. Approximation algorithms for combinatorial problems. J. Comput. Syst. Sci. 9, 256–278 (1974).
Indyk, P. & Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. in Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing 604–613 (Dallas, TX, USA, 1998).
Andoni, A. & Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 117–122 (2008).
NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 44 (D1), D7–D19 (2016).
Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Sayers, E. W. Genbank. Nucleic Acids Res. 44, D67–D72 (2016).
Lesnik, E. A. & Freier, S. M. Relative thermodynamic stability of DNA, RNA, and DNA:RNA hybrid duplexes: relationship with base composition and structure. Biochemistry 34, 10807–10815 (1995).
Wilson, M. R. et al. Multiplexed metagenomic deep sequencing to analyze the composition of high-priority pathogen reagents. mSystems 1, e00058-16 (2016).
Didelot, X., Gardy, J. & Colijn, C. Bayesian inference of infectious disease transmission from whole-genome sequence data. Mol. Biol. Evol. 31, 1869–1879 (2014).
Lemey, P., Rambaut, A. & Pybus, O. G. HIV evolutionary dynamics within and among hosts. AIDS Rev. 8, 125–140 (2006).
Siddle, K. J. et al. Genomic analysis of Lassa virus during an increase in cases in Nigeria in 2018. N. Engl. J. Med. 379, 1745–1753 (2018).
Bowen, M. D. et al. Genetic diversity among Lassa virus strains. J. Virol. 74, 6992–7004 (2000).
Sathar, M., Soni, P. & York, D. GB virus C/hepatitis G virus (GBV-C/HGV): still looking for a disease. Int. J. Exp. Pathol. 81, 305–322 (2000).
Newman, C. M. et al. Culex flavivirus and West Nile virus mosquito coinfection and positive ecological association in Chicago, United States. Vector Borne Zoonotic Dis. 11, 1099–1105 (2011).
Piantadosi, A. et al. Rapid detection of Powassan virus in a patient with encephalitis by metagenomic sequencing. Clin. Infect. Dis. 66, 789–792 (2017).
Karamitros, T. & Magiorkinis, G. Multiplexed targeted sequencing for Oxford Nanopore MinION: a detailed library preparation procedure. Methods Mol. Biol. 1712, 43–51 (2018).
Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2011).
Noyes, N. R. et al. Enrichment allows identification of diverse, rare elements in metagenomic resistome-virulome sequencing. Microbiome 5, 142 (2017).
Brown, J. R. et al. Norovirus whole-genome sequencing by SureSelect target enrichment: a robust and sensitive method. J. Clin. Microbiol. 54, 2530–2537 (2016).
Thomson, E. et al. Comparison of next-generation sequencing technologies for comprehensive assessment of full-length hepatitis C viral genomes. J. Clin. Microbiol. 54, 2470–2484 (2016).
Melnikov, A. et al. Hybrid selection for sequencing pathogen genomes from clinical samples. Genome Biol. 12, R73 (2011).
Lemieux, J. E. et al. A global map of genetic diversity in Babesia microti reveals strong population structure and identifies variants associated with clinical relapse. Nat. Microbiol. 1, 16079 (2016).
Carpi, G. et al. Whole genome capture of vector-borne pathogens from mixed DNA samples: a case study of Borrelia burgdorferi. BMC Genomics 16, 434 (2015).
Konstantinidis, K. T., Ramette, A. & Tiedje, J. M. The bacterial species definition in the genomic era. Phil. Trans. R. Soc. Lond. B 361, 1929–1940 (2006).
Newman, A. M. et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat. Med. 20, 548–554 (2014).
Ma, D. et al. Noninvasive prenatal diagnosis of 21-hydroxylase deficiency using target capture sequencing of maternal plasma DNA. Sci. Rep. 7, 7427 (2017).
Broder, A. Z., Charikar, M., Frieze, A. M. & Mitzenmacher, M. Min-wise independent permutations. J. Comput. Syst. Sci. 60, 630–659 (2000).
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
Popic, V., Kuleshov, V., Snyder, M. & Batzoglou, S. Fast metagenomic binning via hashing and bayesian clustering. J. Comput. Biol. 25, https://doi.org/10.1089/cmb.2017.0250 (2017).
Gu, W., Castoe, T. A., Hedges, D. J., Batzer, M. A. & Pollock, D. D. Identification of repeat structure in large genomes using repeat probability clouds. Anal. Biochem. 380, 77–83 (2008).
de Koning, A. P. J., Gu, W., Castoe, T. A., Batzer, M. A. & Pollock, D. D. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 7, e1002384 (2011).
Pearson, W. R., Robins, G., Wrege, D. E. & Zhang, T. On the primer selection problem in polymerase chain reaction experiments. Discrete Appl. Math. 71, 231–246 (1996).
Jabado, O. J. et al. Greene SCPrimer: a rapid comprehensive tool for designing degenerate primers from multiple sequence alignments. Nucleic Acids Res. 34, 6605–6611 (2006).
Duitama, J. et al. PrimerHunter: a primer design tool for PCR-based virus subtype identification. Nucleic Acids Res. 37, 2483–2492 (2009).
Rash, S. & Gusfield, D. String barcoding: uncovering optimal virus signatures. in Proceedings of the Sixth Annual International Conference on Computational Biology 254–261 (Washington, DC, 2002).
DasGupta, B., Konwar, K. M., Mandoiu, I. I. & Shvartsman, A. A. DNA-BAR: distinguisher selection for DNA barcoding. Bioinformatics 21, 3424–3426 (2005).
Borneman, J., Chrobak, M., Della Vedova, G., Figueroa, A. & Jiang, T. Probe selection algorithms with applications in the analysis of microbial communities. Bioinformatics 17 (Suppl. 1), S39–S48 (2001).
Jabado, O. J. et al. Comprehensive viral oligonucleotide probe design using conserved protein regions. Nucleic Acids Res. 36, e3 (2008).
Phillippy, A. M., Deng, X., Zhang, W. & Salzberg, S. L. Efficient oligonucleotide probe selection for pan-genomic tiling arrays. BMC Bioinformatics 10, 293 (2009).
Feige, U. A threshold of ln n for approximating set cover. J. ACM 45, 634–652 (1998).
Brister, J. R., Ako-Adjei, D., Bao, Y. & Blinkova, O. NCBI viral genomes resource. Nucleic Acids Res. 43, D571–D577 (2015).
Pickett, B. E. et al. ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res. 40, D593–D598 (2012).
Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27, 182–189 (2009).
Park, D. et al. broadinstitute/viral-ngs: v1.17. 0, https://github.com/broadinstitute/viral-ngs/blob/v1.17.0/docs/index.rst (2017).
Park, D. J. et al. Ebola virus epidemiology, transmission, and evolution during seven months in Sierra Leone. Cell 161, 1516–1526 (2015).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
O’Leary, N. A. et al. Reference Sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
Aurrecoechea, C. et al. PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res. 37, D539–D543 (2009).
Yarza, P. et al. The All-Species Living Tree project: a 16S rRNA-based phylogenetic tree of all sequenced type strains. Syst. Appl. Microbiol. 31, 241–250 (2008).
We thank S. Ye, C. Myhrvold, S. Weingarten-Gabbay, C. Freije, S. Schaffner, and other members of the Sabeti laboratory for useful discussions and feedback on the manuscript; B. Chak for assistance with ethical approvals and compliance; and Boca Biolistics, the Florida Department of Health, Miami-Dade County Mosquito Control, Research Blood Components, the Ragon Institute Cellular Immunology Database, and Brigham and Women’s Hospital’s Crimson Core for support with samples. This project has been funded in whole or in part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under grant number U19AI110818 to the Broad Institute. This project was also funded in part by NIH NIAID contract HHSN272200900049C, a Broadnext10 gift from the Broad Institute, Henry M. Jackson Foundation award W81XWH-11-2-0174, and the Bill & Melinda Gates Foundation. IAV samples were funded by NIH NIAID contract HHSN272201400008C to J.A.R. K.J.S. is supported by a fellowship from the Human Frontiers in Science Program (LT000553/2016). S.I. and S.F.M. are supported by NIH NIAID R01AI099210. C.T.H. is supported by NIH NHGRI U01HG007480 and U54HG007480 and by World Bank project ACE019.
Integrated supplementary information
CATCH models hybridization between each candidate probe and the target sequences. Doing so allows CATCH to decide whether a candidate probe captures (or ‘covers’) a region of the target sequence, and thus find a probe set that achieves a desired coverage of the target sequences under this model. For whole genome enrichment, the desired coverage would typically be 100% of each target sequence. (a) Relatively conserved regions (for example, a particular gene) in the input sequences can be captured with few probes because it is likely that any given probe, under a model of hybridization, will capture observed variation across many or all of the input sequences. Highly variable regions may require many probes to be captured because each given probe may capture the observed variation across only a small fraction of the input sequences. (b) By default, CATCH decides whether a probe hybridizes to a region of a target sequence according to the following parameters: a number m of mismatches to tolerate and a length lcf of a longest common substring. CATCH computes the longest common substring with at most m mismatches between the probe and target subsequence, and decides that the probe hybridizes to the target if and only if the length of this is at least lcf. If the parameter i is provided, CATCH additionally requires that the probe and target subsequence share an exact (0-mismatch) match of length at least i. If CATCH decides that the probe hybridizes to the subsequence of the target with which it shares a substring, then it determines that the probe captures the region equal to the length of the probe as well as e nt on each side of this region. e, termed a cover extension, is a parameter whose value can be specified to CATCH, along with m, lcf, and i. Lower values of m, higher values of lcf, higher values of i, and lower values of e are more conservative and lead to more probe sequences. (For details, see the description of fmap in Online Methods.) (c) Number of probes required to fully capture 300 genomes of HCV, HIV-1, EBOV, and ZIKV, for varying values of the mismatches and cover extension parameters, with other parameters fixed. Shaded regions are 95% pointwise confidence bands calculated across randomly sampled input genomes.
Number of probes required to fully capture increasing numbers of HIV-1, EBOV, and ZIKV genomes. Approaches shown are simple tiling (gray), a clustering-based approach at two levels of stringency (red; see Supplementary Note 2 for details), and CATCH at three choices of parameters (blue). Shaded regions are 95% pointwise confidence bands calculated across randomly sampled input genomes.
(a) Number of probes designed by CATCH for each dataset among all 89,990 probes in the VWAFR probe set. The total includes reverse complement probes, which were added to the design of VWAFR for synthesis. (b) Values of two parameters selected by CATCH for each dataset in the design of VWAFR: number of mismatches to tolerate in hybridization and length of the target fragment (in nt) on each side of the hybridized region assumed to be captured along with the hybridized region (cover extension). The label within each bubble is the number of datasets that were assigned a particular combination of values. Species included in our sample testing are labeled; for full list of parameter values, see Supplementary Table 1.
Supplementary Figure 4 Depth of coverage observed across viral genomes from samples with known viral infections.
Depth of coverage across 31 viral genomes from the analysis of 30 patient and environmental samples with known viral infections (one sample contained two known viruses). Shown on (a) linear and (b) logarithmic scales. The logarithmic scale helps compare variance in depth across each genome between pre- and post-captured data.
Fraction of all downsampled pre-capture reads that mapped to the reference genome (shown on the horizontal axis) for 24 viral genomes reflects a wide range of initial viral concentrations in these samples. Enrichment (shown on the vertical axis) was calculated by dividing the total number of post-capture reads mapping to a reference genome by the number of mapped pre-capture reads. Those with the highest viral content showed lower enrichment following capture with VALL. Seven of the 31 viral genomes included in the analysis are excluded from this plot because they yielded fewer than 200,000 total reads (Supplementary Table 3). Two IAV samples with a high fraction of viral reads pre-capture (bottom right) overlap on the plot. One sample (ZIKV-SM3, top left) showed no viral reads pre-capture, so its fold-change is undefined.
(a) Number of species detected (with at least 1 assigned read) in samples with known viral infections. Counts are shown before capture (orange), after capture with VWAFR (light blue), and after capture with VALL (dark blue). (b) Left: Number of reads detected for each species across samples with known viral infections, before and after capture with VWAFR. Right: Abundance of each species before capture and fold-change upon capture with VWAFR. For each sample, the virus known to be present in the sample is colored, and Homo sapiens matches in samples from humans are shown in black. (c) Number of reads detected for each species across uncharacterized sample pools, before and after capture with VALL. Viral species present in each sample (Fig. 4b) are colored, and Homo sapiens matches in human plasma samples are shown in black. Asterisks on species indicate ones that are not targeted by VALL. (d) Same as (b) but for VWAFR in the uncharacterized sample pools. Asterisks on species indicate ones that are not targeted by VWAFR. In all panels, abundance was calculated by dividing species counts pre-capture by counts in pooled water controls.
Supplementary Figure 7 Genome assembly in EBOV dilution series and effect of sequencing depth on amount of viral material sequenced.
(a) Percent of viral genome assembled in a dilution series of viral input in two amounts of human RNA background. There are n = 2 technical replicates for each choice of input copies, background amount, and use of capture (n = 1 replicate for the negative control with 0 copies). Each dot indicates percent of genome assembled, from 200,000 reads, in a replicate; line is through the mean of the replicates. Label to the right of each line indicates amount of background material. Assemblies are from read data presented in Fig. 3a. (b) Number of unique viral reads sequenced at increasing sequencing depth, from an input of 103 viral copies in different amounts of background. Horizontal axis gives the number of total reads to which a sample was subsampled. Each line is a technical replicate (n = 2) and shaded regions are 95% pointwise confidence bands calculated across random subsamplings. Dashed vertical line at 200,000 reads denotes the amount of total reads used in (a) and in Fig. 3a. Viral sequencing data generated after capture with VALL saturates more quickly than without capture. (c) Same as (b), but from an input of 104 viral copies.
(a) Distribution of the enrichment in read depth, across viral genomes, provided by capture with VWAFR. Each curve represents a viral genome. At each position across a genome, the post-capture read depth is divided by the pre-capture depth, and the plotted curve is the empirical cumulative distribution of the log of these fold-change values. (b) Distribution of the enrichment in read depth, across viral genomes, provided by VWAFR over VALL. At each position across a genome, the read depth following capture with VWAFR is divided by the depth following capture with VALL, and the plotted curve is the empirical cumulative distribution of the log of these fold-change values. (c) Same as (a), but for the two-virus probe sets VMM and VZC. The mumps curves (green) show enrichment provided by VMM against pre-capture, and the Zika curves (purple) show enrichment provided by VZC against pre-capture. (d) Same as (b), but for the two-virus probe sets VMM and VZC. The mumps curves (green) show enrichment provided by VMM against VALL, and the Zika curves (purple) show enrichment provided by VZC against VALL.
Variable enrichment across segments of an influenza A virus sample of subtype H4N4 (IAV-SM5). Segments 4 and 6 contain the most genetic diversity and divergence from probe sequences. No sequences of the N4 subtypes were included in the design of VALL or VWAFR. (a) Depth of coverage across the sample’s genome. Each of the eight segments in IAV are labeled. (b, c) Distribution of the enrichment in read depth provided by capture with VALL (b) and VWAFR (c). Each curve represents one of the eight segments. At each position across a genome, the post-capture read depth is divided by the pre-capture depth, and the plotted curve is the empirical cumulative distribution of the log of these fold-change values.
Supplementary Figure 10 Sequencing results of Lassa virus from the 2018 Lassa fever outbreak in Nigeria.
(a) Number of unique LASV reads, among 200,000 reads in total, sequenced following capture with VALL compared to pre-capture in 23 samples from the 2018 Lassa fever outbreak. Points are colored by the state in Nigeria that the sample is from (black is NTC). (b) Percent of LASV genome assembled, after use of VALL, against the fraction of pre-capture reads that are LASV. Points to the left of the horizontal break correspond to samples with no LASV reads pre-capture. As in Fig. 4a, reads were downsampled to 200,000 before assembly. Points are colored as in (a). (c) Percent of LASV genome assembled, after use of VALL. Here, reads were not downsampled before assembly. Bars are ordered as in Fig. 4a and colored by the state in Nigeria that the sample is from.
Supplementary Figure 11 Depth of coverage observed for viral species detected in uncharacterized samples.
Depth of coverage plots for 25 viral genomes detected by metagenomic analysis of uncharacterized samples following capture with VALL (see Fig. 4b). Read depths are shown on a linear scale.
Supplementary Figures 1–11 and Supplementary Notes 1–3
Input taxa, input data, parameters selected, and other details about the four probe sets presented here
Origins, source materials, and GenBank accessions for samples
Sequencing summary metrics for patient and environmental samples with known viral infections
Metagenomic species counts for samples
Sequencing summary metrics for EBOV dilution series
Data on within-host variants in DENV samples that were used in the analysis of preservation of within-host variation
Sequencing summary metrics and metadata for LASV samples from 2018 Lassa fever outbreak in Nigeria
Sequencing summary metrics for uncharacterized samples
Cost estimates for sequencing with and without capture
GenBank accessions used for taxonomic filtering before viral genome assembly
About this article
Nature Reviews Genetics (2019)