Metagenomic sequencing has the potential to transform microbial detection and characterization, but new tools are needed to improve its sensitivity. Here we present CATCH, a computational method to enhance nucleic acid capture for enrichment of diverse microbial taxa. CATCH designs optimal probe sets, with a specified number of oligonucleotides, that achieve full coverage of, and scale well with, known sequence diversity. We focus on applying CATCH to capture viral genomes in complex metagenomic samples. We design, synthesize, and validate multiple probe sets, including one that targets the whole genomes of the 356 viral species known to infect humans. Capture with these probe sets enriches unique viral content on average 18-fold, allowing us to assemble genomes that could not be recovered without enrichment, and accurately preserves within-sample diversity. We also use these probe sets to recover genomes from the 2018 Lassa fever outbreak in Nigeria and to improve detection of uncharacterized viral infections in human and mosquito samples. The results demonstrate that CATCH enables more sensitive and cost-effective metagenomic sequencing.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Code availability

The latest version of CATCH and its full source code is available at https://github.com/broadinstitute/catch under the terms of the MIT license. For designing the VALL probe set, we used CATCH v0.5.0 (available in the repository on GitHub).

Data availability

Sequences used as input for probe design are available in the repository at https://github.com/broadinstitute/catch (see Supplementary Table 1 for links to specific versions used). Sequences of the probe designs (with 20-nt adaptors where applicable) developed here are available at https://github.com/broadinstitute/catch/tree/cf500c6/probe-designs. Sequencing data from this study, as well as viral genomes generated as part of this work, have been deposited in NCBI databases under BioProject accession PRJNA431306 (PRJNA436552 for the 2018 Lassa virus genomes).

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Houldcroft, C. J., Beale, M. A. & Breuer, J. Clinical and biological insights from viral genome sequencing. Nat. Rev. Microbiol. 15, 183–192 (2017).

  2. 2.

    Worobey, M. et al. 1970s and ‘Patient 0’ HIV-1 genomes illuminate early HIV/AIDS history in North America. Nature 539, 98–101 (2016).

  3. 3.

    Andersen, K. G. et al. Clinical sequencing uncovers origins and evolution of Lassa virus. Cell 162, 738–750 (2015).

  4. 4.

    Dudas, G. et al. Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature 544, 309–315 (2017).

  5. 5.

    Bedford, T. et al. Global circulation patterns of seasonal influenza viruses vary with antigenic drift. Nature 523, 217–220 (2015).

  6. 6.

    Metsky, H. C. et al. Zika virus evolution and spread in the Americas. Nature 546, 411–415 (2017).

  7. 7.

    Quick, J. et al. Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples. Nat. Protoc. 12, 1261–1276 (2017).

  8. 8.

    Barnes, K. G. et al. Evidence of Ebola virus replication and high concentration in semen of a patient during recovery. Clin. Infect. Dis. 65, 1400–1403 (2017).

  9. 9.

    Henn, M. R. et al. Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog. 8, e1002529 (2012).

  10. 10.

    Li, J. Z. et al. Comparison of Illumina and 454 deep sequencing in participants failing raltegravir-based antiretroviral therapy. PLoS One 9, e90485 (2014).

  11. 11.

    Depledge, D. P. et al. Specific capture and whole-genome sequencing of viruses from clinical samples. PLoS One 6, e27805 (2011).

  12. 12.

    Matranga, C. B. et al. Enhanced methods for unbiased deep sequencing of Lassa and Ebola RNA viruses from clinical and biological samples. Genome Biol. 15, 519 (2014).

  13. 13.

    Bonsall, D. et al. ve-SEQ: robust, unbiased enrichment for streamlined detection and whole-genome sequencing of HCV and other highly diverse pathogens. F1000Res 4, 1062 (2015).

  14. 14.

    Wang, D. et al. Microarray-based detection and genotyping of viral pathogens. Proc. Natl Acad. Sci. USA 99, 15687–15692 (2002).

  15. 15.

    Lapa, S. et al. Species-level identification of orthopoxviruses with an oligonucleotide microchip. J. Clin. Microbiol. 40, 753–757 (2002).

  16. 16.

    Palacios, G. et al. Panmicrobial oligonucleotide array for diagnosis of infectious diseases. Emerg. Infect. Dis. 13, 73–81 (2007).

  17. 17.

    Chalkias, S. et al. ViroFind: a novel target-enrichment deep-sequencing platform reveals a complex JC virus population in the brain of PML patients. PLoS One 13, e0186945 (2018).

  18. 18.

    Briese, T. et al. Virome capture sequencing enables sensitive viral diagnosis and comprehensive virome analysis. mBio 6, e01491-15 (2015).

  19. 19.

    Wylie, T. N., Wylie, K. M., Herter, B. N. & Storch, G. A. Enhanced virome sequencing using targeted sequence capture. Genome Res. 25, 1910–1920 (2015).

  20. 20.

    Stremlau, M. H. et al. Discovery of novel rhabdoviruses in the blood of healthy individuals from West Africa. PLoS Negl. Trop. Dis. 9, e0003631 (2015).

  21. 21.

    Shi, M. et al. Redefining the invertebrate RNA virosphere. Nature 540, 539–543 (2016).

  22. 22.

    Mayer, C. et al. BaitFisher: a software package for multispecies target DNA enrichment probe design. Mol. Biol. Evol. 33, 1875–1886 (2016).

  23. 23.

    Hugall, A. F., O’Hara, T. D., Hunjan, S., Nilsen, R. & Moussalli, A. An exon-capture system for the entire class Ophiuroidea. Mol. Biol. Evol. 33, 281–294 (2016).

  24. 24.

    Beliveau, B. J. et al. OligoMiner provides a rapid, flexible environment for the design of genome-scale oligonucleotide in situ hybridization probes. Proc. Natl Acad. Sci. USA 115, E2183–E2192 (2018).

  25. 25.

    Chvatal, V. A greedy heuristic for the set-covering problem. Math. Oper. Res. 4, 233–235 (1979).

  26. 26.

    Johnson, D. S. Approximation algorithms for combinatorial problems. J. Comput. Syst. Sci. 9, 256–278 (1974).

  27. 27.

    Indyk, P. & Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. in Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing 604–613 (Dallas, TX, USA, 1998).

  28. 28.

    Andoni, A. & Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 117–122 (2008).

  29. 29.

    NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 44 (D1), D7–D19 (2016).

  30. 30.

    Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Sayers, E. W. Genbank. Nucleic Acids Res. 44, D67–D72 (2016).

  31. 31.

    Lesnik, E. A. & Freier, S. M. Relative thermodynamic stability of DNA, RNA, and DNA:RNA hybrid duplexes: relationship with base composition and structure. Biochemistry 34, 10807–10815 (1995).

  32. 32.

    Wilson, M. R. et al. Multiplexed metagenomic deep sequencing to analyze the composition of high-priority pathogen reagents. mSystems 1, e00058-16 (2016).

  33. 33.

    Didelot, X., Gardy, J. & Colijn, C. Bayesian inference of infectious disease transmission from whole-genome sequence data. Mol. Biol. Evol. 31, 1869–1879 (2014).

  34. 34.

    Lemey, P., Rambaut, A. & Pybus, O. G. HIV evolutionary dynamics within and among hosts. AIDS Rev. 8, 125–140 (2006).

  35. 35.

    Siddle, K. J. et al. Genomic analysis of Lassa virus during an increase in cases in Nigeria in 2018. N. Engl. J. Med. 379, 1745–1753 (2018).

  36. 36.

    Bowen, M. D. et al. Genetic diversity among Lassa virus strains. J. Virol. 74, 6992–7004 (2000).

  37. 37.

    Sathar, M., Soni, P. & York, D. GB virus C/hepatitis G virus (GBV-C/HGV): still looking for a disease. Int. J. Exp. Pathol. 81, 305–322 (2000).

  38. 38.

    Newman, C. M. et al. Culex flavivirus and West Nile virus mosquito coinfection and positive ecological association in Chicago, United States. Vector Borne Zoonotic Dis. 11, 1099–1105 (2011).

  39. 39.

    Piantadosi, A. et al. Rapid detection of Powassan virus in a patient with encephalitis by metagenomic sequencing. Clin. Infect. Dis. 66, 789–792 (2017).

  40. 40.

    Karamitros, T. & Magiorkinis, G. Multiplexed targeted sequencing for Oxford Nanopore MinION: a detailed library preparation procedure. Methods Mol. Biol. 1712, 43–51 (2018).

  41. 41.

    Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2011).

  42. 42.

    Noyes, N. R. et al. Enrichment allows identification of diverse, rare elements in metagenomic resistome-virulome sequencing. Microbiome 5, 142 (2017).

  43. 43.

    Brown, J. R. et al. Norovirus whole-genome sequencing by SureSelect target enrichment: a robust and sensitive method. J. Clin. Microbiol. 54, 2530–2537 (2016).

  44. 44.

    Thomson, E. et al. Comparison of next-generation sequencing technologies for comprehensive assessment of full-length hepatitis C viral genomes. J. Clin. Microbiol. 54, 2470–2484 (2016).

  45. 45.

    Melnikov, A. et al. Hybrid selection for sequencing pathogen genomes from clinical samples. Genome Biol. 12, R73 (2011).

  46. 46.

    Lemieux, J. E. et al. A global map of genetic diversity in Babesia microti reveals strong population structure and identifies variants associated with clinical relapse. Nat. Microbiol. 1, 16079 (2016).

  47. 47.

    Carpi, G. et al. Whole genome capture of vector-borne pathogens from mixed DNA samples: a case study of Borrelia burgdorferi. BMC Genomics 16, 434 (2015).

  48. 48.

    Konstantinidis, K. T., Ramette, A. & Tiedje, J. M. The bacterial species definition in the genomic era. Phil. Trans. R. Soc. Lond. B 361, 1929–1940 (2006).

  49. 49.

    Newman, A. M. et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat. Med. 20, 548–554 (2014).

  50. 50.

    Ma, D. et al. Noninvasive prenatal diagnosis of 21-hydroxylase deficiency using target capture sequencing of maternal plasma DNA. Sci. Rep. 7, 7427 (2017).

  51. 51.

    Broder, A. Z., Charikar, M., Frieze, A. M. & Mitzenmacher, M. Min-wise independent permutations. J. Comput. Syst. Sci. 60, 630–659 (2000).

  52. 52.

    Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).

  53. 53.

    Popic, V., Kuleshov, V., Snyder, M. & Batzoglou, S. Fast metagenomic binning via hashing and bayesian clustering. J. Comput. Biol. 25, https://doi.org/10.1089/cmb.2017.0250 (2017).

  54. 54.

    Gu, W., Castoe, T. A., Hedges, D. J., Batzer, M. A. & Pollock, D. D. Identification of repeat structure in large genomes using repeat probability clouds. Anal. Biochem. 380, 77–83 (2008).

  55. 55.

    de Koning, A. P. J., Gu, W., Castoe, T. A., Batzer, M. A. & Pollock, D. D. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 7, e1002384 (2011).

  56. 56.

    Pearson, W. R., Robins, G., Wrege, D. E. & Zhang, T. On the primer selection problem in polymerase chain reaction experiments. Discrete Appl. Math. 71, 231–246 (1996).

  57. 57.

    Jabado, O. J. et al. Greene SCPrimer: a rapid comprehensive tool for designing degenerate primers from multiple sequence alignments. Nucleic Acids Res. 34, 6605–6611 (2006).

  58. 58.

    Duitama, J. et al. PrimerHunter: a primer design tool for PCR-based virus subtype identification. Nucleic Acids Res. 37, 2483–2492 (2009).

  59. 59.

    Rash, S. & Gusfield, D. String barcoding: uncovering optimal virus signatures. in Proceedings of the Sixth Annual International Conference on Computational Biology 254–261 (Washington, DC, 2002).

  60. 60.

    DasGupta, B., Konwar, K. M., Mandoiu, I. I. & Shvartsman, A. A. DNA-BAR: distinguisher selection for DNA barcoding. Bioinformatics 21, 3424–3426 (2005).

  61. 61.

    Borneman, J., Chrobak, M., Della Vedova, G., Figueroa, A. & Jiang, T. Probe selection algorithms with applications in the analysis of microbial communities. Bioinformatics 17 (Suppl. 1), S39–S48 (2001).

  62. 62.

    Jabado, O. J. et al. Comprehensive viral oligonucleotide probe design using conserved protein regions. Nucleic Acids Res. 36, e3 (2008).

  63. 63.

    Phillippy, A. M., Deng, X., Zhang, W. & Salzberg, S. L. Efficient oligonucleotide probe selection for pan-genomic tiling arrays. BMC Bioinformatics 10, 293 (2009).

  64. 64.

    Feige, U. A threshold of ln n for approximating set cover. J. ACM 45, 634–652 (1998).

  65. 65.

    Brister, J. R., Ako-Adjei, D., Bao, Y. & Blinkova, O. NCBI viral genomes resource. Nucleic Acids Res. 43, D571–D577 (2015).

  66. 66.

    Pickett, B. E. et al. ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res. 40, D593–D598 (2012).

  67. 67.

    Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27, 182–189 (2009).

  68. 68.

    Park, D. et al. broadinstitute/viral-ngs: v1.17. 0, https://github.com/broadinstitute/viral-ngs/blob/v1.17.0/docs/index.rst (2017).

  69. 69.

    Park, D. J. et al. Ebola virus epidemiology, transmission, and evolution during seven months in Sierra Leone. Cell 161, 1516–1526 (2015).

  70. 70.

    Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

  71. 71.

    Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  72. 72.

    Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).

  73. 73.

    O’Leary, N. A. et al. Reference Sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).

  74. 74.

    Aurrecoechea, C. et al. PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res. 37, D539–D543 (2009).

  75. 75.

    Yarza, P. et al. The All-Species Living Tree project: a 16S rRNA-based phylogenetic tree of all sequenced type strains. Syst. Appl. Microbiol. 31, 241–250 (2008).

Download references


We thank S. Ye, C. Myhrvold, S. Weingarten-Gabbay, C. Freije, S. Schaffner, and other members of the Sabeti laboratory for useful discussions and feedback on the manuscript; B. Chak for assistance with ethical approvals and compliance; and Boca Biolistics, the Florida Department of Health, Miami-Dade County Mosquito Control, Research Blood Components, the Ragon Institute Cellular Immunology Database, and Brigham and Women’s Hospital’s Crimson Core for support with samples. This project has been funded in whole or in part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under grant number U19AI110818 to the Broad Institute. This project was also funded in part by NIH NIAID contract HHSN272200900049C, a Broadnext10 gift from the Broad Institute, Henry M. Jackson Foundation award W81XWH-11-2-0174, and the Bill & Melinda Gates Foundation. IAV samples were funded by NIH NIAID contract HHSN272201400008C to J.A.R. K.J.S. is supported by a fellowship from the Human Frontiers in Science Program (LT000553/2016). S.I. and S.F.M. are supported by NIH NIAID R01AI099210. C.T.H. is supported by NIH NHGRI U01HG007480 and U54HG007480 and by World Bank project ACE019.

Author information

Author notes

  1. These authors contributed equally: Hayden C. Metsky, Katherine J. Siddle.

  2. These authors jointly supervised this work: Pardis C. Sabeti, Christian B. Matranga.

  3. A list of members and affiliations appears in Supplementary Note 3.


  1. Broad Institute of MIT and Harvard, Cambridge, MA, USA

    • Hayden C. Metsky
    • , Katherine J. Siddle
    • , Adrianne Gladden-Young
    • , James Qu
    • , David K. Yang
    • , Patrick Brehio
    • , Anne Piantadosi
    • , Shirlee Wohl
    • , Amber Carter
    • , Aaron E. Lin
    • , Kayla G. Barnes
    • , Daniel J. Park
    • , Andreas Gnirke
    • , Pardis C. Sabeti
    •  & Christian B. Matranga
  2. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA

    • Hayden C. Metsky
  3. Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA

    • Katherine J. Siddle
    • , David K. Yang
    • , Shirlee Wohl
    • , Aaron E. Lin
    • , Kayla G. Barnes
    •  & Pardis C. Sabeti
  4. Faculty of Arts and Sciences, Harvard University, Cambridge, MA, USA

    • Andrew Goldfarb
  5. Division of Infectious Diseases, Massachusetts General Hospital, Boston, MA, USA

    • Anne Piantadosi
    •  & Douglas S. Kwon
  6. Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA, USA

    • Kayla G. Barnes
    • , Christian T. Happi
    •  & Pardis C. Sabeti
  7. The Ragon Institute of MGH, MIT and Harvard, Cambridge, MA, USA

    • Damien C. Tully
    • , Bjӧrn Corleis
    • , Douglas S. Kwon
    •  & Todd M. Allen
  8. Massachusetts Department of Public Health, Boston, MA, USA

    • Scott Hennigan
    •  & Sandra Smole
  9. Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, Rio de Janeiro, Brazil

    • Giselle Barbosa-Lima
    • , Yasmine R. Vieira
    • , Fernando A. Bozza
    •  & Thiago M. L. Souza
  10. Department of Biological Sciences, College of Arts and Sciences, Florida Gulf Coast University, Fort Myers, FL, USA

    • Lauren M. Paul
    • , Amanda L. Tan
    • , Sharon Isern
    •  & Scott F. Michael
  11. Instituto de Investigacion en Microbiologia, Universidad Nacional Autónoma de Honduras, Tegucigalpa, Honduras

    • Kimberly F. Garcia
    • , Leda A. Parham
    •  & Ivette Lorenzana
  12. Institute of Lassa Fever Research and Control, Irrua Specialist Teaching Hospital, Irrua, Nigeria

    • Ikponmwosa Odia
    •  & Christian T. Happi
  13. African Center of Excellence for Genomics of Infectious Disease (ACEGID), Redeemer’s University, Ede, Nigeria

    • Philomena Eromon
    • , Onikepe A. Folarin
    •  & Christian T. Happi
  14. Department of Biological Sciences, College of Natural Sciences, Redeemer’s University, Ede, Nigeria

    • Onikepe A. Folarin
    •  & Christian T. Happi
  15. Lassa Fever Laboratory, Kenema Government Hospital, Kenema, Sierra Leone

    • Augustine Goba
    •  & Donald S. Grant
  16. Evolutionary Genomics of RNA Viruses, Virology Department, Institut Pasteur, Paris, France

    • Etienne Simon-Lorière
  17. Integrated Research Facility, Division of Clinical Research, National Institute of Allergy and Infectious Diseases, US National Institutes of Health, Frederick, MD, USA

    • Lisa Hensley
  18. Laboratorio Nacional de Virología, Centro Nacional de Diagnóstico y Referencia, Ministry of Health, Managua, Nicaragua

    • Angel Balmaseda
  19. Division of Infectious Diseases and Vaccinology, School of Public Health, University of California, Berkeley, Berkeley, CA, USA

    • Eva Harris
  20. Department of Infectious Disease and Global Health, Cummings School of Veterinary Medicine, Tufts University, North Grafton, MA, USA

    • Jonathan A. Runstadler
  21. Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA

    • Lee Gehrke
    •  & Irene Bosch
  22. Department of Microbiology and Immunobiology, Harvard Medical School, Boston, MA, USA

    • Lee Gehrke
  23. Department of Microbiology, Immunology and Pathology, Colorado State University, Fort Collins, CO, USA

    • Gregory Ebel
  24. College of Medicine and Allied Health Sciences, University of Sierra Leone, Freetown, Sierra Leone

    • Donald S. Grant
  25. Howard Hughes Medical Institute, Chevy Chase, MD, USA

    • Pardis C. Sabeti


  1. Search for Hayden C. Metsky in:

  2. Search for Katherine J. Siddle in:

  3. Search for Adrianne Gladden-Young in:

  4. Search for James Qu in:

  5. Search for David K. Yang in:

  6. Search for Patrick Brehio in:

  7. Search for Andrew Goldfarb in:

  8. Search for Anne Piantadosi in:

  9. Search for Shirlee Wohl in:

  10. Search for Amber Carter in:

  11. Search for Aaron E. Lin in:

  12. Search for Kayla G. Barnes in:

  13. Search for Damien C. Tully in:

  14. Search for Bjӧrn Corleis in:

  15. Search for Scott Hennigan in:

  16. Search for Giselle Barbosa-Lima in:

  17. Search for Yasmine R. Vieira in:

  18. Search for Lauren M. Paul in:

  19. Search for Amanda L. Tan in:

  20. Search for Kimberly F. Garcia in:

  21. Search for Leda A. Parham in:

  22. Search for Ikponmwosa Odia in:

  23. Search for Philomena Eromon in:

  24. Search for Onikepe A. Folarin in:

  25. Search for Augustine Goba in:

  26. Search for Etienne Simon-Lorière in:

  27. Search for Lisa Hensley in:

  28. Search for Angel Balmaseda in:

  29. Search for Eva Harris in:

  30. Search for Douglas S. Kwon in:

  31. Search for Todd M. Allen in:

  32. Search for Jonathan A. Runstadler in:

  33. Search for Sandra Smole in:

  34. Search for Fernando A. Bozza in:

  35. Search for Thiago M. L. Souza in:

  36. Search for Sharon Isern in:

  37. Search for Scott F. Michael in:

  38. Search for Ivette Lorenzana in:

  39. Search for Lee Gehrke in:

  40. Search for Irene Bosch in:

  41. Search for Gregory Ebel in:

  42. Search for Donald S. Grant in:

  43. Search for Christian T. Happi in:

  44. Search for Daniel J. Park in:

  45. Search for Andreas Gnirke in:

  46. Search for Pardis C. Sabeti in:

  47. Search for Christian B. Matranga in:


  1. Viral Hemorrhagic Fever Consortium


    H.C.M., D.J.P., A. Gnirke, P.C.S., and C.B.M. initiated the study of improved design and application of comprehensive probe sets. H.C.M. conceived of CATCH and implemented it with advice from D.J.P., A. Gnirke, and C.B.M. K.J.S. and C.B.M. conceived of experimental design for evaluating probe sets. C.B.M., J.Q., A.G.-Y., and K.J.S. developed enrichment protocols with help from A. Goldfarb. K.J.S., A.G.-Y., J.Q., and P.B. prepared samples, performed enrichment, and sequenced samples. A.P., S.W., A.C., A.E.L., and K.G.B. helped with sample preparation and enrichment. D.C.T., B.C., S.H., G.B.-L., Y.R.V., L.M.P., A.L.T., K.F.G., L.A.P., A.B., E.H., D.S.K., T.M.A., J.A.R., S.S., F.A.B., T.M.L.S., S.I., S.F.M., I.L., L.G., and I.B. collected and shared samples with known viral content. E.S.-L. and L.H. shared viral seed stocks. G.E. shared uncharacterized mosquito pools. I.O., P.E., O.A.F., A. Goba, D.S.G., and C.T.H. collected human plasma samples from Nigeria and Sierra Leone. H.C.M. and K.J.S. formulated and performed data analyses with help from D.K.Y. H.C.M., K.J.S., and C.B.M. wrote the manuscript with input from other authors.

    Competing interests

    H.C.M., D.J.P., A. Gnirke, P.C.S. and C.B.M. are co-inventors on a patent application filed by the Broad Institute related to work in this manuscript (US 15/756546).

    Corresponding authors

    Correspondence to Hayden C. Metsky or Katherine J. Siddle.

    Integrated supplementary information

    1. Supplementary Figure 1 Parameters used by CATCH in default model of hybridization.

      CATCH models hybridization between each candidate probe and the target sequences. Doing so allows CATCH to decide whether a candidate probe captures (or ‘covers’) a region of the target sequence, and thus find a probe set that achieves a desired coverage of the target sequences under this model. For whole genome enrichment, the desired coverage would typically be 100% of each target sequence. (a) Relatively conserved regions (for example, a particular gene) in the input sequences can be captured with few probes because it is likely that any given probe, under a model of hybridization, will capture observed variation across many or all of the input sequences. Highly variable regions may require many probes to be captured because each given probe may capture the observed variation across only a small fraction of the input sequences. (b) By default, CATCH decides whether a probe hybridizes to a region of a target sequence according to the following parameters: a number m of mismatches to tolerate and a length lcf of a longest common substring. CATCH computes the longest common substring with at most m mismatches between the probe and target subsequence, and decides that the probe hybridizes to the target if and only if the length of this is at least lcf. If the parameter i is provided, CATCH additionally requires that the probe and target subsequence share an exact (0-mismatch) match of length at least i. If CATCH decides that the probe hybridizes to the subsequence of the target with which it shares a substring, then it determines that the probe captures the region equal to the length of the probe as well as e nt on each side of this region. e, termed a cover extension, is a parameter whose value can be specified to CATCH, along with m, lcf, and i. Lower values of m, higher values of lcf, higher values of i, and lower values of e are more conservative and lead to more probe sequences. (For details, see the description of fmap in Online Methods.) (c) Number of probes required to fully capture 300 genomes of HCV, HIV-1, EBOV, and ZIKV, for varying values of the mismatches and cover extension parameters, with other parameters fixed. Shaded regions are 95% pointwise confidence bands calculated across randomly sampled input genomes.

    2. Supplementary Figure 2 Scaling probe count with diversity of viral genomes.

      Number of probes required to fully capture increasing numbers of HIV-1, EBOV, and ZIKV genomes. Approaches shown are simple tiling (gray), a clustering-based approach at two levels of stringency (red; see Supplementary Note 2 for details), and CATCH at three choices of parameters (blue). Shaded regions are 95% pointwise confidence bands calculated across randomly sampled input genomes.

    3. Supplementary Figure 3 Design of the VWAFR probe set.

      (a) Number of probes designed by CATCH for each dataset among all 89,990 probes in the VWAFR probe set. The total includes reverse complement probes, which were added to the design of VWAFR for synthesis. (b) Values of two parameters selected by CATCH for each dataset in the design of VWAFR: number of mismatches to tolerate in hybridization and length of the target fragment (in nt) on each side of the hybridized region assumed to be captured along with the hybridized region (cover extension). The label within each bubble is the number of datasets that were assigned a particular combination of values. Species included in our sample testing are labeled; for full list of parameter values, see Supplementary Table 1.

    4. Supplementary Figure 4 Depth of coverage observed across viral genomes from samples with known viral infections.

      Depth of coverage across 31 viral genomes from the analysis of 30 patient and environmental samples with known viral infections (one sample contained two known viruses). Shown on (a) linear and (b) logarithmic scales. The logarithmic scale helps compare variance in depth across each genome between pre- and post-captured data.

    5. Supplementary Figure 5 Relation between enrichment of viral content and viral titer.

      Fraction of all downsampled pre-capture reads that mapped to the reference genome (shown on the horizontal axis) for 24 viral genomes reflects a wide range of initial viral concentrations in these samples. Enrichment (shown on the vertical axis) was calculated by dividing the total number of post-capture reads mapping to a reference genome by the number of mapped pre-capture reads. Those with the highest viral content showed lower enrichment following capture with VALL. Seven of the 31 viral genomes included in the analysis are excluded from this plot because they yielded fewer than 200,000 total reads (Supplementary Table 3). Two IAV samples with a high fraction of viral reads pre-capture (bottom right) overlap on the plot. One sample (ZIKV-SM3, top left) showed no viral reads pre-capture, so its fold-change is undefined.

    6. Supplementary Figure 6 Metagenomic sequencing results for pre- and post-capture samples.

      (a) Number of species detected (with at least 1 assigned read) in samples with known viral infections. Counts are shown before capture (orange), after capture with VWAFR (light blue), and after capture with VALL (dark blue). (b) Left: Number of reads detected for each species across samples with known viral infections, before and after capture with VWAFR. Right: Abundance of each species before capture and fold-change upon capture with VWAFR. For each sample, the virus known to be present in the sample is colored, and Homo sapiens matches in samples from humans are shown in black. (c) Number of reads detected for each species across uncharacterized sample pools, before and after capture with VALL. Viral species present in each sample (Fig. 4b) are colored, and Homo sapiens matches in human plasma samples are shown in black. Asterisks on species indicate ones that are not targeted by VALL. (d) Same as (b) but for VWAFR in the uncharacterized sample pools. Asterisks on species indicate ones that are not targeted by VWAFR. In all panels, abundance was calculated by dividing species counts pre-capture by counts in pooled water controls.

    7. Supplementary Figure 7 Genome assembly in EBOV dilution series and effect of sequencing depth on amount of viral material sequenced.

      (a) Percent of viral genome assembled in a dilution series of viral input in two amounts of human RNA background. There are n = 2 technical replicates for each choice of input copies, background amount, and use of capture (n = 1 replicate for the negative control with 0 copies). Each dot indicates percent of genome assembled, from 200,000 reads, in a replicate; line is through the mean of the replicates. Label to the right of each line indicates amount of background material. Assemblies are from read data presented in Fig. 3a. (b) Number of unique viral reads sequenced at increasing sequencing depth, from an input of 103 viral copies in different amounts of background. Horizontal axis gives the number of total reads to which a sample was subsampled. Each line is a technical replicate (n = 2) and shaded regions are 95% pointwise confidence bands calculated across random subsamplings. Dashed vertical line at 200,000 reads denotes the amount of total reads used in (a) and in Fig. 3a. Viral sequencing data generated after capture with VALL saturates more quickly than without capture. (c) Same as (b), but from an input of 104 viral copies.

    8. Supplementary Figure 8 Enrichment in read depth with focused probe sets.

      (a) Distribution of the enrichment in read depth, across viral genomes, provided by capture with VWAFR. Each curve represents a viral genome. At each position across a genome, the post-capture read depth is divided by the pre-capture depth, and the plotted curve is the empirical cumulative distribution of the log of these fold-change values. (b) Distribution of the enrichment in read depth, across viral genomes, provided by VWAFR over VALL. At each position across a genome, the read depth following capture with VWAFR is divided by the depth following capture with VALL, and the plotted curve is the empirical cumulative distribution of the log of these fold-change values. (c) Same as (a), but for the two-virus probe sets VMM and VZC. The mumps curves (green) show enrichment provided by VMM against pre-capture, and the Zika curves (purple) show enrichment provided by VZC against pre-capture. (d) Same as (b), but for the two-virus probe sets VMM and VZC. The mumps curves (green) show enrichment provided by VMM against VALL, and the Zika curves (purple) show enrichment provided by VZC against VALL.

    9. Supplementary Figure 9 Enrichment across segments of influenza A virus (H4N4).

      Variable enrichment across segments of an influenza A virus sample of subtype H4N4 (IAV-SM5). Segments 4 and 6 contain the most genetic diversity and divergence from probe sequences. No sequences of the N4 subtypes were included in the design of VALL or VWAFR. (a) Depth of coverage across the sample’s genome. Each of the eight segments in IAV are labeled. (b, c) Distribution of the enrichment in read depth provided by capture with VALL (b) and VWAFR (c). Each curve represents one of the eight segments. At each position across a genome, the post-capture read depth is divided by the pre-capture depth, and the plotted curve is the empirical cumulative distribution of the log of these fold-change values.

    10. Supplementary Figure 10 Sequencing results of Lassa virus from the 2018 Lassa fever outbreak in Nigeria.

      (a) Number of unique LASV reads, among 200,000 reads in total, sequenced following capture with VALL compared to pre-capture in 23 samples from the 2018 Lassa fever outbreak. Points are colored by the state in Nigeria that the sample is from (black is NTC). (b) Percent of LASV genome assembled, after use of VALL, against the fraction of pre-capture reads that are LASV. Points to the left of the horizontal break correspond to samples with no LASV reads pre-capture. As in Fig. 4a, reads were downsampled to 200,000 before assembly. Points are colored as in (a). (c) Percent of LASV genome assembled, after use of VALL. Here, reads were not downsampled before assembly. Bars are ordered as in Fig. 4a and colored by the state in Nigeria that the sample is from.

    11. Supplementary Figure 11 Depth of coverage observed for viral species detected in uncharacterized samples.

      Depth of coverage plots for 25 viral genomes detected by metagenomic analysis of uncharacterized samples following capture with VALL (see Fig. 4b). Read depths are shown on a linear scale.

    Supplementary information

    1. Supplementary Text and Figures

      Supplementary Figures 1–11 and Supplementary Notes 1–3

    2. Reporting Summary

    3. Supplementary Table 1

      Input taxa, input data, parameters selected, and other details about the four probe sets presented here

    4. Supplementary Table 2

      Origins, source materials, and GenBank accessions for samples

    5. Supplementary Table 3

      Sequencing summary metrics for patient and environmental samples with known viral infections

    6. Supplementary Table 4

      Metagenomic species counts for samples

    7. Supplementary Table 5

      Sequencing summary metrics for EBOV dilution series

    8. Supplementary Table 6

      Data on within-host variants in DENV samples that were used in the analysis of preservation of within-host variation

    9. Supplementary Table 7

      Sequencing summary metrics and metadata for LASV samples from 2018 Lassa fever outbreak in Nigeria

    10. Supplementary Table 8

      Sequencing summary metrics for uncharacterized samples

    11. Supplementary Table 9

      Cost estimates for sequencing with and without capture

    12. Supplementary Table 10

      GenBank accessions used for taxonomic filtering before viral genome assembly

    About this article

    Publication history