Protocol | Published:

Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data

Nature Protocols volume 12, pages 16731682 (2017) | Download Citation

Abstract

The analysis of large microbiome data sets holds great promise for the delineation of the biological and metabolic functioning of living organisms and their role in the environment. In the midst of this genomic puzzle, viruses, especially those that infect microbial communities, represent a major reservoir of genetic diversity with great impact on biogeochemical cycles and organismal health. Overcoming the limitations associated with virus detection directly from microbiomes can provide key insights into how ecosystem dynamics are modulated. Here, we present a computational protocol for accurate detection and grouping of viral sequences from microbiome samples. Our approach relies on an expanded and curated set of viral protein families used as bait to identify viral sequences directly from metagenomic assemblies. This protocol describes how to use the viral protein families catalog (7 h) and recommended filters for the detection of viral contigs in metagenomic samples (6 h), and it describes the specific parameters for a nucleotide-sequence-identity-based method of organizing the viral sequences into quasi-species taxonomic-level groups (10 min).

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    et al. IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res. 45, D507–D516 (2017).

  2. 2.

    et al. Genomes OnLine Database (GOLD) v.6: data updates and feature enhancements. Nucleic Acids Res. 45, D446–D456 (2017).

  3. 3.

    et al. The marine viromes of four oceanic regions. PLoS Biol. 4, e368 (2006).

  4. 4.

    , & Global distribution of nearly identical phage-encoded DNA sequences. FEMS Microbiol. Lett. 236, 249–256 (2004).

  5. 5.

    & Here a virus, there a virus, everywhere the same virus? Trends Microbiol. 13, 278–284 (2005).

  6. 6.

    , & Viral communities associated with healthy and bleaching corals. Environ. Microbiol. 10, 2277–2286 (2008).

  7. 7.

    , & Use of ultrafiltration to isolate viruses from seawater which are pathogens of marine phytoplankton 57, 721–726 (1991).

  8. 8.

    , , & Determination of viral production in aquatic sediments using the dilution-based approach. Nat. Protoc. 4, 1013–1022 (2009).

  9. 9.

    , , , & Laboratory procedures to generate viral metagenomes. Nat. Protoc. 4, 470–483 (2009).

  10. 10.

    et al. Ocean plankton. Patterns and ecological drivers of ocean viral communities. Science 348, 1261498 (2015).

  11. 11.

    et al. Functional metagenomic profiling of nine biomes. Nature 452, 629–632 (2008).

  12. 12.

    , , & Expanding the marine virosphere using metagenomics. PLoS Genet. 9, e1003987 (2013).

  13. 13.

    et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537, 689–693 (2016).

  14. 14.

    , & PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 40, e126 (2012).

  15. 15.

    Phage_Finder: automated identification and classification of prophage regions in complete bacterial genome sequences. Nucleic Acids Res. 34, 5839–5851 (2006).

  16. 16.

    , , & Prophinder: a computational tool for prophage prediction in prokaryotic genomes. Bioinformatics 24, 863–865 (2008).

  17. 17.

    et al. PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res. 44, W16–W21 (2016).

  18. 18.

    , , & VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015).

  19. 19.

    , & Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation. Nucleic Acids Res. 45, D491–D498 (2017).

  20. 20.

    et al. Uncovering earth's virome. Nature 536, 425–430 (2016).

  21. 21.

    et al. A call for standardized classification of metagenome projects. Environ. Microbiol. 12, 1803–1805 (2010).

  22. 22.

    et al. Genomes OnLine Database(GOLD) v.6: data updates and feature enhancements. Nucleic Acids Res. 45, D446–D456 (2016).

  23. 23.

    et al. IMG/VR: a database of cultured and uncultured DNA viruses and retroviruses. Nucleic Acids Res. 45, D457–D465 (2017).

  24. 24.

    et al. The iPlant Collaborative: cyberinfrastructure for enabling data to discovery for the life sciences. PLoS Biol. 14, e1002342 (2016).

  25. 25.

    Marine viruses—major players in the global ecosystem. Nat. Rev. Microbiol. 5, 801–812 (2007).

  26. 26.

    , , , & Computational approaches to predict bacteriophage-host relationships. FEMS Microbiol. Rev. 40, 258–272 (2016).

  27. 27.

    et al. HostPhinder: a phage host prediction tool. Viruses 8 (2016).

  28. 28.

    , & Programming bacteriophages by swapping their specificity determinants. Trends Microbiol. 23, 744–746 (2015).

  29. 29.

    & A century of the phage: past, present and future. Nat. Rev. Microbiol. 13, 777–786 (2015).

  30. 30.

    Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).

  31. 31.

    , & An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002).

  32. 32.

    , , & MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002).

  33. 33.

    , & HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011).

  34. 34.

    et al. IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res. 45, D507–D516 (2016).

  35. 35.

    et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).

  36. 36.

    et al. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat. Commun. 5, 4498 (2014).

  37. 37.

    , , , & Multidimensional metrics for estimating phage abundance, distribution, gene density, and sequence coverage in metagenomes. Front. Microbiol. 6, 381 (2015).

  38. 38.

    & Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  39. 39.

    Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks. BioData Min. 8, 1 (2015).

  40. 40.

    et al. HMMER web server: 2015 update. Nucleic Acids Res. 43, W30–W38 (2015).

  41. 41.

    et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

  42. 42.

    , , , & MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).

  43. 43.

    et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

  44. 44.

    et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, R85 (2009).

  45. 45.

    et al. Metagenomic investigation of the geologically unique Hellenic volcanic arc reveals a distinctive ecosystem with unexpected physiology. Environ. Microbiol. 18, 1122–1136 (2016).

  46. 46.

    , & FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650 (2009).

  47. 47.

    & Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Syst. Biol. 61, 1061–1067 (2012).

Download references

Acknowledgements

This work was supported by the US Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, under contract no. DE-AC02-05CH11231, and used resources of the National Energy Research Scientific Computing Center, supported by the Office of Science of the US Department of Energy.

Author information

Affiliations

  1. Joint Genome Institute, Department of Energy, Walnut Creek, California, USA.

    • David Paez-Espino
    • , Georgios A Pavlopoulos
    • , Natalia N Ivanova
    •  & Nikos C Kyrpides

Authors

  1. Search for David Paez-Espino in:

  2. Search for Georgios A Pavlopoulos in:

  3. Search for Natalia N Ivanova in:

  4. Search for Nikos C Kyrpides in:

Contributions

D.P.-E., N.N.I., and N.C.K. conceived and led the protocol. G.A.P. provided computational and scripting support. All authors wrote and edited the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to David Paez-Espino.

Integrated supplementary information

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figure 1 and Supplementary Tables 1 and 2.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nprot.2017.063

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.