Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data

Paez-Espino, David; Pavlopoulos, Georgios A; Ivanova, Natalia N; Kyrpides, Nikos C

doi:10.1038/nprot.2017.063

Protocol
Published: 27 July 2017

Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data

David Paez-Espino ORCID: orcid.org/0000-0002-2939-5398^1,2,
Georgios A Pavlopoulos^1,2,
Natalia N Ivanova^1,2 &
…
Nikos C Kyrpides^1,2

Nature Protocols volume 12, pages 1673–1682 (2017)Cite this article

10k Accesses
90 Citations
41 Altmetric
Metrics details

Subjects

Abstract

The analysis of large microbiome data sets holds great promise for the delineation of the biological and metabolic functioning of living organisms and their role in the environment. In the midst of this genomic puzzle, viruses, especially those that infect microbial communities, represent a major reservoir of genetic diversity with great impact on biogeochemical cycles and organismal health. Overcoming the limitations associated with virus detection directly from microbiomes can provide key insights into how ecosystem dynamics are modulated. Here, we present a computational protocol for accurate detection and grouping of viral sequences from microbiome samples. Our approach relies on an expanded and curated set of viral protein families used as bait to identify viral sequences directly from metagenomic assemblies. This protocol describes how to use the viral protein families catalog (∼7 h) and recommended filters for the detection of viral contigs in metagenomic samples (∼6 h), and it describes the specific parameters for a nucleotide-sequence-identity-based method of organizing the viral sequences into quasi-species taxonomic-level groups (∼10 min).

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Overview of the computational workflow.**

**Figure 2: Metagenome sample used as an example.**

Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks

Article 06 May 2019

Ho Bin Jang, Benjamin Bolduc, … Matthew B. Sullivan

CheckV assesses the quality and completeness of metagenome-assembled viral genomes

Article Open access 21 December 2020

Stephen Nayfach, Antonio Pedro Camargo, … Nikos C. Kyrpides

A compendium of viruses from methanogenic archaea reveals their diversity and adaptations to the gut environment

Article 25 September 2023

Sofia Medvedeva, Guillaume Borrel, … Simonetta Gribaldo

References

Chen, I.A. et al. IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res. 45, D507–D516 (2017).
Article CAS Google Scholar
Mukherjee, S. et al. Genomes OnLine Database (GOLD) v.6: data updates and feature enhancements. Nucleic Acids Res. 45, D446–D456 (2017).
Article CAS Google Scholar
Angly, F.E. et al. The marine viromes of four oceanic regions. PLoS Biol. 4, e368 (2006).
Article Google Scholar
Breitbart, M., Miyake, J.H. & Rohwer, F. Global distribution of nearly identical phage-encoded DNA sequences. FEMS Microbiol. Lett. 236, 249–256 (2004).
Article CAS Google Scholar
Breitbart, M. & Rohwer, F. Here a virus, there a virus, everywhere the same virus? Trends Microbiol. 13, 278–284 (2005).
Article CAS Google Scholar
Marhaver, K.L., Edwards, R.A. & Rohwer, F. Viral communities associated with healthy and bleaching corals. Environ. Microbiol. 10, 2277–2286 (2008).
Article CAS Google Scholar
Suttle, C.A., Chan, A.M. & Cottrell, M.T. Use of ultrafiltration to isolate viruses from seawater which are pathogens of marine phytoplankton 57, 721–726 (1991).
Dell'Anno, A., Corinaldesi, C., Magagnini, M. & Danovaro, R. Determination of viral production in aquatic sediments using the dilution-based approach. Nat. Protoc. 4, 1013–1022 (2009).
Article CAS Google Scholar
Thurber, R.V., Haynes, M., Breitbart, M., Wegley, L. & Rohwer, F. Laboratory procedures to generate viral metagenomes. Nat. Protoc. 4, 470–483 (2009).
Article CAS Google Scholar
Brum, J.R. et al. Ocean plankton. Patterns and ecological drivers of ocean viral communities. Science 348, 1261498 (2015).
Article Google Scholar
Dinsdale, E.A. et al. Functional metagenomic profiling of nine biomes. Nature 452, 629–632 (2008).
Article CAS Google Scholar
Mizuno, C.M., Rodriguez-Valera, F., Kimes, N.E. & Ghai, R. Expanding the marine virosphere using metagenomics. PLoS Genet. 9, e1003987 (2013).
Article Google Scholar
Roux, S. et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537, 689–693 (2016).
Article CAS Google Scholar
Akhter, S., Aziz, R.K. & Edwards, R.A. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 40, e126 (2012).
Article CAS Google Scholar
Fouts, D.E. Phage_Finder: automated identification and classification of prophage regions in complete bacterial genome sequences. Nucleic Acids Res. 34, 5839–5851 (2006).
Article CAS Google Scholar
Lima-Mendez, G., Van Helden, J., Toussaint, A. & Leplae, R. Prophinder: a computational tool for prophage prediction in prokaryotic genomes. Bioinformatics 24, 863–865 (2008).
Article CAS Google Scholar
Arndt, D. et al. PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res. 44, W16–W21 (2016).
Article CAS Google Scholar
Roux, S., Enault, F., Hurwitz, B.L. & Sullivan, M.B. VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015).
Article Google Scholar
Grazziotin, A.L., Koonin, E.V. & Kristensen, D.M. Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation. Nucleic Acids Res. 45, D491–D498 (2017).
Article CAS Google Scholar
Paez-Espino, D. et al. Uncovering earth's virome. Nature 536, 425–430 (2016).
Article CAS Google Scholar
Ivanova, N. et al. A call for standardized classification of metagenome projects. Environ. Microbiol. 12, 1803–1805 (2010).
Article Google Scholar
Mukherjee, S. et al. Genomes OnLine Database(GOLD) v.6: data updates and feature enhancements. Nucleic Acids Res. 45, D446–D456 (2016).
Article Google Scholar
Paez-Espino, D. et al. IMG/VR: a database of cultured and uncultured DNA viruses and retroviruses. Nucleic Acids Res. 45, D457–D465 (2017).
Article CAS Google Scholar
Merchant, N. et al. The iPlant Collaborative: cyberinfrastructure for enabling data to discovery for the life sciences. PLoS Biol. 14, e1002342 (2016).
Article Google Scholar
Suttle, C.A. Marine viruses—major players in the global ecosystem. Nat. Rev. Microbiol. 5, 801–812 (2007).
Article CAS Google Scholar
Edwards, R.A., McNair, K., Faust, K., Raes, J. & Dutilh, B.E. Computational approaches to predict bacteriophage-host relationships. FEMS Microbiol. Rev. 40, 258–272 (2016).
Article CAS Google Scholar
Villarroel, J. et al. HostPhinder: a phage host prediction tool. Viruses 8 http://dx.doi.org/10.3390/v8050116 (2016).
Goren, M.G., Yosef, I. & Qimron, U. Programming bacteriophages by swapping their specificity determinants. Trends Microbiol. 23, 744–746 (2015).
Article CAS Google Scholar
Salmond, G.P. & Fineran, P.C. A century of the phage: past, present and future. Nat. Rev. Microbiol. 13, 777–786 (2015).
Article CAS Google Scholar
Edgar, R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
Article CAS Google Scholar
Enright, A.J., Van Dongen, S. & Ouzounis, C.A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002).
Article CAS Google Scholar
Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002).
Article CAS Google Scholar
Finn, R.D., Clements, J. & Eddy, S.R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011).
Article CAS Google Scholar
Chen, I.A. et al. IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res. 45, D507–D516 (2016).
Article Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Article Google Scholar
Dutilh, B.E. et al. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat. Commun. 5, 4498 (2014).
Article CAS Google Scholar
Aziz, R.K., Dwivedi, B., Akhter, S., Breitbart, M. & Edwards, R.A. Multidimensional metrics for estimating phage abundance, distribution, gene density, and sequence coverage in metagenomes. Front. Microbiol. 6, 381 (2015).
PubMed Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar
Langdon, W.B. Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks. BioData Min. 8, 1 (2015).
Article CAS Google Scholar
Finn, R.D. et al. HMMER web server: 2015 update. Nucleic Acids Res. 43, W30–W38 (2015).
Article CAS Google Scholar
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Article CAS Google Scholar
Li, D., Liu, C.M., Luo, R., Sadakane, K. & Lam, T.W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Article CAS Google Scholar
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
Article Google Scholar
Dick, G.J. et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, R85 (2009).
Article Google Scholar
Oulas, A. et al. Metagenomic investigation of the geologically unique Hellenic volcanic arc reveals a distinctive ecosystem with unexpected physiology. Environ. Microbiol. 18, 1122–1136 (2016).
Article CAS Google Scholar
Price, M.N., Dehal, P.S. & Arkin, A.P. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650 (2009).
Article CAS Google Scholar
Huson, D.H. & Scornavacca, C. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Syst. Biol. 61, 1061–1067 (2012).
Article Google Scholar

Download references

Acknowledgements

This work was supported by the US Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, under contract no. DE-AC02-05CH11231, and used resources of the National Energy Research Scientific Computing Center, supported by the Office of Science of the US Department of Energy.

Author information

Authors and Affiliations

Joint Genome Institute,
David Paez-Espino, Georgios A Pavlopoulos, Natalia N Ivanova & Nikos C Kyrpides
Department of Energy, Walnut Creek, California, USA
David Paez-Espino, Georgios A Pavlopoulos, Natalia N Ivanova & Nikos C Kyrpides

Authors

David Paez-Espino
View author publications
You can also search for this author in PubMed Google Scholar
Georgios A Pavlopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Natalia N Ivanova
View author publications
You can also search for this author in PubMed Google Scholar
Nikos C Kyrpides
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.P.-E., N.N.I., and N.C.K. conceived and led the protocol. G.A.P. provided computational and scripting support. All authors wrote and edited the manuscript.

Corresponding author

Correspondence to David Paez-Espino.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Details of the protocol for the given example.

Pipeline of the workflow including the name of all the files generated during the virus detection for the sampleidentified as 3300001348 in IMG/M (in blue), as well as the approximate time necessary for each of the stepsof the protocol (in red), and required scripts (bold black). The three yellow boxes indicate the three final outputsof this exercise: (i) 640 unique metagenomic viral contigs (mVCs) detected; (ii) 246 viral groups that include268 mVCs (out of the 640) from the given example as well as 457 metagenomic viral contigs from 32 otherdifferent metagenomes, and (iii) a list of 12,963 viral sequences of low abundance (from 8,436 unique viralgroups) with at least 10% of their length covered by unassembled reads (>90% sequence identity) from the targeted metagenome.

Supplementary information

Supplementary Text and Figures

Supplementary Figure 1 and Supplementary Tables 1 and 2. (PDF 13452 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Paez-Espino, D., Pavlopoulos, G., Ivanova, N. et al. Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data. Nat Protoc 12, 1673–1682 (2017). https://doi.org/10.1038/nprot.2017.063

Download citation

Published: 27 July 2017
Issue Date: August 2017
DOI: https://doi.org/10.1038/nprot.2017.063

This article is cited by

The gut ileal mucosal virome is disturbed in patients with Crohn’s disease and exacerbates intestinal inflammation in mice
- Zhirui Cao
- Dejun Fan
- Tao Zuo
Nature Communications (2024)
Hidden diversity and potential ecological function of phosphorus acquisition genes in widespread terrestrial bacteriophages
- Jie-Liang Liang
- Shi-wei Feng
- Jin-tian Li
Nature Communications (2024)
Altered human gut virome in patients undergoing antibiotics therapy for Helicobacter pylori
- Lingling Wang
- Haobin Yao
- Wai K. Leung
Nature Communications (2023)
Soil viral diversity, ecology and climate change
- Janet K. Jansson
- Ruonan Wu
Nature Reviews Microbiology (2023)
Genomic diversity and ecological distribution of marine Pseudoalteromonas phages
- Kaiyang Zheng
- Yue Dong
- Min Wang
Marine Life Science & Technology (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.