Abstract

Metagenomic sequence analysis is rapidly becoming the primary source of virus discovery1,2,3. A substantial majority of the currently available virus genomes come from metagenomics, and some of these represent extremely abundant viruses, even if never grown in the laboratory. A particularly striking case of a virus discovered via metagenomics is crAssphage, which is by far the most abundant human-associated virus known, comprising up to 90% of sequences in the gut virome4. Over 80% of the predicted proteins encoded in the approximately 100 kilobase crAssphage genome showed no significant similarity to available protein sequences, precluding classification of this virus and hampering further study. Here we combine a comprehensive search of genomic and metagenomic databases with sensitive methods for protein sequence analysis to identify an expansive, diverse group of bacteriophages related to crAssphage and predict the functions of the majority of phage proteins, in particular those that comprise the structural, replication and expression modules. Most, if not all, of the crAss-like phages appear to be associated with diverse bacteria from the phylum Bacteroidetes, which includes some of the most abundant bacteria in the human gut microbiome and that are also common in various other habitats. These findings provide for experimental characterization of the most abundant but poorly understood members of the human-associated virome.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Rohwer, F. Global phage diversity. Cell 113, 141 (2003).

  2. 2.

    Suttle, C. A. Marine viruses—major players in the global ecosystem. Nat. Rev. Microbiol. 5, 801–812 (2007).

  3. 3.

    Simmonds, P. et al. Consensus statement: virus taxonomy in the age of metagenomics. Nat. Rev. Microbiol. 15, 161–168 (2017).

  4. 4.

    Dutilh, B. E. et al. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat. Commun. 5, 4498 (2014).

  5. 5.

    Dutilh, B. E. Metagenomic ventures into outer sequence space. Bacteriophage 4,e979664 (2014).

  6. 6.

    Ogilvie, L. A. & Jones, B. V. The human gut virome: a multifaceted majority. Front. Microbiol. 6, 918 (2015).

  7. 7.

    Hurwitz, B. L., U’Ren, J. M. & Youens-Clark, K. Computational prospecting the great viral unknown. FEMS Microbiol. Lett. 363, fnw077 (2016).

  8. 8.

    Yarygin, K. et al. Abundance profiling of specific gene groups using precomputed gut metagenomes yields novel biological hypotheses. PLoS ONE 12,e0176154 (2017).

  9. 9.

    Manrique, P. et al. Healthy human gut phageome. Proc. Natl Acad. Sci. USA 113,10400–10405 (2016).

  10. 10.

    Ahlgren, N. A., Ren, J., Lu, Y. Y., Fuhrman, J. A. & Sun, F. Alignment-free d 2* oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res. 45, 39–53 (2017).

  11. 11.

    Wexler, A. G. & Goodman, A. L. An insider’s perspective: bacteroides as a window into the microbiome. Nat. Microbiol. 2, 17026 (2017).

  12. 12.

    Whitaker, W. R., Shepherd, E. S. & Sonnenburg, J. L. Tunable expression tools enable single-cell strain distinction in the gut microbiome. Cell 169, 538–546 (2017).

  13. 13.

    Pramono, A. K. et al. Discovery and complete genome sequence of a bacteriophage from an obligate intracellular symbiont of a cellulolytic protist in the termite gut. Microbes Environ. 32, 112–117 (2017).

  14. 14.

    Holmfeldt, K. et al. Twelve previously unknown phage genera are ubiquitous in global oceans. Proc. Natl Acad. Sci. USA 110, 12798–12803 (2013).

  15. 15.

    Oude Munnink, B. B. et al. Unexplained diarrhoea in HIV-1 infected individuals. BMC Infect. Dis. 14, 22 (2014).

  16. 16.

    Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523, 208–211 (2015).

  17. 17.

    Burroughs, A. M., Kaur, G., Zhang, D. & Aravind, L. Novel clades of the HU/IHF superfamily point to unexpected roles in the eukaryotic centrosome, chromosome partitioning, and biologic conflicts. Cell Cycle 16, 1093–1103 (2017).

  18. 18.

    Lander, G. C. et al. The P22 tail machine at subnanometer resolution reveals the architecture of an infection conduit. Structure 17, 789–799 (2009).

  19. 19.

    Casjens, S. R. & Molineux, I. J. Short noncontractile tail machines: adsorption and DNA delivery by podoviruses. Adv. Exp. Med. Biol. 726, 143–179 (2012).

  20. 20.

    Bhardwaj, A., Molineux, I. J., Casjens, S. R. & Cingolani, G. Atomic structure of bacteriophage Sf6 tail needle knob. J. Biol. Chem. 286, 30867–30877 (2011).

  21. 21.

    Xiang, Y. et al. Crystal and cryoEM structural studies of a cell wall degrading enzyme in the bacteriophage phi29 tail. Proc. Natl Acad. Sci. USA 105, 9552–9557 (2008).

  22. 22.

    Casjens, S. R. & Thuman-Commike, P. A. Evolution of mosaically related tailed bacteriophage genomes seen through the lens of phage P22 virion assembly. Virology 411, 393–415 (2011).

  23. 23.

    Lane, W. J. & Darst, S. A. Molecular evolution of multisubunit RNA polymerases: sequence analysis. J. Mol. Biol. 395, 671–685 (2010).

  24. 24.

    Iyer, L. M., Burroughs, A. M., Anand, S., de Souza, R. F. & Aravind, L. Polyvalent proteins, a pervasive theme in the intergenomic biological conflicts of bacteriophages and conjugative elements. J. Bacteriol. 199, e00245–17 (2017).

  25. 25.

    Berdygulova, Z. et al. Temporal regulation of gene expression of the Thermus thermophilus bacteriophage P23-45. J. Mol. Biol. 405, 125–142 (2011).

  26. 26.

    Iyer, L. M. & Aravind, L. Insights from the architecture of the bacterial transcription apparatus. J. Struct. Biol. 179, 299–319 (2012).

  27. 27.

    Yakunina, M. et al. A non-canonical multisubunit RNA polymerase encoded by a giant bacteriophage. Nucleic Acids Res. 43, 10411–10420 (2015).

  28. 28.

    Lavysh, D. et al. The genome of AR9, a giant transducing Bacillus phage encoding two multisubunit RNA polymerases. Virology 495, 185–196 (2016).

  29. 29.

    Ruprich-Robert, G. & Thuriaux, P. Non-canonical DNA transcription enzymes and the conservation of two-barrel RNA polymerases. Nucleic Acids Res. 38, 4559–4569 (2010).

  30. 30.

    Krupovic, M. & Koonin, E. V. Multiple origins of viral capsid proteins from cellular ancestors. Proc. Natl Acad. Sci. USA 114, E2401–E2410 (2017).

  31. 31.

    Barr, J. J. et al. Subdiffusive motion of bacteriophage in mucosal surfaces increases the frequency of bacterial encounters. Proc. Natl Acad. Sci. USA 112,13675–13680 (2015).

  32. 32.

    Krupovic, M. et al. Taxonomy of prokaryotic viruses: update from the ICTV bacterial and archaeal viruses subcommittee. Arch. Virol. 161, 1095–1099 (2016).

  33. 33.

    Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

  34. 34.

    Soding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005).

  35. 35.

    Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

  36. 36.

    Besemer, J. & Borodovsky, M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 33, W451–454 (2005).

  37. 37.

    Kelley, D. R., Liu, B., Delcher, A. L., Pop, M. & Salzberg, S. L. Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Res. 40, e9 (2012).

  38. 38.

    Yutin, N., Makarova, K. S., Mekhedov, S. L., Wolf, Y. I. & Koonin, E. V. The deep archaeal roots of eukaryotes. Mol. Biol. Evol. 25, 1619–1630 (2008).

  39. 39.

    Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

  40. 40.

    Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59,307–321 (2010).

  41. 41.

    Bailey, T. L., Williams, N., Misleh, C. & Li, W. W. MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 34, W369–373 (2006).

  42. 42.

    Thompson, W. A., Newberg, L. A., Conlan, S., McCue, L. A. & Lawrence, C. E. The Gibbs Centroid Sampler. Nucleic Acids Res. 35, W232–237 (2007).

Download references

Acknowledgements

The authors thank Y.I. Wolf and S. Shmakov for technical help and Koonin group members for discussion. N.Y., K.S.M., A.B.G. and E.V.K. are supported by intramural funds of the US Department of Health and Human Services (to the National Library of Medicine).

Author information

Affiliations

  1. National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA

    • Natalya Yutin
    • , Kira S. Makarova
    • , Ayal B. Gussow
    • , Anca Segall
    •  & Eugene V. Koonin
  2. Institut Pasteur, Unité Biologie Moléculaire du Gène chez les Extrêmophiles, Paris, France

    • Mart Krupovic
  3. Viral Information Institute, Department of Biology, San Diego State University, San Diego, CA, USA

    • Anca Segall
    •  & Robert A. Edwards

Authors

  1. Search for Natalya Yutin in:

  2. Search for Kira S. Makarova in:

  3. Search for Ayal B. Gussow in:

  4. Search for Mart Krupovic in:

  5. Search for Anca Segall in:

  6. Search for Robert A. Edwards in:

  7. Search for Eugene V. Koonin in:

Contributions

E.V.K. conceived of the study. N.Y., K.S.M. and M.K. performed research. N.Y., K.S.M., A.B.G., M.K., A.S., R.A.E. and E.V.K. analysed the data. E.V.K. wrote the manuscript, which was read, edited and approved by all authors.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Eugene V. Koonin.

Supplementary information

  1. Supplementary information

    Supplementary Figures 1–3 and Supplementary Notes 1 and 2.

  2. Life Sciences Reporting Summary

  3. Supplementary Dataset 1

    The selected representative set of crAss-like family members; conserved genes in the crAss-like phage family (an extended version of Table 1); BLAST scores of conserved crAss-like family proteins.

  4. Supplementary Dataset 2

    Annotation of the crAssphage and IAS phage genes and comparison of the MetaGeneMark and the current RefSeq (Glimmer) crAssphage annotations.

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41564-017-0053-y