Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes

Journal name:
Nature Biotechnology
Volume:
31,
Pages:
533–538
Year published:
DOI:
doi:10.1038/nbt.2579
Received
Accepted
Published online

Abstract

Reference genomes are required to understand the diverse roles of microorganisms in ecology, evolution, human and animal health, but most species remain uncultured. Here we present a sequence composition–independent approach to recover high-quality microbial genomes from deeply sequenced metagenomes. Multiple metagenomes of the same community, which differ in relative population abundances, were used to assemble 31 bacterial genomes, including rare (<1% relative abundance) species, from an activated sludge bioreactor. Twelve genomes were assembled into complete or near-complete chromosomes. Four belong to the candidate bacterial phylum TM7 and represent the most complete genomes for this phylum to date (relative abundances, 0.06–1.58%). Reanalysis of published metagenomes reveals that differential coverage binning facilitates recovery of more complete and higher fidelity genome bins than other currently used methods, which are primarily based on sequence composition. This approach will be an important addition to the standard metagenome toolbox and greatly improve access to genomes of uncultured microorganisms.

At a glance

Figures

  1. Sequence composition-independent binning of metagenome scaffolds from the lab-scale bioreactor using differential coverage (HP+, HP-).
    Figure 1: Sequence composition–independent binning of metagenome scaffolds from the lab-scale bioreactor using differential coverage (HP+, HP).

    Circles represent scaffolds, scaled by the square root of their length and colored by GC content. Only scaffolds ≥5 kbp are shown. Clusters of similarly colored circles represent potential genome bins, the centroids of which are indicated by numbered circles and colored according to phylum-level taxonomic affiliation (Table 1 and Supplementary Table 2). This differential coverage plot provides the starting point for secondary refinement and finishing of genome assemblies (Fig. 2).

  2. Overview of the pipeline to obtain high-quality population genomes from multiple deep metagenomes using differential coverage as the primary binning method, illustrated using the population genome TM7-AAU-ii.
    Figure 2: Overview of the pipeline to obtain high-quality population genomes from multiple deep metagenomes using differential coverage as the primary binning method, illustrated using the population genome TM7-AAU-ii.

    Numbers refer to subsections in Online Methods and in the detailed step-by-step guide on GitHub. Steps 1–4: DNA was extracted using two different methods (HP+, HP), which produced different population abundances. Each sample (HP+, HP) was then shotgun-sequenced (150 bp paired-end, average 124 bp after trimming) followed by independent scaffold assembly. Only the HP scaffolds were used to extract population genome bins. Steps 5–8: preparation of data for the subsequent binning steps. Differential coverage was estimated by independently mapping the reads from each metagenome to the scaffolds from the HP assembly, to produce two abundance estimates (coverage) per scaffold. In addition, for each scaffold the GC content and tetranucleotide frequency was calculated, and conserved essential single-copy marker genes identified. Step 9: binning (clustering) of scaffolds into population genomes was done by plotting the two coverage estimates (one from each metagenome) against each other for all HP scaffolds (Fig. 1). Scaffold subsets clustering together represent putative population genomes and were extracted as initial bins. As multiple species could be present in the same coverage-defined subset, the selected scaffold subset was further refined using principal component analysis of tetranucleotide frequencies. Step 10: as some genes are present in multiple copies (for example, 16S rRNA or transposases) they will not be included in the initial coverage-defined subset. Instead paired-end read information is used to associate multiple copy genes with the appropriate genome bin (Supplementary Fig. 2). Steps 11, 12: all reads associated with a genome bin of interest are extracted and re-assembled using parameters optimized for each genome as the bins can now be treated as standard single genomes. Population genome assemblies were validated using conserved single-copy gene analysis, and through Circos (a visualization tool) in which all relevant assembly metrics, including FRCbam statistics23, are integrated to identify mis-assemblies and other structural problems. All data generation and integration are automated and can be carried out using a FASTA file of the assembled scaffolds and SAM files of the read mappings to the scaffolds.

  3. (a) Sequence composition-independent binning using metagenome coverage of two samples, A and C. Reanalysis of published metagenomes using the differential coverage approach.
    Figure 3: (a) Sequence composition–independent binning using metagenome coverage of two samples, A and C. Reanalysis of published metagenomes17 using the differential coverage approach.

    All circles represent scaffolds, scaled by the square root of their length and colored by GC content. Only scaffolds >2 kbp are shown for consistency with the original study. Clusters of scaffolds represent putative genome bins. (b) Coverage analysis of the scaffolds in the genome bin ACD7 in which ESOM was used for primary binning17. (c) Primary binning by differential coverage improved genome completeness (101 versus 89 essential genes) and removed non-target scaffolds from closely related populations (0 versus 11 duplicated genes and no low-coverage contamination). Ess., essential.

  4. Overview of the metabolism, cell wall characteristics and morphology of TM7.
    Figure 4: Overview of the metabolism, cell wall characteristics and morphology of TM7.

    (a) Metabolic reconstruction of the four TM7 genomes highlighting the presence and absence of pathways. See Supplementary Table 5 for details. (b) Genome tree of the bacterial domain constructed using a concatenated alignment of 38 phylogenetically conserved proteins and associated phylum-level cell envelope classification: Monoderm (M), Diderm (D), Diderm-LPS (DL), Diderm-Atypical (DA). Only some Spirochaetes have LPS28. The associated heat map shows protein families substantially enriched (black) or depleted in archetypal monoderm lineages (Actinobacteria and Firmicutes) relative to an archetypal diderm lineage (Proteobacteria), most of which have known roles in cell envelope biosynthesis. Black dots in the genome tree represents branches with ≥75% bootstrap support. (c) FISH micrographs of TM7 (red) cells showing coccus morphology with a size of ~0.7 μm in diameter. The images show that they are embedded in flocs and confirm they are in low abundance.

Accession codes

Primary accessions

NCBI Reference Sequence

Sequence Read Archive

References

  1. Wu, D. et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462, 10561060 (2009).
  2. Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome biology 3, S0003 (2002).
  3. Achtman, M. & Wagner, M. Microbial diversity and the genetic nature of microbial species. Nat. Rev. Microbiol. 6, 431440 (2008).
  4. Knittel, K. & Boetius, A. Anaerobic oxidation of methane: progress with an unknown process. Annu. Rev. Microbiol. 63, 311334 (2009).
  5. Elinav, E. et al. NLRP6 inflammasome regulates colonic microbial ecology and risk for colitis. Cell 145, 745757 (2011).
  6. Brinig, M.M., Lepp, P.W., Ouverney, C.C., Armitage, G.C. & Relman, D.A. Prevalence of bacteria of division TM7 in human subgingival plaque and their association with disease. Appl. Environ. Microbiol. 69, 16871694 (2003).
  7. Marcy, Y. et al. Dissecting biological “dark matter” with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth. Proc. Natl. Acad. Sci. USA 104, 1188911894 (2007).
  8. Kuehbacher, T. et al. Intestinal TM7 bacterial phylogenies in active inflammatory bowel disease. J. Med. Microbiol. 57, 15691576 (2008).
  9. Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K. & Hugenholtz, P. A bioinformatician's guide to metagenomics. Microbiol. Mol. Biol. Rev. 72, 557578 (2008).
  10. Tyson, G.W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 3743 (2004).
  11. García Martín, H. et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat. Biotechnol. 24, 12631269 (2006).
  12. Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554557 (2005).
  13. Rodrigue, S. et al. Whole genome amplification and de novo assembly of single bacterial cells. PLoS ONE 4, e6864 (2009).
  14. Lasken, R.S. Genomic sequencing of uncultured microorganisms from single cells. Nat. Rev. Microbiol. 10, 631640 (2012).
  15. Luo, C., Tsementzi, D., Kyrpides, N.C. & Konstantinidis, K.T. Individual genome assembly from complex community short-read metagenomic datasets. ISME J. 6, 898901 (2012).
  16. Mande, S.S., Mohammed, M.H. & Ghosh, T.S. Classification of metagenomic sequences: methods and challenges. Brief. Bioinform. 13, 669681 (2012).
  17. Wrighton, K.C. et al. Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science 337, 16611665 (2012).
  18. Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821829 (2008).
  19. Smoot, M.E., Ono, K., Ruscheinski, J., Wang, P.-L. & Ideker, T. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 27, 431432 (2011).
  20. Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 16391645 (2009).
  21. Hess, M. et al. Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331, 463467 (2011).
  22. Dupont, C.L. et al. Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. ISME J. 6, 11861199 (2012).
  23. Vezzi, F., Narzisi, G. & Mishra, B. Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS ONE 7, e52210 (2012).
  24. Podar, M. et al. Targeted access to the genomes of low-abundance organisms in complex microbial communities. Appl. Environ. Microbiol. 73, 32053214 (2007).
  25. Nielsen, P.H., Saunders, A.M., Hansen, A.A., Larsen, P. & Nielsen, J.L. Microbial communities involved in enhanced biological phosphorus removal from wastewater-a model system in environmental biotechnology. Curr. Opin. Biotechnol. 23, 452459 (2012).
  26. Luo, C., Xie, S., Sun, W., Li, X. & Cupples, A.M. Identification of a novel toluene-degrading bacterium from the candidate phylum TM7, as determined by DNA stable isotope probing. Appl. Environ. Microbiol. 75, 46444647 (2009).
  27. Hugenholtz, P., Tyson, G.W., Webb, R.I., Wagner, A.M. & Blackall, L.L. Investigation of candidate division TM7, a recently recognized major lineage of the domain Bacteria with no known pure-culture representatives. Appl. Environ. Microbiol. 67, 411419 (2001).
  28. Sutcliffe, I.C. A phylum level perspective on bacterial cell envelope architecture. Trends Microbiol. 18, 464470 (2010).
  29. McCutcheon, J.P. & Moran, N.A. Extreme genome reduction in symbiotic bacteria. Nat. Rev. Microbiol. 10, 1326 (2012).
  30. Baker, B.J. et al. Enigmatic, ultrasmall, uncultivated Archaea. Proc. Natl. Acad. Sci. USA 107, 88068811 (2010).
  31. Thomsen, T.R., Kjellerup, B.V., Nielsen, J.L., Hugenholtz, P. & Nielsen, P.H. In situ studies of the phylogeny and physiology of filamentous bacteria with attached growth. Environ. Microbiol. 4, 383391 (2002).
  32. Mandlik, A., Swierczynski, A., Das, A. & Ton-That, H. Pili in Gram-positive bacteria: assembly, involvement in colonization and biofilm development. Trends Microbiol. 16, 3340 (2008).
  33. Sutcliffe, I.C. Cell envelope architecture in the Chloroflexi: a shifting frontline in a phylogenetic turf war. Environ. Microbiol. 13, 279282 (2011).
  34. Schneewind, O. & Missiakas, D.M. Protein secretion and surface display in Gram-positive bacteria. Phil. Trans. R. Soc. Lond. B 367, 11231139 (2012).
  35. Weidenmaier, C. & Peschel, A. Teichoic acids and related cell-wall glycopolymers in Gram-positive physiology and host interactions. Nat. Rev. Microbiol. 6, 276287 (2008).
  36. Hoiczyk, E. & Hansel, A. Cyanobacterial cell walls: news from an unusual prokaryotic envelope. J. Bacteriol. 182, 11911199 (2000).
  37. Battistuzzi, F.U. & Hedges, S.B. A major clade of prokaryotes with ancient adaptations to life on land. Mol. Biol. Evol. 26, 335343 (2009).
  38. Gupta, R.S. Origin of diderm (Gram-negative) bacteria: antibiotic selection pressure rather than endosymbiosis likely led to the evolution of bacterial cells with two membranes. Antonie van Leeuwenhoek 100, 171182 (2011).
  39. Ludwig, W. et al. ARB: a software environment for sequence data. Nucleic Acids Res. 32, 13631371 (2004).
  40. Pruesse, E. et al. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35, 71887196 (2007).
  41. Nielsen, P.H. et al. A conceptual ecosystem model of microbial communities in enhanced biological phosphorus removal plants. Water Res. 44, 50705088 (2010).
  42. Caporaso, J. et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl. Acad. Sci. USA 108, 45164522 (2011).
  43. Berry, D., Ben Mahfoudh, K., Wagner, M. & Loy, A. Barcoded primers used in multiplex amplicon pyrosequencing bias amplification. Appl. Environ. Microbiol. 77, 78467849 (2011).
  44. Masella, A.P., Bartram, A.K., Truszkowski, J.M., Brown, D.G. & Neufeld, J.D. PANDAseq: paired-end assembler for illumina sequences. BMC Bioinformatics 13, 31 (2012).
  45. Caporaso, J.G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, 335336 (2010).
  46. Oksanen, J. et al. Vegan: Community Ecology Package. R package version 2.0–5 (2011).
  47. Hyatt, D., LoCascio, P.F., Hauser, L.J. & Uberbacher, E.C. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28, 22232230 (2012).
  48. Huson, D.H., Mitra, S., Ruscheweyh, H.-J., Weber, N. & Schuster, S.C. Integrative analysis of environmental sequences using MEGAN4. Genome Res. 21, 15521560 (2011).
  49. McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6, 610618 (2012).
  50. Edgar, R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 24602461 (2010).
  51. Pruesse, E., Peplies, J. & Glöckner, F.O. SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics 28, 18231829 (2012).
  52. Markowitz, V.M. et al. IMG ER: a system for microbial genome annotation expert review and curation. Bioinformatics 25, 22712278 (2009).
  53. Aziz, R.K. et al. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9, 75 (2008).
  54. Castresana, J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17, 540552 (2000).
  55. Price, M.N., Dehal, P.S. & Arkin, A.P. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
  56. Felsenstein, J. PHYLIP - Phylogeny inference package (version 3.2). Cladistics 5, 164166 (1989).
  57. Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 40, D290D301 (2012).
  58. Tatusov, R.L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).
  59. Szklarczyk, D. et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 39, D561D568 (2011).
  60. Letunic, I. & Bork, P. Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res. 39, W475W478 (2011).

Download references

Author information

Affiliations

  1. Department of Biotechnology, Chemistry and Environmental Engineering, Aalborg University, Aalborg, Denmark.

    • Mads Albertsen,
    • Kåre L Nielsen &
    • Per H Nielsen
  2. Australian Centre for Ecogenomics, School of Chemistry & Molecular Biosciences, The University of Queensland, St. Lucia, Queensland, Australia.

    • Philip Hugenholtz,
    • Adam Skarshewski &
    • Gene W Tyson
  3. Institute for Molecular Bioscience, The University of Queensland, St. Lucia, Queensland, Australia.

    • Philip Hugenholtz
  4. Advanced Water Management Centre, The University of Queensland, St. Lucia, Queensland, Australia.

    • Gene W Tyson

Contributions

M.A., experimental design, data analysis and manuscript; P.H., data analysis and manuscript; A.S., data analysis; K.L.N., sequencing; G.W.T., data analysis and manuscript; P.H.N., experimental design and manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (4.48 MB)

    Supplementary Notes, Supplementary Figures 1–13 and Supplementary Tables 1–10

Zip files

  1. Data Set 1 (7.63 MB)

    All scripts used in the manuscript, including a detailed step by step guide and example datasets.

Additional data