Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences. We demonstrate the method on data from 396 human gut microbiome samples and identify 7,381 co-abundance gene groups (CAGs), including 741 metagenomic species (MGS). We use these to assemble 238 high-quality microbial genomes and identify affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the means for comprehensive profiling of the diversity within complex metagenomic samples.
At a glance
- The “most wanted” taxa from the human microbiome for whole genome sequencing. PLoS ONE 7, e41294 (2012). et al.
- A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010). et al.
- Comparison of 61 sequenced Escherichia coli genomes. Microb. Ecol. 60, 708–720 (2010). , &
- Nearly finished genomes produced using gel microdroplet culturing reveal substantial intraspecies genomic diversity within the human microbiome. Genome Res. 23, 878–888 (2013). et al.
- Genome assembly reborn: recent computational challenges. Brief. Bioinform. 10, 354–366 (2009).
- A primer on metagenomics. PLOS Comput. Biol. 6, e1000667 (2010). , &
- Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science 335, 587–590 (2012). et al.
- MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 28, i356–i362 (2012). , , &
- Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013). et al.
- Molecular eco-systems biology: towards an understanding of community function. Nat. Rev. Microbiol. 6, 693–699 (2008). &
- A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012). et al.
- Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature 466, 334–338 (2010). et al.
- The human gut virome: inter-individual variation and dynamic response to diet. Genome Res. 21, 1616–1625 (2011). et al.
- CRISPR targeting reveals a reservoir of common phages associated with the human gut microbiome. Genome Res. 22, 1985–1994 (2012). , , , &
- CRISPR-Cas systems target a diverse collection of invasive mobile genetic elements in human microbiomes. Genome Biol. 14, R40 (2013). , , , &
- Genomics. Genome project standards in a new era of sequencing. Science 326, 236–237 (2009). et al.
- Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546 (2013). et al.
- Genome sequence of the probiotic strain Bifidobacterium animalis subsp. lactis CNCM I-2494. J. Bacteriol. 193, 5560–5561 (2011). et al.
- CRISPR-based adaptive immune systems. Curr. Opin. Microbiol. 14, 321–327 (2011). &
- Bayesian data analysis. Wiley Interdiscip. Rev. Cogn. Sci. 1, 658–676 (2010).
- The enemy within us: lessons from the 2011 European Escherichia coli O104:H4 outbreak. EMBO Mol. Med. 4, 841–848 (2012). et al.
- MOCAT: a metagenomics assembly and gene prediction toolkit. PLOS ONE 7, e47656 (2012). et al.
- De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010). et al.
- Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38, e132 (2010). , &
- SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009). et al.
- ACLAME: a classification of mobile genetic elements, update 2010. Nucleic Acids Res. 38, D57–D61 (2010). , &
- HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–37 (2011). , &
- The Pfam protein families database. Nucleic Acids Res. 40, D290–D301 (2012). et al.
- Evolutionarily conserved orthologous families in phages are relatively rare in their prokaryotic hosts. J. Bacteriol. 193, 1806–1814 (2011). , &
- eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 40, D284–D289 (2012). et al.
- Comparative metagenomics of microbial communities. Science 308, 554–557 (2005). et al.
- Fine-tuning our knowledge of the anaerobic route to cobalamin (vitamin B12). J. Bacteriol. 188, 7331–7334 (2006). &
- CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics 8, 209 (2007). et al.
- Identification of acquired antimicrobial resistance genes. J. Antimicrob. Chemother. 67, 2640–2644 (2012). et al.
- Essential Bacillus subtilis genes. Proc. Natl. Acad. Sci. USA 100, 4678–4683 (2003). et al.
- Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009). &
- Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116 (2010). , &
- Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). &
- Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods 4, 495–500 (2007). et al.
- Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011). et al.
- Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ. Microbiol. 6, 938–947 (2004). , , , &
- GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012). et al.
- Bambus 2: scaffolding metagenomes. Bioinformatics 27, 2964–2971 (2011). , &
- Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006). et al.
- Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res. 39, W475–W478 (2011). &
- Next generation sequence assembly with AMOS. Curr. Protoc. Bioinformatics Chapter 11, Unit 11.8 (2011). , , , &
- Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006). &
- A weakly informative default prior distribution for logistic and other regression models. Ann. Appl. Stat. 2, 1360–1383 (2008). , , &
- JAGS: a program for analysis of Bayesian graphical models using Gibbs sampling. in Proc. 3rd Int. Work. Distrib. Stat. Comput. March, 20–22 (2003).
- Inference from iterative simulation using multiple sequences. Stat. Sci. 7, 457–511 (1992). &
- Supplementary Text and Figures (4,314 KB)
Supplementary Figures 1–17 and Supplementary Notes 1–9
- Supplementary Data 1 (265 KB)
- Supplementary Data 2 (270 KB)
MGS taxonomical statistics
- Supplementary Data 3 (272 KB)
MGS augmented assembly statistics
- Supplementary Data 4 (31 KB)
MGS augmented assemblies comparison to reference genomes
- Supplementary Data 5 (1,183 KB)
Summary information on the 6640 small CAGs
- Supplementary Data 6 (257 KB)
- Supplementary Data 7 (37 KB)
MGS:4 + dependency-associated CAG assembly statistics
- Supplementary Data 8 (37 KB)
eggNOG prevalent in frequently observed MGS
- Supplementary Data 9 (10 KB)
Gene catalogue comparison
- Supplementary Data 10 (62 KB)
Bacillus subtilis essential COG list
- Supplementary Data 11 (37 KB)
Dependency-associations with or without companion species
- Supplementary Software (22 KB)
Source code for canopy-clustering algorithm