Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes


Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences. We demonstrate the method on data from 396 human gut microbiome samples and identify 7,381 co-abundance gene groups (CAGs), including 741 metagenomic species (MGS). We use these to assemble 238 high-quality microbial genomes and identify affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the means for comprehensive profiling of the diversity within complex metagenomic samples.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Overview of co-abundance clustering and the MGS-augmented assembly.
Figure 2: Size distributions of co-abundance gene groups (CAGs).
Figure 3: Benchmarking sensitivity and specificity of the co-abundance clustering across a range of sequencing depths or sample numbers.
Figure 4: Comparison of the MGS:337 augmented assembly and the B. animalis reference genome.
Figure 5: Dependency associations among MGS and CAGs.
Figure 6: Gut persistence probability for B. adolescentis.

Accession codes

Primary accessions


European Nucleotide Archive


  1. Fodor, A.A. et al. The “most wanted” taxa from the human microbiome for whole genome sequencing. PLoS ONE 7, e41294 (2012).

    Article  CAS  Google Scholar 

  2. Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).

    Article  CAS  Google Scholar 

  3. Lukjancenko, O., Wassenaar, T.M. & Ussery, D.W. Comparison of 61 sequenced Escherichia coli genomes. Microb. Ecol. 60, 708–720 (2010).

    Article  CAS  Google Scholar 

  4. Fitzsimons, M.S. et al. Nearly finished genomes produced using gel microdroplet culturing reveal substantial intraspecies genomic diversity within the human microbiome. Genome Res. 23, 878–888 (2013).

    Article  CAS  Google Scholar 

  5. Pop, M. Genome assembly reborn: recent computational challenges. Brief. Bioinform. 10, 354–366 (2009).

    Article  CAS  Google Scholar 

  6. Wooley, J.C., Godzik, A. & Friedberg, I. A primer on metagenomics. PLOS Comput. Biol. 6, e1000667 (2010).

    Article  Google Scholar 

  7. Iverson, V. et al. Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science 335, 587–590 (2012).

    Article  CAS  Google Scholar 

  8. Wang, Y., Leung, H.C.M., Yiu, S.M. & Chin, F.Y.L. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 28, i356–i362 (2012).

    Article  CAS  Google Scholar 

  9. Albertsen, M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013).

    Article  CAS  Google Scholar 

  10. Raes, J. & Bork, P. Molecular eco-systems biology: towards an understanding of community function. Nat. Rev. Microbiol. 6, 693–699 (2008).

    Article  CAS  Google Scholar 

  11. Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).

    Article  CAS  Google Scholar 

  12. Reyes, A. et al. Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature 466, 334–338 (2010).

    Article  CAS  Google Scholar 

  13. Minot, S. et al. The human gut virome: inter-individual variation and dynamic response to diet. Genome Res. 21, 1616–1625 (2011).

    Article  CAS  Google Scholar 

  14. Stern, A., Mick, E., Tirosh, I., Sagy, O. & Sorek, R. CRISPR targeting reveals a reservoir of common phages associated with the human gut microbiome. Genome Res. 22, 1985–1994 (2012).

    Article  CAS  Google Scholar 

  15. Zhang, Q., Rho, M., Tang, H., Doak, T.G. & Ye, Y. CRISPR-Cas systems target a diverse collection of invasive mobile genetic elements in human microbiomes. Genome Biol. 14, R40 (2013).

    Article  Google Scholar 

  16. Chain, P.S.G. et al. Genomics. Genome project standards in a new era of sequencing. Science 326, 236–237 (2009).

    Article  CAS  Google Scholar 

  17. Le Chatelier, E. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546 (2013).

    Article  CAS  Google Scholar 

  18. Chervaux, C. et al. Genome sequence of the probiotic strain Bifidobacterium animalis subsp. lactis CNCM I-2494. J. Bacteriol. 193, 5560–5561 (2011).

    Article  CAS  Google Scholar 

  19. Terns, M.P. & Terns, R.M. CRISPR-based adaptive immune systems. Curr. Opin. Microbiol. 14, 321–327 (2011).

    Article  CAS  Google Scholar 

  20. Kruschke, J.K. Bayesian data analysis. Wiley Interdiscip. Rev. Cogn. Sci. 1, 658–676 (2010).

    Article  Google Scholar 

  21. Karch, H. et al. The enemy within us: lessons from the 2011 European Escherichia coli O104:H4 outbreak. EMBO Mol. Med. 4, 841–848 (2012).

    Article  CAS  Google Scholar 

  22. Kultima, J.R. et al. MOCAT: a metagenomics assembly and gene prediction toolkit. PLOS ONE 7, e47656 (2012).

    Article  Google Scholar 

  23. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).

    Article  CAS  Google Scholar 

  24. Zhu, W., Lomsadze, A. & Borodovsky, M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38, e132 (2010).

    Article  Google Scholar 

  25. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).

    Article  CAS  Google Scholar 

  26. Leplae, R., Lima-Mendez, G. & Toussaint, A. ACLAME: a classification of mobile genetic elements, update 2010. Nucleic Acids Res. 38, D57–D61 (2010).

    Article  CAS  Google Scholar 

  27. Finn, R.D., Clements, J. & Eddy, S.R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–37 (2011).

    Article  CAS  Google Scholar 

  28. Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 40, D290–D301 (2012).

    Article  CAS  Google Scholar 

  29. Kristensen, D.M., Cai, X. & Mushegian, A. Evolutionarily conserved orthologous families in phages are relatively rare in their prokaryotic hosts. J. Bacteriol. 193, 1806–1814 (2011).

    Article  CAS  Google Scholar 

  30. Powell, S. et al. eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 40, D284–D289 (2012).

    Article  CAS  Google Scholar 

  31. Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).

    Article  CAS  Google Scholar 

  32. Roessner, C.A. & Scott, A.I. Fine-tuning our knowledge of the anaerobic route to cobalamin (vitamin B12). J. Bacteriol. 188, 7331–7334 (2006).

    Article  CAS  Google Scholar 

  33. Bland, C. et al. CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics 8, 209 (2007).

    Article  Google Scholar 

  34. Zankari, E. et al. Identification of acquired antimicrobial resistance genes. J. Antimicrob. Chemother. 67, 2640–2644 (2012).

    Article  CAS  Google Scholar 

  35. Kobayashi, K. et al. Essential Bacillus subtilis genes. Proc. Natl. Acad. Sci. USA 100, 4678–4683 (2003).

    Article  CAS  Google Scholar 

  36. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  Google Scholar 

  37. Kelley, D.R., Schatz, M.C. & Salzberg, S.L. Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116 (2010).

    Article  CAS  Google Scholar 

  38. Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

    Article  CAS  Google Scholar 

  39. Mavromatis, K. et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods 4, 495–500 (2007).

    Article  CAS  Google Scholar 

  40. Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).

    Article  CAS  Google Scholar 

  41. Teeling, H., Meyerdierks, A., Bauer, M., Amann, R. & Glöckner, F.O. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ. Microbiol. 6, 938–947 (2004).

    Article  CAS  Google Scholar 

  42. Salzberg, S.L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).

    Article  CAS  Google Scholar 

  43. Koren, S., Treangen, T.J. & Pop, M. Bambus 2: scaffolding metagenomes. Bioinformatics 27, 2964–2971 (2011).

    Article  CAS  Google Scholar 

  44. Ciccarelli, F.D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).

    Article  CAS  Google Scholar 

  45. Letunic, I. & Bork, P. Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res. 39, W475–W478 (2011).

    Article  CAS  Google Scholar 

  46. Treangen, T.J., Sommer, D.D., Angly, F.E., Koren, S. & Pop, M. Next generation sequence assembly with AMOS. Curr. Protoc. Bioinformatics Chapter 11, Unit 11.8 (2011).

  47. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

    Article  CAS  Google Scholar 

  48. Gelman, A., Jakulin, A., Pittau, M.G. & Su, Y. A weakly informative default prior distribution for logistic and other regression models. Ann. Appl. Stat. 2, 1360–1383 (2008).

    Article  Google Scholar 

  49. Plummer, M. JAGS: a program for analysis of Bayesian graphical models using Gibbs sampling. in Proc. 3rd Int. Work. Distrib. Stat. Comput. March, 20–22 (2003).

  50. Gelman, A. & Rubin, D. Inference from iterative simulation using multiple sequences. Stat. Sci. 7, 457–511 (1992).

    Article  Google Scholar 

Download references


The research leading to these results has received funding from the European Community's Seventh Framework Programme FP7- HEALTH-F4-2007-201052: Metagenomics of the Human Intestinal Tract (MetaHIT) and FP7-HEALTH-2010-261376: International Human Microbiome Standards, as well as the Novo Nordisk Foundation Center for Biosustainability. Work on the clustering concept has been supported by the OpenGPU FUI collaborative research projects, with funding from DGCIS. Researchers on the project were granted access to the HPC resources of CCRT under the allocation 2011-036707 made by GENCI (Grand Equipement National de Calcul Intensif). The company Alliance Services Plus (AS+) has provided help to scale up the process, especially, V. Arslan, D. Tello, V. Ducrot, T. Saidani and S. Monot. The authors affiliated with MGP are funded, in part, by the Metagenopolis ANR-11-DPBS-0001 grant. Ciberehd is funded by the Instituto de Salud Carlos III (Spain). M.A. was supported by a grant from the Ministère de la Recherche et de l'Education Nationale (France).

Author information

Authors and Affiliations




All authors are members of the Metagenomics of the Human Intestinal Tract (MetaHIT) Consortium. S.D.E. and S.B. managed the project. F.C., N.B., F.G., T.H., K.S.B. and T.N. performed clinical sampling. F.L. and C.M. performed DNA extraction. J.L., E.P. and D.L.P. performed sequencing. S.D.E., H.B.N., M.A., A.S.J., S.R., P.R. and P.B. designed the analyses. H.B.N., A.S.J., S.R., M.A., A.G.P., D.R.P., L.G., I.B., M.B., M.B.Q.d.S., M.A., J.L., J.T., S.S., T.Y., E.P., D.L.P. and R.S.K. performed the data analyses. H.B.N., S.B., A.S.J., S.R., A.G.P. and M.A. wrote the manuscript. H.B.N., S.B., S.D.E., D.R.P., I.B., P.B., E.P., O.P. and D.W.U. revised the manuscript. The MetaHIT Consortium members contributed to the design and execution of the study.

Corresponding authors

Correspondence to Søren Brunak or S Dusko Ehrlich.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

A full list of members and affiliations appears at the end of the paper.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–17 and Supplementary Notes 1–9 (PDF 4213 kb)

Supplementary Data 1

Sample description (XLS 259 kb)

Supplementary Data 2

MGS taxonomical statistics (XLS 264 kb)

Supplementary Data 3

MGS augmented assembly statistics (XLS 266 kb)

Supplementary Data 4

MGS augmented assemblies comparison to reference genomes (XLS 31 kb)

Supplementary Data 5

Summary information on the 6640 small CAGs (XLS 1156 kb)

Supplementary Data 6

Dependency-association network (XLS 251 kb)

Supplementary Data 7

MGS:4 + dependency-associated CAG assembly statistics (XLS 37 kb)

Supplementary Data 8

eggNOG prevalent in frequently observed MGS (XLS 36 kb)

Supplementary Data 9

Gene catalogue comparison (XLS 10 kb)

Supplementary Data 10

Bacillus subtilis essential COG list (XLS 61 kb)

Supplementary Data 11

Dependency-associations with or without companion species (XLS 37 kb)

Supplementary Software

Source code for canopy-clustering algorithm (ZIP 40 kb)

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nielsen, H., Almeida, M., Juncker, A. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol 32, 822–828 (2014).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing