Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes


Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences. We demonstrate the method on data from 396 human gut microbiome samples and identify 7,381 co-abundance gene groups (CAGs), including 741 metagenomic species (MGS). We use these to assemble 238 high-quality microbial genomes and identify affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the means for comprehensive profiling of the diversity within complex metagenomic samples.

Figure 1: Overview of co-abundance clustering and the MGS-augmented assembly.
Figure 2: Size distributions of co-abundance gene groups (CAGs).
Figure 3: Benchmarking sensitivity and specificity of the co-abundance clustering across a range of sequencing depths or sample numbers.
Figure 4: Comparison of the MGS:337 augmented assembly and the B. animalis reference genome.
Figure 5: Dependency associations among MGS and CAGs.
Figure 6: Gut persistence probability for B. adolescentis.

Accession codes

Primary accessions


European Nucleotide Archive


The research leading to these results has received funding from the European Community's Seventh Framework Programme FP7- HEALTH-F4-2007-201052: Metagenomics of the Human Intestinal Tract (MetaHIT) and FP7-HEALTH-2010-261376: International Human Microbiome Standards, as well as the Novo Nordisk Foundation Center for Biosustainability. Work on the clustering concept has been supported by the OpenGPU FUI collaborative research projects, with funding from DGCIS. Researchers on the project were granted access to the HPC resources of CCRT under the allocation 2011-036707 made by GENCI (Grand Equipement National de Calcul Intensif). The company Alliance Services Plus (AS+) has provided help to scale up the process, especially, V. Arslan, D. Tello, V. Ducrot, T. Saidani and S. Monot. The authors affiliated with MGP are funded, in part, by the Metagenopolis ANR-11-DPBS-0001 grant. Ciberehd is funded by the Instituto de Salud Carlos III (Spain). M.A. was supported by a grant from the Ministère de la Recherche et de l'Education Nationale (France).

Author information

Authors and Affiliations




All authors are members of the Metagenomics of the Human Intestinal Tract (MetaHIT) Consortium. S.D.E. and S.B. managed the project. F.C., N.B., F.G., T.H., K.S.B. and T.N. performed clinical sampling. F.L. and C.M. performed DNA extraction. J.L., E.P. and D.L.P. performed sequencing. S.D.E., H.B.N., M.A., A.S.J., S.R., P.R. and P.B. designed the analyses. H.B.N., A.S.J., S.R., M.A., A.G.P., D.R.P., L.G., I.B., M.B., M.B.Q.d.S., M.A., J.L., J.T., S.S., T.Y., E.P., D.L.P. and R.S.K. performed the data analyses. H.B.N., S.B., A.S.J., S.R., A.G.P. and M.A. wrote the manuscript. H.B.N., S.B., S.D.E., D.R.P., I.B., P.B., E.P., O.P. and D.W.U. revised the manuscript. The MetaHIT Consortium members contributed to the design and execution of the study.

Corresponding authors

Correspondence to Søren Brunak or S Dusko Ehrlich.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

A full list of members and affiliations appears at the end of the paper.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–17 and Supplementary Notes 1–9 (PDF 4213 kb)

Supplementary Data 1

Sample description (XLS 259 kb)

Supplementary Data 2

MGS taxonomical statistics (XLS 264 kb)

Supplementary Data 3

MGS augmented assembly statistics (XLS 266 kb)

Supplementary Data 4

MGS augmented assemblies comparison to reference genomes (XLS 31 kb)

Supplementary Data 5

Summary information on the 6640 small CAGs (XLS 1156 kb)

Supplementary Data 6

Dependency-association network (XLS 251 kb)

Supplementary Data 7

MGS:4 + dependency-associated CAG assembly statistics (XLS 37 kb)

Supplementary Data 8

eggNOG prevalent in frequently observed MGS (XLS 36 kb)

Supplementary Data 9

Gene catalogue comparison (XLS 10 kb)

Supplementary Data 10

Bacillus subtilis essential COG list (XLS 61 kb)

Supplementary Data 11

Dependency-associations with or without companion species (XLS 37 kb)

Supplementary Software

Source code for canopy-clustering algorithm (ZIP 40 kb)

Cite this article

Nielsen, H., Almeida, M., Juncker, A. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol 32, 822–828 (2014).

