Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes

Journal name:
Nature Biotechnology
Volume:
32,
Pages:
822–828
Year published:
DOI:
doi:10.1038/nbt.2939
Received
Accepted
Published online

Abstract

Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences. We demonstrate the method on data from 396 human gut microbiome samples and identify 7,381 co-abundance gene groups (CAGs), including 741 metagenomic species (MGS). We use these to assemble 238 high-quality microbial genomes and identify affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the means for comprehensive profiling of the diversity within complex metagenomic samples.

At a glance

Figures

  1. Overview of co-abundance clustering and the MGS-augmented assembly.
    Figure 1: Overview of co-abundance clustering and the MGS-augmented assembly.

    DNA from a series of independent biological samples from microbial communities, here originating from the human gut microbiome, is extracted and shotgun sequenced. Genes assembled and identified in individual samples are then integrated to form a cross-sample, nonredundant gene catalog. The abundance profile of each gene in the catalog is assessed by counting the matching sequence reads in each sample. To facilitate co-abundance clustering of large gene catalogs, we used random seed genes as 'baits' for identifying groups of genes that correlate (PCC > 0.9, gray dashed circle) to the abundance profile of the bait genes. The fixed PCC distance threshold is called a canopy (dashed circles). To center the canopy on a co-abundance gene group (CAG), the median gene abundance profile of the genes within the original seed canopy (or subsequent canopies, symbolized as +) is used iteratively to recapture a new canopy until it settles on a particular profile (off-set circles). The gene content of a settled canopy (black dashed circles) is named a metagenomic species (MGS) if it contains 700 or more genes. The smaller groups remain referred to as CAGs. Sequence reads from individual samples that map to the MGS genes and their contigs are then extracted and used to assembly a draft genome sequence for an MGS; we refer to this process as MGS-augmented genome assembly. The use of sample-specific sequence reads in the assemblies helps discriminate between closely related strains.

  2. Size distributions of co-abundance gene groups (CAGs).
    Figure 2: Size distributions of co-abundance gene groups (CAGs).

    (a) Histogram showing the CAG size distribution in terms of gene content. The scale is logarithmic as indicated by the bar widths. (b) Bee swarm plot showing CAGs that are significantly enriched (Fishers exact test, P < 0.001) for the indicated gene annotation, as well as phage-like CAGs and dependency-associated CAGs, plotted against the number of genes contained in the CAGs. Here every point represents an enriched CAG or MGS and the width of the swarms shows the distribution. The dashed line marks the 700-gene threshold separating small CAGs from MGS.

  3. Benchmarking sensitivity and specificity of the co-abundance clustering across a range of sequencing depths or sample numbers.
    Figure 3: Benchmarking sensitivity and specificity of the co-abundance clustering across a range of sequencing depths or sample numbers.

    B. animalis subsp. lactis CNCM I-2494 was used as a benchmark species because 19 samples originated from individuals who had consumed a defined fermented milk product containing this strain. (For each clustering, the size (number of genes captured) is shown as bars (left axis); and the specificity (percentage of genes matching the B. animalis reference genome with > 95% sequence identity over 100 bp or better captured in the MGS that is most similar to B. animalis) is shown as a line (right axis).) (a) Co-abundance clustering using reduced data sets to simulate the sequencing depths indicated (x axis). At a sequence depth of 700 K reads, 97% of the B. animalis genes were captured, and at a depth of 200 K reads 98.6% of the captured genes were from B. animalis. (b) Co-abundance clustering of random sample subsets containing the indicated number of samples (x-axis) from individuals that consumed the DFMP. Here the total sample size was kept constant at 375 samples. (c) Co-abundance clusterings on a series of random sample subsets of the indicated size (x axis). These sample subsets included 19 samples from individuals who had consumed the DFMP, except when they contained <19 samples (i.e., 19-8 DFMP individuals per subset). In b and c, samples were downsized to 11 million sequence reads per sample. Error bars, ±1 s.d. from the mean (n = 5).

  4. Comparison of the MGS:337 augmented assembly and the B. animalis reference genome.
    Figure 4: Comparison of the MGS:337 augmented assembly and the B. animalis reference genome.

    BLAST dot-plot comparing the MGS:337 augmented assembly (y axis) to the B. animalis subsp. lactis CNCM I-2494 reference genome (x axis). The dot-blot shows the relative chromosomal positions of matching sequence on the MGS-augmented assembly and the B. animalis reference genome. The MGS-augmented assembly covers 95% of the reference genome with 99.9% identity. The plot shows an inversion in the assembly relative to the reference genome around position 1,300 K.

  5. Dependency associations among MGS and CAGs.
    Figure 5: Dependency associations among MGS and CAGs.

    (a) A typical example of a significant dependency association. The abundance of the MGS:135 (S. wadsworthensis) and the small CAG:2350 across 318 fecal samples are shown as blue and red curves, respectively (upper panel, logarithmic scale). Below the sample-wise presence of the two CAGs is shown as bars. CAG:2350 is significantly co-occurring with MGS:135 and never detected independently (Fishers exact test, P = 9 × 10−74). The samples were sorted according to the abundance of MGS:135. (b) The dependency-association subnetwork of CAGs associated to S. wadsworthensis (MGS:135). Arrows show dependency associations and solid arrows indicate that co-assembly of the MGS and the CAG in one or more samples supported the association. Blue coloring indicates CAGs dominated by genes with species level similarity to S. wadsworthensis. CAG:2543 and CAG:3731 are enriched for phage genes, and CAG:4011 contains a series of CRISPR-associated genes and a CRISPR cluster. The CRISPR complex containing CAG:4011 and one of the phages-like CAG:3731 anti-correlate (Matthews correlation coefficient = −0.7) and spacers of the CRISPR show sequence complementarity to the phage. (c) The E. coli (MGS:4) and its nine dependency-associated CAGs were co-assembled to high-quality draft genomes in each of 11 samples. The outer black circle represents the consensus assembly of the E. coli–centered agglomerate and each of the gray circles represents alignment of the assembly from a particular sample. The positions and sequence coverage of CAG:427 are marked in red, across the assemblies.

  6. Gut persistence probability for B. adolescentis.
    Figure 6: Gut persistence probability for B. adolescentis.

    The gut persistence of B. adolescentis (MGS:119) populations stratified by the presence (red curves) or absence (black curves) of the dependency-associated CAG:2298 observed across 54 human individuals who had the bacterium in the first of two fecal samples. B. adolescentis had substantially higher persistence probability with CAG:2298 present. (a) Interval-censored Kaplan-Meier curves showing the cumulative loss of populations of B. adolescentis over time across the cohort of human individuals. Points (+) indicate time (in days) of the second of two samplings from a human individual. The curve shows the “losses” when they are registered at the second time point and not when the loss actually happened (i.e., the data are interval-censored). (b) Model-based estimates of annual gut persistence probability for B. adolescentis with or without the dependency-associated CAG. Note that annual persistence probability with the CAG (mean estimate = 88%) is much larger than without (mean estimate = 18%). In the Bayesian logistic regression framework used here, estimates are expressed as probability distributions over the possible values for parameters of interest. We therefore obtain both an estimate of a parameter, and quantification of how certain we are of the estimate. This figure shows the posterior probability distribution over possible values for the annual persistence probabilities; with the shaded areas indicating the 95% highest-density intervals (i.e., the parameter values with the most support).

Accession codes

Primary accessions

BioProject

European Nucleotide Archive

References

  1. Fodor, A.A. et al. The “most wanted” taxa from the human microbiome for whole genome sequencing. PLoS ONE 7, e41294 (2012).
  2. Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 5965 (2010).
  3. Lukjancenko, O., Wassenaar, T.M. & Ussery, D.W. Comparison of 61 sequenced Escherichia coli genomes. Microb. Ecol. 60, 708720 (2010).
  4. Fitzsimons, M.S. et al. Nearly finished genomes produced using gel microdroplet culturing reveal substantial intraspecies genomic diversity within the human microbiome. Genome Res. 23, 878888 (2013).
  5. Pop, M. Genome assembly reborn: recent computational challenges. Brief. Bioinform. 10, 354366 (2009).
  6. Wooley, J.C., Godzik, A. & Friedberg, I. A primer on metagenomics. PLOS Comput. Biol. 6, e1000667 (2010).
  7. Iverson, V. et al. Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science 335, 587590 (2012).
  8. Wang, Y., Leung, H.C.M., Yiu, S.M. & Chin, F.Y.L. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 28, i356i362 (2012).
  9. Albertsen, M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533538 (2013).
  10. Raes, J. & Bork, P. Molecular eco-systems biology: towards an understanding of community function. Nat. Rev. Microbiol. 6, 693699 (2008).
  11. Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 5560 (2012).
  12. Reyes, A. et al. Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature 466, 334338 (2010).
  13. Minot, S. et al. The human gut virome: inter-individual variation and dynamic response to diet. Genome Res. 21, 16161625 (2011).
  14. Stern, A., Mick, E., Tirosh, I., Sagy, O. & Sorek, R. CRISPR targeting reveals a reservoir of common phages associated with the human gut microbiome. Genome Res. 22, 19851994 (2012).
  15. Zhang, Q., Rho, M., Tang, H., Doak, T.G. & Ye, Y. CRISPR-Cas systems target a diverse collection of invasive mobile genetic elements in human microbiomes. Genome Biol. 14, R40 (2013).
  16. Chain, P.S.G. et al. Genomics. Genome project standards in a new era of sequencing. Science 326, 236237 (2009).
  17. Le Chatelier, E. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541546 (2013).
  18. Chervaux, C. et al. Genome sequence of the probiotic strain Bifidobacterium animalis subsp. lactis CNCM I-2494. J. Bacteriol. 193, 55605561 (2011).
  19. Terns, M.P. & Terns, R.M. CRISPR-based adaptive immune systems. Curr. Opin. Microbiol. 14, 321327 (2011).
  20. Kruschke, J.K. Bayesian data analysis. Wiley Interdiscip. Rev. Cogn. Sci. 1, 658676 (2010).
  21. Karch, H. et al. The enemy within us: lessons from the 2011 European Escherichia coli O104:H4 outbreak. EMBO Mol. Med. 4, 841848 (2012).
  22. Kultima, J.R. et al. MOCAT: a metagenomics assembly and gene prediction toolkit. PLOS ONE 7, e47656 (2012).
  23. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265272 (2010).
  24. Zhu, W., Lomsadze, A. & Borodovsky, M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38, e132 (2010).
  25. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 19661967 (2009).
  26. Leplae, R., Lima-Mendez, G. & Toussaint, A. ACLAME: a classification of mobile genetic elements, update 2010. Nucleic Acids Res. 38, D57D61 (2010).
  27. Finn, R.D., Clements, J. & Eddy, S.R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W2937 (2011).
  28. Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 40, D290D301 (2012).
  29. Kristensen, D.M., Cai, X. & Mushegian, A. Evolutionarily conserved orthologous families in phages are relatively rare in their prokaryotic hosts. J. Bacteriol. 193, 18061814 (2011).
  30. Powell, S. et al. eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 40, D284D289 (2012).
  31. Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554557 (2005).
  32. Roessner, C.A. & Scott, A.I. Fine-tuning our knowledge of the anaerobic route to cobalamin (vitamin B12). J. Bacteriol. 188, 73317334 (2006).
  33. Bland, C. et al. CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics 8, 209 (2007).
  34. Zankari, E. et al. Identification of acquired antimicrobial resistance genes. J. Antimicrob. Chemother. 67, 26402644 (2012).
  35. Kobayashi, K. et al. Essential Bacillus subtilis genes. Proc. Natl. Acad. Sci. USA 100, 46784683 (2003).
  36. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 17541760 (2009).
  37. Kelley, D.R., Schatz, M.C. & Salzberg, S.L. Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116 (2010).
  38. Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821829 (2008).
  39. Mavromatis, K. et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods 4, 495500 (2007).
  40. Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 22242241 (2011).
  41. Teeling, H., Meyerdierks, A., Bauer, M., Amann, R. & Glöckner, F.O. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ. Microbiol. 6, 938947 (2004).
  42. Salzberg, S.L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557567 (2012).
  43. Koren, S., Treangen, T.J. & Pop, M. Bambus 2: scaffolding metagenomes. Bioinformatics 27, 29642971 (2011).
  44. Ciccarelli, F.D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 12831287 (2006).
  45. Letunic, I. & Bork, P. Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res. 39, W475W478 (2011).
  46. Treangen, T.J., Sommer, D.D., Angly, F.E., Koren, S. & Pop, M. Next generation sequence assembly with AMOS. Curr. Protoc. Bioinformatics Chapter 11, Unit 11.8 (2011).
  47. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 16581659 (2006).
  48. Gelman, A., Jakulin, A., Pittau, M.G. & Su, Y. A weakly informative default prior distribution for logistic and other regression models. Ann. Appl. Stat. 2, 13601383 (2008).
  49. Plummer, M. JAGS: a program for analysis of Bayesian graphical models using Gibbs sampling. in Proc. 3rd Int. Work. Distrib. Stat. Comput. March, 20–22 (2003).
  50. Gelman, A. & Rubin, D. Inference from iterative simulation using multiple sequences. Stat. Sci. 7, 457511 (1992).

Download references

Author information

  1. These authors contributed equally to this work.

    • H Bjørn Nielsen,
    • Mathieu Almeida,
    • H Bjørn Nielsen &
    • Mathieu Almeida

Affiliations

  1. Center for Biological Sequence Analysis, Technical University of Denmark, Kongens Lyngby, Denmark.

    • H Bjørn Nielsen,
    • Agnieszka Sierakowska Juncker,
    • Simon Rasmussen,
    • Damian R Plichta,
    • Laurent Gautier,
    • Anders G Pedersen,
    • Ida Bonde,
    • Marcelo B Quintanilha dos Santos,
    • Piotr Dworzynski,
    • Ole Lund,
    • David W Ussery,
    • Agnieszka S Juncker,
    • Thomas Sicheritz-Ponten &
    • Søren Brunak
  2. Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark.

    • H Bjørn Nielsen,
    • Agnieszka Sierakowska Juncker,
    • Ida Bonde,
    • Nikolaj Blom,
    • Agnieszka S Juncker,
    • Thomas Sicheritz-Ponten &
    • Søren Brunak
  3. INRA, Institut National de la Recherche Agronomique, UMR 14121 MICALIS, Jouy en Josas, France.

    • Mathieu Almeida,
    • Emmanuelle Le Chatelier,
    • Jean-Michel Batto,
    • Fouad Boumezbeur,
    • Joël Doré,
    • Sean Kennedy,
    • Pierre Léonard,
    • Florence Levenez,
    • Bouziane Moumen,
    • Nicolas Pons,
    • Edi Prifti,
    • Pierre Leonard,
    • Pierre Renault,
    • S Dusko Ehrlich,
    • Alexandre Jamet,
    • Antonella Cultrone,
    • Christine Delorme,
    • Emmanuelle Maguin,
    • Eric Guedon,
    • Gaetana Vandemeulebrouck,
    • Ghalia Khaci,
    • Maarten van de Guchte,
    • Nicolas Sanchez,
    • Rozenn Dervyn,
    • Séverine Layec &
    • Yohanan Winogradski
  4. INRA, Institut National de la Recherche Agronomique, US 1367 Metagenopolis, Jouy en Josas, France.

    • Mathieu Almeida,
    • Emmanuelle Le Chatelier,
    • Jean-Michel Batto,
    • Fouad Boumezbeur,
    • Joël Doré,
    • Sean Kennedy,
    • Pierre Léonard,
    • Florence Levenez,
    • Bouziane Moumen,
    • Nicolas Pons,
    • Edi Prifti,
    • S Dusko Ehrlich,
    • Benoit Quinquis,
    • Florence Haimet,
    • Hervé Blottière &
    • Nathalie Galleron
  5. Department of Computer Science, Center for Bioinformatics and Computational Biology, University of Maryland, USA.

    • Mathieu Almeida
  6. BGI Hong Kong Research Institute, Hong Kong, China.

    • Junhua Li &
    • Junjie Qin
  7. BGI-Shenzhen, Shenzhen, China.

    • Junhua Li,
    • Manimozhiyan Arumugam,
    • Karsten Kristiansen,
    • Junjie Qin &
    • Jun Wang
  8. School of Bioscience and Biotechnology, South China University of Technology, Guangzhou, China.

    • Junhua Li
  9. European Molecular Biology Laboratory, Heidelberg, Germany.

    • Shinichi Sunagawa,
    • Manimozhiyan Arumugam,
    • Jens Roat Kultima,
    • Julien Tap,
    • Takuji Yamada &
    • Peer Bork
  10. Commissariat à l'Énergie Atomique et aux Énergies Alternatives, Institut de Génomique, Évry, France.

    • Eric Pelletier,
    • Denis Le Paslier,
    • François Artiguenave,
    • Jean Weissenbach &
    • Thomas Bruls
  11. Centre National de la Recherche Scientifique, Évry, France.

    • Eric Pelletier &
    • Denis Le Paslier
  12. Université d'Évry Val d'Essonne, Évry, France.

    • Eric Pelletier &
    • Denis Le Paslier
  13. The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark.

    • Trine Nielsen,
    • Manimozhiyan Arumugam,
    • Kristoffer S Burgdorf,
    • Torben Hansen,
    • Oluf Pedersen &
    • Jun Wang
  14. Digestive System Research Unit, University Hospital Vall d'Hebron, Ciberehd, Barcelona, Spain.

    • Chaysavanh Manichanh,
    • Natalia Borruel,
    • Francesc Casellas,
    • Francisco Guarner,
    • Antonio Torrejon,
    • Encarna Varela &
    • Maria Antolin
  15. Faculty of Health Sciences, University of Southern Denmark, Odense, Denmark.

    • Torben Hansen
  16. Department of Structural Biology, VIB, Brussels, Belgium.

    • Falk Hildebrand &
    • Falony Gwen
  17. Department of Bioscience Engineering, Vrije Universiteit, Brussels, Belgium.

    • Falk Hildebrand &
    • Jeroen Raes
  18. National Food Institute, Division for Epidemiology and Microbial Genomics, Technical University of Denmark, Kongens Lyngby, Denmark.

    • Rolf S Kaas
  19. Department of Biology, University of Copenhagen, Copenhagen, Denmark.

    • Karsten Kristiansen &
    • Jun Wang
  20. Hagedorn Research Institute, Gentofte, Denmark.

    • Oluf Pedersen
  21. Institute of Biomedical Science, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.

    • Oluf Pedersen &
    • Niels Grarup
  22. Faculty of Health, Aarhus University, Aarhus, Denmark.

    • Oluf Pedersen
  23. Department of Microbiology and Immunology, Rega Institute, KU Leuven, Belgium.

    • Jeroen Raes
  24. VIB Center for the Biology of Disease, Leuven, Belgium.

    • Jeroen Raes
  25. Section of Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark.

    • Søren Sørensen
  26. Laboratory of Microbiology, Wageningen University, Wageningen, The Netherlands.

    • Sebastian Tims,
    • Willem M de Vos,
    • Jørgensen Torben,
    • Michiel Kleerebezem &
    • Zoetendal Erwin G
  27. Department of Biological Information, Tokyo Institute of Technology, Yokohama, Japan.

    • Takuji Yamada
  28. A full list of members and affiliations appears at the end of the paper.

    • MetaHIT Consortium
  29. Max Delbrück Centre for Molecular Medicine, Berlin, Germany.

    • Peer Bork
  30. Princess Al Jawhara Center of Excellence in the Research of Hereditary Disorders, King Abdulaziz University, Jeddah, Saudi Arabia.

    • Jun Wang
  31. King's College London, Centre for Host-Microbiome Interactions, Dental Institute Central Office, Guy's Hospital, United Kingdom.

    • S Dusko Ehrlich
  32. Institut Mérieux, Lyon, France.

    • Alexandre Mérieux,
    • Christian Brechot &
    • Christine M'Rini
  33. Danone Research, Palaiseau, France.

    • Gérard Denariaz,
    • Johan E T van Hylckama Vlieg,
    • Muriel Derrien &
    • Patrick Veiga
  34. Gut Biology & Microbiology, Danone Research, Center for Specialized Nutrition, Wageningen, the Netherlands.

    • Jan Knol &
    • Raish Oozeer
  35. The Wellcome Trust Sanger Institute, Hinxton, Cambridge, U.K.

    • Julian Parkhill &
    • Keith Turner
  36. Istituto Europeo di Oncologia, Milan, Italy.

    • Maria Rescigno

Consortia

  1. MetaHIT Consortium

    • H Bjørn Nielsen,
    • Mathieu Almeida,
    • Agnieszka S Juncker,
    • Simon Rasmussen,
    • Junhua Li,
    • Shinichi Sunagawa,
    • Damian R Plichta,
    • Laurent Gautier,
    • Anders G Pedersen,
    • Emmanuelle Le Chatelier,
    • Eric Pelletier,
    • Ida Bonde,
    • Trine Nielsen,
    • Chaysavanh Manichanh,
    • Manimozhiyan Arumugam,
    • Jean-Michel Batto,
    • Marcelo B Quintanilha dos Santos,
    • Nikolaj Blom,
    • Natalia Borruel,
    • Kristoffer S Burgdorf,
    • Fouad Boumezbeur,
    • Francesc Casellas,
    • Joël Doré,
    • Piotr Dworzynski,
    • Francisco Guarner,
    • Torben Hansen,
    • Falk Hildebrand,
    • Rolf S Kaas,
    • Sean Kennedy,
    • Karsten Kristiansen,
    • Jens Roat Kultima,
    • Pierre Leonard,
    • Florence Levenez,
    • Ole Lund,
    • Bouziane Moumen,
    • Denis Le Paslier,
    • Nicolas Pons,
    • Oluf Pedersen,
    • Edi Prifti,
    • Junjie Qin,
    • Jeroen Raes,
    • Søren Sørensen,
    • Julien Tap,
    • Sebastian Tims,
    • David W Ussery,
    • Takuji Yamada,
    • Pierre Renault,
    • Thomas Sicheritz-Ponten,
    • Peer Bork,
    • Jun Wang,
    • Søren Brunak,
    • S Dusko Ehrlich,
    • Alexandre Jamet,
    • Alexandre Mérieux,
    • Antonella Cultrone,
    • Antonio Torrejon,
    • Benoit Quinquis,
    • Christian Brechot,
    • Christine Delorme,
    • Christine M'Rini,
    • Willem M de Vos,
    • Emmanuelle Maguin,
    • Encarna Varela,
    • Eric Guedon,
    • Falony Gwen,
    • Florence Haimet,
    • François Artiguenave,
    • Gaetana Vandemeulebrouck,
    • Gérard Denariaz,
    • Ghalia Khaci,
    • Hervé Blottière,
    • Jan Knol,
    • Jean Weissenbach,
    • Johan E T van Hylckama Vlieg,
    • Jørgensen Torben,
    • Julian Parkhill,
    • Keith Turner,
    • Maarten van de Guchte,
    • Maria Antolin,
    • Maria Rescigno,
    • Michiel Kleerebezem,
    • Muriel Derrien,
    • Nathalie Galleron,
    • Nicolas Sanchez,
    • Niels Grarup,
    • Patrick Veiga,
    • Raish Oozeer,
    • Rozenn Dervyn,
    • Séverine Layec,
    • Thomas Bruls,
    • Yohanan Winogradski &
    • Zoetendal Erwin G

Contributions

All authors are members of the Metagenomics of the Human Intestinal Tract (MetaHIT) Consortium. S.D.E. and S.B. managed the project. F.C., N.B., F.G., T.H., K.S.B. and T.N. performed clinical sampling. F.L. and C.M. performed DNA extraction. J.L., E.P. and D.L.P. performed sequencing. S.D.E., H.B.N., M.A., A.S.J., S.R., P.R. and P.B. designed the analyses. H.B.N., A.S.J., S.R., M.A., A.G.P., D.R.P., L.G., I.B., M.B., M.B.Q.d.S., M.A., J.L., J.T., S.S., T.Y., E.P., D.L.P. and R.S.K. performed the data analyses. H.B.N., S.B., A.S.J., S.R., A.G.P. and M.A. wrote the manuscript. H.B.N., S.B., S.D.E., D.R.P., I.B., P.B., E.P., O.P. and D.W.U. revised the manuscript. The MetaHIT Consortium members contributed to the design and execution of the study.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (4,314 KB)

    Supplementary Figures 1–17 and Supplementary Notes 1–9

Excel files

  1. Supplementary Data 1 (265 KB)

    Sample description

  2. Supplementary Data 2 (270 KB)

    MGS taxonomical statistics

  3. Supplementary Data 3 (272 KB)

    MGS augmented assembly statistics

  4. Supplementary Data 4 (31 KB)

    MGS augmented assemblies comparison to reference genomes

  5. Supplementary Data 5 (1,183 KB)

    Summary information on the 6640 small CAGs

  6. Supplementary Data 6 (257 KB)

    Dependency-association network

  7. Supplementary Data 7 (37 KB)

    MGS:4 + dependency-associated CAG assembly statistics

  8. Supplementary Data 8 (37 KB)

    eggNOG prevalent in frequently observed MGS

  9. Supplementary Data 9 (10 KB)

    Gene catalogue comparison

  10. Supplementary Data 10 (62 KB)

    Bacillus subtilis essential COG list

  11. Supplementary Data 11 (37 KB)

    Dependency-associations with or without companion species

Zip files

  1. Supplementary Software (22 KB)

    Source code for canopy-clustering algorithm

Additional data