Metagenomic microbial community profiling using unique clade-specific marker genes

Journal name:
Nature Methods
Year published:
Published online

Metagenomic shotgun sequencing data can identify microbes populating a microbial community and their proportions, but existing taxonomic profiling methods are inefficient for increasingly large data sets. We present an approach that uses clade-specific marker genes to unambiguously assign reads to microbial clades more accurately and >50× faster than current approaches. We validated our metagenomic phylogenetic analysis tool, MetaPhlAn, on terabases of short reads and provide the largest metagenomic profiling to date of the human gut. It can be accessed at

At a glance


  1. Comparison of MetaPhlAn to existing methods.
    Figure 1: Comparison of MetaPhlAn to existing methods.

    We used ten total synthetic metagenomes to compare MetaPhlAn to PhymmBL12, BLAST13, the Rapid Identification of Taxonomic Assignments (RITA) pipeline14 and the naive Bayes classifier (NBC)15. (a,b) Absolute and r.m.s. errors with respect to 100 total organisms in one synthetic metagenome at the species (a) and class (b) level. (c) Correlations of inferred and true species abundances for eight non–evenly distributed synthetic metagenomes. (d) Read rates for the tested methods on single CPUs.

  2. Composition of healthy vaginal microbiota.
    Figure 2: Composition of healthy vaginal microbiota.

    MetaPhlAn species and genus abundances and 16S phylotype abundances for 51 healthy vaginal microbiomes from the Human Microbiome Project are shown. Samples were naively grouped by assigning each based on its dominant (>50%) Lactobacillus species or by the absence (<2%) of any Lactobacillus. For each cluster (named from I to V5) we report averages across samples for all genera and species as inferred by MetaPhlAn and, for genera, as estimated by the combination of the mothur 16S rRNA processing pipeline and the Ribosomal Database Project (RDP) classifier (see Online Methods) applied to 16S rRNA gene sequences from the same specimen.

  3. The gut microbiota in asymptomatic Western populations as inferred by MetaPhlAn on 224 samples combining the HMP and MetaHIT cohorts.
    Figure 3: The gut microbiota in asymptomatic Western populations as inferred by MetaPhlAn on 224 samples combining the HMP and MetaHIT cohorts.

    (a) Taxonomic cladogram reporting all clades present in one or both cohorts (≥0.5% abundance in ≥1 sample). Circle size is proportional to the log of average abundance; color represents relative enrichment of the most abundant taxa (≥1% average in ≥1 cohort) between the HMP (n = 139) and MetaHIT (n = 85, healthy only) populations. (b,c) Genus- (b) and species-level (c) taxonomic profiles of the most abundant clades, hierarchically clustered (average linkage) with the Bray-Curtis similarity, reveal sets of samples with similar microbial community compositions. With the exception of the cluster dominated by genus Bacteroides (B. vulgatus and B. ovatus in particular), samples from both studies are present in all groups, thus confirming substantial consistency of the gut microbiota characterized by independent and geographically distant Western-diet asymptomatic cohorts. Only species and genera with at least 7.5% abundances at the 95th percentile of their distribution are reported.


  1. DeLong, E.F. Nat. Rev. Microbiol. 3, 459469 (2005).
  2. Daniel, R. Nat. Rev. Microbiol. 3, 470478 (2005).
  3. The Human Microbiome Project Consortium. Nature advance online publication, doi:doi:10.1038/nature11209 (14 June 2012).
  4. Qin, J. et al. Nature 464, 5965 (2010).
  5. Ravel, J. et al. Proc. Natl. Acad. Sci. USA 108, 46804687 (2011).
  6. Veiga, P. et al. Proc. Natl. Acad. Sci. USA 107, 1813218137 (2010).
  7. Turnbaugh, P.J. et al. Nature 457, 480484 (2009).
  8. Markowitz, V.M. et al. Nucleic Acids Res. 38, D382D390 (2010).
  9. Fredricks, D.N., Fiedler, T.L. & Marrazzo, J.M. N. Engl. J. Med. 353, 18991911 (2005).
  10. Stewart, F.J., Ulloa, O. & DeLong, E.F. Environ. Microbiol. 14, 2340 (2012).
  11. Arumugam, M. et al. Nature 473, 174180 (2011).
  12. Brady, A. & Salzberg, S. Nat. Methods 8, 367 (2011).
  13. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J. Mol. Biol. 215, 403410 (1990).
  14. Parks, D.H., MacDonald, N. & Beiko, R. BMC Bioinformatics 12, 328 (2011).
  15. Rosen, G.L., Reichenberger, E.R. & Rosenfeld, A.M. Bioinformatics 27, 127129 (2011).
  16. Segata, N. & Huttenhower, C. PLoS ONE 6, e24704 (2011).
  17. Bohlin, J. et al. BMC Evol. Biol. 10, 249 (2010).
  18. Edgar, R.C. Bioinformatics 26, 24602461 (2010).
  19. Wu, M. & Eisen, J.A. Genome Biol. 9, R151 (2008).
  20. Ciccarelli, F.D. et al. Science 311, 12831287 (2006).
  21. Mavromatis, K. et al. Nat. Methods 4, 495500 (2007).
  22. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M. & Hirakawa, M. Nucleic Acids Res. 38, D355D360 (2010).
  23. Li, H., Ruan, J. & Durbin, R. Genome Res. 18, 18511858 (2008).
  24. Pruitt, K.D., Tatusova, T., Klimke, W. & Maglott, D.R. Nucleic Acids Res. 37, D32D36 (2009).
  25. Huson, D.H., Auch, A.F., Qi, J. & Schuster, S.C. Genome Res. 17, 377386 (2007).
  26. Huson, D.H., Mitra, S., Ruscheweyh, H.J., Weber, N. & Schuster, S.C. Genome Res. 21, 15521560 (2011).
  27. Gori, F., Folino, G., Jetten, M.S.M. & Marchiori, E. Bioinformatics 27, 196203 (2011).
  28. Berger, S.A. & Stamatakis, A. Bioinformatics 27, 20682075 (2011).
  29. Gerlach, W. & Stoye, J. Nucleic Acids Res. 39, e91 (2011).
  30. McHardy, A.C., Rigoutsos, I., Hugenholtz, P., Tsirigos, A. & Martin, H.G. Nat. Methods 4, 6372 (2007).
  31. Patil, K.R. et al. Nat. Methods 8, 191192 (2011).
  32. Brady, A. & Salzberg, S.L. Nat. Methods 6, 673676 (2009).
  33. Rosen, G., Garbarine, E., Caseiro, D., Polikar, R. & Sokhansanj, B. Adv. Bioinformatics 2008, 205969 (2008).
  34. Nalbantoglu, O.U., Way, S.F., Hinrichs, S.H. & Sayood, K. BMC Bioinformatics 12, 41 (2011).
  35. Leung, H.C. et al. Bioinformatics 27, 14891495 (2011).
  36. Schloss, P.D. et al. Appl. Environ. Microbiol. 75, 75377541 (2009).
  37. Cole, J.R. et al. Nucleic Acids Res. 37, D141D145 (2009).

Download references

Author information


  1. Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, USA.

    • Nicola Segata,
    • Levi Waldron,
    • Vagheesh Narasimhan &
    • Curtis Huttenhower
  2. Centre for Integrative Biology, University of Trento, Trento, Italy.

    • Annalisa Ballarini &
    • Olivier Jousson


N.S., A.B., O.J. and C.H. conceived the method; N.S. implemented the software; N.S. and C.H. performed the experiments; N.S., L.W., V.N. and C.H. analyzed the data; and N.S. and C.H. wrote the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (2M)

    Supplementary Figures 1–7, Supplementary Tables 1 and 2 and Supplementary Notes 1–3

Additional data