Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences

Journal name:
Nature Biotechnology
Year published:
Published online


Profiling phylogenetic marker genes, such as the 16S rRNA gene, is a key tool for studies of microbial communities but does not provide direct evidence of a community's functional capabilities. Here we describe PICRUSt (phylogenetic investigation of communities by reconstruction of unobserved states), a computational approach to predict the functional composition of a metagenome using marker gene data and a database of reference genomes. PICRUSt uses an extended ancestral-state reconstruction algorithm to predict which gene families are present and then combines gene families to estimate the composite metagenome. Using 16S information, PICRUSt recaptures key findings from the Human Microbiome Project and accurately predicts the abundance of gene families in host-associated and environmental communities, with quantifiable uncertainty. Our results demonstrate that phylogeny and function are sufficiently linked that this 'predictive metagenomic' approach should provide useful insights into the thousands of uncultivated microbial communities for which only marker gene surveys are currently available.

At a glance


  1. The PICRUSt workflow.
    Figure 1: The PICRUSt workflow.

    PICRUSt is composed of two high-level workflows: gene content inference (top box) and metagenome inference (bottom box). Beginning with a reference OTU tree and a gene content table (i.e., counts of genes for reference OTUs with known gene content), the gene content inference workflow predicts gene content for each OTU with unknown gene content, including predictions of marker gene copy number. This information is precomputed for 16S based on Greengenes29 and IMG26, but all functionality is accessible in PICRUSt for use with other marker genes and reference genomes. The metagenome inference workflow takes an OTU table (i.e., counts of OTUs on a per sample basis), where OTU identifiers correspond to tips in the reference OTU tree, as well as the copy number of the marker gene in each OTU and the gene content of each OTU (as generated by the gene content inference workflow), and outputs a metagenome table (i.e., counts of gene families on a per-sample basis).

  2. PICRUSt recapitulates biological findings from the Human Microbiome Project.
    Figure 2: PICRUSt recapitulates biological findings from the Human Microbiome Project.

    (a) Principal component analysis (PCA) plot comparing KEGG module predictions using 16S data with PICRUSt (lighter colored triangles) and sequenced shotgun metagenome (darker colored circles) along with relative abundances for five specific KEGG modules: (b) M00061: Uronic acid metabolism. (c) M00076: Dermatan sulfate degradation. (d) M00077: Chondroitin sulfate degradation. (e) M00078: Heparan sulfate degradation. (f) M00079: Keratan sulfate degradation. All KEGG modules are involved in glycosaminosglycan degradation (KEGG pathway ko00531) using 16S with PICRUSt (P) and whole genome sequencing (W) across human body sites. Color key is the same as in a.

  3. PICRUSt accuracy across various environmental microbiomes.
    Figure 3: PICRUSt accuracy across various environmental microbiomes.

    Prediction accuracy for paired 16S rRNA marker gene surveys and shotgun metagenomes are plotted against the availability of reference genomes as summarized by NSTI. Accuracy is summarized using the Spearman correlation between the relative abundance of gene copy number predicted from 16S data using PICRUSt versus the relative abundance observed in the sequenced shotgun metagenome. In the absence of large differences in metagenomic sequencing depth, relatively well-characterized environments, such as the human gut, had low NSTI values and can be predicted accurately from 16S surveys. Conversely, environments containing much unexplored diversity (e.g., phyla with few or no sequenced genomes), such as the Guerrero Negro hypersaline microbial mats, tended to have high NSTI values.

  4. Accuracy of PICRUSt prediction compared with shotgun metagenomic sequencing at shallow sequencing depths.
    Figure 4: Accuracy of PICRUSt prediction compared with shotgun metagenomic sequencing at shallow sequencing depths.

    Spearman correlation between either PICRUSt-predicted metagenomes (blue lines) or shotgun metagenomes (dashed red lines) using 14 soil microbial communities subsampled to the specified number of annotated sequences. This rarefaction reflects random subsets of either the full 16S OTU table (blue) or the corresponding gene table for the sequenced metagenome (red). Ten randomly chosen rarefactions were performed at each depth to indicate the expected correlation obtained when assessing an underlying true metagenome using either shallow 16S rRNA gene sequencing with PICRUSt prediction or shallow shotgun metagenomic sequencing. The data label describes the number of annotated reads below which PICRUSt-prediction accuracy exceeds metagenome sequencing accuracy. Note that the plotted rarefaction depth reflects the number of 16S or metagenomic sequences remaining after standard quality control, dereplication and annotation (or OTU picking in the case of 16S sequences), not the raw number returned from the sequencing facility. The number of total metagenomic reads below which PICRUSt outperforms metagenomic sequencing (72,650) for this data set was calculated by adjusting the crossover point in annotated reads (above) using annotation rates for the soil data set (17.3%) and closed-reference OTU picking rates for the 16S rRNA data set (68.9%). The inset figure illustrates rapid convergence of PICRUSt predictions given low numbers of annotated reads (blue line).

  5. PICRUSt prediction accuracy across the tree of bacterial and archaeal genomes.
    Figure 5: PICRUSt prediction accuracy across the tree of bacterial and archaeal genomes.

    Phylogenetic tree produced by pruning the Greengenes 16S reference tree down to those tips representing sequenced genomes. Height of the bars in the outermost circle indicates the accuracy of PICRUSt for each genome (accuracy: 0.5–1.0) colored by phylum, with text labels for each genus with at least 15 strains. PICRUSt predictions were as accurate for archaeal (mean = 0.94 ± 0.04 s.d., n = 103) as for bacterial genomes (mean = 0.95 ± 0.05 s.d., n = 2,487).

  6. Variation in inference accuracy across functional modules within single genomes.
    Figure 6: Variation in inference accuracy across functional modules within single genomes.

    Results are colored by functional category and sorted in decreasing order of accuracy within each category (indicated by triangular bars, right margin). Note that accuracy was >0.80 for all, and therefore the region 0.80–1.0 is displayed for clearer visualization of differences between modules.


  1. Cho, I. & Blaser, M.J. The human microbiome: at the interface of health and disease. Nat. Rev. Genet. 13, 260270 (2012).
  2. Suen, G. et al. An insect herbivore microbiome with high plant biomass-degrading capacity. PLoS Genet. 6, e1001129 (2010).
  3. Kuczynski, J. et al. Direct sequencing of the human microbiome readily reveals community differences. Genome Biol. 11, 210 (2010).
  4. Parks, D.H. & Beiko, R.G. Measures of phylogenetic differentiation provide robust and complementary insights into microbial communities. ISME J. 7, 173183 (2013).
  5. Knight, R. et al. Unlocking the potential of metagenomics through replicated experimental design. Nat. Biotechnol. 30, 513520 (2012).
  6. Segata, N. & Huttenhower, C. Toward an efficient method of identifying core genes for evolutionary and functional microbial phylogenies. PLoS ONE 6, e24704 (2011).
  7. Snel, B., Bork, P. & Huynen, M.A. Genome phylogeny based on gene content. Nat. Genet. 21, 108110 (1999).
  8. Konstantinidis, K.T. & Tiedje, J.M. Genomic insights that advance the species definition for prokaryotes. Proc. Natl. Acad. Sci. USA 102, 25672572 (2005).
  9. Zaneveld, J.R., Lozupone, C., Gordon, J.I. & Knight, R. Ribosomal RNA diversity predicts genome diversity in gut bacteria and their relatives. Nucleic Acids Res. 38, 38693879 (2010).
  10. Xu, J. et al. Evolution of symbiotic bacteria in the distal human intestine. PLoS Biol. 5, e156 (2007).
  11. Collins, R.E. & Higgs, P.G. Testing the infinitely many genes model for the evolution of the bacterial core genome and pangenome. Mol. Biol. Evol. 29, 34133425 (2012).
  12. Martiny, A.C., Treseder, K. & Pusch, G. Phylogenetic conservatism of functional traits in microorganisms. ISME J. 7, 830838 (2013).
  13. Morgan, X.C. et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13, R79 (2012).
  14. Muegge, B.D. et al. Diet drives convergence in gut microbiome functions across mammalian phylogeny and within humans. Science 332, 970974 (2011).
  15. Barott, K.L. et al. Microbial to reef scale interactions between the reef-building coral Montastraea annularis and benthic algae. Proc. Biol. Sci. 279, 16551664 (2012).
  16. Chaffron, S., Rehrauer, H., Pernthaler, J. & von Mering, C. A global network of coexisting microbes from environmental and whole-genome sequence data. Genome Res. 20, 947959 (2010).
  17. Kembel, S.W., Wu, M., Eisen, J.A. & Green, J.L. Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLoS Comput. Biol. 8, e1002743 (2012).
  18. Smillie, C.S. et al. Ecology drives a global network of gene exchange connecting the human microbiome. Nature 480, 241244 (2011).
  19. Meehan, C.J. & Beiko, R.G. Lateral gene transfer of an ABC transporter complex between major constituents of the human gut microbiome. BMC Microbiol. 12, 248 (2012).
  20. Boucher, Y. et al. Lateral gene transfer and the origins of prokaryotic groups. Annu. Rev. Genet. 37, 283328 (2003).
  21. Hemme, C.L. et al. Metagenomic insights into evolution of a heavy metal-contaminated groundwater microbial community. ISME J. 4, 660672 (2010).
  22. The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207214 (2012).
  23. Fierer, N. et al. Cross-biome metagenomic analyses of soil microbial communities and their functional attributes. Proc. Natl. Acad. Sci. USA 109, 2139021395 (2012).
  24. Harris, J.K. et al. Phylogenetic stratigraphy in the Guerrero Negro hypersaline microbial mat. ISME J. 7, 5060 (2013).
  25. Kunin, V. et al. Millimeter-scale genetic gradients and community-level molecular convergence in a hypersaline microbial mat. Mol. Syst. Biol. 4, 198 (2008).
  26. Markowitz, V.M. et al. IMG: the Integrated Microbial Genomes database and comparative analysis system. Nucleic Acids Res. 40, D115D122 (2012).
  27. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109D114 (2012).
  28. Tatusov, R.L., Koonin, E.V. & Lipman, D.J. A genomic perspective on protein families. Science 278, 631637 (1997).
  29. DeSantis, T.Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 50695072 (2006).
  30. Caporaso, J.G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, 335336 (2010).
  31. Abubucker, S. et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput. Biol. 8, e1002358 (2012).
  32. Meyer, F. et al. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9, 386 (2008).
  33. McHardy, A.C. & Rigoutsos, I. What's in the mix: phylogenetic classification of metagenome sequence samples. Curr. Opin. Microbiol. 10, 499503 (2007).
  34. Haas, B.J. et al. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res. 21, 494504 (2011).
  35. Patel, P.V. et al. Analysis of membrane proteins in metagenomics: networks of correlated environmental features and protein families. Genome Res. 20, 960971 (2010).
  36. Parks, D.H. & Beiko, R.G. Identifying biologically relevant differences between metagenomic communities. Bioinformatics 26, 715721 (2010).
  37. Zuniga, M. et al. Horizontal gene transfer in the molecular evolution of mannose PTS transporters. Mol. Biol. Evol. 22, 16731685 (2005).
  38. Daniluk, T. et al. Aerobic and anaerobic bacteria in subgingival and supragingival plaques of adult patients with periodontal disease. Adv. Med. Sci. 51 (suppl. 1), 8185 (2006).
  39. Segata, N. et al. Composition of the adult digestive tract bacterial microbiome based on seven mouth surfaces, tonsils, throat and stool samples. Genome Biol. 13, R42 (2012).
  40. Knowlton, N. & Jackson, J.B. Shifting baselines, local impacts, and global change on coral reefs. PLoS Biol. 6, e54 (2008).
  41. Smith, J.E. et al. Indirect effects of algae on coral: algae-mediated, microbe-induced coral mortality. Ecol. Lett. 9, 835845 (2006).
  42. Rasher, D.B., Stout, E.P., Engel, S., Kubanek, J. & Hay, M.E. Macroalgal terpenes function as allelopathic agents against reef corals. Proc. Natl. Acad. Sci. USA 108, 1772617731 (2011).
  43. Gajer, P. et al. Temporal dynamics of the human vaginal microbiota. Sci. Transl. Med. 4, 132ra52 (2012).
  44. Costello, E.K. et al. Bacterial community variation in human body habitats across space and time. Science 326, 16941697 (2009).
  45. McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6, 610618 (2012).
  46. Csuros, M. Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood. Bioinformatics 26, 19101912 (2010).
  47. Paradis, E., Claude, J. & Strimmer, K. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289290 (2004).
  48. Edgar, R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 24602461 (2010).
  49. Federhen, S. The NCBI Taxonomy database. Nucleic Acids Res. 40, D136D143 (2012).

Download references

Author information

  1. These authors contributed equally to this work.

    • Morgan G I Langille &
    • Jesse Zaneveld


  1. Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada.

    • Morgan G I Langille &
    • Robert G Beiko
  2. Department of Microbiology, Oregon State University, Corvallis, Oregon, USA.

    • Jesse Zaneveld &
    • Rebecca L Vega Thurber
  3. Department of Biological Sciences, Northern Arizona University, Flagstaff, Arizona, USA.

    • J Gregory Caporaso
  4. Institute for Genomics and Systems Biology, Argonne National Laboratory, Lemont, Illinois, USA.

    • J Gregory Caporaso
  5. BioFrontiers Institute, University of Colorado, Boulder, Colorado, USA.

    • Daniel McDonald
  6. Department of Computer Science, University of Colorado, Boulder, Colorado, USA.

    • Daniel McDonald
  7. Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, USA.

    • Dan Knights
  8. Biotechnology Institute, University of Minnesota, Saint Paul, Minnesota, USA.

    • Dan Knights
  9. Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, USA.

    • Joshua A Reyes &
    • Curtis Huttenhower
  10. Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado, USA.

    • Jose C Clemente &
    • Rob Knight
  11. Department of Biological Sciences, Florida International University, Miami Beach, Florida, USA.

    • Deron E Burkepile
  12. Howard Hughes Medical Institute, Boulder, Colorado, USA.

    • Rob Knight
  13. Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.

    • Curtis Huttenhower


The teams of M.G.I.L. and R.G.B.; J.A.R. and C.H.; and J.Z., D.K. and R.K. each conceived versions of the gene content prediction algorithm and implemented prototype software. J.Z., M.G.I.L., J.G.C., D.M., D.K., J.C.C., R.K., R.G.B. and C.H. designed the final PICRUSt algorithm and software. J.Z., M.G.I.L., J.G.C. and D.M. wrote the PICRUSt software package. M.G.I.L., J.G.C., D.M. and J.C.C. generated precalculated PICRUSt gene content predictions. D.M. and J.G.C. added functionality to the BIOM software package and the Greengenes resource in support of PICRUSt. M.G.I.L., J.Z., J.G.C., D.M., D.K., J.C.C., J.A.R., R.K., R.G.B. and C.H. applied PICRUSt to control datasets and analyzed the benchmarking data. M.G.I.L., J.Z., J.G.C., D.M., R.K., R.G.B. and C.H. wrote the manuscript. D.E.B. and R.L.V.T. collected and analyzed coral-algal data. All authors edited the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (2 MB)

    Supplementary Results and Supplementary Figures 1–17

Zip files

  1. Supplementary Data (93 MB)

Additional data