Abstract

Functional profiles of microbial communities are typically generated using comprehensive metagenomic or metatranscriptomic sequence read searches, which are time-consuming, prone to spurious mapping, and often limited to community-level quantification. We developed HUMAnN2, a tiered search strategy that enables fast, accurate, and species-resolved functional profiling of host-associated and environmental communities. HUMAnN2 identifies a community’s known species, aligns reads to their pangenomes, performs translated search on unclassified reads, and finally quantifies gene families and pathways. Relative to pure translated search, HUMAnN2 is faster and produces more accurate gene family profiles. We applied HUMAnN2 to study clinal variation in marine metabolism, ecological contribution patterns among human microbiome pathways, variation in species’ genomic versus transcriptional contributions, and strain profiling. Further, we introduce ‘contributional diversity’ to explain patterns of ecological assembly across different microbial community types.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

The Human Microbiome Project (HMP) metagenomes analyzed in this work are available via http://hmpdacc.org. The IBDMDB metagenomes and metatranscriptomes analyzed in this work are available via http://ibdmdb.org. The Red Sea metagenomes analyzed in this work were previously deposited as NCBI BioProject PRJNA289734. The synthetic metagenomes and metatranscriptomes used in the evaluation of HUMAnN2 and other methods are available from the authors and at http://huttenhower.sph.harvard.edu/humann2.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Shafquat, A., Joice, R., Simmons, S. L. & Huttenhower, C. Functional and phylogenetic assembly of microbial communities in the human microbiome. Trends Microbiol. 22, 261–266 (2014).

  2. 2.

    Fuhrman, J. A. Microbial community structure and its functional implications. Nature 459, 193–199 (2009).

  3. 3.

    Lloyd-Price, J., Abu-Ali, G. & Huttenhower, C. The healthy human microbiome. Genome Med. 8, 51 (2016).

  4. 4.

    Franzosa, E. A. et al. Sequencing and beyond: integrating molecular ‘omics’ for microbial community profiling. Nat. Rev. Microbiol. 13, 360–372 (2015).

  5. 5.

    Segata, N. et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9, 811–814 (2012).

  6. 6.

    Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).

  7. 7.

    Silva, G. G., Green, K. T., Dutilh, B. E. & Edwards, R. A. SUPER-FOCUS: a tool for agile functional analysis of shotgun metagenomic data. Bioinformatics 32, 354–361 (2016).

  8. 8.

    Sharma, A. K., Gupta, A., Kumar, S., Dhakan, D. B. & Sharma, V. K. Woods: a fast and accurate functionalannotator and classifier of genomic and metagenomic sequences. Genomics 106, 1–6 (2015).

  9. 9.

    Petrenko, P., Lobb, B., Kurtz, D. A., Neufeld, J. D. & Doxey, A. C. MetAnnotate: function-specific taxonomic profiling and comparison of metagenomes. BMC Biol. 13, 92 (2015).

  10. 10.

    Bose, T., Haque, M. M., Reddy, C. & Mande, S. S. COGNIZER: a framework for functional annotation of metagenomic datasets. PLoS One 10, e0142102 (2015).

  11. 11.

    Kim, J., Kim, M. S., Koh, A. Y., Xie, Y. & Zhan, X. FMAP: functional mapping and analysis pipeline for metagenomics and metatranscriptomics studies. BMC Bioinformatics 17, 420 (2016).

  12. 12.

    Huson, D. H. et al. MEGAN Community Edition—interactive exploration and analysis of large-scale microbiome sequencing data. PLoS Comput. Biol. 12, e1004957 (2016).

  13. 13.

    Nayfach, S. et al. Automated and accurate estimation of gene family abundance from shotgun metagenomes. PLoS Comput. Biol. 11, e1004573 (2015).

  14. 14.

    Abubucker, S. et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput. Biol. 8, e1002358 (2012).

  15. 15.

    Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

  16. 16.

    Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using Diamond. Nat. Methods 12, 59–60 (2015).

  17. 17.

    Zhao, Y., Tang, H. & Ye, Y. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics 28, 125–126 (2012).

  18. 18.

    Hauswedell, H., Singer, J. & Reinert, K. Lambda: the local aligner for massive biological data. Bioinformatics 30, i349–i355 (2014).

  19. 19.

    Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).

  20. 20.

    Scholz, M. et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat. Methods 13, 435–438 (2016).

  21. 21.

    Luo, C. et al. ConStrains identifies microbial strains in metagenomic datasets. Nat. Biotechnol. 33, 1045–1052 (2015).

  22. 22.

    Truong, D. T. et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12, 902–903 (2015).

  23. 23.

    Medini, D., Donati, C., Tettelin, H., Masignani, V. & Rappuoli, R. The microbial pan-genome. Curr. Opin. Genet. Dev. 15, 589–594 (2005).

  24. 24.

    Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B. & Wu, C. H. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

  25. 25.

    Galperin, M. Y., Makarova, K. S., Wolf, Y. I. & Koonin, E. V. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 43, D261–D269 (2015).

  26. 26.

    Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).

  27. 27.

    Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2016).

  28. 28.

    Gene Ontology Consortium. Gene Ontology Consortium: going forward Nucleic Acids Res. 43, D1049–D1056 (2015)..

  29. 29.

    Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 44, D471–D480 (2016).

  30. 30.

    Lloyd-Price, J. et al. Strains, functions and dynamics in the expanded Human Microbiome Project. Nature 550, 61–66 (2017).

  31. 31.

    Sczyrba, A. et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).

  32. 32.

    Hamady, M. & Knight, R. Microbial community profiling for human microbiome projects: tools, techniques, and challenges. Genome Res. 19, 1141–1152 (2009).

  33. 33.

    Ravel, J. et al. Vaginal microbiome of reproductive-age women. Proc. Natl. Acad. Sci. USA 108, 4680–4687 (2011).

  34. 34.

    Thompson, L. R. et al. Metagenomic covariation along densely sampled environmental gradients in the Red Sea. ISME J. 11, 138–151,https://doi.org/10.1038/ismej.2016.99 (2017).

  35. 35.

    Sunagawa, S. et al. Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).

  36. 36.

    Swan, B. K. et al. Genomic and metabolic diversity of Marine Group I Thaumarchaeota in the mesopelagic of two subtropical gyres. PLoS One 9, e95380 (2014).

  37. 37.

    Thompson, L. R. et al. Phage auxiliary metabolic genes and the redirection of cyanobacterial host carbon metabolism. Proc. Natl. Acad. Sci. USA 108, E757–E764 (2011).

  38. 38.

    Integrative HMP (iHMP) Research Network Consortium. The Integrative Human Microbiome Project: dynamic analysis of microbiome-host omics profiles during periods of human health and disease. Cell Host Microbe 16, 276–289 (2014)..

  39. 39.

    Franzosa, E. A. et al. Relating the metatranscriptome and metagenome of the human gut. Proc. Natl. Acad.Sci. USA 111, E2329–E2338 (2014).

  40. 40.

    Turnbaugh, P. J. et al. A core gut microbiome in obese and lean twins. Nature 457, 480–484 (2009).

  41. 41.

    Burke, C., Steinberg, P., Rusch, D., Kjelleberg, S. & Thomas, T. Bacterial community assembly based on functional genes rather than species. Proc. Natl. Acad. Sci. USA 108, 14288–14293 (2011).

  42. 42.

    Duran-Pinedo, A. E. et al. Community-wide transcriptome of the oral microbiome in subjects with and without periodontitis. ISME J. 8, 1659–1672 (2014).

  43. 43.

    Mason, O. U. et al. Metagenome, metatranscriptome and single-cell sequencing reveal microbial response to Deepwater Horizon oil spill. ISME J. 6, 1715–1727 (2012).

  44. 44.

    Pasolli, E., Truong, D. T., Malik, F., Waldron, L. & Segata, N. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12, e1004977 (2016).

  45. 45.

    UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

  46. 46.

    Huang, K. et al. MetaRef: a pan-genomic database for comparative and community microbial genomics. Nucleic Acids Res. 42, D617–D624 (2014).

  47. 47.

    Segata, N., Börnigen, D., Morgan, X. C. & Huttenhower, C. PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat. Commun. 4, 2304 (2013).

  48. 48.

    Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).

  49. 49.

    Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

  50. 50.

    Ye, Y. & Doak, T. G. A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput. Biol. 5, e1000465 (2009).

  51. 51.

    Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).

Download references

Acknowledgements

The authors thank M. Wong, T. Sharpton, and the members of the HUMAnN user group for their feedback on the development and evaluation of HUMAnN2. Funding for this work was provided by NSF 1565100 (to J.G.C.); People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme (FP7/2007–2013) under REA grant agreement PCIG13-GA-2013-618833 and by MIUR “Futuro in Ricerca” RBFR13EWWI_001 (to N.S.); NIH NIDDK U54DE023798, NSF MCB-1453942, NIH NIDDK P30DK043351; and NSF DBI-1053486 (to C.H.).

Author information

Author notes

  1. These authors contributed equally: Eric A. Franzosa and Lauren J. McIver.

Affiliations

  1. Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA

    • Eric A. Franzosa
    • , Lauren J. McIver
    • , Gholamali Rahnavard
    • , Melanie Schirmer
    • , George Weingart
    •  & Curtis Huttenhower
  2. The Broad Institute of MIT and Harvard, Cambridge, MA, USA

    • Eric A. Franzosa
    • , Lauren J. McIver
    • , Gholamali Rahnavard
    • , Melanie Schirmer
    •  & Curtis Huttenhower
  3. Department of Pediatrics, University of California San Diego, San Diego, CA, USA

    • Luke R. Thompson
    •  & Rob Knight
  4. Pathogen and Microbiome Institute, Northern Arizona University, Flagstaff, AZ, USA

    • Karen Schwarzberg Lipson
    •  & J. Gregory Caporaso
  5. Department of Computer Science & Engineering, University of California San Diego, San Diego, CA, USA

    • Rob Knight
  6. Centre for Integrative Biology, University of Trento, Trento, Italy

    • Nicola Segata

Authors

  1. Search for Eric A. Franzosa in:

  2. Search for Lauren J. McIver in:

  3. Search for Gholamali Rahnavard in:

  4. Search for Luke R. Thompson in:

  5. Search for Melanie Schirmer in:

  6. Search for George Weingart in:

  7. Search for Karen Schwarzberg Lipson in:

  8. Search for Rob Knight in:

  9. Search for J. Gregory Caporaso in:

  10. Search for Nicola Segata in:

  11. Search for Curtis Huttenhower in:

Contributions

E.A.F., L.J.M., and C.H. designed the methods. L.J.M. developed the software implementation. G.R., G.W., and N.S. produced datasets to support the software. E.A.F., L.J.M., G.R., L.R.T., M.S., and K.S.L. designed and carried out the evaluations and applications; R.K., J.G.C., and all other authors participated in interpretation of the resulting data. E.A.F., L.J.M., L.R.T., M.S., K.S.L., and C.H. wrote the paper with feedback from the other authors.

Competing interests

The authors declare no competing interests.

Corresponding author

Correspondence to Curtis Huttenhower.

Integrated supplementary information

  1. Supplementary Figure 1 Expanded overview of the HUMAnN2 method.

    (a) HUMAnN2 implements a tiered meta’omic search that aims to explain the origin of microbial community DNA or RNA reads based on the pangenomes of detected microbes before falling back to more computationally expensive translated search. (b) The tiered search produces alignments of reads to coding sequences of known or ambiguous taxonomy. These alignments are processed in a species-specific manner to calculate gene family abundance and reconstruct community metabolic pathways. (c) HUMAnN2 thus provides, for each community meta'ome: per-gene abundances, pathway presence/absence calls and abundances, and downstream visualization and statistical tests

  2. Supplementary Figure 2 Reference hold-out analysis of a complex synthetic metagenome.

    We constructed and analyzed with HUMAnN2 a 100-member mock-even synthetic metagenome containing only non-human associated species (~2 × coverage per species). (a) Variation in the number of reads sampled per gene (compared with a genome’s average fold-coverage) makes a non-trivial contribution to the error in per-species gene abundance estimation in HUMAnN2 (roughly 0.1 Bray-Curtis dissimilarity units). (b) Accuracy of community-level gene family abundance estimation decreases linearly with the number of community species missed by HUMAnN2′s taxonomic prescreen (simulated here by excluding sets of species from the underlying pangenome reference collection). (c) HUMAnN2′s overall runtime increases linearly as more species are excluded from the taxonomic prescreen (which results in more work being done during translated search). Runtimes reflect execution using 8 CPU cores

  3. Supplementary Figure 3 HUMAnN2 tiered search performance on human metagenomes.

    We applied HUMAnN2′s tiered search to profile 397 first-visit HMP metagenomes on Harvard University’s Odyssey Research Computing Cluster (8 CPU cores per job). Sample counts per body site were as follows: 54 for anterior nares, 65 for buccal mucosa, 68 for supragingival plaque, 73 for tongue dorsum, 76 for stool, and 34 for posterior fornix. (a) At most body sites, ~ 60% of reads were explained by detected pangenomes, with (b) an additional ~ 20% explained by downstream translated search (~80% total). Pangenome search performance (c) consistently exceeded translated search performance (d) by 1–2 orders of magnitude. From smallest to largest, box plot elements in panels a–d represent the lower inner fence, first quartile, median, third quartile, and upper inner fence. Horizontal red lines indicate the median value over all samples. (e) Total runtime is largely dictated by the number of reads passed to translated search, and (for HMP samples with < 100 million reads) was approximately linear in the number of input reads (~1 h/5 million input reads). (f) Peak memory use was sublinear in the number of input reads and very predictable. The cluster of outliers in f results from large samples that were requeued during their runs: these samples resumed later in the HUMAnN2 workflow and hence display smaller peak memory use

  4. Supplementary Figure 4 HUMAnN2 compared with other methods (details).

    We profiled a 10-million-read synthetic gut metagenome using HUMAnN2 (tiered and pure translated search modes), HUMAnN1, COGNIZER, MEGAN, and ShotMAP to produce profiles of COG abundance. Here, expected (gold standard) and observed COG abundances are compared in units of copies per million (CPMs; that is., raw abundance normalized by gene length and number of mapped reads). HUMAnN2′s tiered search was considerably more accurate than the other methods based on pure translated search. HUMAnN2′s pure translated search showed better agreement than other translated search methods, with its largest source of error being underreporting of low-abundance COGs (false negatives). This behavior is expected from the translated search coverage filters used in HUMAnN2, which we use to limit false positive detection events (that is., COGs with zero expected abundance and non-zero observed abundance). Ticks in the x- and y-axis margins represent zero values; x-axis ticks are false negatives and y-axis ticks are false positives

  5. Supplementary Figure 5 Protein coverage thresholds in translated search.

    If two largely unrelated proteins share local sequence homology, reads drawn from the homologous region will map to both proteins, potentially resulting in false positive detection events. To limit such events, we require a threshold fraction of sites in a protein to recruit reads during translated search before considering the protein ‘detected’. We evaluated potential thresholds by analyzing the results of pure translated search of synthetic metagenomes versus the UniRef90 database. Trade-offs between sensitivity and precision are shown for the 100-member, even, non-human-associated metagenome in a, and the 20-member, staggered, human-gut-associated metagenome in b. When all community genomes are well covered, a 50% coverage threshold (HUMAnN2′s default) yields a marked increase in precision with only minor loss of sensitivity (a). Loss of sensitivity is higher at this threshold when rare (low-coverage) genomes are included, as genes in low-coverage genomes often fail to meet the coverage threshold due to insufficient read sampling (b). These evaluations do not reflect any additional post-processing of translated search results (for example. weighting by alignment quality), which provide additional accuracy improvements

  6. Supplementary Figure 6 HUMAnN2 compared with other methods: synthetic metatranscriptome evaluation.

    We profiled a 10-million-read synthetic gut metatranscriptome using HUMAnN2 (tiered and pure translated search modes), HUMAnN1, COGNIZER, MEGAN, and ShotMAP to produce profiles of community-level COG transcript abundance. Twenty species’ genomic abundance values were geometrically staggered (as in the gut metagenome evaluation), while genes (transcripts) were sampled within-species following a log-normal distribution [ln N(0, 1)]. (a) Measures of methods’ accuracy and performance in this evaluation. All methods were allowed to use 8 CPU cores and up to 30 GB of memory. This panel is analogous to Fig. 1e (which focuses on metagenomic COG abundance in the same synthetic community). (b) Observed versus expected COG transcript abundance across the six methods. This panel is analogous to Supplementary Fig. 4. CPM refers to “copies per million.” Ticks in the x- and y-axis margins represent zero values; x-axis ticks are false negatives and y-axis ticks are false positives

  7. Supplementary Figure 7 HUMAnN2 compared with other methods: novel isolates of known species, UniRef90-based COG gold standard.

    We profiled a 10-million-read synthetic metagenome using HUMAnN2 (tiered and pure translated search modes), HUMAnN1, COGNIZER, MEGAN, and ShotMAP to produce profiles of community-level COG abundance. Twenty recent, new isolates of known species (that is., species present in HUMAnN2′s pangenome database) were sampled at staggered relative abundance. (a) Measures of methods’ accuracy and performance in this evaluation. All methods were allowed to use 8 CPU cores and up to 30 GB of memory. This panel and analysis are analogous to those in Fig. 1e. (b) Observed versus expected COG transcript abundance across the six methods. This panel is analogous to Supplementary Fig. 4. CPM refers to “copies per million.” Ticks in the x- and y-axis margins represent zero values; x-axis ticks are false negatives and y-axis ticks are false positives

  8. Supplementary Figure 8 HUMAnN2 compared with other methods: novel isolates of known species, UniRef50-based COG gold standard.

    This figure mirrors Supplementary Fig. 6, except that COG annotations are defined based on co-clustering with UniRef50 families (rather than UniRef90). Similarly, HUMAnN2 was run in UniRef50 mode. These changes tend to favor sensitivity over specificity during both isolate genome annotation and profiling. (a) Accuracy and performance of the six functional profiling methods. (b) Observed versus expected COG abundance

  9. Supplementary Figure 9 HUMAnN2 compared with other methods: isolates of novel species, UniRef90-based COG gold standard.

    We profiled a 10-million-read synthetic metagenome using HUMAnN2 (tiered and pure translated search modes), HUMAnN1, COGNIZER, MEGAN, and ShotMAP to produce profiles of community-level COG abundance. Twenty recent, new isolates of novel species (that is., species not present in HUMAnN2′s pangenome database) were sampled at staggered relative abundance. Note that, in this context, HUMAnN2′s tiered search relies entirely on the translated search phase to explain sample reads. (a) Measures of methods’ accuracy and performance in this evaluation. All methods were allowed to use 8 CPU cores and up to 30 GB of memory. This panel and analysis are analogous to those in Fig. 1e. (b) Observed versus expected COG transcript abundance across the six methods. This panel is analogous to Supplementary Fig. 4. CPM refers to “copies per million.” Ticks in the x- and y-axis margins represent zero values; x-axis ticks are false negatives and y-axis ticks are false positives. Vertical striping of “expected COG abundance” results from single-copy COGs that were only assigned to one genome (and hence all have the same expected coverage)

  10. Supplementary Figure 10 HUMAnN2 compared with other methods: isolates of novel species, UniRef50-based COG gold standard.

    This figure mirrors Supplementary Fig. 8, except that COG annotations are defined based on co-clustering with UniRef50 families (rather than UniRef90). Similarly, HUMAnN2 was run in UniRef50 mode. These changes tend to favor sensitivity over specificity during both isolate genome annotation and profiling. (a) Accuracy and performance of the six functional profiling methods. (b) Observed versus expected COG abundance

  11. Supplementary Figure 11 Contributional diversity at additional oral sites.

    This figure follows the format of Fig. 2 from the main text and includes data for two additional oral body sites: buccal mucosa and supragingival plaque. Stars indicate background species-level community diversity

  12. Supplementary Figure 12 Additional examples of core human microbiome pathways with low within-subject and low between-subject contributional diversity.

    Bar heights represent the total relative abundance of the pathway and are log-scaled. Contributions of individual species/other/unclassified are linearly scaled within the total bar height

  13. Supplementary Figure 13 Non-vaginal examples of human microbiome pathways with simple but varied contributional diversity.

    Bar heights represent the total relative abundance of the pathway and are log-scaled. Contributions of individual species/other/unclassified are linearly scaled within the total bar height

  14. Supplementary Figure 14 Examples of subspecies-level functional variation (gene level).

    (a) Strains of Lactobacillus jensenii were well represented in 21 HMP posterior fornix samples. At least two subspecies-level clades appear to be present, defined by the presence of gene block a1 or a2 (highlighted). (b) Strains of Eubacterium eligens were well represented in 51 HMP stool samples. At least three subspecies-level clades appear to be present, defined by the presence/absence of gene blocks b1, b2, and b3 (highlighted)

  15. Supplementary Figure 15 Example of potential niche-adapted subspecies of Haemophilus haemolyticus.

    Metagenomic ‘strains’ (UniRef90 gene family presence/absence profiles) of this species differ across the three oral sites where it was detected. Right-side plots illustrate the coreness, variability, and site-specific enrichment of individual genes. Variability peaks at 1.0 for genes detected in exactly 50% of samples. Site-specific enrichment peaks at 1.0 when the gene is 100% prevalent in a focal site and 0% prevalent in all other sites (with –1 corresponding to the exact opposite scenario)

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–15 and Supplementary Notes 1–7

  2. Reporting Summary

  3. Supplementary Software

    The pypi install package for HUMAnN2 v0.11.0 (used in the evaluations from the manuscript)

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41592-018-0176-y