Abstract
Accurate microbial identification and abundance estimation are crucial for metagenomics analysis. Various methods for classification of metagenomic data and estimation of taxonomic profiles, broadly referred to as metagenomic profilers, have been developed. Nevertheless, benchmarking of metagenomic profilers remains challenging because some tools are designed to report relative sequence abundance while others report relative taxonomic abundance. Here we show how misleading conclusions can be drawn by neglecting this distinction between relative abundance types when benchmarking metagenomic profilers. Moreover, we show compelling evidence that interchanging sequence abundance and taxonomic abundance will influence both per-sample summary statistics and cross-sample comparisons. We suggest that the microbiome research community pay attention to potentially misleading biological conclusions arising from this issue when benchmarking metagenomic profilers, by carefully considering the type of abundance data that were analyzed and interpreted and clearly stating the strategy used for metagenomic profiling.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All simulated datasets can be downloaded from https://figshare.com/projects/Challenges_in_Benchmarking_Metagenomic_Profilers/79916.Source data are provided with this paper.
Code availability
R scripts used in this paper are available at https://github.com/shihuang047/re-benchmarking
References
Knight, R. et al. Best practices for analysing microbiomes. Nat. Rev. Microbiol. 16, 410–422 (2018).
Ye, S. H., Siddle, K. J., Park, D. J. & Sabeti, P. C. Benchmarking metagenomics tools for taxonomic classification. Cell 178, 779–794 (2019).
Liu, B., Gibbons, T., Ghodsi, M., Treangen, T. & Pop, M. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 12, S4 (2011).
Arumugam, M. et al. Enterotypes of the human gut microbiome. Nature 473, 174–180 (2011).
Milanese, A. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017).
Kostic, A. D. et al. PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat. Biotechnol. 29, 393–396 (2011).
Menzel, P., Ng, K. L. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7, 11257 (2016).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Truong, D. T. et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12, 902–903 (2015).
Segata, N. et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9, 811–814 (2012).
Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Li, D. et al. MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102, 3–11 (2016).
Mavromatis, K. et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods 4, 495–500 (2007).
Meyer, F. et al. Assessing taxonomic metagenome profilers with OPAL. Genome Biol. 20, 51 (2019).
Sczyrba, A. et al. Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).
McIntyre, A. B. R. et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. 18, 182 (2017).
Lindgreen, S., Adair, K. L. & Gardner, P. P. An evaluation of the accuracy and speed of metagenome analysis tools. Sci. Rep. 6, 19233 (2016).
Chen, F., Mackey, A. J., Vermunt, J. K. & Roos, D. S. Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS ONE 2, e383 (2007).
Soppa, J. Polyploidy in archaea and bacteria: about desiccation resistance, giant cell size, long-term survival, enforcement by a eukaryotic host and additional aspects. J. Mol. Microbiol. Biotechnol. 24, 409–419 (2014).
Mendell, J. E., Clements, K. D., Choat, J. H. & Angert, E. R. Extreme polyploidy in a large bacterium. Proc. Natl Acad. Sci. USA 105, 6730–6734 (2008).
Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V. & Egozcue, J. J. Microbiome datasets are compositional: and this is not optional. Front. Microbiol. 8, 2224 (2017).
Aitchison, J. On criteria for measures of compositional distance. Math. Geol. 24, 365–379 (1992).
Martino, C. et al. A novel sparse compositional technique reveals microbial perturbations. mSystems 4, e00016–e00019 (2019).
Aitchison, J., Barceló-Vidal, C., Martín-Fernández, J. A., Pawlowsky-Glahn, V. & Logratio Analysis and compositional distance. Math. Geol. 32, 271–275 (2000).
Breitwieser, F. P., Lu, J. & Salzberg, S. L. A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20, 1125–1136 (2019).
Legendre, P., Borcard, D. & Peres-Neto, P. R. Analyzing beta diversity: partitioning the spatial variation of community composition data. Ecol. Monogr. 75, 435–450 (2005).
Mantel, N. The detection of disease clustering and a generalized regression approach. Cancer Res. 27, 209–220 (1967).
Faith, D. P., Minchin, P. R. & Belbin, L. Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69, 57–68 (1987).
Legendre, P. & Gallagher, E. D. Ecologically meaningful transformations for ordination of species data. Oecologia 129, 271–280 (2001).
van der Maaten, L. J. P. & Hinton, G. E. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
McInnes, L. & Healy, J. UMAP: uniform manifold approximation and projection for dimension reduction. J. Open Source Softw. 3, 861 (2018).
Dray, S., Chessel, D. & Thioulouse, J. Procrustean co-inertia analysis for the linking of multivariate datasets. Écoscience 10, 110–119 (2003).
Digby, P. & Kempton, R. Multivariate Analysis of Ecological Communities (Palgrave MacMillan, 1987).
Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).
Hsu, T. et al. Urban transit system microbial communities differ by surface type and interaction with humans and the environment. mSystems 1, e00018-16 (2016).
Acknowledgements
Research reported in this publication was supported by grant nos. R01AI141529, R01HD093761, R01AG067744, UH3OD023268, U19AI095219 and U01HL089856 from the National Institutes of Health. This work was also supported by IBM Research through the AI Horizons Network, UC San Diego AI for Healthy Living program in partnership with the UC San Diego Center for Microbiome Innovation.
Author information
Authors and Affiliations
Contributions
Y.-Y.L. and R.K. conceived and designed the analysis. Z.S. and S.H. led the analysis. M.Z., Q.Z., N.H., A.P.C., Y.V.-B, L.P. and H.-C.K. contributed evaluation strategies. All authors analyzed the results. Z.S., S.H., Y.-Y.L. and R.K. wrote the paper. All authors edited the paper.
Corresponding authors
Ethics declarations
Competing interests
This work received support from IBM Research through the AI Horizons Network. Coauthors N.H., A.P.C., L.P. and H.-C.K. are employees of IBM. The authors declare no other competing interests.
Additional information
Peer review information Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Note, Figs. 1–6 and references.
Supplementary Data
An Excel file that includes the source data used in each of the Supplementary Figures.
Source data
Source Data Fig. 1
Source data used in Fig. 1.
Source Data Fig. 2
Source data used in Fig. 2.
Source Data Fig. 3
Source data used in Fig. 3.
Source Data Fig. 4
Source data used in Fig. 4.
Source Data Fig. 5
Source data used in Fig. 5.
Rights and permissions
About this article
Cite this article
Sun, Z., Huang, S., Zhang, M. et al. Challenges in benchmarking metagenomic profilers. Nat Methods 18, 618–626 (2021). https://doi.org/10.1038/s41592-021-01141-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-021-01141-3
This article is cited by
-
Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes
Genome Biology (2024)
-
Comparative analysis of metagenomic classifiers for long-read sequencing datasets
BMC Bioinformatics (2024)
-
A culture-independent approach, supervised machine learning, and the characterization of the microbial community composition of coastal areas across the Bay of Bengal and the Arabian Sea
BMC Microbiology (2024)
-
Modeling the limits of detection for antimicrobial resistance genes in agri-food samples: a comparative analysis of bioinformatics tools
BMC Microbiology (2024)
-
Integrating taxonomic signals from MAGs and contigs improves read annotation and taxonomic profiling of metagenomes
Nature Communications (2024)