Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA

Kim, Jaebeom; Steinegger, Martin

doi:10.1038/s41592-024-02273-y

Brief Communication
Published: 20 May 2024

Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA

Nature Methods (2024)Cite this article

2276 Accesses
83 Altmetric
Metrics details

Subjects

Abstract

Metagenomic taxonomic classifiers analyze either DNA or amino acid (AA) sequences. Metabuli (https://metabuli.steineggerlab.com), however, jointly analyzes both DNA and AA to leverage AA conservation for sensitive homology detection and DNA mutations for specific differentiation of closely related taxa. In the Critical Assessment of Metagenome Interpretation 2 plant-associated dataset, Metabuli covered 99% and 98% of classifications of state-of-the-art DNA- and AA-based classifiers, respectively.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

A complete domain-to-species taxonomy for Bacteria and Archaea

Article 27 April 2020

Cydrasil 3, a curated 16S rRNA gene reference package and web app for cyanobacterial phylogenetic placement

Article Open access 02 September 2021

A standardized archaeal taxonomy for the Genome Taxonomy Database

Article 21 June 2021

Data availability

The raw data for Fig. 2 and Extended Data Figs. 1–4 are provided as Source data. Performance measures at different ranks of benchmarks of Fig. 2a,b and Extended Data Fig. 1 are available in Supplementary Tables 1–3. The assemblies used for read simulation and database creation in synthetic benchmarks are listed in Supplementary Table 4, and the simulated reads are available via Zenodo at https://doi.org/10.5281/zenodo.10250585 (ref. ²⁸). More detailed results and utilized accessions of Fig. 2c,d are provided in Supplementary Tables 5 and 6. The databases used in Fig. 2c,d were built using viral genomes (release 212) and a human genome (GCF_009914755.1) downloaded from NCBI RefSeq, and accessions of genomes of analyzed SARS-CoV-2 variants were denoted in ‘Pathogen detection tests’ section in Methods. Performance measures at different ranks of Fig. 2e and Extended Data Fig. 2 are provided in Supplementary Tables 7–9. Precision and recall of Extended Data Fig. 4 are available in Supplementary Table 10. The accessions of real data analyzed in Fig. 2g,h and Extended Data Fig. 3 are denoted in ‘Benchmarks with real metagenomes’ section in Methods. CAMI2-provided datasets and taxonomy used in Fig. 2e,f and Extended Data Fig. 2 can be downloaded from https://data.cami-challenge.org/participate. Source data are provided with this paper.

Code availability

Metabuli is GPLv3-licensed free open-source software. The source code and ready-to-use binaries, as well as precomputed databases (Supplementary Table 11), can be downloaded at metabuli.steineggerlab.com. The scripts used for benchmarks and plots are available at https://github.com/jaebeom-kim/metabuli-analysis and https://github.com/jaebeom-kim/metabuli-plots.

References

Simon, H. Y., Siddle, K. J., Park, D. J. & Sabeti, P. C. Benchmarking metagenomics tools for taxonomic classification. Cell 178, 779–794 (2019).
Article Google Scholar
Nooij, S., Schmitz, D., Vennema, H., Kroneman, A. & Koopmans, M. P. Overview of virus metagenomic classification methods and their biological applications. Front. Microbiol. 9, 749 (2018).
Article PubMed PubMed Central Google Scholar
Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J. & Levy Karin, E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics 37, 3029–3031 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 1–13 (2019).
Article Google Scholar
Breitwieser, F. P., Baker, D. N. & Salzberg, S. L. Krakenuniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 19, 198 (2018).
Article CAS PubMed PubMed Central Google Scholar
Menzel, P., Ng, K. L. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7, 11257 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dilthey, A. T., Jain, C., Koren, S. & Phillippy, A. M. Strain-level metagenomic assignment and compositional estimation for long reads with metamaps. Nat. Commun. 10, 3066 (2019).
Article PubMed PubMed Central Google Scholar
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 119 (2010).
Article Google Scholar
Parks, D. H. et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 50, D785–D794 (2022).
Article CAS PubMed Google Scholar
Watson, M. & Warr, A. Errors in long-read assemblies can critically affect protein prediction. Nat. Biotechnol. 37, 124–126 (2019).
Article CAS PubMed Google Scholar
Meyer, F. et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).
Article CAS PubMed PubMed Central Google Scholar
Nasko, D. J., Koren, S., Phillippy, A. M. & Treangen, T. J. Refseq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biol. 19, 1–10 (2018).
Article Google Scholar
Lu, J. et al. Metagenome analysis using the Kraken software suite. Nat. Protoc. 17, 2815–2839 (2022).
Holtgrewe, M. Mason - A Read Simulator for Second Generation Sequencing Data. Technical Report (FU Berlin, 2010).
Ono, Y., Hamada, M. & Asai, K. PBSIM3: a simulator for all types of PacBio and ONT long reads. NAR Genom. Bioinform. 4, lqac092 (2022).
de la Cuesta-Zuluaga, J., Ley, R. E. & Youngblut, N. D. Struo: a pipeline for building custom databases for common metagenome profilers. Bioinformatics 36, 2314–2315 (2020).
Youngblut, N. & Shen, W. nick-youngblut/gtdb_to_taxdump: Zenodo release. Zenodo https://doi.org/10.5281/zenodo.3696964 (2020).
Frith, M. C. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res. 39, e23 (2011).
Rahaman, M. M. et al. Genomic characterization of the dominating Beta, V2 variant carrying vaccinated (Oxford-AstraZeneca) and nonvaccinated COVID-19 patient samples in Bangladesh: a metagenomics and whole-genome approach. J. Med. Virol. 94, 1670–1688 (2022).
Lentini, A., Pereira, A., Winqvist, O. & Reinius, B. Monitoring of the SARS-CoV-2 Omicron BA.1/BA.2 lineage transition in the Swedish population reveals increased viral RNA levels in BA.2 cases. Med 3, 636–643 (2022).
Desai, N. et al. Temporal and spatial heterogeneity of host response to SARS-CoV-2 pulmonary infection. Nat. Commun. 11, 6319 (2020).
Gehrig, J. L. et al. Finding the right fit: evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data. Microb. Genom. 8, 000794 (2022).
Liu, L., Yang, Y., Deng, Y. & Zhang, T. Nanopore long-read-only metagenomics enables complete and high-quality genome reconstruction from mock and complex metagenomes. Microbiome 10, 209 (2022).
Barnes, S. J. et al. Metagenome-assembled genomes from photo-oxidized and nonoxidized oil-degrading marine microcosms. Microbiol. Resour. Announc. 12, 6 (2023).
Priest, T., Orellana, L. H., Huettel, B., Fuchs, B. M. & Amann, R. Microbial metagenome-assembled genomes of the Fram Strait from short and long read sequencing platforms. PeerJ 9, e11721 (2021).
Huang, R. et al. Long-read metagenomics of marine microbes reveals diversely expressed secondary metabolites. Microbiol. Spectr. 11, e0150123 (2023).
Kim, J. Simulated query reads used for benchmarks in Metabuli publication. Zenodo https://doi.org/10.5281/zenodo.10250585 (2023).

Download references

Acknowledgements

The authors thank E. Levy Karin for the valuable scientific feedback and the careful review and revision of the paper; J. Söding for the discussions on metamer encoding; M. Mirdita for the usability improvements of the software; H. Kim for the improvement of figures; S. Jaenicke for the voluntary examination of the software; and M. Kim for the feedback on the paper. M.S. acknowledges support by the National Research Foundation of Korea grants (2020M3-A9G7-103933, 2021-R1C1-C102065 and 2021-M3A9-I4021220), the Samsung DS research fund, and the Creative-Pioneering Researchers Program and AI-Bio Research Grant through Seoul National University.

Author information

Authors and Affiliations

Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
Jaebeom Kim & Martin Steinegger
School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
Martin Steinegger
Institute of Molecular Biology and Genetics, Seoul National University, Seoul, Republic of Korea
Martin Steinegger
Artificial Intelligence Institute, Seoul National University, Seoul, Republic of Korea
Martin Steinegger

Authors

Jaebeom Kim
View author publications
You can also search for this author in PubMed Google Scholar
Martin Steinegger
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.K. and M.S. designed the research, developed the software, performed analysis and wrote the paper.

Corresponding author

Correspondence to Martin Steinegger.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks André Soares and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lei Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Synthetic benchmark results.

Simulated short (Illumina) and long (PacBio HiFi, ONT, and PacBio Sequel II) reads were used for performance evaluation based on GTDB genomes and taxonomy. Hybrid = (x, y) is the result of applying the DNA-based tool x, followed by the AA-based tool y, where both are the best-performing. a–d Subspecies-level classification tests. Reads were simulated from subspecies present in databases, and precision and recall were measured at subspecies rank. a) Hybrid = (KrakenUniq, Kraken2x). b-d) Hybrid = (MetaMaps, Kraken2X). Raw data for performance measurements at subspecies, species, genus, and family ranks are available in Supplementary Table 1. e–h Species-level classification tests. Not the queried subspecies but their sibling subspecies were contained in databases to measure species-level classification. Hybrid = (KrakenUniq, Kraken2X). Raw data for performance measurements at species, genus, family, and order ranks are available in Supplementary Table 2. i–l Genus-level classification tests. Not the queried species but their sibling species were contained in databases, so how well each tool can detect homology within the same genus was measured. i) Hybrid = (Kraken2, MMseqs2). j-l) Hybrid = (Kraken2, Kraken2X). Raw data for performance measurements at genus, family, order, and class ranks are available in Supplementary Table 3.

Source data

Extended Data Fig. 2 Benchmarks using CAMI2’s strain-madness, marine, and plant-associated datasets.

GTDB genomes and the CAMI2-provided taxonomy were used for the database creation. CAMI2-provided short reads of strain-madness (a), marine (b), and plant-associated (c) datasets were classified by each tool, and the average values of the metrics that were measured at the species and genus rank for each sample were plotted. Raw data and metrics for each sample are available in Supplementary Tables 7–9.

Source data

Extended Data Fig. 3 Comparison of Metabuli to best performing AA- and DNA-based tools on real long-read metagenomic samples.

In contrast to Fig. 2g–h, Kraken2X instead of Kaiju is utilized due to its superior performance on long reads. The databases were built using GTDB genomes and a human genome (T2T-CHM13v2.0) based on GTDB taxonomy edited to include a human taxon. Real nanopore sequencing data from human gut (a) and marine (b) environments, as well as PacBio HiFi reads from human gut (c) and marine (d) environments, were classified by each tool. The area is proportional to the number of reads within each panel. The proportion of reads classified by each tool is denoted in parentheses.

Source data

Extended Data Fig. 4 Subspecies-level classification performance by clade size.

All 2,382 query subspecies used in Extended Data Fig. 1a were divided into groups according to the number of subspecies siblings they had in the reference database, that is, by their species clade size. The average F1 score for queries in each group decreases as the clade’s size increases, indicating that more sibling subspecies pose a harder classification challenge to all tools. Precision and recall are available in Supplementary Table 10.

Source data

Extended Data Table 1 Resource measurements in subspecies inclusion test

Full size table

Supplementary information

Supplementary Information

Supplementary Figs. 1–7.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Tables 1–10. Raw data and utilized accessions of Fig. 2a–e and Extended Data Figs. 1, 2 and 4. Supplementary Table 11. A list of provided prebuilt databases.

Source data

Source Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 1

Statistical source data.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 4

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kim, J., Steinegger, M. Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02273-y

Download citation

Received: 14 June 2023
Accepted: 11 April 2024
Published: 20 May 2024
DOI: https://doi.org/10.1038/s41592-024-02273-y

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links