Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA

Abstract

Metagenomic taxonomic classifiers analyze either DNA or amino acid (AA) sequences. Metabuli (https://metabuli.steineggerlab.com), however, jointly analyzes both DNA and AA to leverage AA conservation for sensitive homology detection and DNA mutations for specific differentiation of closely related taxa. In the Critical Assessment of Metagenome Interpretation 2 plant-associated dataset, Metabuli covered 99% and 98% of classifications of state-of-the-art DNA- and AA-based classifiers, respectively.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Metabuli’s workflow.
Fig. 2: Benchmark results.

Similar content being viewed by others

Data availability

The raw data for Fig. 2 and Extended Data Figs. 14 are provided as Source data. Performance measures at different ranks of benchmarks of Fig. 2a,b and Extended Data Fig. 1 are available in Supplementary Tables 13. The assemblies used for read simulation and database creation in synthetic benchmarks are listed in Supplementary Table 4, and the simulated reads are available via Zenodo at https://doi.org/10.5281/zenodo.10250585 (ref. 28). More detailed results and utilized accessions of Fig. 2c,d are provided in Supplementary Tables 5 and 6. The databases used in Fig. 2c,d were built using viral genomes (release 212) and a human genome (GCF_009914755.1) downloaded from NCBI RefSeq, and accessions of genomes of analyzed SARS-CoV-2 variants were denoted in ‘Pathogen detection tests’ section in Methods. Performance measures at different ranks of Fig. 2e and Extended Data Fig. 2 are provided in Supplementary Tables 79. Precision and recall of Extended Data Fig. 4 are available in Supplementary Table 10. The accessions of real data analyzed in Fig. 2g,h and Extended Data Fig. 3 are denoted in ‘Benchmarks with real metagenomes’ section in Methods. CAMI2-provided datasets and taxonomy used in Fig. 2e,f and Extended Data Fig. 2 can be downloaded from https://data.cami-challenge.org/participate. Source data are provided with this paper.

Code availability

Metabuli is GPLv3-licensed free open-source software. The source code and ready-to-use binaries, as well as precomputed databases (Supplementary Table 11), can be downloaded at metabuli.steineggerlab.com. The scripts used for benchmarks and plots are available at https://github.com/jaebeom-kim/metabuli-analysis and https://github.com/jaebeom-kim/metabuli-plots.

References

  1. Simon, H. Y., Siddle, K. J., Park, D. J. & Sabeti, P. C. Benchmarking metagenomics tools for taxonomic classification. Cell 178, 779–794 (2019).

    Article  Google Scholar 

  2. Nooij, S., Schmitz, D., Vennema, H., Kroneman, A. & Koopmans, M. P. Overview of virus metagenomic classification methods and their biological applications. Front. Microbiol. 9, 749 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J. & Levy Karin, E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics 37, 3029–3031 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 1–13 (2019).

    Article  Google Scholar 

  5. Breitwieser, F. P., Baker, D. N. & Salzberg, S. L. Krakenuniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 19, 198 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Menzel, P., Ng, K. L. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7, 11257 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Dilthey, A. T., Jain, C., Koren, S. & Phillippy, A. M. Strain-level metagenomic assignment and compositional estimation for long reads with metamaps. Nat. Commun. 10, 3066 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 119 (2010).

    Article  Google Scholar 

  10. Parks, D. H. et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 50, D785–D794 (2022).

    Article  CAS  PubMed  Google Scholar 

  11. Watson, M. & Warr, A. Errors in long-read assemblies can critically affect protein prediction. Nat. Biotechnol. 37, 124–126 (2019).

    Article  CAS  PubMed  Google Scholar 

  12. Meyer, F. et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Nasko, D. J., Koren, S., Phillippy, A. M. & Treangen, T. J. Refseq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biol. 19, 1–10 (2018).

    Article  Google Scholar 

  14. Lu, J. et al. Metagenome analysis using the Kraken software suite. Nat. Protoc. 17, 2815–2839 (2022).

  15. Holtgrewe, M. Mason - A Read Simulator for Second Generation Sequencing Data. Technical Report (FU Berlin, 2010).

  16. Ono, Y., Hamada, M. & Asai, K. PBSIM3: a simulator for all types of PacBio and ONT long reads. NAR Genom. Bioinform. 4, lqac092 (2022).

  17. de la Cuesta-Zuluaga, J., Ley, R. E. & Youngblut, N. D. Struo: a pipeline for building custom databases for common metagenome profilers. Bioinformatics 36, 2314–2315 (2020).

  18. Youngblut, N. & Shen, W. nick-youngblut/gtdb_to_taxdump: Zenodo release. Zenodo https://doi.org/10.5281/zenodo.3696964 (2020).

  19. Frith, M. C. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res. 39, e23 (2011).

  20. Rahaman, M. M. et al. Genomic characterization of the dominating Beta, V2 variant carrying vaccinated (Oxford-AstraZeneca) and nonvaccinated COVID-19 patient samples in Bangladesh: a metagenomics and whole-genome approach. J. Med. Virol. 94, 1670–1688 (2022).

  21. Lentini, A., Pereira, A., Winqvist, O. & Reinius, B. Monitoring of the SARS-CoV-2 Omicron BA.1/BA.2 lineage transition in the Swedish population reveals increased viral RNA levels in BA.2 cases. Med 3, 636–643 (2022).

  22. Desai, N. et al. Temporal and spatial heterogeneity of host response to SARS-CoV-2 pulmonary infection. Nat. Commun. 11, 6319 (2020).

  23. Gehrig, J. L. et al. Finding the right fit: evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data. Microb. Genom. 8, 000794 (2022).

  24. Liu, L., Yang, Y., Deng, Y. & Zhang, T. Nanopore long-read-only metagenomics enables complete and high-quality genome reconstruction from mock and complex metagenomes. Microbiome 10, 209 (2022).

  25. Barnes, S. J. et al. Metagenome-assembled genomes from photo-oxidized and nonoxidized oil-degrading marine microcosms. Microbiol. Resour. Announc. 12, 6 (2023).

  26. Priest, T., Orellana, L. H., Huettel, B., Fuchs, B. M. & Amann, R. Microbial metagenome-assembled genomes of the Fram Strait from short and long read sequencing platforms. PeerJ 9, e11721 (2021).

  27. Huang, R. et al. Long-read metagenomics of marine microbes reveals diversely expressed secondary metabolites. Microbiol. Spectr. 11, e0150123 (2023).

  28. Kim, J. Simulated query reads used for benchmarks in Metabuli publication. Zenodo https://doi.org/10.5281/zenodo.10250585 (2023).

Download references

Acknowledgements

The authors thank E. Levy Karin for the valuable scientific feedback and the careful review and revision of the paper; J. Söding for the discussions on metamer encoding; M. Mirdita for the usability improvements of the software; H. Kim for the improvement of figures; S. Jaenicke for the voluntary examination of the software; and M. Kim for the feedback on the paper. M.S. acknowledges support by the National Research Foundation of Korea grants (2020M3-A9G7-103933, 2021-R1C1-C102065 and 2021-M3A9-I4021220), the Samsung DS research fund, and the Creative-Pioneering Researchers Program and AI-Bio Research Grant through Seoul National University.

Author information

Authors and Affiliations

Authors

Contributions

J.K. and M.S. designed the research, developed the software, performed analysis and wrote the paper.

Corresponding author

Correspondence to Martin Steinegger.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks André Soares and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lei Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Synthetic benchmark results.

Simulated short (Illumina) and long (PacBio HiFi, ONT, and PacBio Sequel II) reads were used for performance evaluation based on GTDB genomes and taxonomy. Hybrid = (x, y) is the result of applying the DNA-based tool x, followed by the AA-based tool y, where both are the best-performing. ad Subspecies-level classification tests. Reads were simulated from subspecies present in databases, and precision and recall were measured at subspecies rank. a) Hybrid = (KrakenUniq, Kraken2x). b-d) Hybrid = (MetaMaps, Kraken2X). Raw data for performance measurements at subspecies, species, genus, and family ranks are available in Supplementary Table 1. eh Species-level classification tests. Not the queried subspecies but their sibling subspecies were contained in databases to measure species-level classification. Hybrid = (KrakenUniq, Kraken2X). Raw data for performance measurements at species, genus, family, and order ranks are available in Supplementary Table 2. il Genus-level classification tests. Not the queried species but their sibling species were contained in databases, so how well each tool can detect homology within the same genus was measured. i) Hybrid = (Kraken2, MMseqs2). j-l) Hybrid = (Kraken2, Kraken2X). Raw data for performance measurements at genus, family, order, and class ranks are available in Supplementary Table 3.

Source data

Extended Data Fig. 2 Benchmarks using CAMI2’s strain-madness, marine, and plant-associated datasets.

GTDB genomes and the CAMI2-provided taxonomy were used for the database creation. CAMI2-provided short reads of strain-madness (a), marine (b), and plant-associated (c) datasets were classified by each tool, and the average values of the metrics that were measured at the species and genus rank for each sample were plotted. Raw data and metrics for each sample are available in Supplementary Tables 79.

Source data

Extended Data Fig. 3 Comparison of Metabuli to best performing AA- and DNA-based tools on real long-read metagenomic samples.

In contrast to Fig. 2g–h, Kraken2X instead of Kaiju is utilized due to its superior performance on long reads. The databases were built using GTDB genomes and a human genome (T2T-CHM13v2.0) based on GTDB taxonomy edited to include a human taxon. Real nanopore sequencing data from human gut (a) and marine (b) environments, as well as PacBio HiFi reads from human gut (c) and marine (d) environments, were classified by each tool. The area is proportional to the number of reads within each panel. The proportion of reads classified by each tool is denoted in parentheses.

Source data

Extended Data Fig. 4 Subspecies-level classification performance by clade size.

All 2,382 query subspecies used in Extended Data Fig. 1a were divided into groups according to the number of subspecies siblings they had in the reference database, that is, by their species clade size. The average F1 score for queries in each group decreases as the clade’s size increases, indicating that more sibling subspecies pose a harder classification challenge to all tools. Precision and recall are available in Supplementary Table 10.

Source data

Extended Data Table 1 Resource measurements in subspecies inclusion test

Supplementary information

Supplementary Information

Supplementary Figs. 1–7.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Tables 1–10. Raw data and utilized accessions of Fig. 2a–e and Extended Data Figs. 1, 2 and 4. Supplementary Table 11. A list of provided prebuilt databases.

Source data

Source Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 1

Statistical source data.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 4

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, J., Steinegger, M. Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02273-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41592-024-02273-y

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics