16S ribosomal RNA-based analysis is the established standard for elucidating the composition of microbial communities. While short-read 16S rRNA analyses are largely confined to genus-level resolution at best, given that only a portion of the gene is sequenced, full-length 16S rRNA gene amplicon sequences have the potential to provide species-level accuracy. However, existing taxonomic identification algorithms are not optimized for the increased read length and error rate often observed in long-read data. Here we present Emu, an approach that uses an expectation–maximization algorithm to generate taxonomic abundance profiles from full-length 16S rRNA reads. Results produced from simulated datasets and mock communities show that Emu is capable of accurate microbial community profiling while obtaining fewer false positives and false negatives than alternative methods. Additionally, we illustrate a real-world application of Emu by comparing clinical sample composition estimates generated by an established whole-genome shotgun sequencing workflow with those returned by full-length 16S rRNA gene sequences processed with Emu.
Your institute does not have access to this article
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
All sequenced samples used in this study are publicly available on Sequence Read Achieve (SRA). Both ZymoBIOMICS datasets are under BioProject ID PRJNA587452 with SRA accessions SRR10391201 for ONT and SRR10391187 for Illumina33. Our gut mock community is under BioProject ID PRJNA725207. The 12 vaginal samples used for our real-world application demonstration are uploaded under BioProject ID PRJNA723982. Our simulated sequences are publicly available on OSF under project 56UF7. Databases used in this paper include 16S RefSeq nucleotide sequence records (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/), Ribosomal Database Project (RDP) v11.5 (https://rdp.cme.msu.edu/) and rrnDB v5.7 (https://rrndb.umms.med.umich.edu/). Study of vaginal microbiomes was approved by the ethics committee of the Medical Faculty of Heinrich Heine University. All patient samples were collected with informed consent from individuals in the context of an exploratory clinical microbiome study approved by the Ethics Committee of the Medical Faculty of Heinrich Heine University Düsseldorf (institutional review board study identification ‘2019–600-andere Forschung erstvotierend’).
Emu and all associate code are available on GitLab (https://gitlab.com/treangenlab/emu). Emu can be installed via Bioconda (https://anaconda.org/bioconda/emu). A Code Ocean capsule of the package is provided (https://doi.org/10.24433/CO.7761675.v1). All scripts and data used to compile quantitative comparison results can be found on GitLab (https://gitlab.com/treangenlab/emu-benchmark).
Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl Acad. Sci. USA 74, 5088–5090 (1977).
Martínez-Porchas, M., Villalpando-Canchola, E. & Vargas-Albores, F. Significant loss of sensitivity and specificity in the taxonomic classification occurs when short 16S rRNA gene sequences are used. Heliyon 2, e00170 (2016).
Callahan, B. J., Grinevich, D., Thakur, S., Balamotis, M. A. & Yehezkel, T. B. Ultra-accurate microbial amplicon sequencing with synthetic long reads. Microbiome 9, 130 (2021).
Miller, C. S. et al. Short-read assembly of full-length 16S amplicons reveals bacterial diversity in subsurface sediments. PLoS ONE 8, e56018 (2013).
Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305 (2019).
Callahan, B. J. et al. High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. Nucleic Acids Res. 47, e103 (2019).
Karst, S. M. et al. High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing. Nat. Methods 18, 165–169 (2021).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Nearing, J. T., Douglas, G. M., Comeau, A. M. & Langille, M. G. I. Denoising the denoisers: an independent evaluation of microbiome sequence error-correction approaches. PeerJ 6, e5364 (2018).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Santos, A., van Aerle, R., Barrientos, L. & Martinez-Urtaza, J. Computational methods for 16S metabarcoding studies using Nanopore sequencing data. Comput. Struct. Biotechnol. J. 18, 296–305 (2020).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://doi.org/10.48550/arxiv.1303.3997 (2013).
Kiełbasa, S. M., Wan, R., Sato, K., Horton, P. & Frith, M. C. Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487–493 (2011).
Benítez-Páez, A., Portune, K. J. & Sanz, Y. Species-level resolution of 16S rRNA gene amplicons sequenced through the MinION™ portable nanopore sequencer. GigaScience 5, 4 (2016).
Fujiyoshi, S., Muto-Fujita, A. & Maruyama, F. Evaluation of PCR conditions for characterizing bacterial communities with full-length 16S rRNA genes using a portable nanopore sequencer. Sci. Rep. 10, 12580 (2020).
Shin, J. et al. Analysis of the mouse gut microbiome using full-length 16S rRNA amplicon sequencing. Sci. Rep. 6, 29681 (2016).
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
Juul, S. et al. What’s in my pot? Real-time species identification on the MinION™. Preprint at bioRxiv https://doi.org/10.1101/030742 (2015).
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Valenzuela-González, F., Martínez-Porchas, M., Villalpando-Canchola, E. & Vargas-Albores, F. Studying long 16S rDNA sequences with ultrafast-metagenomic sequence classification using exact alignments (Kraken). J. Microbiol. Methods 122, 38–42 (2016).
Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857 (2019).
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 2017, e104 (2017).
Lu, J. & Salzberg, S. L. Ultrafast and accurate 16S rRNA microbial community analysis using Kraken 2. Microbiome 8, 124 (2020).
Rodríguez-Pérez, H., Ciuffreda, L. & Flores, C. NanoCLUST: a species-level analysis of 16S rRNA nanopore sequencing data. Bioinformatics 37, 1600–1601 (2021).
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).
Dilthey, A. T., Jain, C., Koren, S. & Phillippy, A. M. Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps. Nat. Commun. 10, 3066 (2019).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013).
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Singer, E. et al. Next generation sequencing data of a defined microbial mock community. Sci. Data 3, 160081 (2016).
Meyer, F. et al. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).
Winand, R. et al. Targeting the 16S rRNA gene for bacterial identification in complex mixed samples: comparative evaluation of second (Illumina) and third (Oxford Nanopore Technologies) generation sequencing technologies. Int. J. Mol. Sci. 21, 298 (2020).
Edgar, R. Taxonomy annotation and guide tree errors in 16S rRNA databases. PeerJ 6, e5030 (2018).
Cole, J. R. et al. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Res. 42, D633–D642 (2014).
Smith, S. B. & Ravel, J. The vaginal microbiota, host defence and reproductive physiology. J. Physiol. 595, 451–463 (2017).
Pybus, V. & Onderdonk, A. B. Microbial interactions in the vaginal ecosystem, with emphasis on the pathogenesis of bacterial vaginosis. Microbes Infect. 1, 285–292 (1999).
Petrova, M. I., van den Broek, M., Balzarini, J., Vanderleyden, J. & Lebeer, S. Vaginal microbiota and its role in HIV transmission and infection. FEMS Microbiol. Rev. 37, 762–792 (2013).
Mendling, W. Vaginal microbiota. Adv. Exp. Med. Biol. 902, 83–93 (2016).
Gajer, P. et al. Temporal dynamics of the human vaginal microbiota. Sci. Transl. Med. 4, 132ra52 (2012).
Ravel, J. et al. Vaginal microbiome of reproductive-age women. Proc. Natl Acad. Sci. USA 108, 4680–4687 (2011).
Brooks, J. P. et al. The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiol. 15, 66 (2015).
Onderdonk, A. B., Delaney, M. L. & Fichorova, R. N. The human microbiome during bacterial vaginosis. Clin. Microbiol. Rev. 29, 223–238 (2016).
Li, Y. et al. DeepSimulator: a deep simulator for Nanopore sequencing. Bioinformatics 34, 2899–2908 (2018).
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
Yang, C., Chu, J., Warren, R. L. & Birol, I. NanoSim: nanopore sequence read simulator based on statistical characterization. GigaScience 6, 1–6 (2017).
Stoddard, S. F., Smith, B. J., Hein, R., Roller, B. R. & Schmidt, T. M. rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development. Nucleic Acids Res. 43, D593–D598 (2015).
DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006).
Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590–D596 (2013).
Schoch, C. L. et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database 2020, baaa062 (2020).
Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 20, 129 (2019).
Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Sayers, E. W. GenBank. Nucleic Acids Res. 44, D67–D72 (2016).
We thank two additional members of the Treangen Laboratory, B. Kille for technical support and N. Sapoval for algorithm development. Computational support and infrastructure were provided by the Centre for Information and Media Technology (ZIM) at the University of Düsseldorf (Germany). This work has been supported by Jürgen Manchot Foundation and Deutsche Forschungsgemeinschaft (DFG) award 428994620 (A.D., A.T., W.M., P.F. and E.G.). This work has also been supported by NIH grants from NIDDK P30-DK56338, NIAID R01-AI10091401, U01-AI24290 and P01-AI152999, and NINR R01-NR013497 (T.S. and Q. Wu). Q. Wang and S.V. were supported in part by NIH grant R21NS106640 from the National Institute for Neurological Disorders and Stroke (NINDS). K.D.C. was supported in part by a Ken Kennedy Institute Computational Science and Engineering Graduate Recruiting Fellowship. K.D.C., M.G.N. and T.J.T. were supported in part by NIH grant P01-AI152999 from the National Institute of Allergy and Infectious Diseases (NIAID). K.D.C. and T.J.T. were supported in part by NSF EF-2126387. M.G.N. was funded by a fellowship from the National Library of Medicine Training Program in Biomedical Informatics and Data Science (T15LM007093, PI: Kavraki).
The authors declare no competing interests.
Peer review information
Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Follow the gray-arrowed path until expectation–maximization (EM) iterations are complete, then pink arrows are followed to the final composition estimate. The method starts by establishing probabilities for each alignment type C = [mismatch (X), insertion (I), deletion (D), softclip (S)] through occurrence counts in the primary alignments. Next, alignment probability P(r|t) is calculated for each read, taxonomy pair (r,t) by assuming the maximum alignment probability between r and t. Meanwhile, an evenly distributed composition vector F is initialized. The EM phase is entered by determining P(t|r), the probability that r emanated from t, for all P(r|t). F is updated accordingly, and the total log likelihood of the estimate is calculated. If the total log likelihood is a significant increase over the previous iteration (>.01), then EM iterations continue. Otherwise, the loop is exited, and F is trimmed to remove all entries less than the set threshold. Now following the pink arrows, one final round of estimation is completed with the trimmed F to produce the final sample composition estimate.
The theoretical values are taken from ZymoBIOMICS standard report of relative abundance estimates based on 16S rRNA gene copy numbers (https://files.zymoresearch.com/protocols/_d6305_d6306_zymobiomics_microbial_community_dna_standard.pdf). Truth_ONT and truth_illumina represent the ground truth relative abundances calculated for our ONT and Illumina datasets respectively, as described in the Establishing Ground Truth subsection under Methods.
Heatmap of species-level error between calculated ground truth and estimated relative abundances, where darker blue denotes an underestimate by the software, darker red denotes an overestimate, and white represents no error. All Oxford Nanopore Technologies (ONT) errors are measured in relation to the ground truth of the ONT dataset, while Illumina errors are measured in relation to the ground truth for the Illumina dataset. Color scheme is capped at ±10, resulting in error greater than ±10% observing the maximum error colors. Displayed are the 20 species claiming the largest abundance in any of the ONT or Illumina sample results. ‘Other’ represents the sum of all species not shown in figure for the respective column. Species-level L1-norm, L2-norm, precision, recall, and F-score are also plotted for the methods evaluated.
Heatmap of family-level error between ground truth and estimated relative abundances for both the Emu and RDP incomplete databases (missing 35 of the 345 CAMI2 simulated species) with our CAMI2 dataset. Here, darker blue denotes an underestimate by the software, darker red denotes an overestimate, and white represents no error. Color scheme is capped at ±3, resulting in error greater than ±3% observing the maximum error colors. Displayed are the families of the 35 species that were removed from each of the databases.
Species with estimated abundance of over 1% in at least one sample with either Emu or Bracken are shown. Data is grouped by condition: healthy control or vaginosis.
About this article
Cite this article
Curry, K.D., Wang, Q., Nute, M.G. et al. Emu: species-level microbial community profiling of full-length 16S rRNA Oxford Nanopore sequencing data. Nat Methods 19, 845–853 (2022). https://doi.org/10.1038/s41592-022-01520-4