Emu: species-level microbial community profiling of full-length 16S rRNA Oxford Nanopore sequencing data

Curry, Kristen D.; Wang, Qi; Nute, Michael G.; Tyshaieva, Alona; Reeves, Elizabeth; Soriano, Sirena; Wu, Qinglong; Graeber, Enid; Finzer, Patrick; Mendling, Werner; Savidge, Tor; Villapol, Sonia; Dilthey, Alexander; Treangen, Todd J.

doi:10.1038/s41592-022-01520-4

Article
Published: 30 June 2022

Emu: species-level microbial community profiling of full-length 16S rRNA Oxford Nanopore sequencing data

Nature Methods volume 19, pages 845–853 (2022)Cite this article

10k Accesses
49 Citations
165 Altmetric
Metrics details

Subjects

Abstract

16S ribosomal RNA-based analysis is the established standard for elucidating the composition of microbial communities. While short-read 16S rRNA analyses are largely confined to genus-level resolution at best, given that only a portion of the gene is sequenced, full-length 16S rRNA gene amplicon sequences have the potential to provide species-level accuracy. However, existing taxonomic identification algorithms are not optimized for the increased read length and error rate often observed in long-read data. Here we present Emu, an approach that uses an expectation–maximization algorithm to generate taxonomic abundance profiles from full-length 16S rRNA reads. Results produced from simulated datasets and mock communities show that Emu is capable of accurate microbial community profiling while obtaining fewer false positives and false negatives than alternative methods. Additionally, we illustrate a real-world application of Emu by comparing clinical sample composition estimates generated by an established whole-genome shotgun sequencing workflow with those returned by full-length 16S rRNA gene sequences processed with Emu.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Performance on simulated ONT reads.**

**Fig. 3: Performance on our ZymoBIOMICS community standard dataset.**

**Fig. 4: Relative error after consecutive EM iterations within Emu on ZymoBIOMICS ONT reads.**

A distinct Fusobacterium nucleatum clade dominates the colorectal cancer niche

Article Open access 20 March 2024

A host–microbiota interactome reveals extensive transkingdom connectivity

Article 20 March 2024

High-content CRISPR screening

Article 10 February 2022

Data availability

All sequenced samples used in this study are publicly available on Sequence Read Achieve (SRA). Both ZymoBIOMICS datasets are under BioProject ID PRJNA587452 with SRA accessions SRR10391201 for ONT and SRR10391187 for Illumina³³. Our gut mock community is under BioProject ID PRJNA725207. The 12 vaginal samples used for our real-world application demonstration are uploaded under BioProject ID PRJNA723982. Our simulated sequences are publicly available on OSF under project 56UF7. Databases used in this paper include 16S RefSeq nucleotide sequence records (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/), Ribosomal Database Project (RDP) v11.5 (https://rdp.cme.msu.edu/) and rrnDB v5.7 (https://rrndb.umms.med.umich.edu/). Study of vaginal microbiomes was approved by the ethics committee of the Medical Faculty of Heinrich Heine University. All patient samples were collected with informed consent from individuals in the context of an exploratory clinical microbiome study approved by the Ethics Committee of the Medical Faculty of Heinrich Heine University Düsseldorf (institutional review board study identification ‘2019–600-andere Forschung erstvotierend’).

Code availability

Emu and all associate code are available on GitLab (https://gitlab.com/treangenlab/emu). Emu can be installed via Bioconda (https://anaconda.org/bioconda/emu). A Code Ocean capsule of the package is provided (https://doi.org/10.24433/CO.7761675.v1). All scripts and data used to compile quantitative comparison results can be found on GitLab (https://gitlab.com/treangenlab/emu-benchmark).

References

Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl Acad. Sci. USA 74, 5088–5090 (1977).
Article CAS PubMed PubMed Central Google Scholar
Martínez-Porchas, M., Villalpando-Canchola, E. & Vargas-Albores, F. Significant loss of sensitivity and specificity in the taxonomic classification occurs when short 16S rRNA gene sequences are used. Heliyon 2, e00170 (2016).
Article PubMed PubMed Central Google Scholar
Callahan, B. J., Grinevich, D., Thakur, S., Balamotis, M. A. & Yehezkel, T. B. Ultra-accurate microbial amplicon sequencing with synthetic long reads. Microbiome 9, 130 (2021).
Article CAS PubMed PubMed Central Google Scholar
Miller, C. S. et al. Short-read assembly of full-length 16S amplicons reveals bacterial diversity in subsurface sediments. PLoS ONE 8, e56018 (2013).
Article CAS PubMed PubMed Central Google Scholar
Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305 (2019).
Article CAS PubMed PubMed Central Google Scholar
Callahan, B. J. et al. High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. Nucleic Acids Res. 47, e103 (2019).
Article CAS PubMed PubMed Central Google Scholar
Karst, S. M. et al. High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing. Nat. Methods 18, 165–169 (2021).
Article CAS PubMed Google Scholar
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Article CAS PubMed PubMed Central Google Scholar
Nearing, J. T., Douglas, G. M., Comeau, A. M. & Langille, M. G. I. Denoising the denoisers: an independent evaluation of microbiome sequence error-correction approaches. PeerJ 6, e5364 (2018).
Article PubMed PubMed Central CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Santos, A., van Aerle, R., Barrientos, L. & Martinez-Urtaza, J. Computational methods for 16S metabarcoding studies using Nanopore sequencing data. Comput. Struct. Biotechnol. J. 18, 296–305 (2020).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://doi.org/10.48550/arxiv.1303.3997 (2013).
Kiełbasa, S. M., Wan, R., Sato, K., Horton, P. & Frith, M. C. Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487–493 (2011).
Article PubMed PubMed Central CAS Google Scholar
Benítez-Páez, A., Portune, K. J. & Sanz, Y. Species-level resolution of 16S rRNA gene amplicons sequenced through the MinION™ portable nanopore sequencer. GigaScience 5, 4 (2016).
Article PubMed PubMed Central CAS Google Scholar
Fujiyoshi, S., Muto-Fujita, A. & Maruyama, F. Evaluation of PCR conditions for characterizing bacterial communities with full-length 16S rRNA genes using a portable nanopore sequencer. Sci. Rep. 10, 12580 (2020).
Article PubMed PubMed Central CAS Google Scholar
Shin, J. et al. Analysis of the mouse gut microbiome using full-length 16S rRNA amplicon sequencing. Sci. Rep. 6, 29681 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
Article CAS PubMed PubMed Central Google Scholar
Juul, S. et al. What’s in my pot? Real-time species identification on the MinION™. Preprint at bioRxiv https://doi.org/10.1101/030742 (2015).
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Article PubMed PubMed Central Google Scholar
Valenzuela-González, F., Martínez-Porchas, M., Villalpando-Canchola, E. & Vargas-Albores, F. Studying long 16S rDNA sequences with ultrafast-metagenomic sequence classification using exact alignments (Kraken). J. Microbiol. Methods 122, 38–42 (2016).
Article PubMed CAS Google Scholar
Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 2017, e104 (2017).
Article Google Scholar
Lu, J. & Salzberg, S. L. Ultrafast and accurate 16S rRNA microbial community analysis using Kraken 2. Microbiome 8, 124 (2020).
Article CAS PubMed PubMed Central Google Scholar
Rodríguez-Pérez, H., Ciuffreda, L. & Flores, C. NanoCLUST: a species-level analysis of 16S rRNA nanopore sequencing data. Bioinformatics 37, 1600–1601 (2021).
Article PubMed CAS Google Scholar
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).
Article PubMed CAS Google Scholar
Dilthey, A. T., Jain, C., Koren, S. & Phillippy, A. M. Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps. Nat. Commun. 10, 3066 (2019).
Article PubMed PubMed Central CAS Google Scholar
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Article CAS PubMed Google Scholar
Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013).
Article CAS PubMed Google Scholar
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Singer, E. et al. Next generation sequencing data of a defined microbial mock community. Sci. Data 3, 160081 (2016).
Article PubMed PubMed Central Google Scholar
Meyer, F. et al. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).
Article CAS PubMed PubMed Central Google Scholar
Winand, R. et al. Targeting the 16S rRNA gene for bacterial identification in complex mixed samples: comparative evaluation of second (Illumina) and third (Oxford Nanopore Technologies) generation sequencing technologies. Int. J. Mol. Sci. 21, 298 (2020).
Article CAS Google Scholar
Edgar, R. Taxonomy annotation and guide tree errors in 16S rRNA databases. PeerJ 6, e5030 (2018).
Article PubMed PubMed Central CAS Google Scholar
Cole, J. R. et al. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Res. 42, D633–D642 (2014).
Article CAS PubMed Google Scholar
Smith, S. B. & Ravel, J. The vaginal microbiota, host defence and reproductive physiology. J. Physiol. 595, 451–463 (2017).
Article CAS PubMed Google Scholar
Pybus, V. & Onderdonk, A. B. Microbial interactions in the vaginal ecosystem, with emphasis on the pathogenesis of bacterial vaginosis. Microbes Infect. 1, 285–292 (1999).
Article CAS PubMed Google Scholar
Petrova, M. I., van den Broek, M., Balzarini, J., Vanderleyden, J. & Lebeer, S. Vaginal microbiota and its role in HIV transmission and infection. FEMS Microbiol. Rev. 37, 762–792 (2013).
Article CAS PubMed Google Scholar
Mendling, W. Vaginal microbiota. Adv. Exp. Med. Biol. 902, 83–93 (2016).
Article PubMed Google Scholar
Gajer, P. et al. Temporal dynamics of the human vaginal microbiota. Sci. Transl. Med. 4, 132ra52 (2012).
Article PubMed PubMed Central Google Scholar
Ravel, J. et al. Vaginal microbiome of reproductive-age women. Proc. Natl Acad. Sci. USA 108, 4680–4687 (2011).
Article CAS PubMed Google Scholar
Brooks, J. P. et al. The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiol. 15, 66 (2015).
Article PubMed PubMed Central Google Scholar
Onderdonk, A. B., Delaney, M. L. & Fichorova, R. N. The human microbiome during bacterial vaginosis. Clin. Microbiol. Rev. 29, 223–238 (2016).
Article CAS PubMed PubMed Central Google Scholar
Li, Y. et al. DeepSimulator: a deep simulator for Nanopore sequencing. Bioinformatics 34, 2899–2908 (2018).
Article CAS PubMed PubMed Central Google Scholar
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
Article PubMed CAS Google Scholar
Yang, C., Chu, J., Warren, R. L. & Birol, I. NanoSim: nanopore sequence read simulator based on statistical characterization. GigaScience 6, 1–6 (2017).
PubMed PubMed Central Google Scholar
Stoddard, S. F., Smith, B. J., Hein, R., Roller, B. R. & Schmidt, T. M. rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development. Nucleic Acids Res. 43, D593–D598 (2015).
Article CAS PubMed Google Scholar
DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006).
Article CAS PubMed PubMed Central Google Scholar
Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590–D596 (2013).
Article CAS PubMed Google Scholar
Schoch, C. L. et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database 2020, baaa062 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 20, 129 (2019).
Article PubMed PubMed Central CAS Google Scholar
Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Sayers, E. W. GenBank. Nucleic Acids Res. 44, D67–D72 (2016).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank two additional members of the Treangen Laboratory, B. Kille for technical support and N. Sapoval for algorithm development. Computational support and infrastructure were provided by the Centre for Information and Media Technology (ZIM) at the University of Düsseldorf (Germany). This work has been supported by Jürgen Manchot Foundation and Deutsche Forschungsgemeinschaft (DFG) award 428994620 (A.D., A.T., W.M., P.F. and E.G.). This work has also been supported by NIH grants from NIDDK P30-DK56338, NIAID R01-AI10091401, U01-AI24290 and P01-AI152999, and NINR R01-NR013497 (T.S. and Q. Wu). Q. Wang and S.V. were supported in part by NIH grant R21NS106640 from the National Institute for Neurological Disorders and Stroke (NINDS). K.D.C. was supported in part by a Ken Kennedy Institute Computational Science and Engineering Graduate Recruiting Fellowship. K.D.C., M.G.N. and T.J.T. were supported in part by NIH grant P01-AI152999 from the National Institute of Allergy and Infectious Diseases (NIAID). K.D.C. and T.J.T. were supported in part by NSF EF-2126387. M.G.N. was funded by a fellowship from the National Library of Medicine Training Program in Biomedical Informatics and Data Science (T15LM007093, PI: Kavraki).

Author information

These authors contributed equally: Alexander Dilthey, Todd J. Treangen.

Authors and Affiliations

Department of Computer Science, Rice University, Houston, TX, USA
Kristen D. Curry, Michael G. Nute, Elizabeth Reeves & Todd J. Treangen
Department of Systems, Synthetic and Physical Biology Science, Rice University, Houston, TX, USA
Qi Wang
Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
Alona Tyshaieva, Enid Graeber, Patrick Finzer & Alexander Dilthey
Houston Methodist Research Institute, Center for Neuroregeneration, Houston, TX, USA
Sirena Soriano & Sonia Villapol
Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA
Qinglong Wu & Tor Savidge
Texas Children’s Microbiome Center, Department of Pathology, Texas Children’s Hospital, Houston, Texas, USA
Qinglong Wu & Tor Savidge
German Centre for Infections in Gynaecology and Obstetrics at Helios University Clinic Wuppertal, Wuppertal, Germany
Werner Mendling

Authors

Kristen D. Curry
View author publications
You can also search for this author in PubMed Google Scholar
Qi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Michael G. Nute
View author publications
You can also search for this author in PubMed Google Scholar
Alona Tyshaieva
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth Reeves
View author publications
You can also search for this author in PubMed Google Scholar
Sirena Soriano
View author publications
You can also search for this author in PubMed Google Scholar
Qinglong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Enid Graeber
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Finzer
View author publications
You can also search for this author in PubMed Google Scholar
Werner Mendling
View author publications
You can also search for this author in PubMed Google Scholar
Tor Savidge
View author publications
You can also search for this author in PubMed Google Scholar
Sonia Villapol
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Dilthey
View author publications
You can also search for this author in PubMed Google Scholar
Todd J. Treangen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.D. and T.J.T. derived the Emu concept and supervised the project. K.D.C., Q. Wang and M.G.N. developed the software. K.D.C., Q. Wang, A.T. and E.R produced results for benchmarking. P.F., E.G., W.M., S.S, Q. Wu, T.S. and S.V. generated sequencing data for analysis and contributed to the interpretation of results. K.D.C., Q. Wang, M.G.N., A.T., Q. Wu, E.R., A.D. and T.J.T. contributed to writing the original draft of the manuscript. All authors read, revised and approved the manuscript.

Corresponding authors

Correspondence to Kristen D. Curry, Alexander Dilthey or Todd J. Treangen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Pictorial representation of the complete Emu algorithm.

Follow the gray-arrowed path until expectation–maximization (EM) iterations are complete, then pink arrows are followed to the final composition estimate. The method starts by establishing probabilities for each alignment type C = [mismatch (X), insertion (I), deletion (D), softclip (S)] through occurrence counts in the primary alignments. Next, alignment probability P(r|t) is calculated for each read, taxonomy pair (r,t) by assuming the maximum alignment probability between r and t. Meanwhile, an evenly distributed composition vector F is initialized. The EM phase is entered by determining P(t|r), the probability that r emanated from t, for all P(r|t). F is updated accordingly, and the total log likelihood of the estimate is calculated. If the total log likelihood is a significant increase over the previous iteration (>.01), then EM iterations continue. Otherwise, the loop is exited, and F is trimmed to remove all entries less than the set threshold. Now following the pink arrows, one final round of estimation is completed with the trimmed F to produce the final sample composition estimate.

Extended Data Fig. 2 ZymoBIOMICS theoretical and imputed ground truth community profiles.

The theoretical values are taken from ZymoBIOMICS standard report of relative abundance estimates based on 16S rRNA gene copy numbers (https://files.zymoresearch.com/protocols/_d6305_d6306_zymobiomics_microbial_community_dna_standard.pdf). Truth_ONT and truth_illumina represent the ground truth relative abundances calculated for our ONT and Illumina datasets respectively, as described in the Establishing Ground Truth subsection under Methods.

Extended Data Fig. 3 Performance on our synthetic gut microbiome mock community.

Heatmap of species-level error between calculated ground truth and estimated relative abundances, where darker blue denotes an underestimate by the software, darker red denotes an overestimate, and white represents no error. All Oxford Nanopore Technologies (ONT) errors are measured in relation to the ground truth of the ONT dataset, while Illumina errors are measured in relation to the ground truth for the Illumina dataset. Color scheme is capped at ±10, resulting in error greater than ±10% observing the maximum error colors. Displayed are the 20 species claiming the largest abundance in any of the ONT or Illumina sample results. ‘Other’ represents the sum of all species not shown in figure for the respective column. Species-level L1-norm, L2-norm, precision, recall, and F-score are also plotted for the methods evaluated.

Extended Data Fig. 4 Family-level relative abundance error heatmap of novel species simulation.

Heatmap of family-level error between ground truth and estimated relative abundances for both the Emu and RDP incomplete databases (missing 35 of the 345 CAMI2 simulated species) with our CAMI2 dataset. Here, darker blue denotes an underestimate by the software, darker red denotes an overestimate, and white represents no error. Color scheme is capped at ±3, resulting in error greater than ±3% observing the maximum error colors. Displayed are the families of the 35 species that were removed from each of the databases.

Extended Data Fig. 5 Bacterial community of 12 vaginal samples.

Species with estimated abundance of over 1% in at least one sample with either Emu or Bracken are shown. Data is grouped by condition: healthy control or vaginosis.

Supplementary information

Supplementary Information

Supplementary Tables 5, 6, 13, 14, 19 and 22–24 and Note 1.

Reporting Summary

Supplementary Tables 1–4, 7–12, 15–18, 20 and 21

Complete abundance results from all analyses in the paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Curry, K.D., Wang, Q., Nute, M.G. et al. Emu: species-level microbial community profiling of full-length 16S rRNA Oxford Nanopore sequencing data. Nat Methods 19, 845–853 (2022). https://doi.org/10.1038/s41592-022-01520-4

Download citation

Received: 30 April 2021
Accepted: 10 May 2022
Published: 30 June 2022
Issue Date: July 2022
DOI: https://doi.org/10.1038/s41592-022-01520-4

This article is cited by

Bioaccessibility of lead and cadmium in soils around typical lead-acid power plants and their effect on gut microorganisms
- Shuxi Zhang
- Zhiwen Deng
- Lihong Wang
Environmental Geochemistry and Health (2024)
Cefquinome shows a higher impact on the pig gut microbiome and resistome compared to ceftiofur
- Sofie Rutjens
- Nick Vereecke
- Mathias Devreese
Veterinary Research (2023)
Succession of bacterial biofilm communities following removal of chloramine from a full-scale drinking water distribution system
- Tage Rosenqvist
- Mikael Danielsson
- Catherine J. Paul
npj Clean Water (2023)
SituSeq: an offline protocol for rapid and remote Nanopore 16S rRNA amplicon sequence analysis
- Jackie Zorz
- Carmen Li
- Casey R J Hubert
ISME Communications (2023)
Inoculation with black soldier fly larvae alters the microbiome and volatile organic compound profile of decomposing food waste
- Rena Michishita
- Masami Shimoda
- Takuya Uehara
Scientific Reports (2023)