Introduction

High-throughput sequencing has greatly accelerated the study of microbiomics, that is, the scientific field focused on studying the composition, diversity, and function of microbial communities and their interactions with their hosts or environments1. Characterizing the composition of microbial samples commonly relies on the amplification of 16S ribosomal subunit sequences, a ubiquitous gene with highly conserved regions. The subunit simplifies efforts to isolate and amplify 16S rRNA with established PCR primers and hypervariable regions to establish identity and phylogeny. 16S rRNA and rDNA sequencing can be used to identify known prokaryotic species and act as a proxy to quantify the relative abundances of operational taxonomic units (OTUs) within microbiome samples.

Methods for taxonomic profiling of ribosomal RNA gene sequences enable sample OTU identification by classifying rRNA sequences into taxonomic groups. While considerable accuracy in species-level identification is attainable with available tools2, current profiling software for 16S amplicon sequencing data hesitates to identify down to the species level. Instead, they cluster reads based on sequence similarity to assign genus or higher-level identifications to increase specificity and sensitivity, or they directly use error-filtered sequences for taxonomic classification3,4. As the capabilities of modern sequencing platforms increase, and as bacterial reference genome databases expand and improve, more potential arises for achieving enhanced 16S analysis performance with alternative methods more commonly applied in whole genome metagenomics.

The most common software packages currently employed in analysis of 16S amplicon sequencing data are DADA24, QIIME 25, its predecessor, QIIME 26, and Mothur7. QIIME 2 and Mothur were both originally developed shortly after the invention of next-generation sequencing and, along with QIIME 2, essentially follow the same workflow: reads are typically clustered de novo based on sequence similarity into operational taxonomic units (OTUs) or denoised OTUs (many refer to these as amplicon sequence variants or ASVs) depending on whether complete sequence identity is desired for clustering. The initial clustering step serves to 1) improve computational efficiency by limiting the number of sequences needing alignment to a large set of reference genomes and 2) accommodate the low levels of genetic variation present within a given bacterial strain, thereby mitigating sequencing errors. For nearly a decade, the cutoff for OTU inclusion was 97% sequence identity8,9, but current cutoff recommendations are now around 99–100% sequence identity3,10, typically after some form of denoising or other correction for sequencing errors4,11.

An alternative to OTU clustering includes directly aligning reads against a reference genome library, as is done by PathoScope 2.012. PathoScope employs a Bayesian mixed modeling framework to reassign ambiguously aligned reads, dampening potential sequencing errors and minor genetic variation13,14. As another alternative, Kraken 2 performs alignment-free k-mer searches against a reference genome library15 and makes taxonomic assignments to each read based on the cumulative number of k-mer matches across an entire read against each taxonomic node in its reference library. By bypassing a sequence clustering step, PathoScope and Kraken 2 individually avoid the potential pitfalls inherent in OTU generation and denoising errors16,17, although remain susceptible to sequencing errors. While DADA2, QIIME 2, Mothur, Greengenes, and SILVA are all tools designed to address the specific needs of 16S amplicon sequencing, improvements in sequencing technologies, expanding bacterial reference genome databases, and increased availability and affordability of computational resources have collectively made many of the specific issues addressed by these tools irrelevant. Meanwhile, the increased flexibility and power of a tool such as PathoScope may yield augmented results despite being computationally intensive and designed to fulfill a more general metagenomics purpose18,19.

These profiling methods all rely heavily on the quality of the reference library used, as has been shown in previous benchmarking studies20,21,22,23. The most commonly used reference databases for 16S amplicon analyses are Greengenes24, SILVA25, and the Ribosomal Database Project (RDP)26. Each database exclusively contains 16S rRNA gene sequences and offers taxonomic information for each reference sequence. SILVA is well maintained and releases updates regularly, although as of this writing, the most recent update is SILVA 138.1 (released on August 27, 2020). Meanwhile, Greengenes has been stagnant for years; its most recent update at the time of submission was Greengenes 13_8, released in August 2013. As a result, Greengenes lacks several essential bacteria, including Dolosigranulum species27, implicated as playing a protective role in preventing disease in human airways28,29. It is worth noting that a new version, Greengenes230, was made available in 2022 amidst the review process for this paper, and was subsequently not included in this paper’s analyses. Although QIIME 2 and Mothur are compatible with any reference genome library, QIIME 2 uses Greengenes by default, and Mothur’s documentation (as accessed on May 17, 2022) recommends SILVA. DADA2 maintains reference databases for SILVA, RDP, and Greengenes, with the flexibility to create custom databases. Kraken 2 has its own curated “Standard” bacterial library, with a taxonomic tree based on NCBI’s taxonomy database by default31, and has also released Kraken 2-compatible formatted versions of Greengenes, SILVA, and RDP. The current PathoScope reference library recommendation is to download the complete RefSeq representative genome database32, a collection of curated high-quality bacterial genomes and assemblies. RefSeq is constantly updated, and as such, results of any analysis using RefSeq as a reference library may vary according to the date of the library download.

Given these considerations, we systematically benchmarked several current community profiling tools and reference libraries created for both metagenomic and 16S analysis. We evaluated the tools QIIME 2, Mothur, PathoScope 2, and Kraken 2 in conjunction with reference libraries from Greengenes, SILVA, Kraken 2, and RefSeq. Using several publicly available 16S sequencing datasets of synthetic mock communities, we specifically analyzed genus- and species-level performance across pairs of profilers and libraries. We tested 136 samples comprising varying species richness and evenness, several different amplified regions within the 16S rRNA gene, and both DNA spike-ins and cDNA from collections of plated cells. Our evaluative comparisons utilized a combination of diversity and accuracy-based measures to determine what methods and tools provided the best performance in profiling 16S amplicon sequencing datasets.

Methods

Publicly available mock community sequencing datasets

136 mock community sequencing samples were collected in total from four publicly available sequencing datasets and analyzed in our evaluation. 69 samples are from Lluch et al.33; 33 samples are from Kozich et al.34; 29 samples are from Fouhy et al.35; and 5 samples are from Karstens36. These datasets are hereafter referred to as the Lluch, Kozich, Fouhy, and Karstens samples. The species compositions for each set are delineated in Table S1. The Lluch samples include a variety of community compositions, ranging from monoculture samples composed of only a single species to others with 20 species at staggered concentrations. Collectively, 34 species appear across the aggregate of Lluch samples. While the Lluch samples’ taxonomic profiles are diverse, all 69 samples were produced using a single unified DNA extraction, amplification, and sequencing protocol that yielded Illumina MiSeq paired-end reads of the V4–V5 region of the 16S rRNA gene. The Kozich samples each comprise three sequencing replicates of 11 different preparations of BEI’s mock community B (HM-278D), encompassing 21 species. For the Kozich samples, three PCR primer pairs were used to amplify three distinct portions of the 16S rRNA gene (the V3, V4, and V4-V5 ranges), making the sequencing data for these samples more complex than for those samples from the other datasets. The Fouhy samples are each a unique combination of either BEI mock community B (16S DNA spike-ins) or BEI mock community C (cultured cells), prepared using one of three library prep protocols, amplified with PCR primers for either the V1-V2 or the V4-V5 region of the 16S rRNA gene, and sequenced either on an Illumina MiSeq machine or a Thermo Fisher Ion Torrent. Finally, the 5 Karstens samples originate from a single custom mock DNA library of 8 species, with the V4 region amplified and sequenced on an Illumina MiSeq device.

16S amplicon sequencing analysis pipelines

We evaluated five analysis pipelines applied to the 136 mock community samples: DADA2, QIIME 2, Mothur, PathoScope, and Kraken 2.

For the standalone implementation of DADA2, all samples were filtered and trimmed, with errors learned on forward and reverse reads. Learned errors were used to conduct inference on predicted error presence across all reads as a denoising measure. Paired reads were merged and chimeras were removed, with taxonomy assigned down to the species level. When filtering and trimming, for most samples we used the parameters maxN = 0, maxEE = c(3, 3), truncQ = 2, rm.phix = TRUE, and tlength = 0. For unpaired reads, we set maxEE = 3. Several Kozich samples suffered from extreme quality degradation at read ends; to correct for this, we set tlength = c(240, 200). Finally, when running the Fouhy Ion Torrent samples, we set the DADA2 function parameters HOMOPOLYMER_GAP_PENALTY = − 1 and BAND_SIZE = 32. For the filter and trim step, we also set trimLeft = 15. These settings were based on recommendations for processing Ion Torrent data in the DADA2 FAQ.

For all QIIME 2 analyses, we used the DADA2 plugin to cluster sequences and construct feature tables. We decided to test the DADA2 plugin alongside its standalone package due to the broad user base of QIIME 2. However, the standalone implementation uses 100% sequence identity over Mothur and QIIME 2 (with 97%) and uses exact sequence matching rather than a k-mer based method (as in the QIIME 2 q2-feature-classifier). All mock datasets could be run with paired-end reads besides the Fouhy datasets. In most cases, DADA2 did not require truncation of paired-end sequences and only the initial 6 bp were trimmed from each read. However, quality scores at the end of nine samples from the Kozich dataset were universally low enough (cutoff of median quality score < 20) to require truncation to 240 bp for forward reads and 200 bp for reverse reads for Kozich samples. Taxonomy was assigned using custom naïve Bayes classifiers constructed for each set of mock community samples based on their amplified 16S region. The QIIME 2 artifact output files were converted into BIOM format and subsequently into tab-delimited text format for downstream analyses and pipeline comparisons.

For Mothur analyses, all recommended procedures were followed according to Mothur documentation where possible. For paired-end sequences, the native make.contigs() function was used to join reads. In the pre.cluster() step of Mothur analysis, the “diffs” parameter (the number of mismatches allowed between a cluster’s representative sequence and each member sequence) was set to 2 for joined sequencing reads shorter than 250 bp, 3 for joined reads of length 250–349 bp, and 4 for longer joined reads. For cluster.split(), we set the “taxlevel” parameter to 4, with a “cutoff” of 0.03.

For PathoScope 2.0 analyses, Bowtie2 alignment parameters were set to “–local -R 2 -N 0 -L 25 -i S,1,0.75 -k 10 –score-min L,100,1.28.” These values were optimized for 16S sequencing reads, requiring higher similarity to a reference genome to be considered a hit than the default settings due to the highly conserved nature of portions of the 16S rRNA gene. Phylogeny for each taxon was inferred from the NCBI taxon id (ti) for each reference genome using the entrez_fetch() function from the R package rentrez.

For Kraken 2 analyses, Kraken 2 taxonomic reports were created for each sample. These were parsed into a taxon/feature counts matrix that included the complete phylogeny for each identified taxon as reported by Kraken 2.

Bacterial genomic and 16S reference libraries

We used five bacterial sequence reference databases in conjunction with the aforementioned pipelines: Greengenes 13_8, SILVA 138, two versions of RefSeq’s representative genomes, and the Kraken 2 Standard library (downloaded on August 20, 2020). According to the Kraken 2 manual, the Kraken 2 Standard library is compiled using the RefSeq database, so it could be considered analogous to the RefSeq2020 library. The RefSeq libraries were downloaded on November 2, 2018, and June 23, 2020; these are denoted as “RefSeq2018” and “RefSeq2020.” Greengenes and SILVA are specifically 16S reference databases as they include only sequences for the bacterial 16S rRNA gene. RefSeq2018, RefSeq2020, and the Kraken 2 Standard database are all whole-genome libraries, with no special modifications for use with 16S amplicon sequencing data.

Analysis pipeline and reference library pairings

We analyzed 136 mock community samples using a total of 11 distinct pairings of analysis tools and reference libraries: DADA2 only with SILVA, QIIME 2 with Greengenes and SILVA, Mothur only with SILVA (the default reference library), PathoScope using Greengenes, SILVA, RefSeq2018, and RefSeq2020, and Kraken 2 with its Standard library, SILVA, and Greengenes. While the SILVA database includes species-level taxonomic information for most of its representative 16S sequences, note that Mothur collapses feature counts into genus-level clades and thus does not make species-level calls. The adaptations of SILVA used for Kraken 2 and QIIME 2 did not provide species-level calls. Thus, only eight of the 11 pairings make species-level calls. The pairings and pipeline parameters are summarized in Table 1.

Table 1 Summary table of all pipeline and reference database pairings, with information on non-default parameters chosen.

Tracking available taxonomic information for each OTU

A counts matrix was created from the results of each of the 11 pipeline/reference pairs for each operational taxonomic unit (OTU), denoised OTU, and feature. Each feature was assigned phylum, class, order, family, genus, species, and subspecies level information when available. For a given database, whenever a taxonomic label was missing, the lowest level taxonomy available for a feature was propagated using that database’s taxonomic path, taking note of what granularity was available (taxonomic best hit). For example, a feature assigned only as a member of the Bacillales order would be given the metadata: “phylum: Firmicutes, class: Bacilli, order: Bacillales, family: o_Bacillales, genus: o_Bacillales, species: o_Bacillales.

Assessing taxon sensitivity, read specificity, error, and diversity

Several metrics were used in assessing the overall quality and power of each 16S analysis pipeline and reference library at each taxonomic level. Results were independently evaluated at each taxonomic level. Any reads or features not assigned to a taxon at a given phylogenetic level were excluded from analysis, except where otherwise specified.

Taxon detection sensitivity

The metric of taxon detection sensitivity is defined here as the portion of expected taxa in a mock community sample detected by a given pipeline, at a minimum of 0.1% relative abundance. It essentially examines how often a given method can correctly determine an organism’s presence in the mock community.

Read assignment specificity

Read assignment specificity is defined here as the portion of reads from a given sample assigned to taxa that are actually present in that sample’s mock community. This is equivalent to 1 minus the portion of reads assigned to spurious taxa. This metric identifies the frequency of read assignment to incorrect organisms for a given method.

Error rate

The Normalized Root Mean Squared Error (NRMSE) was calculated as the root mean square error normalized with the assumption that the variance could be increasing given higher read counts. For each sample’s results, given by the equation

$$\mathrm{NRMSE}=\sqrt{\frac{1}{K}\sum_{i=1}^{K}\frac{{|{w}_{i}-{t}_{i}|}^{2}}{\left(\frac{{w}_{i}+{t}_{i}}{2}\right)}}$$

where, for \(K\) taxa, wi and ti are respectively the measured and true read counts of taxon i. We evaluated the union of the expected and detected taxa for each sample, using \({t}_{i}=0\) for theoretical taxa counts that were not actually measured in the mock community. All taxa that were absent from both the measured results and the true mock community were excluded (i.e., taxa which had relative abundance values of 0, both theoretical and measured).

Alpha diversity estimations

To assess each pipeline’s ability to estimate the true alpha diversity within a sample regardless of accurate species identification, we calculated the log-fold change between the expected and the measured alpha diversity as measured by the Shannon index, the Simpson index, and the breakaway_nof15 index. The R package vegan37 was used to calculate the Shannon and Simpson indices. The R package breakaway was used to calculate the breakaway_nof138 index, which predicts both the number of unobserved taxa and the number of true singletons based on the non-singleton frequency counts. Due to the alpha diversity metrics' sensitivity to library and count size differences39, we converted the relative abundances of the mock community samples' ground truths to virtual sequencing libraries of 1,000,000 reads again using the vegan37 package. A rarefaction depth of 10,000 reads per sample was used to normalize all samples and ground truth libraries.

Statistical methods for significance testing

A series of linear mixed-effects models (LMMs), coupled with post hoc least-square means tests and a Tukey multiple comparison correction, were used to determine which pipelines outperformed each other in sensitivity, specificity, error rates, and alpha diversity estimates. LMMs were estimated using the lmer() function from the R package lme440, and post hoc comparisons were performed with the lsmeans() function from the R package lsmeans41. These LMMs examine the relevant performance metric as the measured variable, using the 136 mock community samples as a random effect and the pipeline/reference library pair as a fixed effect.

Results

Visual evaluation of species-level detection and abundance estimates

Figure 1 shows stacked bar charts of the results from the Kozich dataset for the ground truth versus all methods at the species level. Overall, pipelines using the Greengenes database (Kraken 2, QIIME 2, and Pathoscope) performed the worst in classifying species, followed by DADA2 paired with SILVA. PathoScope made the best use of the Greengenes database with the fewest misclassified reads and most correct species-level detection. Kraken 2 (paired with its Standard library) and PathoScope (paired with the RefSeq and SILVA libraries) performed best on these datasets. A more quantitative evaluation of these methods in the context of all samples follows.

Figure 1
figure 1

Expected versus measured relative abundances of mock bacteria. A stacked bar plot of the measured relative abundances of bacterial species in 33 samples from Kozich et al. These samples were all equimolar concentrations of 16S rDNA from 21 species, as shown in the ‘Ground Truth’ bar on the left. All reads assigned to bacterial species other than the 21 expected in the mock community are colored gray and are labeled “Incorrect ID”. Mothur calls were not included as the pipeline does not make species-level calls, and the same is true of QIIME 2 and Kraken 2 paired with the SILVA database.

Taxon detection sensitivity

At the genus level (Fig. 2A), DADA2 paired with SILVA was the least sensitive (mean = 0.67, SD = 0.35), followed collectively by methods which utilized the Greengenes (QIIME 2: mean = 0.73, SD = 0.16; Kraken 2: mean = 0.73, SD = 0.17; PathoScope: mean = 0.78, SD = 0.24; see Table S2 for p-values). When paired with the SILVA, RefSeq2018, or RefSeq2020 reference libraries, PathoScope was more sensitive at detecting genera than any other method, peaking when paired with the RefSeq 2018 reference library (mean = 0.88, SD = 0.14).

Figure 2
figure 2

Taxon detection sensitivity of 16S analysis pipelines. Violin plots of the sensitivity, specificity, and log NRMSE of each analysis pipeline and reference library pair used to analyze 16S samples, calculated at the genus (A,C,E) and species (B,D,F) levels. Sensitivity is calculated as the portion of expected taxa in each mock community sample that was detected with least 0.1% relative abundance. Specificity is calculated as the portion of reads assigned to taxa that are expected to exist within each mock community.

Generally, taxon detection sensitivity was lower at the species level than at the genus level (Fig. 2B). Methods using Greengenes had extremely low species-level sensitivities (QIIME 2: mean = 0.16, SD = 0.18; Kraken 2: mean = 0.19, SD = 0.13; PathoScope: mean = 0.28, SD = 0.21), as did DADA2 with SILVA (mean = 0.24, SD = 0.19). These were all significantly lower than all other methods (see Table S3 for pairwise p-values). Among those methods that used Greengenes, PathoScope was significantly more sensitive than either QIIME 2 (p < 0.001) or Kraken 2 (p < 0.001). The most sensitive method at the species level was PathoScope using the SILVA reference library (mean = 0.86, SD = 0.15), followed by PathoScope using RefSeq2018 (mean = 0.67, SD = 0.16). Only three species were not detected by PathoScope at a minimum of 0.1% relative abundance in any samples when using SILVA as a reference library; these were Bifidobacterium adolescentis, Prosthecobacter fusiformis, and Clostridium beijerinckii.

Read assignment specificity

At the genus level, the average read assignment specificity was generally lower for Kraken 2 with its Standard library (mean = 0.719, SD = 0.26); PathoScope and QIIME 2 with Greengenes (PathoScope: mean = 0.72, SD = 0.26; QIIME 2: mean = 0.73, SD = 0.28); and DADA2, Kraken 2, and Mothur with SILVA (DADA2: mean = 0.75, SD = 0.37; Kraken 2: mean = 0.75, SD = 0.2; Mothur: mean = 0.76, SD = 0.22). (Fig. 2C). However, no overall trends arose in pairwise tests among pipelines and database pairings (see Table S4 for pairwise p-values). PathoScope with the RefSeq2018 library (mean = 0.91, SD = 0.15) and Kraken 2 with Greengenes (mean = 0.89, SD = 0.18) had the overall highest read assignment specificity.

At the species level, both Kraken 2 and QIIME 2 paired with Greengenes had the lowest read assignment specificity (Kraken 2: mean = 0.21, SD = 0.17; QIIME 2: mean = 0.23, SD = 0.2), which were significantly lower than all methods (see Table S5 for pairwise p-values). PathoScope, when paired with either the SILVA library (mean = 0.75, SD = 0.18), RefSeq2020 (mean = 0.75, SD = 0.24), or RefSeq2018 (mean = 0.86, SD = 0.18) was significantly more specific than QIIME 2 and Kraken 2 (Fig. 2D).

Normalized root mean-square error

Kraken 2 had the lowest error rates, measured as the log NRMSE of the raw reads, of all methods evaluated at the genus level regardless of the reference library used (SILVA: mean = 3.78, SD = 0.58; Standard: mean = 4.02, SD = 0.52, Greengenes: mean = 4.12, SD = 0.66). These were significantly lower than all other error rates (see Table S6 for pairwise comparison p-values). QIIME 2 had the highest genus-level NRMSE for both the SILVA and Greengenes libraries of all methods (SILVA: mean = 5.36, SD = 0.54; Greengenes: mean = 5.36, SD = 0.54), alongside DADA2 with SILVA (mean = 0.35, SD = 0.55; Fig. 2E).

At the species level, Kraken 2 also had the lowest log NRMSE for its Standard database and SILVA, which were better than all other methods (Standard: mean = 3.77; SD = 0.5; SILVA: mean = 3.95, SD = 0.55; see Table S7 for pairwise comparison p-values). This was followed by PathoScope for the SILVA database (mean = 4.28, SD = 0.48) and Kraken 2 using Greengenes (mean = 4.38, SD = 0.8). The worst NRMSE was again held by QIIME 2 with Greengenes and DADA2 using SILVA (QIIME 2: mean = 5.3, SD = 0.46; DADA2: mean = 5.29, SD = 0.45), which were significantly worse than all other methods (Fig. 2F).

Alpha diversity estimations

Out of all methods evaluated at the species level, Kraken 2 paired with Greengenes showed the greatest deviations from expected Shannon (deviation mean = 1.05, SD = 1.06) and Simpson (deviation mean = 0.25, SD = 0.27) alpha diversity indices, with significantly higher deviations than all other methods (Tukey-adjusted p < 0.001 in all pairwise comparisons). PathoScope generally matched the true Shannon indices more closely than all other methods (RefSeq2020: deviation mean = 0.21, SD = 0.23; RefSeq2018: deviation mean = 0.27, SD = 0.28; Fig. 3A). The same trend held for the Simpson indices.

Figure 3
figure 3

Deviation from true alpha diversity metrics at the species level. (A) The absolute difference between the measured Shannon alpha diversity index and the Shannon index value for the true mock community composition, and B) the log of the absolute difference between Breakaway_nof1 richness estimates and the true number of species present in each mock community. In both cases, values closer to 0 indicate more accurate estimation of the alpha diversity within a sample.

DADA2 reported the most closely matching log breakaway_nof138 indices, averaging significantly less deviation from the true number of species present than other methods (mean = 1.37, SD = 3.07; Tukey-adjusted p < 0.001 in all pairwise comparisons). On the other hand, Kraken 2 using its Standard library and SILVA frequently overestimated the number of species present by several orders of magnitude (Standard: mean = 6.17, SD = 1.82; SILVA: mean = 5.87, SD = 1.85), performing worse than all other methods (Fig. 3B).

Overall, no single pipeline or reference library performed the best in all evaluative metrics, but some holistic trends are present, especially at the species level. Figure 4B shows that sensitivity and specificity are correlated traits at the species level (Spearman’s r = 0.85) and that PathoScope (regardless of reference library) and Kraken 2 (with its Standard library) dominate the upper right quadrant, where sensitivity and specificity are both high. In particular, PathoScope excels in both sensitivity and specificity when used with either SILVA or RefSeq2018. Similarly, Fig. 4C shows that error and alpha diversity estimated deviation are inversely correlated (Spearman’s r = − 0.57) and that no single method yields the lowest alpha diversity deviation and error rates. Trends are not well-defined at the genus level (Fig. 4A).

Figure 4
figure 4

Combined quality of 16S analysis methods. Scatterplots of relative metrics for each 16S analysis pipeline, at the genus (A) and species (B,C) levels. Each point represents a single method’s results when analyzing a single mock community sample. Points are colored by the analysis pipeline/reference library used. Centroids representing the mean values for each pipeline/reference library pair are marked with bolded diamonds.

Discussion

Mock bacterial communities, either derived from spike-in DNA sequences or extracted from mixtures of bacterial cell monocultures, provide a semblance of a “ground truth” to assess 16S amplicon sequencing analysis methods. Ideally, knowing which species at what amounts should be present in any genuine microbiome sample would allow for accurate identification in every analysis. There are, of course, complications in sequencing experiments: technical bias and errors are introduced into samples at every step of the experiment until being safely sealed as bits in a FASTQ file on a server. Mock species' relative abundances may be affected by subtle variations in pipetting technique as spike-in DNA is aliquoted from individual sources. Spike-in DNA might be cloned from mutated DNA, or an early PCR error may have propagated through an entire commercial stock of nucleic acids. Different species of bacteria vary in lysing difficulty42, causing some species to be underrepresented or even absent in the collected cDNA libraries from a plate43. While 16S amplification primers are designed to bind to universal conserved regions of the 16S rRNA gene, there is still clearly some amplification bias during PCR44. Contamination from reagents45, local bacteria in the air, on gloves, or in a pipette tip box can further complicate matters. Thus, limitations of different experimental conditions and methods can dramatically affect the quality of results obtained from mock communities. Present sequencing errors and contamination suggest that amplicon reads will not identify with taxa as precisely as might a tidy, evenly distributed set of sequences drawn from a closed set of well-characterized species. It should then be evident that no analysis pipeline could theoretically exist to perfectly measure a mock community. Such a feat would require identifying only the expected species in their exact proportions, with no extraneous observations. As such, the proximate best method to analyze 16S amplicon sequencing data is one that identifies the makeup of the microbiome as truthfully as possible. Mock microbial communities can provide a level testing ground for existing tools to find their relative strengths and weaknesses in performance.

Of the pipelines tested, both QIIME 2 and Mothur were designed and built specifically for 16S amplicon sequencing analysis. Each has a suite of utility functions built to assist researchers in processing their data from the sequencer to differential abundance analysis and visualizations. Both are typically bundled with a dedicated bacterial 16S rRNA gene sequence database's reference library for alignment (i.e., Greengenes for QIIME 2, SILVA for Mothur). However, our results present strong evidence that PathoScope and Kraken 2 outperform QIIME 2, Mothur, and DADA2, even when comparing reads to identical reference databases. This phenomenon interestingly occurs despite Kraken 2 and PathoScope's status as more general whole genome sequencing and metagenomics data tools. In pairwise comparisons, PathoScope is more sensitive and specific at taxon detection and has a lower error score than either DADA2, QIIME 2, or Mothur, and has comparable alpha diversity index estimates at both the genus and species levels. In general, SILVA’s outperformance of Greengenes confirmed results found in previous benchmarks of 16S amplicon sequencing analysis methods20,21,22,23,46. This is likely due to several factors, including the small size of the Greengenes database in comparison to SILVA (99,000 versus 190,000), and the fact that this version of Greengenes had not been updated since 201347.

Kraken 2, when used with its Standard library, was rarely the top-performing analytic method in terms of sensitivity or specificity, although it was generally less error-prone than QIIME 2, Mothur, or any tool using Greengenes as a reference library. Kraken 2 has the added practical utility of being extremely fast and easy to use. Yet one constraint in analyzing Kraken 2 results is that they cannot be upsampled from a given taxon level, whereas PathoScope, QIIME 2, and Mothur all allow the tracing back of the taxonomic hierarchy of a given microbe. Both QIIME 2 and Mothur take advantage of naïve Bayes classifiers, which work most efficiently when trained on the specific region of the 16S rRNA gene amplified by PCR primers. Overall, PathoScope was the most sensitive in detecting taxa and specific in assigning reads, and the least error-prone tool when paired with either SILVA or RefSeq2018. However, it was not without limitations, as its computational expenses seemed to be an additional order of magnitude above those of other methods. This was evident from large interim SAM files (> 128 GB) and runtimes on the order of several hours, whereas Kraken 2 in particular took mere minutes. Issues aside, PathoScope is likely to outperform QIIME 2, DADA2, and Mothur in identification regardless of the database used. This finding partly results from PathoScope’s Bayesian mixed modeling identification algorithm, which accounts for the possibility that multiple species can be present in the sample or that the target strain is not present in the reference database. PathoScope consistently outperformed Kraken 2 in most cases, although the difference was often slight and not statistically significantly better. Overall, these comparisons show that methods designed for general metagenomics analyses consistently outperform methods specifically designed for analyzing 16S data.

While many species are identifiable from their 16S rRNA gene sequence or a single hypervariable region, it is important to note that imperfect accuracy at this level is not solely a computational issue. For example, although the 16S rRNA gene is approximately 1550 bp long, the short sequencing reads obtained in most next-generation sequencing (NGS) only span about 250–500 bases and lack ideal resolution at the species level48. Compared to NGS, long-read sequencing technologies have been shown to perform better in classifying at the genus and species level49,50. We also observed differences in our study between the results of 15 samples sequenced with Ion Torrent sequencing, as compared to samples from the same mock communities that were sequenced using Illumina Miseq. Furthermore, a major limitation to 16S amplicon studies is that some clades of bacteria exist with identical 16S DNA in the commonly sequenced V4 region. These clades of difficult-to-identify bacteria make up the bulk of Kraken 2’s and PathoScope’s incorrect calls. For example, Bifidobacterium adolescentis was nearly universally misclassified by all methods as other Bifidobacterium species, and Prosthecobacter fusiformis was frequently misidentified as Prosthecobacter dejongeii, a species with which it shares over 99% of its 16S DNA sequence51. Even further complications arise from many bacteria having several copies of the 16S rRNA gene, which may not be identical between operons within a genome52. This latter point may be in part why metagenomic methods such as Kraken 2 and PathoScope outperform specific methods such as QIIME 2 and Mothur, especially at the species level. The metagenomic methods are better designed to account for multiple 16S rRNA genes if present.

One of PathoScope’s largest sources of error and lost taxon detection sensitivity calls when using the RefSeq2020 library comes from an apparent erroneous reference genome scaffold in the RefSeq representative genomes. In all mock community samples containing Escherichia coli, PathoScope with RefSeq2020 reported the presence of Tumebacillus flagellates at relative abundances tightly correlated with the expected values of E. coli (Pearson’s r = 0.959). The circumstances strongly imply that reads actually originating from E. coli were incorrectly assigned to T. flagellates. T. flagellates is not even in the same phylum as E. coli, so the casual misassignment of reads between the species would be extremely unlikely based on 16S sequence similarity. Instead, a pairwise BLAST comparing E. coli’s 16S rRNA gene sequence to the T. flagellates scaffolds using the exact RefSeq entry that PathoScope had assigned those reads to (accession: NZ_JMIR01000093)53 revealed that one T. flagellates scaffold had a 100% identity alignment over 911 bp. The finding possibly represents a case of horizontal gene transfer of the 16S rRNA gene, but it appears far more likely that E. coli contamination existed in the DNA library, which then was sequenced and assembled into T. flagellates scaffolds. On further study, it became apparent that this is merely one example of pervasive sequence contamination, meaning the accidental inclusion of sequences from other organisms or the misclassification of sequences, in public genome databases. This phenomenon has been recently explored in the NCBI RefSeq database54,55,56. The recent prevalence of high throughput and the accelerating low cost of next-generation sequencing (NGS) technologies has led to a rapid increase in published genomes available in the RefSeq libraries, although imperfect methods and protocols for sequencing data are contributing to high contamination rates. Human contamination in published genomes, while not a problem in 16S analyses, is a particularly frustrating problem when analyzing shotgun metagenomics data. Clearly, metagenomic read-mapping approaches such as Kraken 2 and PathoScope afford the potential for the development of novel quality control pipelines for RefSeq and other genome sequence databases.

The increasing prevalence of poor sequencing quality control helps to explain why the RefSeq 2018 libraries often performed better than the 2020 libraries. Many tools have been developed to identify and correct contamination errors in sequences and public databases56,57,58,59,60, but this is an ongoing problem that demands additional filtering and correction efforts after directly retrieving libraries from the public repository. Given the higher specificity and sensitivity of PathoScope when using the 2018 RefSeq library over the 2020 library, we recommend using older RefSeq libraries until newer versions have been processed to remove contamination. Also of interest is the high accuracy of SILVA in its species calls when using PathoScope, even though it cannot be used to make such calls when used with QIIME 2, Mothur, or Kraken 2. SILVA also presents a viable alternative to the RefSeq libraries in avoiding contamination.

We note that in performing this benchmark, we sought to evaluate several common 16S amplicon sequencing analysis pipelines alongside metagenomics analysis pipelines. 16S pipelines were chosen based on performance as well as prevalence in previously published benchmarks. To identify metagenomics pipelines, we used a previously published metagenomics benchmarking paper18. Miossec et al. found that among the tested pipelines, PathoScope 2.0 and Kraken represented high sensitivity and specificity in the benchmark results. However, we emphasize that further comparison of other metagenomics analysis pipelines such as MetaMix61, Centrifuge62, and Metaxa263 should be conducted to analyze their differential performance, especially as new methods are developed and published.

Conclusion

DADA2, QIIME 2 and Mothur struggle to maintain accuracy at the genus level or granular species level in taxonomic analyses. Kraken 2, despite its primary purpose for whole genome sequencing metagenomics analyses, offers more power in analyzing 16S data without any increase in computational costs. PathoScope, while computationally more expensive, produces the most sensitive and accurate results of all evaluated pipelines when used on a diverse set of mock bacterial community samples. Analysis pipelines using SILVA as a reference library outperformed those using Greengenes significantly, and PathoScope using SILVA yielded the highest accuracies and sensitivities. While whole-genome reference libraries, such as Kraken 2’s Standard or RefSeq’s representative genomes, may provide some benefits over SILVA in terms of sensitivity, they may yield more spurious species-level calls. Based on the research conducted here with mock microbial communities, we recommend SILVA and RefSeq above other databases and strongly discourage usage of the Greengenes reference library for future analysis. While not included in our analysis due to release date, we do encourage users to try the Greengenes230 reference library as an phylogeny-based improvement over Greengenes. We also recommend PathoScope and Kraken 2 as fully capable, competitive options for conducting genus- and species-level 16S amplicon sequencing data analysis, in addition to outperforming other tools when using shotgun metagenomics data

18.