Metagenomic profiling pipelines improve taxonomic classification for 16S amplicon sequencing data

Most experiments studying bacterial microbiomes rely on the PCR amplification of all or part of the gene for the 16S rRNA subunit, which serves as a biomarker for identifying and quantifying the various taxa present in a microbiome sample. Several computational methods exist for analyzing 16S amplicon sequencing. However, the most-used bioinformatics tools cannot produce high quality genus-level or species-level taxonomic calls and may underestimate the potential accuracy of these calls. We used 16S sequencing data from mock bacterial communities to evaluate the sensitivity and specificity of several bioinformatics pipelines and genomic reference libraries used for microbiome analyses, concentrating on measuring the accuracy of species-level taxonomic assignments of 16S amplicon reads. We evaluated the tools DADA2, QIIME 2, Mothur, PathoScope 2, and Kraken 2 in conjunction with reference libraries from Greengenes, SILVA, Kraken 2, and RefSeq. Profiling tools were compared using publicly available mock community data from several sources, comprising 136 samples with varied species richness and evenness, several different amplified regions within the 16S rRNA gene, and both DNA spike-ins and cDNA from collections of plated cells. PathoScope 2 and Kraken 2, both tools designed for whole-genome metagenomics, outperformed DADA2, QIIME 2 using the DADA2 plugin, and Mothur, which are theoretically specialized for 16S analyses. Evaluations of reference libraries identified the SILVA and RefSeq/Kraken 2 Standard libraries as superior in accuracy compared to Greengenes. These findings support PathoScope and Kraken 2 as fully capable, competitive options for genus- and species-level 16S amplicon sequencing data analysis, whole genome sequencing, and metagenomics data tools.

For the standalone implementation of DADA2, all samples were filtered and trimmed, with errors learned on forward and reverse reads.Learned errors were used to conduct inference on predicted error presence across all reads as a denoising measure.Paired reads were merged and chimeras were removed, with taxonomy assigned down to the species level.When filtering and trimming, for most samples we used the parameters maxN = 0, maxEE = c (3, 3), truncQ = 2, rm.phix = TRUE, and tlength = 0.For unpaired reads, we set maxEE = 3.Several Kozich samples suffered from extreme quality degradation at read ends; to correct for this, we set tlength = c (240,  200).Finally, when running the Fouhy Ion Torrent samples, we set the DADA2 function parameters HOMOPOL-YMER_GAP_PENALTY = − 1 and BAND_SIZE = 32.For the filter and trim step, we also set trimLeft = 15.These settings were based on recommendations for processing Ion Torrent data in the DADA2 FAQ.
For all QIIME 2 analyses, we used the DADA2 plugin to cluster sequences and construct feature tables.We decided to test the DADA2 plugin alongside its standalone package due to the broad user base of QIIME 2. However, the standalone implementation uses 100% sequence identity over Mothur and QIIME 2 (with 97%) and uses exact sequence matching rather than a k-mer based method (as in the QIIME 2 q2-feature-classifier).All mock datasets could be run with paired-end reads besides the Fouhy datasets.In most cases, DADA2 did not require truncation of paired-end sequences and only the initial 6 bp were trimmed from each read.However, quality scores at the end of nine samples from the Kozich dataset were universally low enough (cutoff of median quality score < 20) to require truncation to 240 bp for forward reads and 200 bp for reverse reads for Kozich samples.Taxonomy was assigned using custom naïve Bayes classifiers constructed for each set of mock community samples based on their amplified 16S region.The QIIME 2 artifact output files were converted into BIOM format and subsequently into tab-delimited text format for downstream analyses and pipeline comparisons.
For Mothur analyses, all recommended procedures were followed according to Mothur documentation where possible.For paired-end sequences, the native make.contigs()function was used to join reads.In the pre.cluster() step of Mothur analysis, the "diffs" parameter (the number of mismatches allowed between a cluster's representative sequence and each member sequence) was set to 2 for joined sequencing reads shorter than 250 bp, 3 for joined reads of length 250-349 bp, and 4 for longer joined reads.For cluster.split(), we set the "taxlevel" parameter to 4, with a "cutoff " of 0.03.
For PathoScope 2.0 analyses, Bowtie2 alignment parameters were set to "-local -R 2 -N 0 -L 25 -i S,1,0.75 -k 10 -score-min L,100,1.28." These values were optimized for 16S sequencing reads, requiring higher similarity to a reference genome to be considered a hit than the default settings due to the highly conserved nature of portions of the 16S rRNA gene.Phylogeny for each taxon was inferred from the NCBI taxon id (ti) for each reference genome using the entrez_fetch() function from the R package rentrez.
For Kraken 2 analyses, Kraken 2 taxonomic reports were created for each sample.These were parsed into a taxon/feature counts matrix that included the complete phylogeny for each identified taxon as reported by Kraken 2.
Bacterial genomic and 16S reference libraries.We used five bacterial sequence reference databases in conjunction with the aforementioned pipelines: Greengenes 13_8, SILVA 138, two versions of RefSeq's representative genomes, and the Kraken 2 Standard library (downloaded on August 20, 2020).According to the Kraken 2 manual, the Kraken 2 Standard library is compiled using the RefSeq database, so it could be considered analogous to the RefSeq2020 library.The RefSeq libraries were downloaded on November 2, 2018, and June 23, 2020; these are denoted as "RefSeq2018" and "RefSeq2020." Greengenes and SILVA are specifically 16S reference databases as they include only sequences for the bacterial 16S rRNA gene.RefSeq2018, RefSeq2020, and the Kraken 2 Standard database are all whole-genome libraries, with no special modifications for use with 16S amplicon sequencing data.
Analysis pipeline and reference library pairings.We analyzed 136 mock community samples using a total of 11 distinct pairings of analysis tools and reference libraries: DADA2 only with SILVA, QIIME 2 with Greengenes and SILVA, Mothur only with SILVA (the default reference library), PathoScope using Greengenes, SILVA, RefSeq2018, and RefSeq2020, and Kraken 2 with its Standard library, SILVA, and Greengenes.While the SILVA database includes species-level taxonomic information for most of its representative 16S sequences, note that Mothur collapses feature counts into genus-level clades and thus does not make species-level calls.The adaptations of SILVA used for Kraken 2 and QIIME 2 did not provide species-level calls.Thus, only eight of the 11 pairings make species-level calls.The pairings and pipeline parameters are summarized in Table 1.

Tracking available taxonomic information for each OTU.
A counts matrix was created from the results of each of the 11 pipeline/reference pairs for each operational taxonomic unit (OTU), denoised OTU, and feature.Each feature was assigned phylum, class, order, family, genus, species, and subspecies level information when available.For a given database, whenever a taxonomic label was missing, the lowest level taxonomy available for a feature was propagated using that database's taxonomic path, taking note of what granularity was Vol:.( 1234567890 Assessing taxon sensitivity, read specificity, error, and diversity.Several metrics were used in assessing the overall quality and power of each 16S analysis pipeline and reference library at each taxonomic level.Results were independently evaluated at each taxonomic level.Any reads or features not assigned to a taxon at a given phylogenetic level were excluded from analysis, except where otherwise specified.
Taxon detection sensitivity.The metric of taxon detection sensitivity is defined here as the portion of expected taxa in a mock community sample detected by a given pipeline, at a minimum of 0.1% relative abundance.It essentially examines how often a given method can correctly determine an organism's presence in the mock community.
Read assignment specificity.Read assignment specificity is defined here as the portion of reads from a given sample assigned to taxa that are actually present in that sample's mock community.This is equivalent to 1 minus the portion of reads assigned to spurious taxa.This metric identifies the frequency of read assignment to incorrect organisms for a given method.
Error rate.The Normalized Root Mean Squared Error (NRMSE) was calculated as the root mean square error normalized with the assumption that the variance could be increasing given higher read counts.For each sample's results, given by the equation where, for K taxa, w i and t i are respectively the measured and true read counts of taxon i.We evaluated the union of the expected and detected taxa for each sample, using t i = 0 for theoretical taxa counts that were not actually measured in the mock community.All taxa that were absent from both the measured results and the true mock community were excluded (i.e., taxa which had relative abundance values of 0, both theoretical and measured).
Alpha diversity estimations.To assess each pipeline's ability to estimate the true alpha diversity within a sample regardless of accurate species identification, we calculated the log-fold change between the expected and the measured alpha diversity as measured by the Shannon index, the Simpson index, and the breakaway_nof1 5 index.The R package vegan 37 was used to calculate the Shannon and Simpson indices.The R package breakaway was used to calculate the breakaway_nof1 38 index, which predicts both the number of unobserved taxa and the number of true singletons based on the non-singleton frequency counts.Due to the alpha diversity metrics' sensitivity to library and count size differences 39 , we converted the relative abundances of the mock community samples' ground truths to virtual sequencing libraries of 1,000,000 reads again using the vegan 37 package.A rarefaction depth of 10,000 reads per sample was used to normalize all samples and ground truth libraries.

Statistical methods for significance testing.
A series of linear mixed-effects models (LMMs), coupled with post hoc least-square means tests and a Tukey multiple comparison correction, were used to determine which pipelines outperformed each other in sensitivity, specificity, error rates, and alpha diversity estimates.
Table 1.Summary table of all pipeline and reference database pairings, with information on non-default parameters chosen.There are 11 distinct pairings in all.For each pairing, a note was included as to the lowest taxonomic level call available.In sum, eight of 11 pairings were capable of making species-level calls.Any nondefault parameters used for a given pipeline were also specified.www.nature.com/scientificreports/LMMs were estimated using the lmer() function from the R package lme4 40 , and post hoc comparisons were performed with the lsmeans() function from the R package lsmeans 41 .These LMMs examine the relevant performance metric as the measured variable, using the 136 mock community samples as a random effect and the pipeline/reference library pair as a fixed effect.

Results
Visual evaluation of species-level detection and abundance estimates.Figure 1 shows stacked bar charts of the results from the Kozich dataset for the ground truth versus all methods at the species level.Overall, pipelines using the Greengenes database (Kraken 2, QIIME 2, and Pathoscope) performed the worst in classifying species, followed by DADA2 paired with SILVA.PathoScope made the best use of the Greengenes database with the fewest misclassified reads and most correct species-level detection.Kraken 2 (paired with its Standard library) and PathoScope (paired with the RefSeq and SILVA libraries) performed best on these datasets.
A more quantitative evaluation of these methods in the context of all samples follows.

Alpha diversity estimations.
Out of all methods evaluated at the species level, Kraken 2 paired with Greengenes showed the greatest deviations from expected Shannon (deviation mean = 1.05,SD = 1.06) and Simpson (deviation mean = 0.25, SD = 0.27) alpha diversity indices, with significantly higher deviations than all other methods (Tukey-adjusted p < 0.001 in all pairwise comparisons).PathoScope generally matched the  A,C,E) and species (B,D,F) levels.Sensitivity is calculated as the portion of expected taxa in each mock community sample that was detected with least 0.1% relative abundance.Specificity is calculated as the portion of reads assigned to taxa that are expected to exist within each mock community.
DADA2 reported the most closely matching log breakaway_nof1 38 indices, averaging significantly less deviation from the true number of species present than other methods (mean = 1.37,SD = 3.07; Tukey-adjusted p < 0.001 in all pairwise comparisons).On the other hand, Kraken 2 using its Standard library and SILVA frequently overestimated the number of species present by several orders of magnitude (Standard: mean = 6.17,SD = 1.82;SILVA: mean = 5.87, SD = 1.85), performing worse than all other methods (Fig. 3B).
Overall, no single pipeline or reference library performed the best in all evaluative metrics, but some holistic trends are present, especially at the species level.Figure 4B shows that sensitivity and specificity are correlated traits at the species level (Spearman's r = 0.85) and that PathoScope (regardless of reference library) and Kraken 2 (with its Standard library) dominate the upper right quadrant, where sensitivity and specificity are both high.In particular, PathoScope excels in both sensitivity and specificity when used with either SILVA or RefSeq2018.Similarly, Fig. 4C shows that error and alpha diversity estimated deviation are inversely correlated (Spearman's r = − 0.57) and that no single method yields the lowest alpha diversity deviation and error rates.Trends are not well-defined at the genus level (Fig. 4A).

Discussion
Mock bacterial communities, either derived from spike-in DNA sequences or extracted from mixtures of bacterial cell monocultures, provide a semblance of a "ground truth" to assess 16S amplicon sequencing analysis methods.Ideally, knowing which species at what amounts should be present in any genuine microbiome sample would allow for accurate identification in every analysis.There are, of course, complications in sequencing experiments: technical bias and errors are introduced into samples at every step of the experiment until being safely sealed as bits in a FASTQ file on a server.Mock species' relative abundances may be affected by subtle variations in pipetting technique as spike-in DNA is aliquoted from individual sources.Spike-in DNA might be cloned from mutated DNA, or an early PCR error may have propagated through an entire commercial stock of nucleic acids.Different species of bacteria vary in lysing difficulty 42 , causing some species to be underrepresented or even absent in the collected cDNA libraries from a plate 43 .While 16S amplification primers are designed to bind to universal conserved regions of the 16S rRNA gene, there is still clearly some amplification bias during PCR 44 .Contamination from reagents 45 , local bacteria in the air, on gloves, or in a pipette tip box can further complicate matters.Thus, limitations of different experimental conditions and methods can dramatically affect the quality of results obtained from mock communities.Present sequencing errors and contamination suggest that amplicon reads will not identify with taxa as precisely as might a tidy, evenly distributed set of sequences drawn from a closed set of well-characterized species.It should then be evident that no analysis pipeline could theoretically exist to perfectly measure a mock community.Such a feat would require identifying only the expected species in their exact proportions, with no extraneous observations.As such, the proximate best method to analyze 16S amplicon sequencing data is one that identifies the makeup of the microbiome as truthfully as possible.Mock microbial communities can provide a level testing ground for existing tools to find their relative strengths and weaknesses in performance.
Of the pipelines tested, both QIIME 2 and Mothur were designed and built specifically for 16S amplicon sequencing analysis.Each has a suite of utility functions built to assist researchers in processing their data from the sequencer to differential abundance analysis and visualizations.Both are typically bundled with a dedicated bacterial 16S rRNA gene sequence database's reference library for alignment (i.e., Greengenes for QIIME 2, SILVA for Mothur).However, our results present strong evidence that PathoScope and Kraken 2 outperform QIIME 2, Mothur, and DADA2, even when comparing reads to identical reference databases.This phenomenon interestingly occurs despite Kraken 2 and PathoScope's status as more general whole genome sequencing and metagenomics data tools.In pairwise comparisons, PathoScope is more sensitive and specific at taxon detection and has a lower error score than either DADA2, QIIME 2, or Mothur, and has comparable alpha diversity index estimates at both the genus and species levels.In general, SILVA's outperformance of Greengenes confirmed results found in previous benchmarks of 16S amplicon sequencing analysis methods [20][21][22][23]46 . Thi is likely due to several factors, including the small size of the Greengenes database in comparison to SILVA (99,000 versus 190,000), and the fact that this version of Greengenes had not been updated since 2013 47 .
Kraken 2, when used with its Standard library, was rarely the top-performing analytic method in terms of sensitivity or specificity, although it was generally less error-prone than QIIME 2, Mothur, or any tool using Greengenes as a reference library.Kraken 2 has the added practical utility of being extremely fast and easy to use.Yet one constraint in analyzing Kraken 2 results is that they cannot be upsampled from a given taxon level, whereas PathoScope, QIIME 2, and Mothur all allow the tracing back of the taxonomic hierarchy of a given microbe.Both QIIME 2 and Mothur take advantage of naïve Bayes classifiers, which work most efficiently when trained on the specific region of the 16S rRNA gene amplified by PCR primers.Overall, PathoScope was the most sensitive in detecting taxa and specific in assigning reads, and the least error-prone tool when paired with either SILVA or RefSeq2018.However, it was not without limitations, as its computational expenses seemed to be an additional order of magnitude above those of other methods.This was evident from large interim SAM files (> 128 GB) and runtimes on the order of several hours, whereas Kraken 2 in particular took mere minutes.Issues aside, PathoScope is likely to outperform QIIME 2, DADA2, and Mothur in identification regardless of the database used.This finding partly results from PathoScope's Bayesian mixed modeling identification algorithm, which accounts for the possibility that multiple species can be present in the sample or that the target strain is not present in the reference database.PathoScope consistently outperformed Kraken 2 in most cases, although the difference was often slight and not statistically significantly better.Overall, these comparisons show that methods designed for general metagenomics analyses consistently outperform methods specifically designed for analyzing 16S data.
While many species are identifiable from their 16S rRNA gene sequence or a single hypervariable region, it is important to note that imperfect accuracy at this level is not solely a computational issue.For example, although the 16S rRNA gene is approximately 1550 bp long, the short sequencing reads obtained in most next-generation sequencing (NGS) only span about 250-500 bases and lack ideal resolution at the species level 48 .Compared to NGS, long-read sequencing technologies have been shown to perform better in classifying at the genus and species level 49,50 .We also observed differences in our study between the results of 15 samples sequenced with Ion Torrent sequencing, as compared to samples from the same mock communities that were sequenced using Illumina Miseq.Furthermore, a major limitation to 16S amplicon studies is that some clades of bacteria exist with identical 16S DNA in the commonly sequenced V4 region.These clades of difficult-to-identify bacteria make up the bulk of Kraken 2's and PathoScope's incorrect calls.For example, Bifidobacterium adolescentis was nearly universally misclassified by all methods as other Bifidobacterium species, and Prosthecobacter fusiformis was frequently misidentified as Prosthecobacter dejongeii, a species with which it shares over 99% of its 16S DNA sequence 51 .Even further complications arise from many bacteria having several copies of the 16S rRNA gene, which may not be identical between operons within a genome 52 .This latter point may be in part why metagenomic methods such as Kraken 2 and PathoScope outperform specific methods such as QIIME 2 and Mothur, especially at the species level.The metagenomic methods are better designed to account for multiple 16S rRNA genes if present.
One of PathoScope's largest sources of error and lost taxon detection sensitivity calls when using the Ref-Seq2020 library comes from an apparent erroneous reference genome scaffold in the RefSeq representative genomes.In all mock community samples containing Escherichia coli, PathoScope with RefSeq2020 reported the presence of Tumebacillus flagellates at relative abundances tightly correlated with the expected values of E. coli (Pearson's r = 0.959).The circumstances strongly imply that reads actually originating from E. coli were incorrectly assigned to T. flagellates.T. flagellates is not even in the same phylum as E. coli, so the casual misassignment of reads between the species would be extremely unlikely based on 16S sequence similarity.Instead, a pairwise BLAST comparing E. coli's 16S rRNA gene sequence to the T. flagellates scaffolds using the exact RefSeq entry that PathoScope had assigned those reads to (accession: NZ_JMIR01000093) 53 revealed that one T. flagellates scaffold had a 100% identity alignment over 911 bp.The finding possibly represents a case of horizontal gene transfer of the 16S rRNA gene, but it appears far more likely that E. coli contamination existed in the DNA library, which then was sequenced and assembled into T. flagellates scaffolds.On further study, it became apparent that this is merely one example of pervasive sequence contamination, meaning the accidental inclusion of sequences from other organisms or the misclassification of sequences, in public genome databases.This phenomenon has been recently explored in the NCBI RefSeq database [54][55][56] .The recent prevalence of high throughput and the accelerating low cost of next-generation sequencing (NGS) technologies has led to a rapid increase in published genomes available in the RefSeq libraries, although imperfect methods and protocols for sequencing data are contributing to high contamination rates.Human contamination in published genomes, while not a problem in 16S analyses, is a particularly frustrating problem when analyzing shotgun metagenomics data.Clearly, metagenomic readmapping approaches such as Kraken 2 and PathoScope afford the potential for the development of novel quality control pipelines for RefSeq and other genome sequence databases.
The increasing prevalence of poor sequencing quality control helps to explain why the RefSeq 2018 libraries often performed better than the 2020 libraries.Many tools have been developed to identify and correct contamination errors in sequences and public databases [56][57][58][59][60] , but this is an ongoing problem that demands additional filtering and correction efforts after directly retrieving libraries from the public repository.Given the higher specificity and sensitivity of PathoScope when using the 2018 RefSeq library over the 2020 library, we recommend using older RefSeq libraries until newer versions have been processed to remove contamination.Also of interest is the high accuracy of SILVA in its species calls when using PathoScope, even though it cannot be used to make such calls when used with QIIME 2, Mothur, or Kraken 2. SILVA also presents a viable alternative to the RefSeq libraries in avoiding contamination.
We note that in performing this benchmark, we sought to evaluate several common 16S amplicon sequencing analysis pipelines alongside metagenomics analysis pipelines.16S pipelines were chosen based on performance as well as prevalence in previously published benchmarks.To identify metagenomics pipelines, we used a previously published metagenomics benchmarking paper 18 .Miossec et al. found that among the tested pipelines, PathoScope 2.0 and Kraken represented high sensitivity and specificity in the benchmark results.However, we emphasize that further comparison of other metagenomics analysis pipelines such as MetaMix 61 , Centrifuge 62 , and Metaxa2 63 should be conducted to analyze their differential performance, especially as new methods are developed and published.

Conclusion
DADA2, QIIME 2 and Mothur struggle to maintain accuracy at the genus level or granular species level in taxonomic analyses.Kraken 2, despite its primary purpose for whole genome sequencing metagenomics analyses, offers more power in analyzing 16S data without any increase in computational costs.PathoScope, while computationally more expensive, produces the most sensitive and accurate results of all evaluated pipelines when used on a diverse set of mock bacterial community samples.Analysis pipelines using SILVA as a reference library outperformed those using Greengenes significantly, and PathoScope using SILVA yielded the highest accuracies and sensitivities.While whole-genome reference libraries, such as Kraken 2's Standard or RefSeq's representative genomes, may provide some benefits over SILVA in terms of sensitivity, they may yield more spurious specieslevel calls.Based on the research conducted here with mock microbial communities, we recommend SILVA and RefSeq above other databases and strongly discourage usage of the Greengenes reference library for future analysis.While not included in our analysis due to release date, we do encourage users to try the Greengenes2 30 reference library as an phylogeny-based improvement over Greengenes.We also recommend PathoScope and Kraken 2 as fully capable, competitive options for conducting genus-and species-level 16S amplicon sequencing data analysis, in addition to outperforming other tools when using shotgun metagenomics data 18 .

Figure 1 .
Figure 1.Expected versus measured relative abundances of mock bacteria.A stacked bar plot of the measured relative abundances of bacterial species in 33 samples from Kozich et al.These samples were all equimolar concentrations of 16S rDNA from 21 species, as shown in the 'Ground Truth' bar on the left.All reads assigned to bacterial species other than the 21 expected in the mock community are colored gray and are labeled "Incorrect ID".Mothur calls were not included as the pipeline does not make species-level calls, and the same is true of QIIME 2 and Kraken 2 paired with the SILVA database.

Figure 2 .
Figure 2. Taxon detection sensitivity of 16S analysis pipelines.Violin plots of the sensitivity, specificity, and log NRMSE of each analysis pipeline and reference library pair used to analyze 16S samples, calculated at the genus (A,C,E) and species (B,D,F) levels.Sensitivity is calculated as the portion of expected taxa in each mock community sample that was detected with least 0.1% relative abundance.Specificity is calculated as the portion of reads assigned to taxa that are expected to exist within each mock community.

Figure 3 .
Figure 3. Deviation from true alpha diversity metrics at the species level.(A) The absolute difference between the measured Shannon alpha diversity index and the Shannon index value for the true mock community composition, and B) the log of the absolute difference between Breakaway_nof1 richness estimates and the true number of species present in each mock community.In both cases, values closer to 0 indicate more accurate estimation of the alpha diversity within a sample.

Figure 4 .
Figure 4. Combined quality of 16S analysis methods.Scatterplots of relative metrics for each 16S analysis pipeline, at the genus (A) and species (B,C) levels.Each point represents a single method's results when analyzing a single mock community sample.Points are colored by the analysis pipeline/reference library used.Centroids representing the mean values for each pipeline/reference library pair are marked with bolded diamonds. )