Species abundance information improves sequence taxonomy classification accuracy

Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments.


Supplementary
. Cumulative weights for different habitat types display similar diversity trends. Lines show the cumulative taxonomic weights for 14 EMPO 3 habitat types as a function of species count, coloured by habitat. Taxa are ordered separately for each habitat from most to least abundant. The most peaked distribution is Plant corpus, where 73% of reads were mapped to a single taxonomy in the Cyanobacteria phylum, Chloroplast class. Source data are provided as a Source Data file.
Supplementary Figure 5 . Habitat-specific taxonomic weights improve species-level classification accuracy across phyla. The use of habitat-specific weights is more important for species classification within some phyla, but is more important for more abundant phyla. Columns show percentage of reads correctly classified averaged across 14 EMPO 3 habitats and 21,513 empirical samples. Black lines show average abundances for each phylum. The phyla were truncated to only show those with an average abundance of > 0.05%. Error bars show standard error. Source data are provided as a Source Data file.
Supplementary Figure 6 . Taxonomic weights from shotgun sequencing improve agreement between amplicon and shotgun sequencing taxonomic compositions. Taxonomic weights derived from shotgun sequencing experiments make the taxa discovered in out-of-sample 16S sequencing samples agree more closely with shotgun sequencing experiments on the same samples. Taxon detection rate (TDR) is the fraction of taxa detected using shotgun sequencing that were also found using 16S sequencing. Points show mean TDR across 71 stool samples for which paired 16S and shotgun sequencing exists. Error bars show standard errors across folds for 5-fold cross validation. Source data are provided as a Source Data file. Supplementary Figure 7 . On average, bespoke weights outperformed average weights across EMPO 3 habitat types, and average weights outperformed uniform weights based on Taxon Accuracy Rate (but neither significantly, minimum paired t-test P = 0.17) . Taxon Accuracy Rates are from 5-fold cross validation where classifiers were trained using a variety of taxonomic weighting strategies. Tests were based on 21,513 empirical samples across the 14 habitat types. Box bounds and centre lines show first and third quartiles and median. Whiskers extend to measurements no further than 1.5 times the interquartile range from the nearest quartiles. Outliers are plotted individually. Source data are provided as a Source Data file.
Supplementary Figure 8. Bespoke weights always outperformed average weights across EMPO 3 habitat types, and average weights always outperformed uniform weights based on Taxon Detection Rate (maximum paired t-test P = 4.7x10-7). Taxon Detection Rates are from 5-fold cross validation where classifiers were trained using a variety of taxonomic weighting strategies. Tests were based on 21,513 empirical samples across the 14 habitat types. Box bounds and centre lines show first and third quartiles and median. Whiskers extend to measurements no further than 1.5 times the interquartile range from the nearest quartiles. Outliers are plotted individually. Source data are provided as a Source Data file.
Supplementary Figure 9 . Error rates improve less for less abundant species than for more abundant species under bespoke weights. Hex bin plot shows difference in error rates between bespoke and uniform weighting strategies, averaged over each species, then across all the samples for each of the 14 EMPO 3 habitat types, then across the habitat types. Histograms show abundances averaged in the same way then marginalised. Whereas the hex bin plot indicates the presence of rare ASVs that are more accurately classified under uniform weights, the histograms indicate that bespoke weights provide at least a slight improvement for the vast majority of ASVs detected across all sample types. Source data are provided as a Source Data file.

Supplementary Tables
Supplementary Table 1 . Using habitat-specific taxonomic weights, researchers can now classify sequences at species level with the same confidence that they previously classified sequences at genus level . Table shows error rates at genus and species levels for habitat-specific (bespoke) and standard (uniform) taxonomic weights. Bolded rows indicate EMPO 3 habitats where species-level error rate with the bespoke classifier is less than genus-level accuracy with the uniform classifier. Lower error rate indicates superior accuracy. Source data are provided as a Source Data file.

Supplementary Notes
Using cross validation (see Methods), we determined the effect of several different options for obtaining taxonomic weights on taxonomic classification accuracy. We labelled those options as: -Uniform weights: every taxonomic class is assumed to be equally likely.
-Bespoke weights: weights drawn from the same EMPO 3 habitat as the test samples.
-Cross-habitat weights: weights from any of the 13 EMPO 3 habitats other than a test sample's source EMPO 3 habitat.
Uniform weights is the current default assumption for q2-feature-classifier and the only available option for the RDP Classifier. Average weights were used to determine how important it is to closely match the taxonomic weights with the expected weights for a given sample, and to investigate a classification approach for uncharacterized and unknown sample types.
Cross-habitat weights were used to determine the effect of classifying reads using misspecified weights. For example, if one were to take a sample that would properly be labelled as Animal distal gut, but erroneously undertake taxonomic classification using a classifier trained using Plant corpus weights.
We used three measures of classification accuracy: error rate, Bray-Curtis dissimilarity, and F-measure, made possible because for each test sample there is a known taxonomy for each read. Please refer to the Methods for details of how these measures were calculated and averaged across samples.
As is typical for existing taxonomy classification methods, classification accuracy was excellent at class level, but decreased at finer levels of taxonomic resolution (Supplementary Figure 1).
Classification accuracy decreased for both bespoke and uniform weighted classifiers, but bespoke classifiers were much less prone to this effect (Supplementary Figure 1)   Across the EMPO 3 types, paired t-test differences between bespoke and average weights and average and uniform weights were significant in all cases; maximum P was 4.2x10 -4 .
Cross-habitat weights outcomes occupied a spread but rarely outperformed average weights (8 of 182 comparisons); however, cross-habitat weights frequently outperformed uniform weights (117 out of 182 comparisons) ( Figure 3). Thus, it appears that uniform weights gravely misrepresent natural species distributions, marring classification accuracy. By comparison, any type of naturally derived taxonomic weight usually improved classification accuracy, even if those weights were derived from a dissimilar habitat type.
Using the cross-habitat weights it was possible to quantify the relationship between classification accuracy and taxonomic weight misspecification over 182 comparisons. We first calculated the differences between error rates using cross-habitat weights and the error rates using bespoke weights. We then calculated the corresponding Kullback-Leibler divergence between the bespoke and cross-habitat weights for each difference and discovered a significant correlation (Pearson r 2 = 0.57, P < 2.2x10 -16 , see Figure 4). We performed the same test with F-measure and discovered a negative correlation of roughly the same magnitude (Pearson r 2 = 0.58, P < 2.2x10 -16 ). In our tests, using the bespoke weights yielded the best classification accuracy at every level, regardless of how we measured it. This result refines that finding to show that the amount by which performance degrades for taxonomic weights other than the default weights is proportional to how different they are to the bespoke weights. The implication is that any improvement of uniform weights in the direction of bespoke weights is worthwhile, and that if bespoke weights are not available, then taxonomic weights from a similar habitat or the average weights should be used for classification.
We investigated what factors lead the classification accuracy for bespoke weights to vary While it is not shown in Supplementary Figure 4, it is also worth noting that the most common taxa for each habitat type are similar: the union of the sets of taxa that account for the first 95% of weights for each habitat type contains only 1,571 taxa.
We first examined whether the diversity of an EMPO 3 habitat was related to classification accuracy under bespoke weights. Regression of error rate against the entropy of the taxonomic weights for each of the 14 EMPO 3 habitats showed no significant relationship (Pearson r 2 = 0.12, P = 0.23).
The classification accuracy for a given EMPO 3 habitat type was instead found to be largely explained by specific interaction between taxonomic weights and the similarity among reference sequences. We discovered a significant correlation between the confusion index (see Methods) and the error rate using bespoke weights for each of the 14 EMPO 3 types (Pearson r 2 = 0.72, P = 1.3x10 -4 ) ( Figure 5). A negative correlation of a similar magnitude was found for F-measure (Pearson r 2 = 0.63, P = 7.1x10 -4 ). While much has been written about the difficulty of establishing species-level identity from short read marker-gene sequences [1][2][3][4][5] , to our knowledge this is the first instance of a systematic quantitative analysis of how difficult this problem is in the light of which species co-occur. By using typical weights (as estimated from the data) we have shown that while it is possible to find many examples where a taxonomic classifier can be confused at species level if presented with isolated genetic sequences and the entire reference database, that in practice the problem does not have to be that hard. More than explaining variation between habitat types, these discoveries give a clear indication of the basis for the improvements that we see when using bespoke weights. where error rate and abundance was averaged over the habitat types. We observed that reads from some phyla are significantly more difficult to classify than others. For instance, using uniform weights, the error rate was 44% (0.7% standard error) for Firmicutes but 4% (0.2% standard error) for Acidobacteria. For the two most abundant phyla, Proteobacteria and Firmicutes, the decrease in incorrect classifications from uniform to bespoke weights was substantial, from 35% (0.7% standard error) to 22% (0.4% standard error) and from 44% (0.7% standard error) to 24% (0.3% standard error), respectively (maximum t-test P = 8.4x10 -6 ). For Firmicutes that is an almost two-fold reduction in the number of incorrectly classified reads.
These increases in accuracy underline the consistent increases in accuracy from uniform to bespoke weights that we have observed throughout this study.
Shotgun sequencing data, which has the potential to be less biased and higher-resolution than short amplicon sequences 6 Supplementary   Figures 7 and 8. TAR is the fraction of the observed taxa that were expected to be observed for a given sample. TDR is the fraction of taxa that were expected to be observed that were observed for a given sample. As they rely purely on presence and absence of taxa in a sample, they are less affected by species abundance. Across the 14 EMPO 3 habitats considered, the average TDR and TAR was better for bespoke weights than for uniform weights. This difference was significant at 5% significance for TDR (paired t-test P = 2.2x10 -13 ) but not for TAR (paired t-test P = 0.5). TDR followed the same trend shown in every other test that we performed and TAR was no worse for bespoke weights than it was for uniform weights.
We went further to search for whether rare species were being misclassified under the use of bespoke weights. In Supplementary Figure 9, several histograms represent the relationship between average species abundance and the difference in error rate between uniform and bespoke weights. A positive difference indicates that uniform weights outperformed bespoke weights for a given species on average across the 14 EMPO 3 habitats. Graphically, there is some evidence that the error rate degrades for bespoke weighted classifiers for species with average abundance of less than 10 -4 . As the distribution is strongly peaked for no change in error rate for abundances of less than around 10 -3 , we did not perform a linear regression. For a sample of 10,000 reads, that represents roughly one read attributable to that species. We concede that if rare species that one would expect to exist as singletons at usual sequencing depths are important to an experimental technique, then classification should be performed with uniform and bespoke classifiers.