Strain level microbial detection and quantification with applications to single cell metagenomics

Zhu, Kaiyuan; Schäffer, Alejandro A.; Robinson, Welles; Xu, Junyan; Ruppin, Eytan; Ergun, A. Funda; Ye, Yuzhen; Sahinalp, S. Cenk

doi:10.1038/s41467-022-33869-7

Download PDF

Article
Open access
Published: 28 October 2022

Strain level microbial detection and quantification with applications to single cell metagenomics

Kaiyuan Zhu^1,2,3,
Alejandro A. Schäffer ORCID: orcid.org/0000-0002-2147-8033¹,
Welles Robinson^1,4,
Junyan Xu¹,
Eytan Ruppin¹,
A. Funda Ergun³,
Yuzhen Ye³ &
…
S. Cenk Sahinalp ORCID: orcid.org/0000-0002-2170-2808^1,3

Nature Communications volume 13, Article number: 6430 (2022) Cite this article

5394 Accesses
2 Citations
16 Altmetric
Metrics details

Subjects

Abstract

Computational identification and quantification of distinct microbes from high throughput sequencing data is crucial for our understanding of human health. Existing methods either use accurate but computationally expensive alignment-based approaches or less accurate but computationally fast alignment-free approaches, which often fail to correctly assign reads to genomes. Here we introduce CAMMiQ, a combinatorial optimization framework to identify and quantify distinct genomes (specified by a database) in a metagenomic dataset. As a key methodological innovation, CAMMiQ uses substrings of variable length and those that appear in two genomes in the database, as opposed to the commonly used fixed-length, unique substrings. These substrings allow to accurately decouple mixtures of highly similar genomes resulting in higher accuracy than the leading alternatives, without requiring additional computational resources, as demonstrated on commonly used benchmarking datasets. Importantly, we show that CAMMiQ can distinguish closely related bacterial strains in simulated metagenomic and real single-cell metatranscriptomic data.

Strainberry: automated strain separation in low-complexity metagenomes using long reads

Article Open access 23 July 2021

Unveiling microbial diversity: harnessing long-read sequencing technology

Article 30 April 2024

Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps

Article Open access 11 July 2019

Introduction

Recent appreciation for the importance of microbes in human health and disease has prompted the generation of many metagenomic HTS (high throughput sequencing) datasets¹. The increase in available HTS data from human tissues also represents an enormous resource because many of these datasets include reads from tissue-resident microbes, which have been shown to play important roles human disease, including tumorigenesis and the tumor response to therapy^{2,3,4,5,6,7,8}.

The increase in available metagenomic HTS datasets prompted the development of many taxonomic classification and abundance estimation methods. A recent benchmarking study⁹ involving a dataset established by Critical Assessment of Metagenome Interpretation (CAMI) challenge and International Microbiome and Multiomics Standards Alliance (IMMSA) provides a comprehensive review of these methods. The study covers 20 taxonomic classifiers including both alignment-based approaches (such as GATK PathSeq, blastn and MetaPhlAn2^10,11,12) as well as alignment-free approaches (such as Kraken, CLARK, KrakenUniq, Centrifuge, and Bracken^{13,14,15,16,17}). Below, we provide an overview of the general approaches employed for metagenomic classification methods.

Early approaches for analyzing metagenomic sequencing data were alignment-based and used a reference database. Reads were primarily searched in GenBank¹⁸ through blastn¹¹ or custom built aligners such as GATK PathSeq¹⁰. Unfortunately, the growth of HTS data and reference databases has made read search and alignment using blastn or GATK PathSeq computationally infeasible on the largest datasets. For example, a recent study showing that microbial reads from tumors sequenced by The Cancer Genome Atlas (TCGA) can be used to build a classifier for cancer type¹⁹ use the alignment-free approach Kraken¹³ due to the large number of samples analyzed. Even though Kraken and other alignment-free tools are faster than the alignment-based tools²⁰, these alignemnt-free tools are not as accurate. For example, another recent paper on microbial reads from single cell RNA-seq (scRNA-seq) datasets to distinguish cell type specific intracellular microbes from extracellular and contaminating microbes²¹ had to use GATK PathSeq because the relatively small number of microbial reads per cell were inadequate for available alignment-free methods to give accurate results. The distinct approaches taken by these two studies exemplify the tradeoffs inherent in the above methodologies.

Alignment-based methods can be sped up substantially by aligning reads to a compressed reference database or to a reference collection of sequences from marker genes, which are usually clade-specific, single-copy genes^22,23. Since marker-gene based methods identify and use only a handful of marker genes on each genome, much of the data goes unused, making taxonomic quantification less accurate. Species with low abundance within the sample may be difficult to identify through marker gene methods because the data may contain few reads originating from the marker genes.

Alignment-free methods typically rely on exact string matching^16,24, or k-mer (substrings of length k) “matches” to obtain a taxonomic assignment for every read. These methods either assign a read to the lowest taxonomic rank possible (determined by the specificity of the read’s substrings, or k-mers)^13,25,26,27, or to a pre-determined taxonomic level, i.e., genus, species, or strain^14,28. Unlike marker-gene based methods, k-mer based applications can use all the input reads²⁹. The large memory footprint to maintain the entire k-mer profile of each genome, for large values of k, can be reduced through hashing or subsampling the k-mers^30,31,32,33. In addition to methods based on exact k-mer matches, it is also possible to assign metagenomic reads to bacterial genomes by employing sequence-specific features (e.g., short k-mer distribution or GC content)^{34,35,36,37,38}, although methods that employ this approach are typically not very accurate at species level or strain level assignment. These methods, as a result, are typically insufficient for strain-level applications³⁹, e.g., to identify mixed infections caused by multiple strains of a bacterial species^40,41,42, to distinguish pathogenic strains from non-pathogenic strains⁴³, or to track food-borne pathogens⁴⁴.

Most of the methods described above and covered in the aforementioned benchmarking study⁹ analyze each read without consideration of how the reads are sampled. Provided that the sequence data to be analyzed are genomic DNA, the distribution of HTS reads from a given species or strain should be roughly uniform. This principle is used in several methods for isoform abundance estimation^45,46,47 and are effective even though the distribution of reads across an isoform may not be uniform in practice. In the context of metagenomic abundance estimation, however, the uniform coverage principle is under-utilized. One exception is the network flow based approach, utilized, for example, by ref. 48, which does take into account the uniform coverage—however, it is relatively slow due to the hardness of the underlying algorithmic problem. Another method that utilizes the near uniformity across k-mers within a genome is ref. 15, which runs faster but also is less accurate.

In addition to the metagenomic species identification and quantification methods summarized above, there are also tools to determine the likely presence of a long genomic sequence (e.g., the complete or partial genome of a bacterial species) in a given metagenomic sample^{49,50,51,52,53}. Even though these tools solve an entirely different problem, methodologically they are similar to the k-mer based metagenomic identification and quantification tools such as refs. 13,14, in the sense that they build a succinct index on the database, which is comprised of the metagenomic read collection, and they query this index without explicit alignment. However, because of their design parameters, these tools can not perform abundance estimation.

In this paper, we describe CAMMiQ (Combinatorial Algorithms for Metagenomic Microbial Quantification), a computational approach to maintain/manage a collection of m (bacterial) genomes ${{{{{{{\mathcal{S}}}}}}}}=\{{s}_{1},\ldots,{s}_{m}\}$, each assembled into one or more strings/contigs, representing a species, a particular strain of a species, or any other taxonomic rank. CAMMiQ constructs a data structure, which can answer queries of the following form: given a set ${{{{{{{\mathcal{Q}}}}}}}}$ of HTS reads obtained from a mixture of genomes or transcriptomes, each from ${{{{{{{\mathcal{S}}}}}}}}$, identify the genomes in ${{{{{{{\mathcal{Q}}}}}}}}$, and, in case the reads are genomic, compute their relative abundances. Our data structure is very efficient in terms of its empirical querying time and is shown to be very accurate on simulations for which the ground truth answers are known. The distinctive feature of our data structure is its utilization of substrings that are present in at most c genomes (c > 1) in ${{{{{{{\mathcal{S}}}}}}}}$; in this paper, we focus on c = 2, which we call doubly-unique substrings. CAMMiQ is thus different from available methods which set c = 1 to compare genomes via their shortest unique substrings^54,55, or perform metagenomic analysis by employing k-mers unique to each genome^13,14,15. By considering substrings that are present in c = 2 (or possibly more) genomes, CAMMiQ utilizes a higher proportion of reads and can accurately identify genomes at subspecies/strain level. The choice of c = 2 is sufficiently powerful for the datasets we considered. However, our approach can be generalized for any fixed value of c ≥ 2. Another distinctive feature of our data structure is its use of the variable length substrings—rather than fixed length k-mers. Because any extension of a shortest unique substring is also unique, CAMMiQ only maintains the shortest of these overlapping unique substrings to maximize utility. By being flexible about substring length, CAMMiQ potentially has a a larger selection of substrings from which to choose; because it utilizes the shortest unique substrings, it maximizes possible coverage. To assign each read in ${{{{{{{\mathcal{Q}}}}}}}}$ that includes an almost-unique substring (i.e., a string present in at most c genomes) to a genome, our data structure solves an integer linear program (ILP) - that simultaneously infers which genomes are present in ${{{{{{{\mathcal{Q}}}}}}}}$ and, if the reads are genomic, the relative abundances of the identified genomes. Specifically, the objective of the ILP is to identify a set of genomes in which the coverage of the almost-unique substrings in each genome is (approximately) uniform. Our final contribution is a set of conditions sufficient to identify and quantify genomes in a query correctly, through the use of unique substrings/k-mers, provided the reads are error-free. Although this is a purely theoretical result, to the best of our knowledge it has not been applied to metagenomic data analysis, and is valid for CAMMiQ for the case c = 1 and other unique substring based methods such as CLARK and KrakenUniq. Setting c = 2 for CAMMiQ is advised for cases where these conditions are not met. On the experimental side, we show that CAMMiQ is not only much faster but also more accurate than the mapping based GATK PathSeq, which, as mentioned earlier, was used on scRNA-seq data obtained from monocyte-derived dendritic cells (moDCs) infected with distinct Salmonella strains²¹—where accuracy was the top priority. The application to single-cell data is important because in studies of the human microbiome, it is of interest to know which cells are infected with which microbial strains, especially to distinguish between benign commensals and pathogenic variants of bacteria such as E. coli. Using current sequencing technologies, single-cell nucleotide data are primarily RNAseq rather than DNAseq, which is why we focus on an RNAseq case study. Returning to the established problem of analyzing bulk DNAseq data, we demonstrate the comparative advantage of CAMMiQ against the top performing alignment based and alignment free metagenomic classification methods according to the above-mentioned benchmarking study⁹ on the very same (CAMI and IMMSA) dataset. We additionally show that CAMMiQ is uniquely capable of handling particularly challenging microbial strains we derived from the NCBI RefSeq database.

Results

Below, we first give a brief overview of CAMMiQ algorithm. Then we describe the index data sets, simulated and real query sets, as well as the alternative computational methods we used to benchmark CAMMiQ’s performance. We next demonstrate CAMMiQ’s comparative accuracy performance against alternative metagenomic analysis methods on the two species level data sets we have: the first is the CAMI and IMMSA benchmark (i.e., species-level-all) index dataset and the second is the species-level-bacteria index dataset. For these two datasets, we not only provide accuracy figures for the tools benchmarked but also the computational resources they use. Additionally, we demonstrate the maximum potential advantage that could be offered by CAMMiQ through its use of doubly-unique, variable length substrings on our species-level-bacteria index dataset. We then demonstrate CAMMiQ’s performance on our strain-level index dataset. Finally, we demonstrate CAMMiQ’s performance on real metatranscriptomic query sets through its use of our subspecies-level index dataset. The results of CAMMiQ in this setup was compared against that of the GATK PathSeq tool¹⁰ which was utilized by the original study on this data set²¹, as well as blastn method¹¹, which possibly offers the most accurate (albeit slow) approach for the relevant purpose.

Overview of CAMMiQ indexing and querying procedure

As per a typical metagenomic classification or profiling tool, CAMMiQ involves two steps, namely, index construction and query. In the index construction step, CAMMiQ is given a set ${{{{{{{\mathcal{S}}}}}}}}={\{{s}_{i}\}}_{i=1}^{m}$ of m genomes or contigs, each labeled with an ID representing the taxonomy of that genome. We call ${{{{{{{\mathcal{S}}}}}}}}$ an index dataset below. By the end of this step, CAMMiQ returns the collection of sparsified shortest unique substrings and shortest doubly-unique substrings on each genome s_i in ${{{{{{{\mathcal{S}}}}}}}}$ in a compressed binary format, and other meta information involving the input index dataset, which jointly composing its index on the dataset. CAMMiQ reuses its index in the next query step.

In the query step, CAMMiQ is given a collection of reads ${{{{{{{\mathcal{Q}}}}}}}}={\{{r}_{j}\}}_{j=1}^{n}$ of varying length, and identifies a set of genomes ${{{{{{{\mathcal{A}}}}}}}}=\{{s}_{1},\cdots \,,{s}_{a}\}\subset {{{{{{{\mathcal{S}}}}}}}}$ and their respective abundances p₁, ⋯ , p_a that “best explain” ${{{{{{{\mathcal{Q}}}}}}}}$ efficiently. We call ${{{{{{{\mathcal{Q}}}}}}}}$ a query or query set below. Depending on specific applications, a user can select to return (i) ${{{{{{{{\mathcal{A}}}}}}}}}_{1}\subseteq {{{{{{{\mathcal{S}}}}}}}}$, the set of genomes such that each includes at least one shortest unique substring that also occur in some read r_j in the query ${{{{{{{\mathcal{Q}}}}}}}}$; (ii) ${{{{{{{{\mathcal{A}}}}}}}}}_{2}\subseteq {{{{{{{\mathcal{S}}}}}}}}$, the smallest subset of genomes in ${{{{{{{\mathcal{S}}}}}}}}$ which include all shortest unique and doubly-unique substrings that also occur in some read ${r}_{j}\in {{{{{{{\mathcal{Q}}}}}}}}$; or ${{{{{{{{\mathcal{A}}}}}}}}}_{3}\subseteq {{{{{{{\mathcal{S}}}}}}}}$, the smallest subset of ${{{{{{{\mathcal{S}}}}}}}}$ which again include all shortest unique and doubly-unique substrings that also occur in some read ${r}_{j}\in {{{{{{{\mathcal{Q}}}}}}}}$, with the additional constraint that the “coverage” of these substrings in each genome ${s}_{i}\in {{{{{{{{\mathcal{A}}}}}}}}}_{3}$ is roughly uniform. In the last case CAMMiQ also computes the relative abundance of each genome s_i in ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$.

For all three query types, CAMMiQ first identifies for each read r_j all unique and doubly-unique substrings it includes; it then assigns r_j to the one or two genomes from which these substrings possibly originate. To compute ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$, CAMMiQ simply returns the collection of genomes receiving at least one read assignment. To compute ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$, CAMMiQ solves a hitting set problem though an ILP, where genomes form the universe of items, and indexed strings that appear in query reads form the sets of items to be hit. To compute ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$, CAMMiQ solves the combinatorial optimization problem that asks to minimize the variance among the number of reads assigned to each indexed substring of each genome, again through an ILP. The solution indicates the set of genomes in ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ along with their respective abundances.

Datasets

To evaluate the overall performance of CAMMiQ, we have performed four sets of experiments, each with a distinct index dataset (all based on NCBI’s RefSeq database⁵⁶) and a distinct collection of queries.

(i)
The first, species-level-all dataset is the most comprehensive index dataset, which includes one complete genome from each bacterial, viral and archaeal species from NCBI’s RefSeq database, resulting in a total of m = 16,418 genomes. This dataset is established for the CAMI and IMMSA repository used in recent benchmarking studies of metagenomics classification and profiling tools^9,57. There are 16 query sets from this repository used in these two studies, 8 from CAMI and 8 from IMMSA. Notably, both CAMI and IMMSA query sets include genomes that are not present in the species-level-all index dataset. In fact, the CAMI query sets include only a small porportion of genomes from the index dataset - the majority of the reads in these queries represent unknown species or simulated strains “evolved” from known species that are not in the species-level-all index dataset. See Supplementary Notes 5.4.1 and 5.4.2 for a detailed description of these queries. We used these query sets to demonstrate the comparative performance of CAMMiQ against the best performing methods according to ref. 9, namely Kraken2⁵⁸, KrakenUniq¹⁵, CLARK¹⁴, Centrifuge¹⁶, and Bracken¹⁷; please see Supplementary Note 6 for the specific parameters and setup used for each of these tools. Since genomes in the query sets may not be all included in the index dataset, we employed query type ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ to evaluate CAMMiQ’s performance against the aforementioned tools.
(ii)
We compiled our next, species-level-bacteria index dataset to evaluate the species level performance of CAMMiQ, this time across one representative complete genome from each of the m = 4122 bacterial species from (an earlier version of) NCBI’s RefSeq. This index dataset enabled us to measure the performance of CAMMiQ’s type ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ queries against the tools mentioned above plus MetaPhlAn2¹², a marker-gene based profiling tool. We simulated 14 query sets for this experiment with varying levels of “difficulty” across the genomes. These include 10 challenging (marked Least) and 4 easier queries (marked Random). See Supplementary Note 5.4.3 and Supplementary Fig. 2 for a detailed description of these queries.
(iii)
Our next strain-level index dataset is smaller: it includes the complete set of m = 614 human gut related bacterial strains from ref. 59 for the purpose of evaluating CAMMiQ’s strain level performance. We again employed type ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ queries of CAMMiQ to compare it against the above-mentioned tools. We simulated 4 queries for this index dataset with varying levels of “difficulty”. See Supplementary Note 5.5 for details.
(iv)
We finally evaluated CAMMiQ on a dataset from another study⁶⁰ which involved metatranscriptomic reads from 262 single human immune cells (monocyte-derived dendritic cells, moDCs) deliberately infected with two distinct strains of the intracellular bacterium Salmonella enterica and 80 uninfected cells used as negative controls. A recent study²¹ applied the GATK PathSeq tool¹⁰ to these metatranscriptomic read sets to validate the presence of Salmonella genus in each cell. To demonstrate CAMMiQ’s ability to distinguish cells infected with specific strains of Salmonella in time much faster than GATK PathSeq, we applied its query types ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$ and ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ to these metatranscriptomic read sets. Since these are not genomic reads, our query type ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ could not be used. The index dataset we used for these queries are at the subspecies-level; it consists of m = 3395 complete bacterial genomes, where each species is represented by a handful of strains. This index dataset was generated to reduce the sampling bias observed in the RefSeq database, which, e.g., includes more than 300 strains from the genus Salmonella. CAMMiQ’s accuracy was compared mainly against PathSeq (a mapping based, thus relatively slow method) for this experiment since PathSeq was the preferred method of the original study due to its high levels of accuracy. Further details on the real query sets can be found in Supplementary Note 5.6.

A summary of data sets used in our experiments can be found in Table 1. Additional details on the four index datasets can be found in Supplementary Notes 5.1–5.3. As will be demonstrated, CAMMiQ’s performance on these query sets is superior to all alternatives in almost all scenarios we tested.

Table 1 Synthetic and real bacterial read sets used to benchmark CAMMiQ’s performance against the best performing metagenomic classification and abundance estimation tools

Full size table

Precision and recall in read classification across all species level queries

We tested CAMMiQ’s species level performance on both CAMI and IMMSA (i.e., species-level-all) and species-level-bacteria data sets, and compared it against the best performing alternatives according to ref. ⁹. Results based on CAMI and IMMSA are summarized in Table 2; results based on species-level-bacteria data set are summarized in Table 3.

Table 2 Performance evaluation of CAMMiQ, Kraken2, KrakenUniq, CLARK, Centrifuge, and Bracken on CAMI and IMMSA benchmark queries against the species-level-all index dataset

Full size table

Table 3 Performance evaluation of CAMMiQ, Kraken2, KrakenUniq, CLARK, Centrifuge, Bracken and MetaPhlAn2 on the 14 species-level-bacteria queries

Full size table

Perhaps the most widely-used performance measures to benchmark metagenomic classifiers are the proportion of reads correctly assigned to a genome among (i) the set of reads assigned to some genome, i.e., precision, and (ii) the full set of reads in the query, i.e., recall¹⁴. In Table 2, panel A, as well as Table 3, panel A, we report the selected tools precision in read classification. Then, in Table 2 panel B and Table 3, panel B, we report these tools’ recall in read classification.

Note that the above tables do not report the read classification precision and recall values for MetaPhlAn2. This is partially due to MetaPhlAn2’s use of an index based on a very different (predetermined) and much smaller database of marker genes. As a consequence, MetaPhlAn2 assigns very few reads to the marker genes in its database and thus appears to have very low recall (and possibly higher precision). This would not accurately reflect MetaPhlAn2’s performance since unlike the other tools we benchmarked, MetaPhlAn2 does not aim to assign as many reads reads to genomes correctly but rather aims to identify distinct genomes in a metagenomic sample; see Supplementary Note 6 for details. Additionally note that for our bookkeeping purposes, any read assigned to a taxonomic level strictly higher than the species level by Kraken2, KrakenUniq, and Centrifuge is considered to be not assigned. This likely increases their reported precision but may decrease their recall.

In all our species level tests, we used CAMMiQ’s default parameter settings of ${L}_{\min }=26$ and ${L}_{\max }=50$ to compare it against Kraken2, KrakenUniq, Bracken, CLARK, and Centrifuge, all using k-mer length of 26; see Supplementary Note 6 for details on parameter settings. Results based on alternative parameter settings can also be found in Supplementary Note 8 and in particular Supplementary Table 6. In all of these experiments, we used the same collection of genomes for establishing the index for each of the five tools (with the exception of MetaPhlAn2, which uses its own predetermined index): the results in Table 3 are based on our species-level-bacteria index dataset and the results in Table 2 are based on our species-level-all index dataset.

Compared with the species-level-bacteria queries which are composed of highly similar genomes, the CAMI and IMMSA queries are, in principle, less challenging since reads that did not get mapped to a unique genome were excluded from these queries at the time they were complied⁵⁷. Even though the RefSeq database has been significantly updated since these queries were complied, almost all reads in these queries still map to a unique genome. Having said that, reads in these queries may originate from genomes outside of the species-level-all index dataset - including plasmids from these species that have not been indexed. It is entirely possible that such reads may include one or more unique or doubly-unique substring(s) indexed by CAMMiQ, and thus be assigned to the wrong genome.

As can be seen in Table 2, panel A, CAMMiQ offered the best precision in read classification for all IMMSA queries; interestingly the precision values for Centrifuge was much lower than the alternatives. CAMMiQ was arguably the best on the recall in read classification on the IMMSA queries as well, as can be seen in Table 2, panel B. However, reads that originate from genomes outside of the index database were likely not utilized by CAMMiQ, reducing its comparative advantage against, KrakenUniq and CLARK, which may still assign such reads to a genome; this would increase their recall, while possibly reducing their precision.

As can be seen in Table 3 and Supplementary Table 5, CAMMiQ achieved the best recall and F1 score (see Supplementary Note 7 for a definition), and the second best precision for the 11 species-level-bacteria query sets (the three queries with uneven coverage were excluded). Its precision and recall were particularly impressive for first 7 challenging queries (labeled with prefix “Least”), where CAMMiQ was an order of magnitude better than the alternatives in terms of both measures. On these queries, tools other than CAMMiQ assigned only a small proportion of reads to genomes at the species level. This is because none of them employ doubly-unique substrings to differentiate species in the index dataset from the same genus. The only exception is Centrifuge, which achieved the best classification precision and the second best recall. For example, on the 3 hypothetically error-free queries (labels ending with -1 in Table 3), where Centrifuge (in addition to CAMMiQ and CLARK) achieved 100% precision. However, Centrifuge’s classification performance deteriorated when genomes in queries were likely not present in the corresponding index dataset (Table 2, panels A and B). Note that in principle Kraken2, KrakenUniq and Centrifuge could assign reads to the correct taxonomy higher than the species level. However, as mentioned above, only reads that were assigned to the correct species were considered to be true positives for this benchmark.

Precision and recall in genome identification on IMMSA and CAMI queries

On the CAMI and IMMSA queries, CAMMiQ correctly identified more genomes than the alternative tools with the same abundance cutoff of 0.01% (we consider a genome to have been identified by a tool only if the tool reports its abundance to be ≥0.01% of the total abundance of all genomes), resulting in superior recall values for genome identification (Table 2, panel C; note that recall in genome identification represents the fraction of correctly identified genomes among all genomes in a query set). This is primarily due to CAMMiQ’s use of doubly-unique substrings in its query type ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$. Compared to its recall performance, CAMMiQ achieved even better precision figures than the alternatives (Table 2, panel D; note that precision in genome identification represents the fraction of correctly identified genomes among the set of genomes identified by a given tool), due to fewer false positive identifications. The fact that CAMMiQ particularly performs best with respect to the precision values indicates that genomes not present in the index dataset would have the least impact on CAMMiQ in comparison to other tools. Note that by postprocessing the output of Kraken2, Bracken manages to improve on the number of identified genomes, and achieves comparable figures to CAMMiQ. However, it does not reduce the large number of false positive genome identifications produced by Kraken2, when unknown genomes or genomes outside the index dataset present in the query sample.

Genome identification and quantification performance on species-level-bacteria queries

Next, we evaluated the number of correctly identified genomes by each tool (specific to MetaPhlAn2, the genus corresponding to each genome), as well as the L1 and L2 distances between the true abundance profile and the predicted abundance profile, on the 14 queries involving our species-level-bacteria dataset, including the 3 queries with GC bias. As can be seen in Table 3, panels C–E, and Supplementary Table 5, CAMMiQ clearly offered the best performance in both identification and quantification. It correctly identified all genomes present in each one of the 14 queries and was not impacted by uneven read coverage or the genome we added to the query Random-20-lognormal-a.g. which was not indexed. Importantly, CAMMiQ consistently returned very few false positive genomes for the most challenging queries, and at most one false positive genome for the remaining 4 queries.

Compared to CAMMiQ, other tools reported larger number of false negatives in these 14 queries (again we consider a genome to be a “negative”, if its reported abundance level is ≤0.01% of the total abundance of all genomes), in particular in the 10 challenging queries (labeled with the prefix “Least”) with minimal unique substrings (i.e., L-mers). Among them, CLARK and Centrifuge offered the best false negative performance, especially on error free queries. As can be expected, MetaPhlAn2 had the worst performance with respect to false negatives, very likely due to the incompleteness of its marker gene list (we used the latest set of marker genes mpa_v20_m200 in MetaPhlAn2). This also led to a relatively larger L1/L2 distances than the other tools, even for the remaining 4 (easier) queries. Kraken2 and KrakenUniq were also prone to having false negatives, though fewer than MetaPhlAn2. Bracken, in general, could correctly identify a few more genomes than Kraken2, and this improvement in its identification performance also leads to better quantification results (see below).

CAMMiQ performs even better with respect to the number of false positive genomes, as demonstrated by its F1 score distribution (see Supplementary Table 5). The alternative tools all returned a large number of false positives in species-level-bacteria queries, especially in the first 10 challenging queries, even though all reads in these queries were sampled from (some genome in) the index dataset (see Table 3, panel C). Among them, Centrifuge and Bracken usually performed better on the 10 challenging queries with fewer ‘unique’ genomes; while KrakenUniq and CLARK performed better on the remaining 4 (easier) queries. Kraken2 showed the worst performance with respect to the false positives: it outputs more than a third of the genomes from the index dataset even for the three error free queries. In many of the datasets, these false positives were eliminated by Bracken’s postprocessing of Kraken2’s output; unfortunately, in other query datasets, e.g., Random-20-uniform and Random-100-uniform, Bracken introduced additional false positives. MetaPhlAn2 identified only limited number of genomes (and few true positive genomes) in all queries in general, so it had a comparable performance to CAMMiQ with respect to false positives. However, its F1 scores were not as good as CAMMiQ’s (see Supplementary Table 5).

Note that CAMMiQ not only correctly identified all genomes, but also predicted their abundances reasonably close to the true values. As can be seen in Table 3, CAMMiQ outperformed all other tools on both L1 and L2 errors, typically offering a factor of 3 ~ 4× improvement over the second best alternative. Interestingly, even when the coverage across each genome were non-uniform, CAMMiQ’s ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ type of query was only mildly impacted. As noted earlier, on the 10 challenging queries (especially those with sequencing errors), all alternative tools except MetaPhlAn2 output hundreds of false positive genomes. As a consequence, their predictions for the abundances of the true positive genomes were smaller than the true abundance values. This is particularly the case for Kraken2 and KrakenUniq: even though they identified the majority of the true positive genomes correctly, their reported abundance values were all close to 0; this results in their L1 distances to be very close to 1.

Evaluation of computational resources on species level queries

We compared the running time and memory usage of CAMMiQ, Kraken2/Bracken, KrakenUniq, CLARK, MetaPhlAn2, and Centrifuge in building the index and responding to the queries; see Table 4. As can be seen CAMMiQ performs better than all alternatives in running time - including those tools that aim to index all unique k-mers (KrakenUniq and CLARK), and all substrings (Centrifuge) - with respect to both query time and index construction time. The only exception is Kraken2 (MetaPhlAn2 uses a pre-built index and so it can not be compared against others with respect to index construction time), however, Kraken2’s overall accuracy is worse than the others across the species-level queries. Since MetaPhlAn2 uses a pre-built index (See Supplementary Note 6) it avoids the expensive index construction process. This, however, results in many false negatives (See Subsections Genome identification and quantification performance on species-level-bacteriaqueries and Performance of CAMMiQ at the strain level). CAMMiQ also supports pre-built indices. Compared to the other tools and methods, the sizes of these pre-built indices are much smaller (Table 4, Panel B), due to the sparsification of unique and doubly unique substrings, allowing convenient transfer and fast downloading. Note that we do not report the time for loading the index into memory for any of the tools, since this is performed only once.

Table 4 Comparison of the running times required by CAMMiQ, Kraken2, KrakenUniq, CLARK Centrifuge, Bracken, and MetaPhlAn2

Full size table

All of our experiments were run on a Linux server equipped with 40 Intel Xeon E7-8891 2.80 GHz processors, with 2.5 TB of physical memory and 30 TB of disk space. The ILP solver used by CAMMiQ in the initial implementation is IBM ILOG CPLEX 12.9.0. We have also ported the code to use the ILP solver Gurobi 9.1.0.

Assessing the use of variable-length and doubly-unique substrings in species-level-bacteria queries

Due to its unique algorithmic features CAMMiQ outperforms available alternatives on the CAMI, IMMSA and the species-level-bacteria query sets. A key question is: what is the maximum potential improvement in performance one can expect through the use of (i) variable-length substrings as opposed to fix length k-mers, and (ii) doubly-unique substrings in addition to unique substrings? Here, we evaluate both of these algorithmic features in the context of the species-level-bacteria dataset we constructed (see Table 1). For that, we compare the proportion of L-mers (for read length L = 100) from each genome s_i in our species-level-bacteria index dataset that are unique or doubly-unique (and thus is utilized by CAMMiQ) with the proportion of L-mers that include a unique k-mer (and thus can be utilized by CLARK and others) for k = 30.

Figure 1a summarizes our findings: on the horizontal axis, the genomes are sorted with respect to the proportion of unique and doubly-unique L-mers they have; the vertical axis depicts this proportionality (from 0.0 to 1.0). The figure shows the proportion of unique L-mers, doubly-unique L-mers, the combination of unique and doubly-unique L-mers (all utilized by CAMMiQ), as well as the L-mers that include a unique k-mer (utilized by, e.g., CLARK) for each genome depicted on the horizontal axis. As can be seen, roughly three quarters of all genomes in this dataset are easily distinguishable since a large fraction of their L-mers include a unique k-mer. However, about a quarter of the genomes in this dataset can benefit from the consideration of doubly-unique substrings, especially when their abundances are low. In particular, 66 of these 4122 genomes/species have extremely low proportions (each ≤1%) of unique 100-mers. At the extreme, the species Francisella sp. MA06-7296 does not have a single unique 100-mer and the species Rhizobium sp. N6212 does not have any 100-mer that includes a unique 30-mer (in fact any substring of length $\le {L}_{\max }=50$). These two species cannot be identified by, e.g., CLARK in any microbial mixture, regardless of their abundance values.

**Fig. 1: The advantage of using variable-length and doubly-unique substrings.**

Figure 1b depicts the inverse proportionality of doubly-unique L-mers in comparison to unique L-mers among 50 genomes that have the lowest proportion of unique L-mers - for L = 100. The inverse-proportionality of unique or doubly-unique L-mers for a genome corresponds to the number of reads to be sampled (on average) from that genome to guarantee that the sample includes one read that would be assigned to the correct genome. In the absence of read errors, this guarantees correct identification of the corresponding genome in the query. Note that, in half of these 50 genomes, almost all L-mers are doubly-unique. This implies that any query involving one or more of these genomes could only be resolved by CAMMiQ and no other tool.

We further assessed whether the usage of unique and doubly-unique substrings can lead to robust genome identification and quantification performance in practice, by evaluating the distribution of these substrings across the genome. In principle, the more evenly these substrings are distributed across a genome, the less likely CAMMiQ’s quantification performance can be impacted by queries composed of genomes with small alterations to the corresponding index genomes. As can be seen in Fig. 1c, d, unique and doubly unique substrings span the entire genome on most of the species in our species-level-bacteria index dataset, not significantly biased towards any functionally annotated region by NCBI (i.e., gene, CDS, ncRNA, rRNA, tRNA, tmRNA or plasmid). Even when the numbers of unique or doubly-unique substrings are relatively small in a genome (for example, the last 3 genomes in Fig. 1d), they are still well distributed, helping CAMMiQ with that genome’s identification as well as quantification. We would like to note here that even though some genomes have very few unique substrings, implying that they would be difficult to identify through the use of alternative methods, because of their (well distributed) doubly-unique substrings, CAMMiQ can identify and quantify them accurately. Consider, for example, the last genome in Fig. 1d, Rhizobium sp. N1341 in which the only unique substrings are located on the plasmids. However, since there are sufficiently many doubly unique substrings on the chromosome, this species could still be identified and quantified by CAMMiQ, through the ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ or ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ type of query.

Performance of CAMMiQ at the strain level

In the next experiment, we evaluated CAMMiQ’s performance (with default parameters) on queries composed from our strain-level dataset that consists of 614 Human Gut related genomes of bacterial strains from 409 species⁵⁹ as described in Supplementary Note 5.2. As can be seen in Table 5, CAMMiQ managed to identify and accurately quantify all strains in the queries HumanGut-random-100-1 and HumanGut-random-100-2, and > 96% strains in the other two queries, with almost no false positives. Other tools benchmarked against CAMMiQ lead to either more false negative (KrakenUniq, CLARK, MetaPhlAn2) genomes, or more false positive identifications (Kraken2, Centrifuge). Furthermore, their quantification performance (Table 5, panel B) is worse than CAMMiQ.

Table 5 CAMMiQ’s strain level performance compared to Kraken2, KrakenUniq, CLARK, Centrifuge, and MetaPhlAn2, on the four strain-level queries

Full size table

Performance of CAMMiQ on real single-cell metatranscriptomic queries

Our final set of experiments involve “real” metatranscriptomic reads from human monocyte-derived dendritic cells (moDCs)⁶⁰. Because CAMNMiQ’s most powerful type ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ query is not suitable for RNA-seq data (due to high variance in read coverage), we employed ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$ and ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ queries. We remind the reader that ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$ only uses unique substrings in query reads and returns the genomes in the index for which there is at least one such substring. On the other hand, ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ computes the smallest set of genomes in the index that include all unique or doubly-unique substrings across the query reads.

Each query was composed of all high quality, non-human scRNA-seq reads from the corresponding single cell⁶⁰. For guaranteeing this, we filtered out all scRNA-seq reads which (i) possibly originate from the human genome, or (ii) have low sequence quality and “complexity”, or (iii) map to 16S or 23S ribosomal RNAs on the two Salmonella genomes (to avoid incorrect assignment of reads due to “barcode hopping”).

Following the original study⁶⁰, we categorized each cell into one of the 5 groups: infected cells that were confirmed to contain (1) STM-LT2 or (2) STM-D23580 strain of intracellular Salmonella; bystander cells that were exposed to (3) STM-LT2 or (4) STM-D23580 strains, but confirmed to not contain intracellular Salmonella; and (5) cells that were mock-infected and sequenced as controls. For each query, we compared the number of reads CAMMiQ assigned uniquely to STM-LT2 or STM-D23580 genomes against those aligned and assigned either by the GATK PathSeq¹⁰ tool or blastn¹¹ (see Supplementary Note 9).

Figure 2 summarizes our results on this data set. In Fig. 2a, we demonstrate that compared to the GATK PathSeq approach, CAMMiQ’s ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$ type queries were more sensitive with respect to read assignment. On average, CAMMiQ identified (roughly) an order of magnitude more unique STM-LT2 or STM-D23580 reads in each cell, demonstrating its potential to better identify intracellular organisms at subspecies or strain level. Note that CAMMiQ’s performance is comparable or slightly better than that of blastn. However CAMMiQ is several orders of magnitude faster than blastn or GATK PathSeq. (CAMMiQ only took a total of 65.3s for computing ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$ type queries and an additional 2.5s for computing ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ type queries on the entire query set, outperforming GATK PathSeq, which required 29628.1s, or blastn, which is typically slower).

**Fig. 2: CAMMiQ performance on the filtered-scRNA-seq queries.**

The abundances reported by each of the three tools (measured by unique read counts) of Salmonella were substantially higher in the infected cells compared to the mock-infected controls. More importantly, cells known to be infected with or exposed to a particular strain indeed include significantly more reads from that strain. Interestingly, CAMMiQ as well as blastn reported that cells infected with or exposed to a particular strain also contain reads unique to the other strain. This is possibly due to sequencing errors or incorrect cell assignments for these reads.

In Fig. 2b, we compare CAMMiQ’s ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ type queries with its ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$ type queries (as well as GATK PathSeq and blastn) with respect to the number of cells they correctly identify to include STM-LT2 or STM-D23580 strains. For that we vary the minimum number of reads that need to be identified by each tool to report a given strain, and for each such value we indicate how many cells are reported to include the STM-LT2 strain (on the vertical axis) vs the STM-D23580 strain (on the horizontal axis). With the exception of the third subpanel a method with a plot closer to the diagonal is less sensitive. As can be seen CAMMiQ’s ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ type queries are more sensitive than not only its ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$ type queries but also GATK PathSeq and blastn. However, they also introduce some potential false positive calls (e.g., in the third subpanel panel corresponding to the controls). This could be due to additional reads utilized by ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ queries impacted by read errors or incorrect assignments of these reads to cells.

Discussion

We have introduced CAMMiQ, a new computational tool to identify microbes in an HTS sample and to estimate abundance of each species or strain. CAMMiQ is based on a principled approach that starts by defining formally the following algorithmic problem that has not been fully addressed by any available method. Given a set ${{{{{{{\mathcal{S}}}}}}}}$ of distinct genomic sequences of any taxonomic rank, build a data structure so as to identify and quantify genomes in any query, composed of a mixture of reads from a subset of ${{{{{{{\mathcal{S}}}}}}}}$. CAMMiQ is particularly designed to handle genomes that lack unique features; for that, it reduces the aforementioned identification and quantification problems to a combinatorial optimization problem that assigns substrings with limited ambiguity (i.e., doubly-unique substrings) to genomes so that, in its most general ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ type query, each genome is “uniformly covered”. Uniform coverage is a simplifying assumption we employ in our theoretical analysis since which genomes are represented in a query are not known in advance. In practice, the coverage for genomic sequences might be biased by GC content^61,62. We do not employ this assumption in CAMMiQ implementation for ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$ and ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ type queries, which are more suitable for transcriptomic sequences. Our experiments on the Salmonella scRNAseq dataset indeed show that CAMMiQ delivers good results on scRNAseq queries work well even though the reads are skewed by variable expression and the selection biases of single-cell technology. Because each such substring has limited ambiguity, the resulting combinatorial optimization problem can be efficiently solved through the existing integer program solvers IBM CPLEX and Gurobi.

One potential limitation of CAMMiQ is that it relies on a database of reference genomes. In the context of medical microbiology this is a reasonable assumption since virtually all clinically-relevant microbes detected in new patients are known and have some similar genome sequenced and in RefSeq. The reliance on a reference database is more problematic in the context of studying environmental samples, in which new and rare taxa might be found by methods that do not rely on reference genomes. Our results on the CAMI benchmark data set provide reassurance that CAMMiQ performs well even when many genomes and plasmids are absent from the reference database. Another potential limitation is that the memory required by CAMMiQ index construction is relatively high. However, CAMMiQ supports pre-built indices on commonly used databases for metagenomic studies, e.g., (the latest version of) the RefSeq bacteria, viruses and archaea database. Compared to the other tools and methods, the sizes of these pre-built indices are much smaller, due to the sparsification of unique and doubly unique substrings, allowing convenient transfer and fast downloading. The prebuilt CAMMiQ index for all index datasets are available via the GitHub link provided in the Code Availability statement. In addition, as shown for the experiments summarized in Table 4, the memory requirements for CAMMiQ queries are comparable to those of other widely used packages and within the capabilities of currently available computers.

Provided that the doubly-unique substrings of a given genome are not all shared with one other genome, the use of doubly-unique substrings increases CAMMiQ’s ability to identify and quantify this genome within a query. In case the dataset to be indexed involves several genomes with high levels of similarity, CAMMiQ’s data structure and its combinatorial optimization formulation could be generalized to include “triply” or “quadruply” unique substrings, but this is not yet implemented. In summary, using principled methods from combinatorial optimization and string algorithms, CAMMiQ delivers better sensitivity and specificity than widely-used existing methods on practical genome classification and quantification methods.

Methods

The input to CAMMiQ is a set of m genomes ${{{{{{{\mathcal{S}}}}}}}}={\{{s}_{i}\}}_{i=1}^{m}$, possibly but not necessarily all from the same taxonomic level (each genome here may be associated with a genus, species, subspecies, or strain), to be indexed. Although we describe CAMMiQ for the case where each ${s}_{i}\in {{{{{{{\mathcal{S}}}}}}}}$ is a single string, we do not assume that the genomes are fully assembled into a single contig. The string representing a genome could simply be a concatenation of all contigs from genome s_i and their reverse complements, with a special symbol $_i between consecutive contigs. We call ${{{{{{{\mathcal{S}}}}}}}}$ the input database or synonymously index dataset, and we call i ∈ {1, ⋯ , m} the genome ID of string s_i.

A query or query set for CAMMiQ contains a set of reads ${{{{{{{\mathcal{Q}}}}}}}}={\{{r}_{j}\}}_{j=1}^{n}$ representing a metagenomic mixture. For simplicity, we describe CAMMiQ for reads of homogeneous length L; however, our data structure can handle reads of varying length. Given ${{{{{{{\mathcal{Q}}}}}}}}$, the goal of CAMMiQ is to identify a set of genomes ${{{{{{{\mathcal{A}}}}}}}}=\{{s}_{1},\cdots \,,{s}_{a}\}\subset {{{{{{{\mathcal{S}}}}}}}}$ and their respective abundances p₁, ⋯ , p_a that “best explain” ${{{{{{{\mathcal{Q}}}}}}}}$. This is achieved by assigning (selected) reads r_j to genomes s_i such that the implied coverage of each genome ${s}_{i}\in {{{{{{{\mathcal{A}}}}}}}}$ is (roughly) uniform across s_i, with p_i as the mean.

CAMMiQ’s index data structure involves the collection of shortest unique substrings and shortest doubly-unique substrings on each genome s_i in ${{{{{{{\mathcal{S}}}}}}}}$. We call a substring of s_i unique if it does not occur on any other genome s_j ≠ s_i in ${{{{{{{\mathcal{S}}}}}}}}$; a shortest unique substring is a unique substring that does not include another unique substring. Similarly, we call a substring of s_i doubly-unique if it occurs on exactly one other genome ${s}_{j}\ne {s}_{i}\in {{{{{{{\mathcal{S}}}}}}}}$; a shortest doubly-unique substring is a doubly-unique substring that does not include another doubly-unique substring. See Supplementary Note 1 for a formal definition for the uniqueness of a substring and Supplementary Fig. 1 for a graphical illustration. CAMMiQ does not maintain the entire collection of shortest unique and doubly-unique substrings of genomes in ${{{{{{{\mathcal{S}}}}}}}}$; instead, its index contains only a sparsified set of shortest unique and doubly-unique substrings of each ${s}_{i}\in {{{{{{{\mathcal{S}}}}}}}}$ so that no unique and doubly-unique substring is in close proximity (i.e., within a read length) of another in s_i. See Section CAMMiQ Index and Supplementary Note 2 for how exactly CAMMiQ sparsifies the collection of shortest unique and doubly-unique substrings.

With the (sparsified) collection of shortest unique and doubly-unique substrings, CAMMiQ is sufficiently powerful to answer the following three types of queries. The simplest type of query only involves unique substrings: given a query set ${{{{{{{\mathcal{Q}}}}}}}}$, it asks for the set of genomes ${{{{{{{{\mathcal{A}}}}}}}}}_{1}\subseteq {{{{{{{\mathcal{S}}}}}}}}$ so that each includes at least one (shortest) unique substring that also occur in some read r_j in the query ${{{{{{{\mathcal{Q}}}}}}}}$. The second, more general query type involves both unique and doubly-unique substrings. It asks to compute ${{{{{{{{\mathcal{A}}}}}}}}}_{2}\subseteq {{{{{{{\mathcal{S}}}}}}}}$, the smallest subset of genomes in ${{{{{{{\mathcal{S}}}}}}}}$ which include all (shortest) unique and doubly-unique substrings that also occur in some read ${r}_{j}\in {{{{{{{\mathcal{Q}}}}}}}}$. Finally, the third and the most general type of query asks to compute the smallest subset ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ of ${{{{{{{\mathcal{S}}}}}}}}$ which again include all (shortest) unique and doubly-unique substrings that also occur in some read ${r}_{j}\in {{{{{{{\mathcal{Q}}}}}}}}$, with the additional constraint that the “coverage” of these substrings in each genome ${s}_{i}\in {{{{{{{{\mathcal{A}}}}}}}}}_{3}$ is roughly uniform. In addition to the set of genomes ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$, the query also asks to compute the relative abundance of each genome s_i in ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$.

CAMMiQ with its ability to efficiently answer all three queries described above has several advantages over existing methods that rely on fixed-length unique substrings (i.e., unique k-mers). (i) Notice that the shorter a unique substring is the more likely it will be sampled (i.e., present in a read sampled from the relevant genome). This is because a substring of length $L^{\prime} < L$ is included in $L-L^{\prime}+1$ potential reads of length L that could be sampled from a genome. Unfortunately, the shorter a substring is, the less likely that it is unique or doubly-unique. A method that uses fixed length k-mers needs to have a compromise between the number of unique substrings and the likelihood of sampling each. CAMMiQ gets around this limitation by utilizing unique substrings of any length. CAMMiQ features a lower bound ${L}_{\min }$ and upper bound ${L}_{\max }$ on the lengths of unique and doubly-unique substrings as explained below. (ii) Unique substrings are relatively rare, at least for certain genomes and taxa, but substrings that appear in many genomes provide very limited information about the composition of a query ${{{{{{{\mathcal{Q}}}}}}}}$. By involving doubly-unique substrings in a query ${{{{{{{\mathcal{Q}}}}}}}}$, the subset of genomes that could be identified through query ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ would be larger and more accurate than those that could be identified through query ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$, especially in the extreme case where ${{{{{{{\mathcal{Q}}}}}}}}$ includes highly similar genomes that do not include any unique substring. (iii) Finally, by introducing the “uniform coverage” constraint, CAMMiQ’s ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ type of query can identify more accurately the genome(s) where a doubly-unique substring originates. This is because a query of type ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ may result in significant differences in coverage between unique and doubly-unique substrings of a given genome.

As mentioned above, CAMMiQ builds an index for the sparsified sets of shortest unique and doubly-unique substrings to compute efficiently the sets ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$, ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ and ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$. For all three query types, CAMMiQ first identifies for each read r_j all unique and doubly-unique substrings it includes; it then assigns r_j to the one or two genomes from which these substrings can originate. To compute ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$, CAMMiQ can simply return the collection of genomes receiving at least one read assignment. To compute ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$, CAMMiQ needs to solve instances of the NP-hard set cover problem, or more precisely, its dual, the hitting set problem where genomes form the universe of items, and indexed strings that appear in query reads form the sets of items to be hit. Even though this is a restricted version of the hitting set problem where each set to be hit contains at most two items, it is still NP-hard due to a reduction to the vertex cover problem. To compute ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$CAMMiQ solves the combinatorial optimization problem that asks to minimize the variance among the number of reads assigned to each indexed substring of each genome - the solution indicates the set of genomes in ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ along with their respective abundances.

Details on the composition as well as the construction process for CAMMiQ’s index are discussed in Section CAMMiQ Index, as well as Supplementary Notes 1 and 2. The two stages in query processing of CAMMiQ are discussed in Subsections Query processing stage 1: Preprocessing the Reads and Queryprocessing stage 2: ILP formulation. The first stage assigns reads to specific genomes, which is sufficient for computing sets ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$ and ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$. See Section Query processing stage 1: Preprocessing the Reads and Supplementary Note 3 for the criteria we use to assign a read to a genome, based on the indexed substrings that the read includes. The second stage introduces the combinatorial optimization formulation to compute ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ as a response to the most general query type. See Section Query processing stage 2:ILP formulation for details.

CAMMiQ Index

To respond to all three types of queries described above, CAMMiQ identifies all unique and doubly-unique substrings of the genomes in ${{{{{{{\mathcal{S}}}}}}}}$ and organizes them in a simple but efficient data structure. Specifically, CAMMiQ computes the complete set of shortest unique substrings, ${{{{{{{\mathcal{U}}}}}}}}={\cup }_{i=1}^{m}{{{{{{{{\mathcal{U}}}}}}}}}_{i}$, and the set of shortest doubly-unique substrings, ${{{{{{{\mathcal{D}}}}}}}}={\cup }_{i=1}^{m}{{{{{{{{\mathcal{D}}}}}}}}}_{i}$, where ${{{{{{{{\mathcal{U}}}}}}}}}_{i}$ and ${{{{{{{{\mathcal{D}}}}}}}}}_{i}$ respectively denote the complete set of shortest unique and doubly-unique substrings from genome s_i, whose lengths are within the range $[{L}_{\min },{L}_{\max }\le L]$. See Supplementary Note 1 for a linear time algorithm to build both ${{{{{{{\mathcal{U}}}}}}}}$ and ${{{{{{{\mathcal{D}}}}}}}}$. CAMMiQ then sparsifies ${{{{{{{\mathcal{U}}}}}}}}$ and ${{{{{{{\mathcal{D}}}}}}}}$ by selecting only one representative substring among those that are in close proximity in each genome, and discarding the rest; this sparsification step is described in detail in below and the Supplementary Note 2. Finally it builds a collection of tries (trees where the root node represents a substring of length ${L}_{\min }$ and every other internal node represents a single character) to compactly represent and efficiently search for substrings in ${{{{{{{\mathcal{U}}}}}}}}$ and ${{{{{{{\mathcal{D}}}}}}}}$.

Determining ${L}_{\max }$ and ${L}_{\min }$

In general, as the value of ${L}_{\max }$ increases, so do the numbers of unique and doubly-unique substrings to be considered by CAMMiQ - potentially increasing its sensitivity. However query type ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ relies on the read coverage for each unique and doubly-unique substring of each genome; the higher the coverage the better. The read coverage for a unique substring of length ⌊L − L/Δ⌋, for some constant Δ > 1, would roughly be 1/Δ-th of the read coverage (of a single nucleotide) of the respective genome. The best tradeoff between these two objectives, i.e., substring length, ~ (1 − 1/Δ) and coverage, ~ 1/Δ, can be achieved by maximizing their product, i.e., (1 − 1/Δ)/Δ, which is achieved at Δ = 2. This suggests to choose ${L}_{\max }=L/2$.

A shortest unique substring u, by definition, differs from (at least) one other substring $u^{\prime}$ by just one nucleotide. The shorter u gets, the more likely a read error impacting $u^{\prime}$ would modify it to u, leading to false positives. We have experimentally observed that unique substrings of length ≤25 could lead to false positives that impact the performance of CAMMiQ; as a consequence, we set the default value of ${L}_{\min }$ to 26.

Sparsifying unique substrings

Let ${{{{{{{{\mathcal{U}}}}}}}}}_{i}$ be the collection of all unique substrings on genome s_i. To reduce the index size, CAMMiQ aims to compute a subset ${{{{{{{\mathcal{U}}}}_i}}}}^{\prime}$ of ${{{{{{{{\mathcal{U}}}}}}}}}_{i}$, consisting of the minimum number of shortest unique substrings such that every unique substring of length L (i.e., unique L-mer) on s_i includes one substring from ${{{{{{{\mathcal{U}}}}_i}}}}^{\prime}$. Independently, CAMMiQ also aims to compute a subset ${{{{{{{\mathcal{D}}}}_i}}}}^{\prime}$ of ${{{{{{{{\mathcal{D}}}}}}}}}_{i}$, consisting of the minimum number of shortest doubly-unique substrings such that every doubly-unique substring of length L (i.e., doubly-unique L-mer) on s_i includes one substring from ${{{{{{{\mathcal{D}}}}_i}}}}^{\prime}$. This is all done by greedily maintaining only the rightmost shortest unique or doubly-unique substring in a sliding window of length L on a genome in ${{{{{{{\mathcal{S}}}}}}}}$. In the remainder of the paper, we denote the number of unique substrings in subset ${{{{{{{\mathcal{U}}}}_i}}}}^{\prime}$ by nu_i ($=|{{{{{{{\mathcal{U}}}}_i}}}}^{\prime}|$) and the number of doubly-unique substrings in subset ${{{{{{{\mathcal{D}}}}_i}}}}^{\prime}$ by nd_i ($=|{{{{{{{\mathcal{D}}}}_i}}}}^{\prime}|$); we denote the number of unique L-mers on s_i by $n{u}_{i}^{L}$ and respectively the number of doubly-unique L-mers on s_i by $n{d}_{i}^{L}$. As we prove in Supplementary Note 2, the greedy strategy we employ can indeed obtain the minimum number of shortest unique substrings to cover each unique L-mer, provided that each substring in ${{{{{{{{\mathcal{U}}}}}}}}}_{i}$ occurs only once in s_i.

Index organization

We demonstrate the index structure and query processing for the set of unique substrings ${{{{{{{\mathcal{U}}}}}}}}$; the processing for doubly-unique substrings is essentially identical to that for unique substrings. Let $h=\mathop{\min }\limits_{{u}_{i}\in {{{{{{{\mathcal{U}}}}}}}}}|{u}_{i}|$ be the minimum length of all shortest unique substrings (h is automatically set to ${L}_{\min }$ if the minimum length constraint is imposed). CAMMiQ maintains a hash table that maps a distinct h-mer w to a bucket containing all unique substrings u_i that have w as a prefix. Within each bucket, the remaining suffices of all unique substrings u_i, i.e., u_i[h + 1: ∣u_i∣], are maintained in a trie (rooted at u_i[1: h]) so that (i) each internal node represents a single character; and (ii) each leaf represents the corresponding genome ID. For each read r_j in the query, CAMMiQ considers each substring of length h and its reverse complement and computes its hash value in time linear with L through Karp–Rabin fingerprinting⁶³. If the substring has a match in the hash table, then CAMMiQ tries to extend the match until a matching unique substring is found, or until an extension by one character leads to no match. See Fig. 3 for a schematic of the index structure. See Subsection Query processing stage 1: Preprocessing the Reads below for the use of unique and doubly-unique substrings identified for each read to answer the query.

**Fig. 3: Overview of CAMMiQ’s index structure.**

Query processing stage 1: preprocessing the reads

Given the index structure on the sparsified set of shortest unique and doubly-unique substrings of genomes in ${{{{{{{\mathcal{S}}}}}}}}$, we handle each query ${{{{{{{\mathcal{Q}}}}}}}}$ in two stages. The first stage counts the number of reads that include each unique and doubly-unique substring with the following provision. We call two or more (unique or doubly-unique) substrings in a read “conflict-free” if there is at least one genome that includes all of these substrings. See Supplementary Note 3 for a detailed discussion on conflicting substrings; the conflicts arise due to either sequencing errors or the query including genomes that are not in the database and thus should be avoided. Reads that include more than one unique or doubly-unique substring that is conflict-free contribute to the counting process; all other reads are discarded.

We denote by c(u_i), the counter for the conflict-free reads that include the unique substring u_i and by c(d_i) that for the doubly-unique substring d_i. These counters are sufficient to compute the set ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$ as well as ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$, the answer to our most general query type. For computing ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$, CAMMiQ additionally maintains a counter $d({s}_{k},{s}_{k^{\prime} })$ for each pair of genomes ${s}_{k},{s}_{k^{\prime} }$, indicating the number of reads in ${{{{{{{\mathcal{Q}}}}}}}}$ that can originate both from s_k and ${s}_{k^{\prime} }$ (i.e., the case (e - iii) in the procedure described in Supplementary Note 3).

The first stage thus produces two count vectors ${{{{{{{{\bf{c}}}}}}}}}_{i}^{u}=(c({u}_{i,1}),\cdots \,,c({u}_{i,n{u}_{i}}))$ and ${{{{{{{{\bf{c}}}}}}}}}_{i}^{d}=(c({d}_{i,1}),\cdots \,,c({d}_{i,n{d}_{i}}))$ that indicate the number of (conflict-free) reads that include each unique and doubly-unique substring on each genome s_i. Using these vectors, CAMMiQ answers the first type of query by computing ${{{{{{{{\mathcal{A}}}}}}}}}_{1}=\{{s}_{i}:\mathop{\sum }\nolimits_{l=1}^{n{u}_{i}}c({u}_{i,l}) > 0\}$. Additionally, through the use of the counters $d({s}_{k},{s}_{k^{\prime} })$, CAMMiQ answers the second type of query by computing ${{{{{{{{\mathcal{A}}}}}}}}}_{2}=\arg \min|{{{{{{{\mathcal{A}}}}}}}}^{\prime} \subset {{{{{{{\mathcal{S}}}}}}}}|$ such that (i) ${s}_{i}\in {{{{{{{\mathcal{A}}}}}}}}^{\prime}$ if $\mathop{\sum }\nolimits_{l=1}^{n{u}_{i}}c({u}_{i,l}) > 0$ and (ii) $\exists {s}_{i}\in {{{{{{{\mathcal{A}}}}}}}}^{\prime}$, if $d({s}_{k},{s}_{k^{\prime} }) > 0$ then either i = k or $i=k^{\prime}$. This is basically the solution to the hitting set problem we mentioned earlier, whose formulation as an integer linear program (ILP) is well known⁶⁴. The genomes returned in ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$ are ranked in decreasing order by the aggregated counter values on unique substrings (i.e., $|{{{{{{{{\bf{c}}}}}}}}}_{i}^{u}|$); and the genomes returned in ${{{{{{{{\mathcal{A}}}}}}}}}_{2}$ are ranked by the aggregated counter values on unique substriungs plus the counter values on doubly-unique substrings (i.e., $|{{{{{{{{\bf{c}}}}}}}}}_{i}^{u} |+|{{{{{{{{\bf{c}}}}}}}}}_{i}^{d}|$).

From this point on, our main focus will be how CAMMiQ answers the third type of query by computing ${{{{{{{{\mathcal{A}}}}}}}}}_{3}$ through an ILP formulation described below.

Query processing stage 2: ILP formulation

In its second stage, CAMMiQ computes the list of genomes in the query as well as their abundances through an ILP. Let δ_i = 0/1 be the indicator for the absence or presence of the genome s_i in ${{{{{{{\mathcal{Q}}}}}}}}$. The ILP formulation assigns a value to each δ_i and also computes for each s_i its abundance p_i, upper bounded by ${p}_{\max }$ - a user-defined maximum abundance with a default setting of 100, which is introduced to avoid potential anomalies due to sequence contamination.

$${{{{{{{\bf{Minimize}}}}}}}}\quad \mathop{\sum}\limits_{i}(\frac{1}{n{u}_{i}}\mathop{\sum }\limits_{l=1}^{n{u}_{i}}|c({u}_{i,l})-e({u}_{i,l}) |+\frac{1}{n{d}_{i}}\mathop{\sum }\limits_{l=1}^{n{d}_{i}}|c({d}_{i,l})-e({d}_{i,l})|)$$

$${{{{{{{\bf{s.t.}}}}}}}}\quad \quad e({u}_{i,l})=(L-|{u}_{i,l} |+1)\cdot {p}_{i}\cdot \frac{1}{L}\cdot {(1-\hat{{{{{{{{\rm{err}}}}}}}}})}^{|{u}_{i,l}|}\quad \forall i,\, l,\,{{{{{{{\rm{s.t.}}}}}}}}\,1\le l\le n{u}_{i}$$

(1)

$$e({d}_{i,l})=(L-|{d}_{i,l} |+1)\cdot ({p}_{i}+{p}_{j})\cdot \frac{1}{L}\cdot {(1-\hat{{{{{{{{\rm{err}}}}}}}}})}^{|{d}_{i,l}|}\quad \forall i,\, l{{{{{{{\rm{s.t.}}}}}}}}\,1\le l\le n{d}_{i}$$

(2)

$${p}_{i}\le {\delta }_{i}\cdot {p}_{\max }\quad \forall i$$

(3)

$${\delta }_{i}=0\quad \forall i,\,{{{{{{{\rm{s.t.}}}}}}}}\,{s}_{i}\in M({{{{{{{\mathcal{Q}}}}}}}})$$

(4)

$${p}_{i}\ge {\delta }_{i}\cdot \min \{L\mathop{\sum }\limits_{l=1}^{n{u}_{i}}c({u}_{i,l})\cdot \frac{1}{{nu}_{i}^{L}},L\mathop{\sum }\limits_{l=1}^{n{d}_{i}}c({d}_{i,l})\cdot \frac{1}{{nd}_{i}^{L}}\}\cdot (1-\epsilon )\quad \forall i \,,{{{{{{{\rm{s.t.}}}}}}}}\,{s}_{i}\,\notin \,M({{{{{{{\mathcal{Q}}}}}}}})$$

(5)

$$\mathop{\sum}\limits_{i}|{s}_{i}|\cdot {p}_{i}\le n\cdot L$$

(6)

The objective of the ILP is to minimize the sum of absolute differences between the expected and the actual number of reads to cover a unique or doubly-unique substring. Since each genome may have different numbers of unique and doubly-unique substrings, the sums of differences are normalized w.r.t. nu_i or nd_i.

Constraint (1) defines the expected number of reads to cover a particular unique substring u_i,l, given abundance p_i of the corresponding genome s_i. Similarly, constraint (2) defines the expected number of reads to cover a particular doubly-unique substring d_i,l; in this constraint, p_i and p_j denote the respective abundances of the two genomes s_i and s_j that include (the doubly unique substring) d_i,l. Specifically the expected coverage of u_i,l is $\frac{L-|{u}_{i,l} |+1}{L}\cdot {p}_{i}$ and the expected coverage of d_i,l is $\frac{L-|{d}_{i,l} |+1}{L}\cdot ({p}_{i}+{p}_{j})$, provided that the coverage is uniform across a given genome and there are no read errors. To account for read errors, we normalize these coverage estimates respectively by ${(1-\hat{{{{{{{{\rm{err}}}}}}}}})}^{|{u}_{i,l}|}$ and ${(1-\hat{{{{{{{{\rm{err}}}}}}}}})}^{|{d}_{i,l}|}$; these values represent the probability that, a substring u_i,l or d_i,l would be error free within a read that has been subject to uniform i.i.d. substitution errors. Here $\hat{{{{{{{{\rm{err}}}}}}}}}$ denotes the estimated substitution error rate per nucleotide; and ∣w∣ denotes the length of a substring w. CAMMiQ formulation also allows updates to the expected coverage according to any given unique or doubly-unique substring’s sequence composition (e.g., GC content) to address sequencing biases.

Constraint (3) ensures that the abundance p_i of a genome is 0 if δ_i = 0. Constraint (4) ensures that the solution to the above ILP excludes those genomes whose counters for unique and doubly-unique substrings add up to a value below a threshold - so as to reduce the size of the solution space. More specifically, given a threshold value α (α is introduced to avoid potential false positives due to read errors and genomes that are not in the database; its default value is 0.0001), the constraint excludes those genomes s_i that are in the set of genomes $M({{{{{{{\mathcal{Q}}}}}}}})$ whose counters for its unique substrings add up to a value below $\alpha \cdot n{u}_{i}^{L}$, and doubly-unique substrings add up to a value less than $\alpha \cdot n{d}_{i}^{L}$. Formally, $M({{{{{{{\mathcal{Q}}}}}}}})=\{{s}_{i}\in {{{{{{{\mathcal{S}}}}}}}}\,|\,\mathop{\sum }\nolimits_{l=1}^{n{u}_{i}}c({u}_{i,l}) < \alpha \cdot n{u}_{i}^{L}\}\cap \{{s}_{i}\in {{{{{{{\mathcal{S}}}}}}}}\,|\,\mathop{\sum }\nolimits_{l=1}^{n{d}_{i}} c({d}_{i,l}) < \alpha \cdot n{d}_{i}^{L}\}$. Constraint (5) enforces a lower bound on the coverage of each genome s_i in the solution to the above ILP (namely, with δ_i = 1), which must match the coverage ($L\cdot \mathop{\sum }\nolimits_{l=1}^{n{u}_{i}}c({u}_{i,l})\cdot \frac{1}{n{u}_{i}^{L}}$ and $\mathop{\sum }\nolimits_{l=1}^{n{d}_{i}}c({d}_{i,l})\cdot \frac{1}{n{d}_{i}^{L}}$) resulting from the number of reads in ${{{{{{{\mathcal{Q}}}}}}}}$ that include a unique and doubly-unique substring respectively, i.e., it must be at least (1 − ϵ) times the smaller one above for a user defined ϵ. Constraint (6) enforces an upper bound on the coverage of each genome s_i in the solution to the above ILP, through making the sum over each s_i of the number of reads produced on s_i based on p_i not exceed the total number of reads n. Collectively, the last two constraints ensure that the abundance p_i computed from the ILP matches what is (i.e., the coverage based on read counts) given by ${{{{{{{\mathcal{Q}}}}}}}}$. As written above, the formulation does not strictly conform to the rules for ILPs because of the use of the absolute value function. We use a standard technique to replace the absolute values in the objective by introducing a new variable $\gamma ({u}_{i,l})\ge \max \{c({u}_{i,l})-e({u}_{i,l}),\, e({u}_{i,l})-c({u}_{i,l})\}$.

When to use unique substrings—the error free case

We now provide a set of sufficient conditions to guarantee the approximate performance that can be obtained with high probability in metagenomic identification and quantification by the use of unique substrings only. These conditions apply to CAMMiQ when c = 1, as well as CLARK, KrakenUniq, and other similar approaches. In case these conditions are not met, it is advisable to use CAMMiQ with c ≥ 2.

Suppose that we are given a query ${{{{{{{\mathcal{Q}}}}}}}}$ composed of n error-free reads of length L, sampled independently and uniformly at random from a collection of genomes ${{{{{{{\mathcal{A}}}}}}}}=\{{s}_{1},\cdots \,,{s}_{a}\}$ according to their abundances p₁, ⋯ , p_a. More specifically, suppose that our goal is to answer query ${{{{{{{\mathcal{Q}}}}}}}}$ by computing ${{{{{{{{\mathcal{A}}}}}}}}}_{1}$, along with an estimate for the abundance value p_i for each ${s}_{i}\in {{{{{{{{\mathcal{A}}}}}}}}}_{1}$, calculated as the weighted number of reads assigned to s_i according to the procedure described in Section Query processing stage 1: Preprocessing the Reads. Then, the L1 distance between the true abundance values and this estimate will not exceed a value determined by n (number of reads), a, and ${q}_{\min }$, the minimum normalized proportion of unique L-mers among these genomes. For a given failure probability ζ and an upper bound on L1 distance ϵ, this translates into sufficient conditions on the values of n, a and ${q}_{\min }$ to ensure acceptable performance by the computational method in use.

Theorem 1

Let ${{{{{{{\mathcal{Q}}}}}}}}=\{{r}_{1},\cdots \,,{r}_{n}\}$ be a set of n error-free reads of length L, each sampled independently and uniformly at random from all positions on a genome ${s}_{i}\in {{{{{{{\mathcal{A}}}}}}}}=\{{s}_{1},\cdots \,,{s}_{a}\}$, where s₁, ⋯ , s_a is distributed according to their abundances p₁, ⋯ , p_a > 0. Let ${p}_{i}^{\prime}=\frac{{p}_{i}\cdot {n}_{i}^{L}}{\mathop{\sum }\nolimits_{i^{\prime}=1}^{a}{p}_{i^{\prime} }^{\prime}\cdot {n}_{i^{\prime} }^{L}}$ be the corresponding “unnormalized” abundance of p_i for i = 1, ⋯ , a, where ${n}_{i}^{L}$ denotes the total number of L-mers on s_i. Let q₁, ⋯ , q_a > 0 be the proportion of unique L-mers on s₁, ⋯ , s_a respectively; ${p}_{\min }=\min {\{{p}_{i}\}}_{i=1}^{a}$; ${q}_{\min }=\min {\{{q}_{i}\}}_{i=1}^{a}$. Then,

(i) With probability at least 1 − ζ, each s_i can be identified through querying ${{{{{{{\mathcal{Q}}}}}}}}$ if $n\ge \frac{2(a+1)+\ln (1/\zeta )}{{({p}_{\min }{q}_{\min })}^{2}}$
(ii) With probability at least 1 − ζ, the L1 distance between the predicted abundances ${\hat{p}}_{1},\cdots \,,{\hat{p}}_{a}$ by setting $\hat{{p}_{i}}=\frac{{c}_{i}/{q}_{i}}{n}$ and the true (unnormalized) abundances ${p}_{1}^{\prime},\cdots \,,{p}_{a}^{\prime}$ is at most ϵ if $n\ge \frac{2(a+1)+\ln (1/\zeta )}{{(\epsilon {q}_{\min })}^{2}}$.
(iii) Given n such reads in a query, with probability at least 1 − ζ, the L1 distance between the predicted abundances ${\hat{p}}_{1},\cdots \,,{\hat{p}}_{a}$ by setting $\hat{{p}_{i}}=\frac{{c}_{i}/{q}_{i}}{n}$ and the true (unnormalized) abundances ${p}_{1}^{\prime},\cdots \,,{p}_{a}^{\prime}$ is bounded by $\sqrt{\frac{2[\ln (1/\zeta )+(a+1)]}{n{q}_{\min }^{2}}}$.

Where c_i denotes the number of reads assigned to s_i.

See Supplementary Note 4 for a proof.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

There are four index datasets (species-level-all, species-level-bacteria, strain-level and subspecies-level) and associated query sets used in this paper. All of the four index datasets include a subset of all (complete) bacterial, viral and archaeal genomes from NCBI’s RefSeq database, which is available at is available at ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq. For the species-level-all index dataset, we use the release version 205 of RefSeq, which can be found at https://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/archive/RefSeq-release205.txt. The complete list of 16418 genomes can be found in https://github.com/algo-cancer/CAMMiQ/blob/master/README.md. The corresponding IMMSA queries can be found privately at http://ftp-private.ncbi.nlm.nih.gov/nist-immsa/IMMSA. A publicly available copy of the above directory is available at https://ftp.ncbi.nlm.nih.gov/pub/catSMA/for_Kaiyuan. The CAMI queries as well as the ground truth files can be found at at http://gigadb.org/dataset/100344. For the species-level-bacteria index dataset, we use the release version 93 of RefSeq, which can be found at https://ftp.ncbi.nlm.nih.gov/refseq/release/releasenotes/archive/RefSeq-release93.txt. The complete list of 4122 genomes can be found in https://github.com/algo-cancer/CAMMiQ/blob/master/README.md. The corresponding queries were generated by a python script CAMMiQ-simulate, which is available along with the software repo https://github.com/algo-cancer/CAMMiQ; these queries are available upon request. For the strain-level index dataset, we use the release version 93 of RefSeq, which can be found at https://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/archive/RefSeq-release93.txt. The list of human gut associated bacteria was obtained in the Supplementary Table 1 from https://www.nature.com/articles/s41587-018-0009-7. The complete list of 614 genomes can be found in https://github.com/algo-cancer/CAMMiQ/blob/master/README.md. The corresponding queries were also generated by running CAMMiQ-simulate, which is available along with the software repo https://github.com/algo-cancer/CAMMiQ; these queries are available upon request. For our subspecies-level index dataset, we use the release version 93 of RefSeq, which can be found at https://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/archive/RefSeq-release93.txt. The complete list of 3395 genomes can be found in https://github.com/algo-cancer/CAMMiQ/blob/master/README.md. The corresponding scRNA-seq queries can be obtained from https://www.ncbi.nlm.nih.gov/bioproject/PRJNA437328.

Code availability

The source code of CAMMiQ, under the MIT license, is publicly available at github https://github.com/algo-cancer/CAMMiQ (https://doi.org/10.5281/zenodo.7102588).

References

Huttenhower, C. et al. Structure, function and diversity of the healthy human microbiome. Nature 486, 207 (2012).
Article ADS CAS Google Scholar
Nejman, D. et al. The human tumor microbiome is composed of tumor type-specific intracellular bacteria. Science 368, 973–980 (2020).
Article CAS PubMed PubMed Central Google Scholar
Bullman, S. et al. Analysis of Fusobacterium persistence and antibiotic response in colorectal cancer. Science 358, 1443–1448 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Castellarin, M. et al. Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. Genome Res. 22, 299–306 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gur, C. et al. Binding of the Fap2 protein of Fusobacterium nucleatum to human inhibitory receptor tigit protects tumors from immune cell attack. Immunity 42, 344–355 (2015).
Article CAS PubMed PubMed Central Google Scholar
Gur, C. et al. Fusobacterium nucleatum suppresses anti-tumor immunity by activating CEACAM1. Oncoimmunology 8, e1581531 (2019).
Article PubMed PubMed Central Google Scholar
Kostic, A. D. et al. Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome Res. 22, 292–298 (2012).
Article CAS PubMed PubMed Central Google Scholar
Yu, T. et al. Fusobacterium nucleatum promotes chemoresistance to colorectal cancer by modulating autophagy. Cell 170, 548–563 (2017).
Article CAS PubMed PubMed Central Google Scholar
Simon, H. Y., Siddle, K. J., Park, D. J. & Sabeti, P. C. Benchmarking metagenomics tools for taxonomic classification. Cell 178, 779–794 (2019).
Article Google Scholar
Walker, M. A. et al. GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts. Bioinformatics 34, 4287–4289 (2018).
CAS PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Truong, D. T. et al. Metaphlan2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12, 902 (2015).
Article CAS PubMed Google Scholar
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Article PubMed PubMed Central Google Scholar
Ounit, R., Wanamaker, S., Close, T. J. & Lonardi, S. Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16, 236 (2015).
Article PubMed PubMed Central Google Scholar
Breitwieser, F., Baker, D. & Salzberg, S. L. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 19, 198 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Computer Sci. 3, e104 (2017).
Article Google Scholar
Huson, D. H., Auch, A. F., Qi, J. & Schuster, S. C. Megan analysis of metagenomic data. Genome Res. 17, 377–386 (2007).
Article CAS PubMed PubMed Central Google Scholar
Poore, G. D. et al. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567–574 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Elworth, R. et al. To petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics. Nucleic Acids Res. 48, 5217–5234 (2020).
Article CAS PubMed PubMed Central Google Scholar
Robinson, W., Schischlik, F., Gertz, E. M., Schaffer, A. A. & Ruppin, E. Identifying the landscape of intratumoral microbes via a single cell transcriptomic analysis. bioRxiv (2020).
Liu, B., Gibbons, T., Ghodsi, M., Treangen, T. & Pop, M. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. Genome Biol. 12, S4 (2011).
Article CAS Google Scholar
Segata, N. et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9, 811 (2012).
Article CAS PubMed PubMed Central Google Scholar
Menzel, P., Ng, K. L. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with kaiju. Nat. Commun. 7, 11257 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Ames, S. K. et al. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29, 2253–2260 (2013).
Article CAS PubMed PubMed Central Google Scholar
Brinda, K., Sykulski, M. & Kucherov, G. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31, 3584–3592 (2015).
Article CAS PubMed Google Scholar
Kawulok, J. & Deorowicz, S. Cometa: classification of metagenomes using k-mers. PLoS ONE 10, e0121453 (2015).
Article PubMed PubMed Central Google Scholar
Tu, Q., He, Z. & Zhou, J. Strain/species identification in metagenomes using genome-specific markers. Nucleic Acids Res. 42, e67–e67 (2014).
Article CAS PubMed PubMed Central Google Scholar
Koslicki, D. & Falush, D. Metapalette: ak-mer painting approach for metagenomic taxonomic profiling and quantification of novel strain variation. MSystems 1, e00020–16 (2016).
Article PubMed PubMed Central Google Scholar
Luo, Y., Zeng, J., Berger, B. & Peng, J. Low-density locality-sensitive hashing boosts metagenomic binning. In International Conference on Research in Computational Molecular Biology, LNCS volume 9649, 255–257 (Springer, 2016).
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
Article PubMed PubMed Central Google Scholar
Piro, V. C., Dadi, T. H., Seiler, E., Reinert, K. & Renard, B. Y. ganon: precise metagenomics classification against large and up-to-date sets of reference sequences. Bioinformatics 36, i12–i20 (2020).
Article CAS PubMed PubMed Central Google Scholar
Nazeen, S., Yu, Y. W. & Berger, B. Carnelian uncovers hidden functional patterns across diverse study populations from whole metagenome sequencing reads. Genome Biol. 21, 1–18 (2020).
Article Google Scholar
McHardy, A. C., Martín, H. G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length dna fragments. Nat. Methods 4, 63 (2007).
Article CAS PubMed Google Scholar
Rosen, G., Garbarine, E., Caseiro, D., Polikar, R. & Sokhansanj, B. Metagenome fragment classification using n-mer frequency profiles. Adv. Bioinform. 2008, 205969 (2008).
Google Scholar
Brady, A. & Salzberg, S. L. Phymm and phymmbl: metagenomic phylogenetic classification with interpolated markov models. Nat. Methods 6, 673 (2009).
Article CAS PubMed PubMed Central Google Scholar
Rosen, G. L., Reichenberger, E. R. & Rosenfeld, A. M. NBC: the naive bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 27, 127–129 (2010).
Article PubMed PubMed Central Google Scholar
Vervier, K., Mahe, P., Tournoud, M., Veyrieras, J.-B. & Vert, J.-P. Large-scale machine learning for metagenomics sequence classification. Bioinformatics 32, 1023–1032 (2015).
Article PubMed PubMed Central Google Scholar
Anyansi, C., Straub, T. J., Manson, A. L., Earl, A. M. & Abeel, T. Computational methods for strain-level microbial detection in colony and metagenome sequencing data. Front. Microbiol. 11, 1925 (2020).
Article PubMed PubMed Central Google Scholar
Marshall, J. A. Mixed infections of intestinal viruses and bacteria in humans. In Polymicrobial Diseases (ASM Press, 2002).
Balmer, O. & Tanner, M. Prevalence and implications of multiple-strain infections. Lancet Infectious Dis. 11, 868–878 (2011).
Article Google Scholar
Cohen, T. et al. Mixed-strain Mycobacterium tuberculosis infections and the implications for tuberculosis treatment and control. Clin. Microbiol. Rev. 25, 708–719 (2012).
Article CAS PubMed PubMed Central Google Scholar
Secher, T., Brehin, C. & Oswald, E. Early settlers: which e. coli strains do you not want at birth? Am. J. Physiol. Gastroint. Liv. Physiol. 311, G123–G129 (2016).
Article Google Scholar
Gerner-Smidt, P. et al. Whole genome sequencing: Bridging one-health surveillance of fooborne diseases. Front. Public Health 7, 172 (2019).
Article PubMed PubMed Central Google Scholar
Lin, Y.-Y. et al. Cliiq: Accurate comparative detection and quantification of expressed isoforms in a population. In International Workshop on Algorithms in Bioinformatics, 178–189 (Springer, 2012).
Li, W., Feng, J. & Jiang, T. Isolasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J. Computational Biol. 18, 1693–1707 (2011).
Article MathSciNet Google Scholar
Dao, P. et al. Orman: optimal resolution of ambiguous rna-seq multimappings in the presence of novel isoforms. Bioinformatics 30, 644–651 (2014).
Article CAS PubMed Google Scholar
Sobih, A., Tomescu, A. I. & Makinen, V. Metaflow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows. In RECOMB, Int. Conf. on Research in Computational Molecular Biology, LNCS Volume 9649, 111–121 (Springer, 2016).
Solomon, B. & Kingsford, C. Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300 (2016).
Article CAS PubMed PubMed Central Google Scholar
Solomon, B. & Kingsford, C. Improved search of large transcriptomic sequencing databases using split sequence bloom trees. In International Conference on Research in Computational Molecular Biology, 257–271 (Springer, 2017).
Sun, C., Harris, R. S., Chikhi, R. & Medvedev, P. Allsome sequence bloom trees. In International Conference on Research in Computational Molecular Biology, 272–286 (Springer, 2017).
Pandey, P. et al. Mantis: A fast, small, and exact large-scale sequence-search index. Cell Systems 7, 201–207 (2018).
Article CAS PubMed Google Scholar
Ondov, B. D. et al. Mash screen: high-throughput sequence containment estimation for genome discovery. Genome Biol. 20, 1–13 (2019).
Article Google Scholar
Haubold, B., Pierstorff, N., Moller, F. & Wiehe, T. Genome comparison without alignment using shortest unique substrings. BMC Bioinform. 6, 1–11 (2005).
Article Google Scholar
Leimeister, C.-A. & Morgenstern, B. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30, 2000–2008 (2014).
Article CAS PubMed PubMed Central Google Scholar
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–D65 (2007).
Article CAS PubMed Google Scholar
McIntyre, A. B. R. et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. 18, 72 (2017).
Article Google Scholar
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken2. Genome Biol. 20, 257 (2019).
Article CAS PubMed PubMed Central Google Scholar
Forster, S. C. et al. A human gut bacterial genome and culture collection for improved metagenomic analyses. Nat. Biotechnol. 37, 186 (2019).
Article CAS PubMed PubMed Central Google Scholar
Aulicino, A. et al. Invasive Salmonella exploits divergent immune evasion strategies in infected and bystander dendritic cell subsets. Nat. Commun. 9, 4883 (2018).
Article ADS PubMed PubMed Central Google Scholar
Emiola, A. & Oh, J. High throughput in situ metagenomic measurement of bacterial replication at ultra-low sequencing coverage. Nat. Commun. 9, 4956 (2018).
Article ADS PubMed PubMed Central Google Scholar
Emiola, A., Zhou, W. & Oh, J. Metagenomic growth rate inferences of strains in situ. Sci. Adv. 6, eaaz2299 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Karp, R. M. & Rabin, M. O. Efficient randomized pattern-matching algorithms. IBM J. Res. Development 31, 249–260 (1987).
Article MathSciNet MATH Google Scholar
Vazirani, V. V. Approximation Algorithms (Springer Science & Business Media, 2013).

Download references

Acknowledgements

This research was supported in part by the Intramural Research Program of the National Institutes of Health, National Cancer Institute. This work utilized the computational resources of the NIH HPC Biowulf cluster. (http://hpc.nih.gov). Y.Y. acknowledges support from the NIH grant 5R01AI143254. We thank Dr. Moses Stamboulian for the details of the strain level index dataset.

Funding

Open Access funding provided by the National Institutes of Health (NIH).

Author information

Authors and Affiliations

Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
Kaiyuan Zhu, Alejandro A. Schäffer, Welles Robinson, Junyan Xu, Eytan Ruppin & S. Cenk Sahinalp
Department of Computer Science & Engineering, UC San Diego, La Jolla, CA, USA
Kaiyuan Zhu
Department of Computer Science, Indiana University, Bloomington, IN, USA
Kaiyuan Zhu, A. Funda Ergun, Yuzhen Ye & S. Cenk Sahinalp
Surgery Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
Welles Robinson

Authors

Kaiyuan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro A. Schäffer
View author publications
You can also search for this author in PubMed Google Scholar
Welles Robinson
View author publications
You can also search for this author in PubMed Google Scholar
Junyan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Eytan Ruppin
View author publications
You can also search for this author in PubMed Google Scholar
A. Funda Ergun
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhen Ye
View author publications
You can also search for this author in PubMed Google Scholar
S. Cenk Sahinalp
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

K.Z., A.F.E., and S.C.S. initially formulated the problem of metagenomic abundance estimation. K.Z., A.F.E., and S.C.S. developed the index structure and query process. K.Z. implemented the proposed solution. K.Z. performed the comparison with other software tools. Y.Y. and her lab provided the strain-level index dataset. W.R., A.A.S., and E.R. provided the analysis of scRNA-seq dataset. A.A.S provided the support of testing codes on NIH HPC Biowulf cluster. A.A.S. performed the blastn analysis of Salmonella strains. K.Z., A.A.S., W.R., J.X., and S.C.S. co-wrote the manuscript. A.A.S. provided further proofreading of the manuscript. E.R., A.F.E., Y.Y., and S.C.S. supervised the study.

Corresponding author

Correspondence to S. Cenk Sahinalp.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhu, K., Schäffer, A.A., Robinson, W. et al. Strain level microbial detection and quantification with applications to single cell metagenomics. Nat Commun 13, 6430 (2022). https://doi.org/10.1038/s41467-022-33869-7

Download citation

Received: 23 September 2021
Accepted: 04 October 2022
Published: 28 October 2022
DOI: https://doi.org/10.1038/s41467-022-33869-7

This article is cited by

Fast, parallel, and cache-friendly suffix array construction
- Jamshed Khan
- Tobias Rubel
- Rob Patro
Algorithms for Molecular Biology (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Strainberry: automated strain separation in low-complexity metagenomes using long reads

Unveiling microbial diversity: harnessing long-read sequencing technology

Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps

Introduction

Results

Overview of CAMMiQ indexing and querying procedure

Datasets

Precision and recall in read classification across all species level queries

Precision and recall in genome identification on IMMSA and CAMI queries

Genome identification and quantification performance on species-level-bacteria queries

Evaluation of computational resources on species level queries

Assessing the use of variable-length and doubly-unique substrings in species-level-bacteria queries

Performance of CAMMiQ at the strain level

Performance of CAMMiQ on real single-cell metatranscriptomic queries

Discussion

Methods

CAMMiQ Index

Determining \({L}_{\max }\) and \({L}_{\min }\)

Sparsifying unique substrings

Index organization

Query processing stage 1: preprocessing the reads

Query processing stage 2: ILP formulation

When to use unique substrings—the error free case

Theorem 1

Reporting summary

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Fast, parallel, and cache-friendly suffix array construction

Comments

Search

Quick links