## Introduction

Recent appreciation for the importance of microbes in human health and disease has prompted the generation of many metagenomic HTS (high throughput sequencing) datasets1. The increase in available HTS data from human tissues also represents an enormous resource because many of these datasets include reads from tissue-resident microbes, which have been shown to play important roles human disease, including tumorigenesis and the tumor response to therapy2,3,4,5,6,7,8.

The increase in available metagenomic HTS datasets prompted the development of many taxonomic classification and abundance estimation methods. A recent benchmarking study9 involving a dataset established by Critical Assessment of Metagenome Interpretation (CAMI) challenge and International Microbiome and Multiomics Standards Alliance (IMMSA) provides a comprehensive review of these methods. The study covers 20 taxonomic classifiers including both alignment-based approaches (such as GATK PathSeq, blastn and MetaPhlAn210,11,12) as well as alignment-free approaches (such as Kraken, CLARK, KrakenUniq, Centrifuge, and Bracken13,14,15,16,17). Below, we provide an overview of the general approaches employed for metagenomic classification methods.

Early approaches for analyzing metagenomic sequencing data were alignment-based and used a reference database. Reads were primarily searched in GenBank18 through blastn11 or custom built aligners such as GATK PathSeq10. Unfortunately, the growth of HTS data and reference databases has made read search and alignment using blastn or GATK PathSeq computationally infeasible on the largest datasets. For example, a recent study showing that microbial reads from tumors sequenced by The Cancer Genome Atlas (TCGA) can be used to build a classifier for cancer type19 use the alignment-free approach Kraken13 due to the large number of samples analyzed. Even though Kraken and other alignment-free tools are faster than the alignment-based tools20, these alignemnt-free tools are not as accurate. For example, another recent paper on microbial reads from single cell RNA-seq (scRNA-seq) datasets to distinguish cell type specific intracellular microbes from extracellular and contaminating microbes21 had to use GATK PathSeq because the relatively small number of microbial reads per cell were inadequate for available alignment-free methods to give accurate results. The distinct approaches taken by these two studies exemplify the tradeoffs inherent in the above methodologies.

Alignment-based methods can be sped up substantially by aligning reads to a compressed reference database or to a reference collection of sequences from marker genes, which are usually clade-specific, single-copy genes22,23. Since marker-gene based methods identify and use only a handful of marker genes on each genome, much of the data goes unused, making taxonomic quantification less accurate. Species with low abundance within the sample may be difficult to identify through marker gene methods because the data may contain few reads originating from the marker genes.

Alignment-free methods typically rely on exact string matching16,24, or k-mer (substrings of length k) “matches” to obtain a taxonomic assignment for every read. These methods either assign a read to the lowest taxonomic rank possible (determined by the specificity of the read’s substrings, or k-mers)13,25,26,27, or to a pre-determined taxonomic level, i.e., genus, species, or strain14,28. Unlike marker-gene based methods, k-mer based applications can use all the input reads29. The large memory footprint to maintain the entire k-mer profile of each genome, for large values of k, can be reduced through hashing or subsampling the k-mers30,31,32,33. In addition to methods based on exact k-mer matches, it is also possible to assign metagenomic reads to bacterial genomes by employing sequence-specific features (e.g., short k-mer distribution or GC content)34,35,36,37,38, although methods that employ this approach are typically not very accurate at species level or strain level assignment. These methods, as a result, are typically insufficient for strain-level applications39, e.g., to identify mixed infections caused by multiple strains of a bacterial species40,41,42, to distinguish pathogenic strains from non-pathogenic strains43, or to track food-borne pathogens44.

Most of the methods described above and covered in the aforementioned benchmarking study9 analyze each read without consideration of how the reads are sampled. Provided that the sequence data to be analyzed are genomic DNA, the distribution of HTS reads from a given species or strain should be roughly uniform. This principle is used in several methods for isoform abundance estimation45,46,47 and are effective even though the distribution of reads across an isoform may not be uniform in practice. In the context of metagenomic abundance estimation, however, the uniform coverage principle is under-utilized. One exception is the network flow based approach, utilized, for example, by ref. 48, which does take into account the uniform coverage—however, it is relatively slow due to the hardness of the underlying algorithmic problem. Another method that utilizes the near uniformity across k-mers within a genome is ref. 15, which runs faster but also is less accurate.

In addition to the metagenomic species identification and quantification methods summarized above, there are also tools to determine the likely presence of a long genomic sequence (e.g., the complete or partial genome of a bacterial species) in a given metagenomic sample49,50,51,52,53. Even though these tools solve an entirely different problem, methodologically they are similar to the k-mer based metagenomic identification and quantification tools such as refs. 13,14, in the sense that they build a succinct index on the database, which is comprised of the metagenomic read collection, and they query this index without explicit alignment. However, because of their design parameters, these tools can not perform abundance estimation.

In this paper, we describe CAMMiQ (Combinatorial Algorithms for Metagenomic Microbial Quantification), a computational approach to maintain/manage a collection of m (bacterial) genomes $${{{{{{{\mathcal{S}}}}}}}}=\{{s}_{1},\ldots,{s}_{m}\}$$, each assembled into one or more strings/contigs, representing a species, a particular strain of a species, or any other taxonomic rank. CAMMiQ constructs a data structure, which can answer queries of the following form: given a set $${{{{{{{\mathcal{Q}}}}}}}}$$ of HTS reads obtained from a mixture of genomes or transcriptomes, each from $${{{{{{{\mathcal{S}}}}}}}}$$, identify the genomes in $${{{{{{{\mathcal{Q}}}}}}}}$$, and, in case the reads are genomic, compute their relative abundances. Our data structure is very efficient in terms of its empirical querying time and is shown to be very accurate on simulations for which the ground truth answers are known. The distinctive feature of our data structure is its utilization of substrings that are present in at most c genomes (c > 1) in $${{{{{{{\mathcal{S}}}}}}}}$$; in this paper, we focus on c = 2, which we call doubly-unique substrings. CAMMiQ is thus different from available methods which set c = 1 to compare genomes via their shortest unique substrings54,55, or perform metagenomic analysis by employing k-mers unique to each genome13,14,15. By considering substrings that are present in c = 2 (or possibly more) genomes, CAMMiQ utilizes a higher proportion of reads and can accurately identify genomes at subspecies/strain level. The choice of c = 2 is sufficiently powerful for the datasets we considered. However, our approach can be generalized for any fixed value of c ≥ 2. Another distinctive feature of our data structure is its use of the variable length substrings—rather than fixed length k-mers. Because any extension of a shortest unique substring is also unique, CAMMiQ only maintains the shortest of these overlapping unique substrings to maximize utility. By being flexible about substring length, CAMMiQ potentially has a a larger selection of substrings from which to choose; because it utilizes the shortest unique substrings, it maximizes possible coverage. To assign each read in $${{{{{{{\mathcal{Q}}}}}}}}$$ that includes an almost-unique substring (i.e., a string present in at most c genomes) to a genome, our data structure solves an integer linear program (ILP) - that simultaneously infers which genomes are present in $${{{{{{{\mathcal{Q}}}}}}}}$$ and, if the reads are genomic, the relative abundances of the identified genomes. Specifically, the objective of the ILP is to identify a set of genomes in which the coverage of the almost-unique substrings in each genome is (approximately) uniform. Our final contribution is a set of conditions sufficient to identify and quantify genomes in a query correctly, through the use of unique substrings/k-mers, provided the reads are error-free. Although this is a purely theoretical result, to the best of our knowledge it has not been applied to metagenomic data analysis, and is valid for CAMMiQ for the case c = 1 and other unique substring based methods such as CLARK and KrakenUniq. Setting c = 2 for CAMMiQ is advised for cases where these conditions are not met. On the experimental side, we show that CAMMiQ is not only much faster but also more accurate than the mapping based GATK PathSeq, which, as mentioned earlier, was used on scRNA-seq data obtained from monocyte-derived dendritic cells (moDCs) infected with distinct Salmonella strains21—where accuracy was the top priority. The application to single-cell data is important because in studies of the human microbiome, it is of interest to know which cells are infected with which microbial strains, especially to distinguish between benign commensals and pathogenic variants of bacteria such as E. coli. Using current sequencing technologies, single-cell nucleotide data are primarily RNAseq rather than DNAseq, which is why we focus on an RNAseq case study. Returning to the established problem of analyzing bulk DNAseq data, we demonstrate the comparative advantage of CAMMiQ against the top performing alignment based and alignment free metagenomic classification methods according to the above-mentioned benchmarking study9 on the very same (CAMI and IMMSA) dataset. We additionally show that CAMMiQ is uniquely capable of handling particularly challenging microbial strains we derived from the NCBI RefSeq database.

## Results

Below, we first give a brief overview of CAMMiQ algorithm. Then we describe the index data sets, simulated and real query sets, as well as the alternative computational methods we used to benchmark CAMMiQ’s performance. We next demonstrate CAMMiQ’s comparative accuracy performance against alternative metagenomic analysis methods on the two species level data sets we have: the first is the CAMI and IMMSA benchmark (i.e., species-level-all) index dataset and the second is the species-level-bacteria index dataset. For these two datasets, we not only provide accuracy figures for the tools benchmarked but also the computational resources they use. Additionally, we demonstrate the maximum potential advantage that could be offered by CAMMiQ through its use of doubly-unique, variable length substrings on our species-level-bacteria index dataset. We then demonstrate CAMMiQ’s performance on our strain-level index dataset. Finally, we demonstrate CAMMiQ’s performance on real metatranscriptomic query sets through its use of our subspecies-level index dataset. The results of CAMMiQ in this setup was compared against that of the GATK PathSeq tool10 which was utilized by the original study on this data set21, as well as blastn method11, which possibly offers the most accurate (albeit slow) approach for the relevant purpose.

### Overview of CAMMiQ indexing and querying procedure

As per a typical metagenomic classification or profiling tool, CAMMiQ involves two steps, namely, index construction and query. In the index construction step, CAMMiQ is given a set $${{{{{{{\mathcal{S}}}}}}}}={\{{s}_{i}\}}_{i=1}^{m}$$ of m genomes or contigs, each labeled with an ID representing the taxonomy of that genome. We call $${{{{{{{\mathcal{S}}}}}}}}$$ an index dataset below. By the end of this step, CAMMiQ returns the collection of sparsified shortest unique substrings and shortest doubly-unique substrings on each genome si in $${{{{{{{\mathcal{S}}}}}}}}$$ in a compressed binary format, and other meta information involving the input index dataset, which jointly composing its index on the dataset. CAMMiQ reuses its index in the next query step.

In the query step, CAMMiQ is given a collection of reads $${{{{{{{\mathcal{Q}}}}}}}}={\{{r}_{j}\}}_{j=1}^{n}$$ of varying length, and identifies a set of genomes $${{{{{{{\mathcal{A}}}}}}}}=\{{s}_{1},\cdots \,,{s}_{a}\}\subset {{{{{{{\mathcal{S}}}}}}}}$$ and their respective abundances p1,   , pa that “best explain” $${{{{{{{\mathcal{Q}}}}}}}}$$ efficiently. We call $${{{{{{{\mathcal{Q}}}}}}}}$$ a query or query set below. Depending on specific applications, a user can select to return (i) $${{{{{{{{\mathcal{A}}}}}}}}}_{1}\subseteq {{{{{{{\mathcal{S}}}}}}}}$$, the set of genomes such that each includes at least one shortest unique substring that also occur in some read rj in the query $${{{{{{{\mathcal{Q}}}}}}}}$$; (ii) $${{{{{{{{\mathcal{A}}}}}}}}}_{2}\subseteq {{{{{{{\mathcal{S}}}}}}}}$$, the smallest subset of genomes in $${{{{{{{\mathcal{S}}}}}}}}$$ which include all shortest unique and doubly-unique substrings that also occur in some read $${r}_{j}\in {{{{{{{\mathcal{Q}}}}}}}}$$; or $${{{{{{{{\mathcal{A}}}}}}}}}_{3}\subseteq {{{{{{{\mathcal{S}}}}}}}}$$, the smallest subset of $${{{{{{{\mathcal{S}}}}}}}}$$ which again include all shortest unique and doubly-unique substrings that also occur in some read $${r}_{j}\in {{{{{{{\mathcal{Q}}}}}}}}$$, with the additional constraint that the “coverage” of these substrings in each genome $${s}_{i}\in {{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ is roughly uniform. In the last case CAMMiQ also computes the relative abundance of each genome si in $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$.

For all three query types, CAMMiQ first identifies for each read rj all unique and doubly-unique substrings it includes; it then assigns rj to the one or two genomes from which these substrings possibly originate. To compute $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$, CAMMiQ simply returns the collection of genomes receiving at least one read assignment. To compute $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$, CAMMiQ solves a hitting set problem though an ILP, where genomes form the universe of items, and indexed strings that appear in query reads form the sets of items to be hit. To compute $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$, CAMMiQ solves the combinatorial optimization problem that asks to minimize the variance among the number of reads assigned to each indexed substring of each genome, again through an ILP. The solution indicates the set of genomes in $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ along with their respective abundances.

### Datasets

To evaluate the overall performance of CAMMiQ, we have performed four sets of experiments, each with a distinct index dataset (all based on NCBI’s RefSeq database56) and a distinct collection of queries.

1. (i)

The first, species-level-all dataset is the most comprehensive index dataset, which includes one complete genome from each bacterial, viral and archaeal species from NCBI’s RefSeq database, resulting in a total of m = 16,418 genomes. This dataset is established for the CAMI and IMMSA repository used in recent benchmarking studies of metagenomics classification and profiling tools9,57. There are 16 query sets from this repository used in these two studies, 8 from CAMI and 8 from IMMSA. Notably, both CAMI and IMMSA query sets include genomes that are not present in the species-level-all index dataset. In fact, the CAMI query sets include only a small porportion of genomes from the index dataset - the majority of the reads in these queries represent unknown species or simulated strains “evolved” from known species that are not in the species-level-all index dataset. See Supplementary Notes 5.4.1 and 5.4.2 for a detailed description of these queries. We used these query sets to demonstrate the comparative performance of CAMMiQ against the best performing methods according to ref. 9, namely Kraken258, KrakenUniq15, CLARK14, Centrifuge16, and Bracken17; please see Supplementary Note 6 for the specific parameters and setup used for each of these tools. Since genomes in the query sets may not be all included in the index dataset, we employed query type $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ to evaluate CAMMiQ’s performance against the aforementioned tools.

2. (ii)

We compiled our next, species-level-bacteria index dataset to evaluate the species level performance of CAMMiQ, this time across one representative complete genome from each of the m = 4122 bacterial species from (an earlier version of) NCBI’s RefSeq. This index dataset enabled us to measure the performance of CAMMiQ’s type $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ queries against the tools mentioned above plus MetaPhlAn212, a marker-gene based profiling tool. We simulated 14 query sets for this experiment with varying levels of “difficulty” across the genomes. These include 10 challenging (marked Least) and 4 easier queries (marked Random). See Supplementary Note 5.4.3 and Supplementary Fig. 2 for a detailed description of these queries.

3. (iii)

Our next strain-level index dataset is smaller: it includes the complete set of m = 614 human gut related bacterial strains from ref. 59 for the purpose of evaluating CAMMiQ’s strain level performance. We again employed type $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ queries of CAMMiQ to compare it against the above-mentioned tools. We simulated 4 queries for this index dataset with varying levels of “difficulty”. See Supplementary Note 5.5 for details.

4. (iv)

We finally evaluated CAMMiQ on a dataset from another study60 which involved metatranscriptomic reads from 262 single human immune cells (monocyte-derived dendritic cells, moDCs) deliberately infected with two distinct strains of the intracellular bacterium Salmonella enterica and 80 uninfected cells used as negative controls. A recent study21 applied the GATK PathSeq tool10 to these metatranscriptomic read sets to validate the presence of Salmonella genus in each cell. To demonstrate CAMMiQ’s ability to distinguish cells infected with specific strains of Salmonella in time much faster than GATK PathSeq, we applied its query types $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$ and $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ to these metatranscriptomic read sets. Since these are not genomic reads, our query type $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ could not be used. The index dataset we used for these queries are at the subspecies-level; it consists of m = 3395 complete bacterial genomes, where each species is represented by a handful of strains. This index dataset was generated to reduce the sampling bias observed in the RefSeq database, which, e.g., includes more than 300 strains from the genus Salmonella. CAMMiQ’s accuracy was compared mainly against PathSeq (a mapping based, thus relatively slow method) for this experiment since PathSeq was the preferred method of the original study due to its high levels of accuracy. Further details on the real query sets can be found in Supplementary Note 5.6.

A summary of data sets used in our experiments can be found in Table 1. Additional details on the four index datasets can be found in Supplementary Notes 5.15.3. As will be demonstrated, CAMMiQ’s performance on these query sets is superior to all alternatives in almost all scenarios we tested.

### Precision and recall in read classification across all species level queries

We tested CAMMiQ’s species level performance on both CAMI and IMMSA (i.e., species-level-all) and species-level-bacteria data sets, and compared it against the best performing alternatives according to ref. 9. Results based on CAMI and IMMSA are summarized in Table 2; results based on species-level-bacteria data set are summarized in Table 3.

Perhaps the most widely-used performance measures to benchmark metagenomic classifiers are the proportion of reads correctly assigned to a genome among (i) the set of reads assigned to some genome, i.e., precision, and (ii) the full set of reads in the query, i.e., recall14. In Table 2, panel A, as well as Table 3, panel A, we report the selected tools precision in read classification. Then, in Table 2 panel B and Table 3, panel B, we report these tools’ recall in read classification.

Note that the above tables do not report the read classification precision and recall values for MetaPhlAn2. This is partially due to MetaPhlAn2’s use of an index based on a very different (predetermined) and much smaller database of marker genes. As a consequence, MetaPhlAn2 assigns very few reads to the marker genes in its database and thus appears to have very low recall (and possibly higher precision). This would not accurately reflect MetaPhlAn2’s performance since unlike the other tools we benchmarked, MetaPhlAn2 does not aim to assign as many reads reads to genomes correctly but rather aims to identify distinct genomes in a metagenomic sample; see Supplementary Note 6 for details. Additionally note that for our bookkeeping purposes, any read assigned to a taxonomic level strictly higher than the species level by Kraken2, KrakenUniq, and Centrifuge is considered to be not assigned. This likely increases their reported precision but may decrease their recall.

In all our species level tests, we used CAMMiQ’s default parameter settings of $${L}_{\min }=26$$ and $${L}_{\max }=50$$ to compare it against Kraken2, KrakenUniq, Bracken, CLARK, and Centrifuge, all using k-mer length of 26; see Supplementary Note 6 for details on parameter settings. Results based on alternative parameter settings can also be found in Supplementary Note 8 and in particular Supplementary Table 6. In all of these experiments, we used the same collection of genomes for establishing the index for each of the five tools (with the exception of MetaPhlAn2, which uses its own predetermined index): the results in Table 3 are based on our species-level-bacteria index dataset and the results in Table 2 are based on our species-level-all index dataset.

Compared with the species-level-bacteria queries which are composed of highly similar genomes, the CAMI and IMMSA queries are, in principle, less challenging since reads that did not get mapped to a unique genome were excluded from these queries at the time they were complied57. Even though the RefSeq database has been significantly updated since these queries were complied, almost all reads in these queries still map to a unique genome. Having said that, reads in these queries may originate from genomes outside of the species-level-all index dataset - including plasmids from these species that have not been indexed. It is entirely possible that such reads may include one or more unique or doubly-unique substring(s) indexed by CAMMiQ, and thus be assigned to the wrong genome.

As can be seen in Table 2, panel A, CAMMiQ offered the best precision in read classification for all IMMSA queries; interestingly the precision values for Centrifuge was much lower than the alternatives. CAMMiQ was arguably the best on the recall in read classification on the IMMSA queries as well, as can be seen in Table 2, panel B. However, reads that originate from genomes outside of the index database were likely not utilized by CAMMiQ, reducing its comparative advantage against, KrakenUniq and CLARK, which may still assign such reads to a genome; this would increase their recall, while possibly reducing their precision.

As can be seen in Table 3 and Supplementary Table 5, CAMMiQ achieved the best recall and F1 score (see Supplementary Note 7 for a definition), and the second best precision for the 11 species-level-bacteria query sets (the three queries with uneven coverage were excluded). Its precision and recall were particularly impressive for first 7 challenging queries (labeled with prefix “Least”), where CAMMiQ was an order of magnitude better than the alternatives in terms of both measures. On these queries, tools other than CAMMiQ assigned only a small proportion of reads to genomes at the species level. This is because none of them employ doubly-unique substrings to differentiate species in the index dataset from the same genus. The only exception is Centrifuge, which achieved the best classification precision and the second best recall. For example, on the 3 hypothetically error-free queries (labels ending with -1 in Table 3), where Centrifuge (in addition to CAMMiQ and CLARK) achieved 100% precision. However, Centrifuge’s classification performance deteriorated when genomes in queries were likely not present in the corresponding index dataset (Table 2, panels A and B). Note that in principle Kraken2, KrakenUniq and Centrifuge could assign reads to the correct taxonomy higher than the species level. However, as mentioned above, only reads that were assigned to the correct species were considered to be true positives for this benchmark.

### Precision and recall in genome identification on IMMSA and CAMI queries

On the CAMI and IMMSA queries, CAMMiQ correctly identified more genomes than the alternative tools with the same abundance cutoff of 0.01% (we consider a genome to have been identified by a tool only if the tool reports its abundance to be ≥0.01% of the total abundance of all genomes), resulting in superior recall values for genome identification (Table 2, panel C; note that recall in genome identification represents the fraction of correctly identified genomes among all genomes in a query set). This is primarily due to CAMMiQ’s use of doubly-unique substrings in its query type $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$. Compared to its recall performance, CAMMiQ achieved even better precision figures than the alternatives (Table 2, panel D; note that precision in genome identification represents the fraction of correctly identified genomes among the set of genomes identified by a given tool), due to fewer false positive identifications. The fact that CAMMiQ particularly performs best with respect to the precision values indicates that genomes not present in the index dataset would have the least impact on CAMMiQ in comparison to other tools. Note that by postprocessing the output of Kraken2, Bracken manages to improve on the number of identified genomes, and achieves comparable figures to CAMMiQ. However, it does not reduce the large number of false positive genome identifications produced by Kraken2, when unknown genomes or genomes outside the index dataset present in the query sample.

### Genome identification and quantification performance on species-level-bacteria queries

Next, we evaluated the number of correctly identified genomes by each tool (specific to MetaPhlAn2, the genus corresponding to each genome), as well as the L1 and L2 distances between the true abundance profile and the predicted abundance profile, on the 14 queries involving our species-level-bacteria dataset, including the 3 queries with GC bias. As can be seen in Table 3, panels C–E, and Supplementary Table 5, CAMMiQ clearly offered the best performance in both identification and quantification. It correctly identified all genomes present in each one of the 14 queries and was not impacted by uneven read coverage or the genome we added to the query Random-20-lognormal-a.g. which was not indexed. Importantly, CAMMiQ consistently returned very few false positive genomes for the most challenging queries, and at most one false positive genome for the remaining 4 queries.

Compared to CAMMiQ, other tools reported larger number of false negatives in these 14 queries (again we consider a genome to be a “negative”, if its reported abundance level is ≤0.01% of the total abundance of all genomes), in particular in the 10 challenging queries (labeled with the prefix “Least”) with minimal unique substrings (i.e., L-mers). Among them, CLARK and Centrifuge offered the best false negative performance, especially on error free queries. As can be expected, MetaPhlAn2 had the worst performance with respect to false negatives, very likely due to the incompleteness of its marker gene list (we used the latest set of marker genes mpa_v20_m200 in MetaPhlAn2). This also led to a relatively larger L1/L2 distances than the other tools, even for the remaining 4 (easier) queries. Kraken2 and KrakenUniq were also prone to having false negatives, though fewer than MetaPhlAn2. Bracken, in general, could correctly identify a few more genomes than Kraken2, and this improvement in its identification performance also leads to better quantification results (see below).

CAMMiQ performs even better with respect to the number of false positive genomes, as demonstrated by its F1 score distribution (see Supplementary Table 5). The alternative tools all returned a large number of false positives in species-level-bacteria queries, especially in the first 10 challenging queries, even though all reads in these queries were sampled from (some genome in) the index dataset (see Table 3, panel C). Among them, Centrifuge and Bracken usually performed better on the 10 challenging queries with fewer ‘unique’ genomes; while KrakenUniq and CLARK performed better on the remaining 4 (easier) queries. Kraken2 showed the worst performance with respect to the false positives: it outputs more than a third of the genomes from the index dataset even for the three error free queries. In many of the datasets, these false positives were eliminated by Bracken’s postprocessing of Kraken2’s output; unfortunately, in other query datasets, e.g., Random-20-uniform and Random-100-uniform, Bracken introduced additional false positives. MetaPhlAn2 identified only limited number of genomes (and few true positive genomes) in all queries in general, so it had a comparable performance to CAMMiQ with respect to false positives. However, its F1 scores were not as good as CAMMiQ’s (see Supplementary Table 5).

Note that CAMMiQ not only correctly identified all genomes, but also predicted their abundances reasonably close to the true values. As can be seen in Table 3, CAMMiQ outperformed all other tools on both L1 and L2 errors, typically offering a factor of 3 ~ 4× improvement over the second best alternative. Interestingly, even when the coverage across each genome were non-uniform, CAMMiQ’s $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ type of query was only mildly impacted. As noted earlier, on the 10 challenging queries (especially those with sequencing errors), all alternative tools except MetaPhlAn2 output hundreds of false positive genomes. As a consequence, their predictions for the abundances of the true positive genomes were smaller than the true abundance values. This is particularly the case for Kraken2 and KrakenUniq: even though they identified the majority of the true positive genomes correctly, their reported abundance values were all close to 0; this results in their L1 distances to be very close to 1.

### Evaluation of computational resources on species level queries

We compared the running time and memory usage of CAMMiQ, Kraken2/Bracken, KrakenUniq, CLARK, MetaPhlAn2, and Centrifuge in building the index and responding to the queries; see Table 4. As can be seen CAMMiQ performs better than all alternatives in running time - including those tools that aim to index all unique k-mers (KrakenUniq and CLARK), and all substrings (Centrifuge) - with respect to both query time and index construction time. The only exception is Kraken2 (MetaPhlAn2 uses a pre-built index and so it can not be compared against others with respect to index construction time), however, Kraken2’s overall accuracy is worse than the others across the species-level queries. Since MetaPhlAn2 uses a pre-built index (See Supplementary Note 6) it avoids the expensive index construction process. This, however, results in many false negatives (See Subsections Genome identification and quantification performance on species-level-bacteriaqueries and Performance of CAMMiQ at the strain level). CAMMiQ also supports pre-built indices. Compared to the other tools and methods, the sizes of these pre-built indices are much smaller (Table 4, Panel B), due to the sparsification of unique and doubly unique substrings, allowing convenient transfer and fast downloading. Note that we do not report the time for loading the index into memory for any of the tools, since this is performed only once.

All of our experiments were run on a Linux server equipped with 40 Intel Xeon E7-8891 2.80 GHz processors, with 2.5 TB of physical memory and 30 TB of disk space. The ILP solver used by CAMMiQ in the initial implementation is IBM ILOG CPLEX 12.9.0. We have also ported the code to use the ILP solver Gurobi 9.1.0.

### Assessing the use of variable-length and doubly-unique substrings in species-level-bacteria queries

Due to its unique algorithmic features CAMMiQ outperforms available alternatives on the CAMI, IMMSA and the species-level-bacteria query sets. A key question is: what is the maximum potential improvement in performance one can expect through the use of (i) variable-length substrings as opposed to fix length k-mers, and (ii) doubly-unique substrings in addition to unique substrings? Here, we evaluate both of these algorithmic features in the context of the species-level-bacteria dataset we constructed (see Table 1). For that, we compare the proportion of L-mers (for read length L = 100) from each genome si in our species-level-bacteria index dataset that are unique or doubly-unique (and thus is utilized by CAMMiQ) with the proportion of L-mers that include a unique k-mer (and thus can be utilized by CLARK and others) for k = 30.

Figure 1a summarizes our findings: on the horizontal axis, the genomes are sorted with respect to the proportion of unique and doubly-unique L-mers they have; the vertical axis depicts this proportionality (from 0.0 to 1.0). The figure shows the proportion of unique L-mers, doubly-unique L-mers, the combination of unique and doubly-unique L-mers (all utilized by CAMMiQ), as well as the L-mers that include a unique k-mer (utilized by, e.g., CLARK) for each genome depicted on the horizontal axis. As can be seen, roughly three quarters of all genomes in this dataset are easily distinguishable since a large fraction of their L-mers include a unique k-mer. However, about a quarter of the genomes in this dataset can benefit from the consideration of doubly-unique substrings, especially when their abundances are low. In particular, 66 of these 4122 genomes/species have extremely low proportions (each ≤1%) of unique 100-mers. At the extreme, the species Francisella sp. MA06-7296 does not have a single unique 100-mer and the species Rhizobium sp. N6212 does not have any 100-mer that includes a unique 30-mer (in fact any substring of length $$\le {L}_{\max }=50$$). These two species cannot be identified by, e.g., CLARK in any microbial mixture, regardless of their abundance values.

Figure 1b depicts the inverse proportionality of doubly-unique L-mers in comparison to unique L-mers among 50 genomes that have the lowest proportion of unique L-mers - for L = 100. The inverse-proportionality of unique or doubly-unique L-mers for a genome corresponds to the number of reads to be sampled (on average) from that genome to guarantee that the sample includes one read that would be assigned to the correct genome. In the absence of read errors, this guarantees correct identification of the corresponding genome in the query. Note that, in half of these 50 genomes, almost all L-mers are doubly-unique. This implies that any query involving one or more of these genomes could only be resolved by CAMMiQ and no other tool.

We further assessed whether the usage of unique and doubly-unique substrings can lead to robust genome identification and quantification performance in practice, by evaluating the distribution of these substrings across the genome. In principle, the more evenly these substrings are distributed across a genome, the less likely CAMMiQ’s quantification performance can be impacted by queries composed of genomes with small alterations to the corresponding index genomes. As can be seen in Fig. 1c, d, unique and doubly unique substrings span the entire genome on most of the species in our species-level-bacteria index dataset, not significantly biased towards any functionally annotated region by NCBI (i.e., gene, CDS, ncRNA, rRNA, tRNA, tmRNA or plasmid). Even when the numbers of unique or doubly-unique substrings are relatively small in a genome (for example, the last 3 genomes in Fig. 1d), they are still well distributed, helping CAMMiQ with that genome’s identification as well as quantification. We would like to note here that even though some genomes have very few unique substrings, implying that they would be difficult to identify through the use of alternative methods, because of their (well distributed) doubly-unique substrings, CAMMiQ can identify and quantify them accurately. Consider, for example, the last genome in Fig. 1d, Rhizobium sp. N1341 in which the only unique substrings are located on the plasmids. However, since there are sufficiently many doubly unique substrings on the chromosome, this species could still be identified and quantified by CAMMiQ, through the $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ or $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ type of query.

### Performance of CAMMiQ at the strain level

In the next experiment, we evaluated CAMMiQ’s performance (with default parameters) on queries composed from our strain-level dataset that consists of 614 Human Gut related genomes of bacterial strains from 409 species59 as described in Supplementary Note 5.2. As can be seen in Table 5, CAMMiQ managed to identify and accurately quantify all strains in the queries HumanGut-random-100-1 and HumanGut-random-100-2, and > 96% strains in the other two queries, with almost no false positives. Other tools benchmarked against CAMMiQ lead to either more false negative (KrakenUniq, CLARK, MetaPhlAn2) genomes, or more false positive identifications (Kraken2, Centrifuge). Furthermore, their quantification performance (Table 5, panel B) is worse than CAMMiQ.

### Performance of CAMMiQ on real single-cell metatranscriptomic queries

Our final set of experiments involve “real” metatranscriptomic reads from human monocyte-derived dendritic cells (moDCs)60. Because CAMNMiQ’s most powerful type $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ query is not suitable for RNA-seq data (due to high variance in read coverage), we employed $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$ and $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ queries. We remind the reader that $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$ only uses unique substrings in query reads and returns the genomes in the index for which there is at least one such substring. On the other hand, $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ computes the smallest set of genomes in the index that include all unique or doubly-unique substrings across the query reads.

Each query was composed of all high quality, non-human scRNA-seq reads from the corresponding single cell60. For guaranteeing this, we filtered out all scRNA-seq reads which (i) possibly originate from the human genome, or (ii) have low sequence quality and “complexity”, or (iii) map to 16S or 23S ribosomal RNAs on the two Salmonella genomes (to avoid incorrect assignment of reads due to “barcode hopping”).

Following the original study60, we categorized each cell into one of the 5 groups: infected cells that were confirmed to contain (1) STM-LT2 or (2) STM-D23580 strain of intracellular Salmonella; bystander cells that were exposed to (3) STM-LT2 or (4) STM-D23580 strains, but confirmed to not contain intracellular Salmonella; and (5) cells that were mock-infected and sequenced as controls. For each query, we compared the number of reads CAMMiQ assigned uniquely to STM-LT2 or STM-D23580 genomes against those aligned and assigned either by the GATK PathSeq10 tool or blastn11 (see Supplementary Note 9).

Figure 2 summarizes our results on this data set. In Fig. 2a, we demonstrate that compared to the GATK PathSeq approach, CAMMiQ’s $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$ type queries were more sensitive with respect to read assignment. On average, CAMMiQ identified (roughly) an order of magnitude more unique STM-LT2 or STM-D23580 reads in each cell, demonstrating its potential to better identify intracellular organisms at subspecies or strain level. Note that CAMMiQ’s performance is comparable or slightly better than that of blastn. However CAMMiQ is several orders of magnitude faster than blastn or GATK PathSeq. (CAMMiQ only took a total of 65.3s for computing $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$ type queries and an additional 2.5s for computing $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ type queries on the entire query set, outperforming GATK PathSeq, which required 29628.1s, or blastn, which is typically slower).

The abundances reported by each of the three tools (measured by unique read counts) of Salmonella were substantially higher in the infected cells compared to the mock-infected controls. More importantly, cells known to be infected with or exposed to a particular strain indeed include significantly more reads from that strain. Interestingly, CAMMiQ as well as blastn reported that cells infected with or exposed to a particular strain also contain reads unique to the other strain. This is possibly due to sequencing errors or incorrect cell assignments for these reads.

In Fig. 2b, we compare CAMMiQ’s $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ type queries with its $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$ type queries (as well as GATK PathSeq and blastn) with respect to the number of cells they correctly identify to include STM-LT2 or STM-D23580 strains. For that we vary the minimum number of reads that need to be identified by each tool to report a given strain, and for each such value we indicate how many cells are reported to include the STM-LT2 strain (on the vertical axis) vs the STM-D23580 strain (on the horizontal axis). With the exception of the third subpanel a method with a plot closer to the diagonal is less sensitive. As can be seen CAMMiQ’s $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ type queries are more sensitive than not only its $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$ type queries but also GATK PathSeq and blastn. However, they also introduce some potential false positive calls (e.g., in the third subpanel panel corresponding to the controls). This could be due to additional reads utilized by $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ queries impacted by read errors or incorrect assignments of these reads to cells.

## Discussion

We have introduced CAMMiQ, a new computational tool to identify microbes in an HTS sample and to estimate abundance of each species or strain. CAMMiQ is based on a principled approach that starts by defining formally the following algorithmic problem that has not been fully addressed by any available method. Given a set $${{{{{{{\mathcal{S}}}}}}}}$$ of distinct genomic sequences of any taxonomic rank, build a data structure so as to identify and quantify genomes in any query, composed of a mixture of reads from a subset of $${{{{{{{\mathcal{S}}}}}}}}$$. CAMMiQ is particularly designed to handle genomes that lack unique features; for that, it reduces the aforementioned identification and quantification problems to a combinatorial optimization problem that assigns substrings with limited ambiguity (i.e., doubly-unique substrings) to genomes so that, in its most general $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ type query, each genome is “uniformly covered”. Uniform coverage is a simplifying assumption we employ in our theoretical analysis since which genomes are represented in a query are not known in advance. In practice, the coverage for genomic sequences might be biased by GC content61,62. We do not employ this assumption in CAMMiQ implementation for $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$ and $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ type queries, which are more suitable for transcriptomic sequences. Our experiments on the Salmonella scRNAseq dataset indeed show that CAMMiQ delivers good results on scRNAseq queries work well even though the reads are skewed by variable expression and the selection biases of single-cell technology. Because each such substring has limited ambiguity, the resulting combinatorial optimization problem can be efficiently solved through the existing integer program solvers IBM CPLEX and Gurobi.

One potential limitation of CAMMiQ is that it relies on a database of reference genomes. In the context of medical microbiology this is a reasonable assumption since virtually all clinically-relevant microbes detected in new patients are known and have some similar genome sequenced and in RefSeq. The reliance on a reference database is more problematic in the context of studying environmental samples, in which new and rare taxa might be found by methods that do not rely on reference genomes. Our results on the CAMI benchmark data set provide reassurance that CAMMiQ performs well even when many genomes and plasmids are absent from the reference database. Another potential limitation is that the memory required by CAMMiQ index construction is relatively high. However, CAMMiQ supports pre-built indices on commonly used databases for metagenomic studies, e.g., (the latest version of) the RefSeq bacteria, viruses and archaea database. Compared to the other tools and methods, the sizes of these pre-built indices are much smaller, due to the sparsification of unique and doubly unique substrings, allowing convenient transfer and fast downloading. The prebuilt CAMMiQ index for all index datasets are available via the GitHub link provided in the Code Availability statement. In addition, as shown for the experiments summarized in Table 4, the memory requirements for CAMMiQ queries are comparable to those of other widely used packages and within the capabilities of currently available computers.

Provided that the doubly-unique substrings of a given genome are not all shared with one other genome, the use of doubly-unique substrings increases CAMMiQ’s ability to identify and quantify this genome within a query. In case the dataset to be indexed involves several genomes with high levels of similarity, CAMMiQ’s data structure and its combinatorial optimization formulation could be generalized to include “triply” or “quadruply” unique substrings, but this is not yet implemented. In summary, using principled methods from combinatorial optimization and string algorithms, CAMMiQ delivers better sensitivity and specificity than widely-used existing methods on practical genome classification and quantification methods.

## Methods

The input to CAMMiQ is a set of m genomes $${{{{{{{\mathcal{S}}}}}}}}={\{{s}_{i}\}}_{i=1}^{m}$$, possibly but not necessarily all from the same taxonomic level (each genome here may be associated with a genus, species, subspecies, or strain), to be indexed. Although we describe CAMMiQ for the case where each $${s}_{i}\in {{{{{{{\mathcal{S}}}}}}}}$$ is a single string, we do not assume that the genomes are fully assembled into a single contig. The string representing a genome could simply be a concatenation of all contigs from genome si and their reverse complements, with a special symbol \$i between consecutive contigs. We call $${{{{{{{\mathcal{S}}}}}}}}$$ the input database or synonymously index dataset, and we call i {1,   , m} the genome ID of string si.

A query or query set for CAMMiQ contains a set of reads $${{{{{{{\mathcal{Q}}}}}}}}={\{{r}_{j}\}}_{j=1}^{n}$$ representing a metagenomic mixture. For simplicity, we describe CAMMiQ for reads of homogeneous length L; however, our data structure can handle reads of varying length. Given $${{{{{{{\mathcal{Q}}}}}}}}$$, the goal of CAMMiQ is to identify a set of genomes $${{{{{{{\mathcal{A}}}}}}}}=\{{s}_{1},\cdots \,,{s}_{a}\}\subset {{{{{{{\mathcal{S}}}}}}}}$$ and their respective abundances p1,   , pa that “best explain” $${{{{{{{\mathcal{Q}}}}}}}}$$. This is achieved by assigning (selected) reads rj to genomes si such that the implied coverage of each genome $${s}_{i}\in {{{{{{{\mathcal{A}}}}}}}}$$ is (roughly) uniform across si, with pi as the mean.

CAMMiQ’s index data structure involves the collection of shortest unique substrings and shortest doubly-unique substrings on each genome si in $${{{{{{{\mathcal{S}}}}}}}}$$. We call a substring of si unique if it does not occur on any other genome sj ≠ si in $${{{{{{{\mathcal{S}}}}}}}}$$; a shortest unique substring is a unique substring that does not include another unique substring. Similarly, we call a substring of si doubly-unique if it occurs on exactly one other genome $${s}_{j}\ne {s}_{i}\in {{{{{{{\mathcal{S}}}}}}}}$$; a shortest doubly-unique substring is a doubly-unique substring that does not include another doubly-unique substring. See Supplementary Note 1 for a formal definition for the uniqueness of a substring and Supplementary Fig. 1 for a graphical illustration. CAMMiQ does not maintain the entire collection of shortest unique and doubly-unique substrings of genomes in $${{{{{{{\mathcal{S}}}}}}}}$$; instead, its index contains only a sparsified set of shortest unique and doubly-unique substrings of each $${s}_{i}\in {{{{{{{\mathcal{S}}}}}}}}$$ so that no unique and doubly-unique substring is in close proximity (i.e., within a read length) of another in si. See Section CAMMiQ Index and Supplementary Note 2 for how exactly CAMMiQ sparsifies the collection of shortest unique and doubly-unique substrings.

With the (sparsified) collection of shortest unique and doubly-unique substrings, CAMMiQ is sufficiently powerful to answer the following three types of queries. The simplest type of query only involves unique substrings: given a query set $${{{{{{{\mathcal{Q}}}}}}}}$$, it asks for the set of genomes $${{{{{{{{\mathcal{A}}}}}}}}}_{1}\subseteq {{{{{{{\mathcal{S}}}}}}}}$$ so that each includes at least one (shortest) unique substring that also occur in some read rj in the query $${{{{{{{\mathcal{Q}}}}}}}}$$. The second, more general query type involves both unique and doubly-unique substrings. It asks to compute $${{{{{{{{\mathcal{A}}}}}}}}}_{2}\subseteq {{{{{{{\mathcal{S}}}}}}}}$$, the smallest subset of genomes in $${{{{{{{\mathcal{S}}}}}}}}$$ which include all (shortest) unique and doubly-unique substrings that also occur in some read $${r}_{j}\in {{{{{{{\mathcal{Q}}}}}}}}$$. Finally, the third and the most general type of query asks to compute the smallest subset $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ of $${{{{{{{\mathcal{S}}}}}}}}$$ which again include all (shortest) unique and doubly-unique substrings that also occur in some read $${r}_{j}\in {{{{{{{\mathcal{Q}}}}}}}}$$, with the additional constraint that the “coverage” of these substrings in each genome $${s}_{i}\in {{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ is roughly uniform. In addition to the set of genomes $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$, the query also asks to compute the relative abundance of each genome si in $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$.

CAMMiQ with its ability to efficiently answer all three queries described above has several advantages over existing methods that rely on fixed-length unique substrings (i.e., unique k-mers). (i) Notice that the shorter a unique substring is the more likely it will be sampled (i.e., present in a read sampled from the relevant genome). This is because a substring of length $$L^{\prime} < L$$ is included in $$L-L^{\prime}+1$$ potential reads of length L that could be sampled from a genome. Unfortunately, the shorter a substring is, the less likely that it is unique or doubly-unique. A method that uses fixed length k-mers needs to have a compromise between the number of unique substrings and the likelihood of sampling each. CAMMiQ gets around this limitation by utilizing unique substrings of any length. CAMMiQ features a lower bound $${L}_{\min }$$ and upper bound $${L}_{\max }$$ on the lengths of unique and doubly-unique substrings as explained below. (ii) Unique substrings are relatively rare, at least for certain genomes and taxa, but substrings that appear in many genomes provide very limited information about the composition of a query $${{{{{{{\mathcal{Q}}}}}}}}$$. By involving doubly-unique substrings in a query $${{{{{{{\mathcal{Q}}}}}}}}$$, the subset of genomes that could be identified through query $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ would be larger and more accurate than those that could be identified through query $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$, especially in the extreme case where $${{{{{{{\mathcal{Q}}}}}}}}$$ includes highly similar genomes that do not include any unique substring. (iii) Finally, by introducing the “uniform coverage” constraint, CAMMiQ’s $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ type of query can identify more accurately the genome(s) where a doubly-unique substring originates. This is because a query of type $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ may result in significant differences in coverage between unique and doubly-unique substrings of a given genome.

As mentioned above, CAMMiQ builds an index for the sparsified sets of shortest unique and doubly-unique substrings to compute efficiently the sets $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$, $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ and $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$. For all three query types, CAMMiQ first identifies for each read rj all unique and doubly-unique substrings it includes; it then assigns rj to the one or two genomes from which these substrings can originate. To compute $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$, CAMMiQ can simply return the collection of genomes receiving at least one read assignment. To compute $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$, CAMMiQ needs to solve instances of the NP-hard set cover problem, or more precisely, its dual, the hitting set problem where genomes form the universe of items, and indexed strings that appear in query reads form the sets of items to be hit. Even though this is a restricted version of the hitting set problem where each set to be hit contains at most two items, it is still NP-hard due to a reduction to the vertex cover problem. To compute $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$CAMMiQ solves the combinatorial optimization problem that asks to minimize the variance among the number of reads assigned to each indexed substring of each genome - the solution indicates the set of genomes in $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ along with their respective abundances.

Details on the composition as well as the construction process for CAMMiQ’s index are discussed in Section CAMMiQ Index, as well as Supplementary Notes 1 and 2. The two stages in query processing of CAMMiQ are discussed in Subsections Query processing stage 1: Preprocessing the Reads and Queryprocessing stage 2: ILP formulation. The first stage assigns reads to specific genomes, which is sufficient for computing sets $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$ and $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$. See Section Query processing stage 1: Preprocessing the Reads and Supplementary Note 3 for the criteria we use to assign a read to a genome, based on the indexed substrings that the read includes. The second stage introduces the combinatorial optimization formulation to compute $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ as a response to the most general query type. See Section Query processing stage 2:ILP formulation for details.

### CAMMiQ Index

To respond to all three types of queries described above, CAMMiQ identifies all unique and doubly-unique substrings of the genomes in $${{{{{{{\mathcal{S}}}}}}}}$$ and organizes them in a simple but efficient data structure. Specifically, CAMMiQ computes the complete set of shortest unique substrings, $${{{{{{{\mathcal{U}}}}}}}}={\cup }_{i=1}^{m}{{{{{{{{\mathcal{U}}}}}}}}}_{i}$$, and the set of shortest doubly-unique substrings, $${{{{{{{\mathcal{D}}}}}}}}={\cup }_{i=1}^{m}{{{{{{{{\mathcal{D}}}}}}}}}_{i}$$, where $${{{{{{{{\mathcal{U}}}}}}}}}_{i}$$ and $${{{{{{{{\mathcal{D}}}}}}}}}_{i}$$ respectively denote the complete set of shortest unique and doubly-unique substrings from genome si, whose lengths are within the range $$[{L}_{\min },{L}_{\max }\le L]$$. See Supplementary Note 1 for a linear time algorithm to build both $${{{{{{{\mathcal{U}}}}}}}}$$ and $${{{{{{{\mathcal{D}}}}}}}}$$. CAMMiQ then sparsifies $${{{{{{{\mathcal{U}}}}}}}}$$ and $${{{{{{{\mathcal{D}}}}}}}}$$ by selecting only one representative substring among those that are in close proximity in each genome, and discarding the rest; this sparsification step is described in detail in below and the Supplementary Note 2. Finally it builds a collection of tries (trees where the root node represents a substring of length $${L}_{\min }$$ and every other internal node represents a single character) to compactly represent and efficiently search for substrings in $${{{{{{{\mathcal{U}}}}}}}}$$ and $${{{{{{{\mathcal{D}}}}}}}}$$.

#### Determining $${L}_{\max }$$ and $${L}_{\min }$$

In general, as the value of $${L}_{\max }$$ increases, so do the numbers of unique and doubly-unique substrings to be considered by CAMMiQ - potentially increasing its sensitivity. However query type $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ relies on the read coverage for each unique and doubly-unique substring of each genome; the higher the coverage the better. The read coverage for a unique substring of length L − L, for some constant Δ > 1, would roughly be 1/Δ-th of the read coverage (of a single nucleotide) of the respective genome. The best tradeoff between these two objectives, i.e., substring length, ~ (1 − 1/Δ) and coverage, ~ 1/Δ, can be achieved by maximizing their product, i.e., (1 − 1/Δ)/Δ, which is achieved at Δ = 2. This suggests to choose $${L}_{\max }=L/2$$.

A shortest unique substring u, by definition, differs from (at least) one other substring $$u^{\prime}$$ by just one nucleotide. The shorter u gets, the more likely a read error impacting $$u^{\prime}$$ would modify it to u, leading to false positives. We have experimentally observed that unique substrings of length ≤25 could lead to false positives that impact the performance of CAMMiQ; as a consequence, we set the default value of $${L}_{\min }$$ to 26.

#### Sparsifying unique substrings

Let $${{{{{{{{\mathcal{U}}}}}}}}}_{i}$$ be the collection of all unique substrings on genome si. To reduce the index size, CAMMiQ aims to compute a subset $${{{{{{{\mathcal{U}}}}_i}}}}^{\prime}$$ of $${{{{{{{{\mathcal{U}}}}}}}}}_{i}$$, consisting of the minimum number of shortest unique substrings such that every unique substring of length L (i.e., unique L-mer) on si includes one substring from $${{{{{{{\mathcal{U}}}}_i}}}}^{\prime}$$. Independently, CAMMiQ also aims to compute a subset $${{{{{{{\mathcal{D}}}}_i}}}}^{\prime}$$ of $${{{{{{{{\mathcal{D}}}}}}}}}_{i}$$, consisting of the minimum number of shortest doubly-unique substrings such that every doubly-unique substring of length L (i.e., doubly-unique L-mer) on si includes one substring from $${{{{{{{\mathcal{D}}}}_i}}}}^{\prime}$$. This is all done by greedily maintaining only the rightmost shortest unique or doubly-unique substring in a sliding window of length L on a genome in $${{{{{{{\mathcal{S}}}}}}}}$$. In the remainder of the paper, we denote the number of unique substrings in subset $${{{{{{{\mathcal{U}}}}_i}}}}^{\prime}$$ by nui ($$=|{{{{{{{\mathcal{U}}}}_i}}}}^{\prime}|$$) and the number of doubly-unique substrings in subset $${{{{{{{\mathcal{D}}}}_i}}}}^{\prime}$$ by ndi ($$=|{{{{{{{\mathcal{D}}}}_i}}}}^{\prime}|$$); we denote the number of unique L-mers on si by $$n{u}_{i}^{L}$$ and respectively the number of doubly-unique L-mers on si by $$n{d}_{i}^{L}$$. As we prove in Supplementary Note 2, the greedy strategy we employ can indeed obtain the minimum number of shortest unique substrings to cover each unique L-mer, provided that each substring in $${{{{{{{{\mathcal{U}}}}}}}}}_{i}$$ occurs only once in si.

#### Index organization

We demonstrate the index structure and query processing for the set of unique substrings $${{{{{{{\mathcal{U}}}}}}}}$$; the processing for doubly-unique substrings is essentially identical to that for unique substrings. Let $$h=\mathop{\min }\limits_{{u}_{i}\in {{{{{{{\mathcal{U}}}}}}}}}|{u}_{i}|$$ be the minimum length of all shortest unique substrings (h is automatically set to $${L}_{\min }$$ if the minimum length constraint is imposed). CAMMiQ maintains a hash table that maps a distinct h-mer w to a bucket containing all unique substrings ui that have w as a prefix. Within each bucket, the remaining suffices of all unique substrings ui, i.e., ui[h + 1: ui], are maintained in a trie (rooted at ui[1: h]) so that (i) each internal node represents a single character; and (ii) each leaf represents the corresponding genome ID. For each read rj in the query, CAMMiQ considers each substring of length h and its reverse complement and computes its hash value in time linear with L through Karp–Rabin fingerprinting63. If the substring has a match in the hash table, then CAMMiQ tries to extend the match until a matching unique substring is found, or until an extension by one character leads to no match. See Fig. 3 for a schematic of the index structure. See Subsection Query processing stage 1: Preprocessing the Reads below for the use of unique and doubly-unique substrings identified for each read to answer the query.

### Query processing stage 1: preprocessing the reads

Given the index structure on the sparsified set of shortest unique and doubly-unique substrings of genomes in $${{{{{{{\mathcal{S}}}}}}}}$$, we handle each query $${{{{{{{\mathcal{Q}}}}}}}}$$ in two stages. The first stage counts the number of reads that include each unique and doubly-unique substring with the following provision. We call two or more (unique or doubly-unique) substrings in a read “conflict-free” if there is at least one genome that includes all of these substrings. See Supplementary Note 3 for a detailed discussion on conflicting substrings; the conflicts arise due to either sequencing errors or the query including genomes that are not in the database and thus should be avoided. Reads that include more than one unique or doubly-unique substring that is conflict-free contribute to the counting process; all other reads are discarded.

We denote by c(ui), the counter for the conflict-free reads that include the unique substring ui and by c(di) that for the doubly-unique substring di. These counters are sufficient to compute the set $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$ as well as $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$, the answer to our most general query type. For computing $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$, CAMMiQ additionally maintains a counter $$d({s}_{k},{s}_{k^{\prime} })$$ for each pair of genomes $${s}_{k},{s}_{k^{\prime} }$$, indicating the number of reads in $${{{{{{{\mathcal{Q}}}}}}}}$$ that can originate both from sk and $${s}_{k^{\prime} }$$ (i.e., the case (e - iii) in the procedure described in Supplementary Note 3).

The first stage thus produces two count vectors $${{{{{{{{\bf{c}}}}}}}}}_{i}^{u}=(c({u}_{i,1}),\cdots \,,c({u}_{i,n{u}_{i}}))$$ and $${{{{{{{{\bf{c}}}}}}}}}_{i}^{d}=(c({d}_{i,1}),\cdots \,,c({d}_{i,n{d}_{i}}))$$ that indicate the number of (conflict-free) reads that include each unique and doubly-unique substring on each genome si. Using these vectors, CAMMiQ answers the first type of query by computing $${{{{{{{{\mathcal{A}}}}}}}}}_{1}=\{{s}_{i}:\mathop{\sum }\nolimits_{l=1}^{n{u}_{i}}c({u}_{i,l}) > 0\}$$. Additionally, through the use of the counters $$d({s}_{k},{s}_{k^{\prime} })$$, CAMMiQ answers the second type of query by computing $${{{{{{{{\mathcal{A}}}}}}}}}_{2}=\arg \min|{{{{{{{\mathcal{A}}}}}}}}^{\prime} \subset {{{{{{{\mathcal{S}}}}}}}}|$$ such that (i) $${s}_{i}\in {{{{{{{\mathcal{A}}}}}}}}^{\prime}$$ if $$\mathop{\sum }\nolimits_{l=1}^{n{u}_{i}}c({u}_{i,l}) > 0$$ and (ii) $$\exists {s}_{i}\in {{{{{{{\mathcal{A}}}}}}}}^{\prime}$$, if $$d({s}_{k},{s}_{k^{\prime} }) > 0$$ then either i = k or $$i=k^{\prime}$$. This is basically the solution to the hitting set problem we mentioned earlier, whose formulation as an integer linear program (ILP) is well known64. The genomes returned in $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$ are ranked in decreasing order by the aggregated counter values on unique substrings (i.e., $$|{{{{{{{{\bf{c}}}}}}}}}_{i}^{u}|$$); and the genomes returned in $${{{{{{{{\mathcal{A}}}}}}}}}_{2}$$ are ranked by the aggregated counter values on unique substriungs plus the counter values on doubly-unique substrings (i.e., $$|{{{{{{{{\bf{c}}}}}}}}}_{i}^{u} |+|{{{{{{{{\bf{c}}}}}}}}}_{i}^{d}|$$).

From this point on, our main focus will be how CAMMiQ answers the third type of query by computing $${{{{{{{{\mathcal{A}}}}}}}}}_{3}$$ through an ILP formulation described below.

### Query processing stage 2: ILP formulation

In its second stage, CAMMiQ computes the list of genomes in the query as well as their abundances through an ILP. Let δi = 0/1 be the indicator for the absence or presence of the genome si in $${{{{{{{\mathcal{Q}}}}}}}}$$. The ILP formulation assigns a value to each δi and also computes for each si its abundance pi, upper bounded by $${p}_{\max }$$ - a user-defined maximum abundance with a default setting of 100, which is introduced to avoid potential anomalies due to sequence contamination.

$${{{{{{{\bf{Minimize}}}}}}}}\quad \mathop{\sum}\limits_{i}(\frac{1}{n{u}_{i}}\mathop{\sum }\limits_{l=1}^{n{u}_{i}}|c({u}_{i,l})-e({u}_{i,l}) |+\frac{1}{n{d}_{i}}\mathop{\sum }\limits_{l=1}^{n{d}_{i}}|c({d}_{i,l})-e({d}_{i,l})|)$$
$${{{{{{{\bf{s.t.}}}}}}}}\quad \quad e({u}_{i,l})=(L-|{u}_{i,l} |+1)\cdot {p}_{i}\cdot \frac{1}{L}\cdot {(1-\hat{{{{{{{{\rm{err}}}}}}}}})}^{|{u}_{i,l}|}\quad \forall i,\, l,\,{{{{{{{\rm{s.t.}}}}}}}}\,1\le l\le n{u}_{i}$$
(1)
$$e({d}_{i,l})=(L-|{d}_{i,l} |+1)\cdot ({p}_{i}+{p}_{j})\cdot \frac{1}{L}\cdot {(1-\hat{{{{{{{{\rm{err}}}}}}}}})}^{|{d}_{i,l}|}\quad \forall i,\, l{{{{{{{\rm{s.t.}}}}}}}}\,1\le l\le n{d}_{i}$$
(2)
$${p}_{i}\le {\delta }_{i}\cdot {p}_{\max }\quad \forall i$$
(3)
$${\delta }_{i}=0\quad \forall i,\,{{{{{{{\rm{s.t.}}}}}}}}\,{s}_{i}\in M({{{{{{{\mathcal{Q}}}}}}}})$$
(4)
$${p}_{i}\ge {\delta }_{i}\cdot \min \{L\mathop{\sum }\limits_{l=1}^{n{u}_{i}}c({u}_{i,l})\cdot \frac{1}{{nu}_{i}^{L}},L\mathop{\sum }\limits_{l=1}^{n{d}_{i}}c({d}_{i,l})\cdot \frac{1}{{nd}_{i}^{L}}\}\cdot (1-\epsilon )\quad \forall i \,,{{{{{{{\rm{s.t.}}}}}}}}\,{s}_{i}\,\notin \,M({{{{{{{\mathcal{Q}}}}}}}})$$
(5)
$$\mathop{\sum}\limits_{i}|{s}_{i}|\cdot {p}_{i}\le n\cdot L$$
(6)

The objective of the ILP is to minimize the sum of absolute differences between the expected and the actual number of reads to cover a unique or doubly-unique substring. Since each genome may have different numbers of unique and doubly-unique substrings, the sums of differences are normalized w.r.t. nui or ndi.

Constraint (1) defines the expected number of reads to cover a particular unique substring ui,l, given abundance pi of the corresponding genome si. Similarly, constraint (2) defines the expected number of reads to cover a particular doubly-unique substring di,l; in this constraint, pi and pj denote the respective abundances of the two genomes si and sj that include (the doubly unique substring) di,l. Specifically the expected coverage of ui,l is $$\frac{L-|{u}_{i,l} |+1}{L}\cdot {p}_{i}$$ and the expected coverage of di,l is $$\frac{L-|{d}_{i,l} |+1}{L}\cdot ({p}_{i}+{p}_{j})$$, provided that the coverage is uniform across a given genome and there are no read errors. To account for read errors, we normalize these coverage estimates respectively by $${(1-\hat{{{{{{{{\rm{err}}}}}}}}})}^{|{u}_{i,l}|}$$ and $${(1-\hat{{{{{{{{\rm{err}}}}}}}}})}^{|{d}_{i,l}|}$$; these values represent the probability that, a substring ui,l or di,l would be error free within a read that has been subject to uniform i.i.d. substitution errors. Here $$\hat{{{{{{{{\rm{err}}}}}}}}}$$ denotes the estimated substitution error rate per nucleotide; and w denotes the length of a substring w. CAMMiQ formulation also allows updates to the expected coverage according to any given unique or doubly-unique substring’s sequence composition (e.g., GC content) to address sequencing biases.

Constraint (3) ensures that the abundance pi of a genome is 0 if δi = 0. Constraint (4) ensures that the solution to the above ILP excludes those genomes whose counters for unique and doubly-unique substrings add up to a value below a threshold - so as to reduce the size of the solution space. More specifically, given a threshold value α (α is introduced to avoid potential false positives due to read errors and genomes that are not in the database; its default value is 0.0001), the constraint excludes those genomes si that are in the set of genomes $$M({{{{{{{\mathcal{Q}}}}}}}})$$ whose counters for its unique substrings add up to a value below $$\alpha \cdot n{u}_{i}^{L}$$, and doubly-unique substrings add up to a value less than $$\alpha \cdot n{d}_{i}^{L}$$. Formally, $$M({{{{{{{\mathcal{Q}}}}}}}})=\{{s}_{i}\in {{{{{{{\mathcal{S}}}}}}}}\,|\,\mathop{\sum }\nolimits_{l=1}^{n{u}_{i}}c({u}_{i,l}) < \alpha \cdot n{u}_{i}^{L}\}\cap \{{s}_{i}\in {{{{{{{\mathcal{S}}}}}}}}\,|\,\mathop{\sum }\nolimits_{l=1}^{n{d}_{i}} c({d}_{i,l}) < \alpha \cdot n{d}_{i}^{L}\}$$. Constraint (5) enforces a lower bound on the coverage of each genome si in the solution to the above ILP (namely, with δi = 1), which must match the coverage ($$L\cdot \mathop{\sum }\nolimits_{l=1}^{n{u}_{i}}c({u}_{i,l})\cdot \frac{1}{n{u}_{i}^{L}}$$ and $$\mathop{\sum }\nolimits_{l=1}^{n{d}_{i}}c({d}_{i,l})\cdot \frac{1}{n{d}_{i}^{L}}$$) resulting from the number of reads in $${{{{{{{\mathcal{Q}}}}}}}}$$ that include a unique and doubly-unique substring respectively, i.e., it must be at least (1 − ϵ) times the smaller one above for a user defined ϵ. Constraint (6) enforces an upper bound on the coverage of each genome si in the solution to the above ILP, through making the sum over each si of the number of reads produced on si based on pi not exceed the total number of reads n. Collectively, the last two constraints ensure that the abundance pi computed from the ILP matches what is (i.e., the coverage based on read counts) given by $${{{{{{{\mathcal{Q}}}}}}}}$$. As written above, the formulation does not strictly conform to the rules for ILPs because of the use of the absolute value function. We use a standard technique to replace the absolute values in the objective by introducing a new variable $$\gamma ({u}_{i,l})\ge \max \{c({u}_{i,l})-e({u}_{i,l}),\, e({u}_{i,l})-c({u}_{i,l})\}$$.

### When to use unique substrings—the error free case

We now provide a set of sufficient conditions to guarantee the approximate performance that can be obtained with high probability in metagenomic identification and quantification by the use of unique substrings only. These conditions apply to CAMMiQ when c = 1, as well as CLARK, KrakenUniq, and other similar approaches. In case these conditions are not met, it is advisable to use CAMMiQ with c ≥ 2.

Suppose that we are given a query $${{{{{{{\mathcal{Q}}}}}}}}$$ composed of n error-free reads of length L, sampled independently and uniformly at random from a collection of genomes $${{{{{{{\mathcal{A}}}}}}}}=\{{s}_{1},\cdots \,,{s}_{a}\}$$ according to their abundances p1,   , pa. More specifically, suppose that our goal is to answer query $${{{{{{{\mathcal{Q}}}}}}}}$$ by computing $${{{{{{{{\mathcal{A}}}}}}}}}_{1}$$, along with an estimate for the abundance value pi for each $${s}_{i}\in {{{{{{{{\mathcal{A}}}}}}}}}_{1}$$, calculated as the weighted number of reads assigned to si according to the procedure described in Section Query processing stage 1: Preprocessing the Reads. Then, the L1 distance between the true abundance values and this estimate will not exceed a value determined by n (number of reads), a, and $${q}_{\min }$$, the minimum normalized proportion of unique L-mers among these genomes. For a given failure probability ζ and an upper bound on L1 distance ϵ, this translates into sufficient conditions on the values of n, a and $${q}_{\min }$$ to ensure acceptable performance by the computational method in use.

### Theorem 1

Let $${{{{{{{\mathcal{Q}}}}}}}}=\{{r}_{1},\cdots \,,{r}_{n}\}$$ be a set of n error-free reads of length L, each sampled independently and uniformly at random from all positions on a genome $${s}_{i}\in {{{{{{{\mathcal{A}}}}}}}}=\{{s}_{1},\cdots \,,{s}_{a}\}$$, where s1,   , sa is distributed according to their abundances p1,   , pa > 0. Let $${p}_{i}^{\prime}=\frac{{p}_{i}\cdot {n}_{i}^{L}}{\mathop{\sum }\nolimits_{i^{\prime}=1}^{a}{p}_{i^{\prime} }^{\prime}\cdot {n}_{i^{\prime} }^{L}}$$ be the corresponding “unnormalized” abundance of pi for i = 1,   , a, where $${n}_{i}^{L}$$ denotes the total number of L-mers on si. Let q1,   , qa > 0 be the proportion of unique L-mers on s1,   , sa respectively; $${p}_{\min }=\min {\{{p}_{i}\}}_{i=1}^{a}$$; $${q}_{\min }=\min {\{{q}_{i}\}}_{i=1}^{a}$$. Then,

• (i) With probability at least 1 − ζ, each si can be identified through querying $${{{{{{{\mathcal{Q}}}}}}}}$$ if $$n\ge \frac{2(a+1)+\ln (1/\zeta )}{{({p}_{\min }{q}_{\min })}^{2}}$$

• (ii) With probability at least 1 − ζ, the L1 distance between the predicted abundances $${\hat{p}}_{1},\cdots \,,{\hat{p}}_{a}$$ by setting $$\hat{{p}_{i}}=\frac{{c}_{i}/{q}_{i}}{n}$$ and the true (unnormalized) abundances $${p}_{1}^{\prime},\cdots \,,{p}_{a}^{\prime}$$ is at most ϵ if $$n\ge \frac{2(a+1)+\ln (1/\zeta )}{{(\epsilon {q}_{\min })}^{2}}$$.

• (iii) Given n such reads in a query, with probability at least 1 − ζ, the L1 distance between the predicted abundances $${\hat{p}}_{1},\cdots \,,{\hat{p}}_{a}$$ by setting $$\hat{{p}_{i}}=\frac{{c}_{i}/{q}_{i}}{n}$$ and the true (unnormalized) abundances $${p}_{1}^{\prime},\cdots \,,{p}_{a}^{\prime}$$ is bounded by $$\sqrt{\frac{2[\ln (1/\zeta )+(a+1)]}{n{q}_{\min }^{2}}}$$.

Where ci denotes the number of reads assigned to si.

See Supplementary Note 4 for a proof.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.