MetaMaps – Strain-level metagenomic assignment and compositional estimation for long reads

Metagenomic sequence classification should be fast, accurate and information-rich. Emerging long-read sequencing technologies promise to improve the balance between these factors but most existing methods were designed for short reads. MetaMaps is a new method, specifically developed for long reads, that combines the accuracy of slower alignment-based methods with the scalability of faster k-mer-based methods. Using an approximate mapping algorithm, it is capable of mapping a long-read metagenome to a comprehensive RefSeq database with >12,000 genomes in <30 GB or RAM on a laptop computer. Integrating these mappings with a probabilistic scoring scheme and EM-based estimation of sample composition, MetaMaps achieves >95% accuracy for species-level read assignment and r2 > 0.98 for the estimation of sample composition on both simulated and real data. Uniquely, MetaMaps outputs mapping locations and qualities for all classified reads, enabling functional studies (e.g. gene presence/absence) and the detection of novel species not present in the current database. Availability and Implementation MetaMaps is implemented in C++/Perl and freely available from https://github.com/DiltheyLab/MetaMaps (GPL v3).


28
Metagenomics, the study of microbial communities with the methods of genomics, has become an 29 important tool for microbiology [1]. One key step in metagenomics is to determine the source genomes 30 that a metagenomic sequencing dataset is derived from. This can be done either at the level of individual 31 reads (read assignment or binning) or at the level of the complete dataset (compositional analysis and Pathoscope [13,14]. There are also approaches based on linear models or linear mixed models, for 38 example PhyloPythia [15,16], DiTASiC [17] and MetaPalette [18]; and methods that combine Markov 39 models with alignment, for example Phymm/PhymmBL [19,20]. The large majority of these methods 40 have been designed for the analysis of short-read data; a small number of long-read-specific methods 41 exist, but they are limited to specific data types (PacBio CCS [21]) or protein analysis (MEGAN-LR 42 [22]). 43 The dominance of short-read sequencing in the field of metagenomics has traditionally been driven by 44 cost efficiency. However, long-read sequencing (defined here as reads >2,000 bases) has recently become 45 more cost-effective and has two intrinsic advantages over short-read sequencing for the interrogation of 46 metagenomes. First, long reads preserve more long-range genomic information such as operon structures 47 and gene-genome associations. The availability of this information can be key to functional and 48 evolutionary studies, concerning, for example, the organization of metabolic pathways and horizontal 49 gene transfer across metagenomes. Second, some long-read sequencers (the Oxford Nanopore MinION in 50 particular) support rapid, portable and robust sequencing workflows, enabling "in-field" metagenomics. 51 This is expanding the types of applications and scenarios that DNA sequencing and metagenomics can be 52 applied to, such as the in-situ characterization of soil metagenomes in remote locations [23] or real-time 53 pathogen sequencing during outbreaks [24]. For these reasons, the applicability and importance of long-54 read sequencing to metagenomics is growing rapidly. 55 This development of sequencing technology, however, has not yet been matched with the development of 56 long-read-specific metagenomic analysis algorithms. Whereas tools that were designed for short reads can 57 usually be applied to long-read sequencing datasets in principle, they often do not fully capitalize on the 58 specific properties of the data. In short-read metagenomics, there are pronounced trade-offs between 59 speed, accuracy and information richness. For example, methods like Kraken are very fast, but they do 60 not attempt to determine the genomic positions of individual reads; alignment-based methods, on the 61 other hand, can determine the genomic locations and alignment qualities of individual reads, but they are 62 typically slow. 63 In the space of long-read metagenomics, desirable algorithms are both fast (to deal with large data 64 volumes of incoming sequencing data on acceptable time scales, e.g. in the field) and produce highly 65 informative output that includes per-read positional and quality information (because the availability of 66 long-range spatial information is one of the key advantages of long-read sequencing). Here we show that 67 this is indeed possible by leveraging the specific properties of long reads in a new approach called 68 MetaMaps. 69 MetaMaps implements a two-stage analysis procedure. First, a list of possible mapping locations for each 70 long read is generated using a minimizer-based approximate mapping strategy [25]. Second, each 71 mapping location is scored probabilistically using a model developed here, and total sample composition 72 is estimated using the EM algorithm. This step also enables the disambiguation of alternative read 73 mapping locations. 74 Our approach has three main advantages. First, utilizing a mapping approach enables MetaMaps to 75 determine individual read mapping locations, estimated alignment identities, and mapping qualities. 76 These can be used, for example, to determine the presence of individual genes, or to assess the evidence 77 for the presence of novel strains or species (which will exhibit systematically decreased alignment 78 identities). Second, our approach is robust against the presence of large "contaminant" genomes, 79 introduced during sample collection and processing or part of the environmental DNA, which often lead 80 to false-positive classifications in methods that rely purely on individual k-mers. Third, reliance on 81 approximate mapping makes the algorithm much faster than alignment-based methods, and our mapping 82 algorithm can be tuned to different read lengths and qualities. 83 MetaMaps is also well-equipped to handle the continuous growth of reference database size [26]. First, 84 MetaMaps implements a "limited memory" mode that, while leading to slightly increased runtimes, 85 reduces memory usage while maintaining the same level of accuracy. This enables, for example, complete 86 mapping of a long-read metagenomic sample to a comprehensive NCBI RefSeq database on a laptop 87 computer. Second, by using the EM algorithm for borrowing information across reads [12,14], 88 MetaMaps can distinguish between closely related database genomes, a challenge that becomes more 89 common as reference databases grow. The source package also includes support for Krona [27] and a set 90 of lightweight R scripts for quick visualization of the sample-to-database mapping results. 91

92
Strain-level metagenomics assignment 93 "Strain-level" accuracy is defined as source-genome resolution; that is, a read or a compositional estimate 94 is counted as correct at strain-level resolution if and only if it is assigned to its correct genome of origin 95 (known a priori for simulated data and determined via alignment for real data; see Methods). There are 96 some instances in which a reference genome is directly attached to a species node in the NCBI taxonomy 97 (typically when there is only one reference genome for the species); for these genomes, strain-and 98 species-level resolution are synonymous. 99 Reference database and strain-specific taxonomy 100 A comprehensive reference database, comprising 12,058 complete RefSeq genomes and 25 gigabases of 101 sequence, is used for all experiments presented here. It includes 215 archaeal, 5,774 bacterial, 6,059 102 viral/viroidal, and 10 eukaryotic genomes, 7 of which are fungal and one of which is the human genome. 103 The database also includes an extended, strain-specific version of the NCBI taxonomy, in which each 104 input genome maps to exactly one node (see Supplementary Figure S1). MetaMaps supports the use of 105 custom databases, and scripts for downloading and processing genomes from NCBI are part of the 106 distribution. 107 Initial read mapping and identity estimation 108 MetaMaps employs a fast approximate long-read mapping algorithm to generate an initial set of read 109 mappings, fully described elsewhere [25] and visualized in Supplementary Figure S2. Briefly, a 110 minimizer [28] index is constructed from the reference database; intersections between the minimizer sets 111 selected from a sequencing read and the reference define an initial set of candidate mapping locations. 112 Low-identity candidate mapping locations are eliminated using a fast linear-time algorithm. For all 113 surviving candidate mapping locations of a read , alignment identity is estimated using a winnowed-114 minhash statistical model [25]. Briefly, let be the MinHash sketch [29] of the read-selected 115 minimizers; let be the MinHash sketch the set of reference-selected minimizers for mapping location 116 1 ≤ ≤ ; let ∪ be the MinHash sketch of the minimizer set union. We define ̂≔ | ∪ ∩ ∩ | | ∪ | as an 117 estimate of the Jaccard similarity between the k-mer sets of the mapping location and the read, and 118 further ( 21 +̂) 1 as an estimator of the corresponding alignment identity for k-mer size . Minimizer 119 density is auto-tuned [25] based on a user-defined minimum read length and minimum mapping identity 120 (by default and for all experiments presented here: 2,000 bases and 80% identity). 121

122
Using the MinHash sketch, we define a probabilistic mapping quality model to quantify mapping 123 uncertainty for reads with multiple mapping locations. Under the assumptions of the model and 124 conditional on a known sequencing error rate , we model P(| ∪ ∩ ∩ |) as binomial with 125 parameters = | ∪ | and k-mer survival rate = (1 − ) . The posterior probability (mapping quality) 126 of mapping location for read is defined as . is unknown and read-127 specific; for simplicity we define (1 − ) as the estimated identity of the highest-scoring mapping for 128 each read. 129

130
Let be the set of genomes in the database and let be the (unknown) probability that a sequencing 131 read in the sample emanates from database genome ∈ . is a vector that describes the sample 132 composition and is to be estimated.
We define the likelihood of the mapped read set as L C ( ; ) ≔ ∏ L C ( ; ) ∈ and the likelihood of an 134 individual read as 135 where is a read mapping location. To account for genomic duplications, map(r,g) is the set of all 137 mapping locations of read in genome , ( ) is the posterior probability of mapping location , and 138 ; is the count of effective start positions for read in genome . ; is defined as contig length minus 139 read length, summed over all contigs of genome . This implies a uniform distribution over possible 140 within-genome start positions of reads; for simplicity, we don't distinguish between circular and non-141 circular contigs. 142 We define the composition-dependent posterior probability of mapping location as 143 where g( ) is the genome associated with mapping location . Summing over all reads, we obtain an 145 updated sample composition estimate as 146 The frequency vector is initialized with 1 | | for each element and we update until convergence of the 148 likelihood L C ( ; ). Unmapped reads exceeding the minimum read length are assigned to the 149 'unassigned' category (special taxon ID 0), followed by a final renormalization. Each mapped read is 150 individually assigned to its maximum likelihood genome location. 151

152
To enable classification against large reference databases with limited resources, MetaMaps supports the 153 specification of a maximum memory target amount ("memory-efficient mapping"). When run in memory-154 efficient mode, the order of contigs in the reference database is randomized and processed in a sequential 155 manner. Starting from the first contig, index construction is started and continues until internally 156 estimated memory consumption is just below the user-specified target amount or until the end of the 157 reference database has been reached. The input data are then mapped against this index representing a 158 subset of the reference database and stored on disk. The index is cleared, and construction of a new index 159 begins at the position at which the process was previously aborted. Suboptimal mapping locations will 160 later be assigned low mapping qualities during the EM step. 161 Evaluation metrics: Read assignment accuracy is assessed by PPV (the proportion of correct read 200 assignments) and recall (the proportion of reads having been given a correct assignment). Specifically, 201

Kraken and Bracken
there is a true taxonomic assignment ℎ( ) ∈ for all reads in the validation set and an inferred 202 taxonomic assignment ( ) ∈ {0 ∪ } for some or all reads in the validation set, where is 203 the set of nodes in the database taxonomy and 0 is a special taxon ID indicating unassignedness. The 204 function _ ( , ), where ∈ {0 ∪ } is a taxonomic assignment and a taxonomic level, enables 205 the conversion of these assignments to specific taxonomic levels and is defined as 206 , where ( ) is a function that returns the taxonomic level of an assignment and where 208 ( , ) is a function that returns the -level ancestral node of . An assignment is counted as 209 correct at level iff _ ( ( ), ) = _ ( ℎ( ), ). For MetaMaps, strain-level 210 correctness is defined as ( ) = ℎ( ). PPV at level is defined as the proportion of correct 211 assignments at level , and recall at level is defined as the number of correct assignments at level 212 divided by the total number of reads in the validation set. MetaMaps produces assignments for all reads in 213 the validation set longer than 2000bp, some of which might be 0 (see above). Kraken produces 214 assignments for all reads in the validation set, some of which might be 0 or, because they correspond to 215 non-leaf nodes of the taxonomy, will be converted to 0 when evaluating at lower taxonomic levels. 216 Note that the presence of 0s is not confined to the inference set; _ ( ℎ( ), ) is 0 if the 217 taxonomic level of ℎ( ) is higher than . This happens in experiments e2 and HMP7 for reads that 218 emanate from out-of-database genomes; for these, ℎ( ) is set to the node that represents the most 219 recent common ancestor between the source and database genomes (see below and Supplementary Figure  220  S6). Under the model of Kraken, reads assigned to higher nodes in the taxonomy could also be considered 221 as entirely uncalled at lower levels; we therefore also report metric PPV2, which, for level , is defined as 222 the proportion of correct non-0 assignments at level . 223 Sample composition estimation accuracy is assessed by the two metrics L1 (the distance between the true 224 and inferred sample composition vectors using the L1 norm) and r2 ( 17982), the next-closest in-database relative belongs to a different species (Actinomyces meyeri, mash 234 distance 0.14). For the truth set that read assignments are evaluated against, out-of-database genomes are 235 assigned to non-leaf nodes of the taxonomy (see above); specifically, the true strain-level taxon IDs for all 236 Acinetobacter and Rhodobacter reads, and the true strain-and species-level taxon IDs for all Actinomyces 237 reads, are set to 0 (Supplementary Figure S6). Note that MetaMaps will always assign all mapped reads at 238 the strain level, whereas Kraken can place reads at higher taxonomic levels. 239

240
Simulated data 241 We first evaluate the performance of MetaMaps in two simulation experiments. Experiment i100 242 represents a "medium-complexity" metagenomic analysis scenario with approximately 100 species; 243 experiment p25 a "pathogenic" metagenomic scenario with 15 strains of 5 potentially pathogenic bacteria 244 and 10 other bacterial strains. At the strain level, MetaMaps assignments achieve a PPV of 92% (i100) 245 and 86% (p25). At the species level, PPV increases to 98% (i100) and 100% (p25). MetaMaps 246 consistently outperforms Kraken in terms of PPV by a margin of 4 -10% (species, genus); at the family 247 level, MetaMaps still outperforms Kraken, but by <1%. MetaMaps also consistently outperforms Kraken 248 in terms of PPV2, apart from the species level in i100, where its PPV2 is 0.2% lower than that of Kraken 249 (though at a 6% higher recall). Recall for MetaMaps is consistently about 3% lower than PPV, driven by 250 the minimum length requirement. At the species level, MetaMaps outperforms Kraken by 1 -6% in terms 251 of recall; at higher taxonomic levels, Kraken recall is 1 -3% above that of MetaMaps. For reads longer 252 than 2000 bp, MetaMaps achieves a recall of 99.9% or higher from the genus (i100) and species (p25) 253 level onwards, consistently higher than that of Kraken. Read classification results are visualized in Figure  254 1 and full results are given in Supplementary Table S7

258
Bar plots show PPV, recall for all reads, and recall for reads longer than 2000 bases at different evaluation levels. Note that

259
Kraken was not designed to achieve strain-level resolution; it is therefore not validated at this level.

260
MetaMaps can also accurately estimate sample composition (Figure 2). At the strain level, MetaMaps 261 achieves a Pearson's 2 between estimated and true abundances of 0.87 (i100) and 0.77 (p25). These 262 increase to >0.99 at higher levels. The performance of Bracken and MetaMaps is similar; MetaMaps, 263 however, exhibits slightly smaller distances (L1-norm) between estimated and true compositions. Full 264 compositional estimation accuracy results are shown in Supplementary Table S8. 265 We use the i100 experiment to assess the effect of read length on the ability to accurately classify a read. 266 All methods show a trend towards higher classification accuracy for longer reads. This effect is most 267 pronounced for Kraken (Figure 3), whereas MetaMaps exhibits relatively constant classification accuracy, 268 once the minimum read length has been reached (see Discussion). 269 274 reads classified as belonging to these genera map to other genomes of the same species / genus (see "Evaluating HMP7" in 275 "Methods").

277
Real HMP7 data 278 To evaluate performance on real data, we apply MetaMaps and Kraken/Bracken to PacBio data from the 279 Microbial Mock Community B of the Human Microbiome Project (HMP Set 7). Note that not all 280 genomes represented in HMP7 are part of the utilized reference database (see Methods), and that sample-281 database mismatches are a recurrent concern in metagenomics. 282 First, we evaluate read assignment accuracy (Figure 1 and Supplementary Table S7). At the strain level, 283 MetaMaps achieves a PPV of 56%; but we note that strain-level differences might exist between the 284 sequenced HMP7 sample and the reference genomes deposited in NCBI. Consistent with this, PPV 285 increases to 95% at the species level. MetaMaps consistently outperforms Kraken in terms of PPV, by a 286 margin between 9% (species) and 5% (family). It also outperforms Kraken in terms of PPV2, though by 287 smaller margins (1 -2%). In the HMP7 dataset, a higher proportion (23%) of reads remain unassigned 288 due to the minimum length requirement of MetaMaps, and Kraken achieves higher recall values than 289 MetaMaps (margin 13% -16%). We note, however, that MetaMaps recall for reads longer than 2000bp is 290 >92% across all evaluated levels and consistently higher than that of Kraken (on the same set of reads). 291 Second, we consider the accuracy of sample composition estimation (Supplementary Table S8). As 292 before, estimating sample composition at the strain level is most challenging (MetaMaps 2 = 0.3); the 293 accuracy of the estimation is much higher at the species ( 2 = 0.98) and genus/family ( 2 = 0.91) levels. 294 On the HMP7 data, MetaMaps has a consistent advantage over Bracken (Figure 3), which has a species-295 level 2 of 0.85. Of note, accuracy for the Actinomyces genus is low for both MetaMaps and Kraken, 296 because the specific strain present in HMP7 is not part of the reference database (see Methods and next 297 section). 298 299 300 Figure 3 PPV of called reads in simulation experiment i100, stratified by read length. Note that MetaMaps results start at a 301 minimum read length of 2,000, corresponding to the "minimum read length" parameter the algorithm was run with. For bins 302 above the MetaMaps minimum read length, recall equals PPV.

303
Database-sample mismatches 304 Mismatches between the sequencing sample and the utilized database are an important concern in 305 metagenomics (i.e. sequencing reads originating from genomes not in the database). To evaluate the 306 effect of large out-of-database genomes, we assess classification accuracy in experiment e2. Experiment 307 e2 contains simulated reads from two eukaryotic genomes, neither of which is present in the reference 308 database (the yellow fever mosquito and Toxoplasma gondii, representing plausible contamination 309 scenarios; see Methods). For both read sets, MetaMaps has a low false-positive rate and correctly leaves 310 the large majority of reads unclassified (99% PPV/recall at the species, genus and family levels for 311 mosquito reads and 98% of toxoplasma reads); of note, the minimum length requirement of MetaMaps 312 does not contribute to this result, as all simulated reads in this experiment are long enough (see Methods). 313 Kraken and Bracken, on the other hand, falsely classify four times as many mosquito reads and eightfold 314 more toxoplasma reads than MetaMaps (96% PPV/recall for mosquito reads and 83% for toxoplasma 315 reads, at the same levels). Actual microbial contamination of these eukaryotic assemblies is possible, but 316 given that MetaMaps shows similar sensitivity to Kraken on the other datasets, it is likely that the 317 majority of Kraken/Bracken e2 calls are false. 318 We use the HMP7 data to evaluate the effect of subtler mismatches and whether the availability of read 319 mapping locations and estimated alignment identities enable the detection of database-sample 320 mismatches. MetaMaps provides an R tool to visualize spatial coverage and identities of the read 321 mappings. Examination of these plots for the Actinomyces genome (Figure 4), which is diverged from the 322 strain in the sequencing dataset (mash distance 0.14, see Methods), reveals both a highly uneven coverage 323 pattern as well as a stark shift of read identities away from the expected average of around 0.88 324 (approximately equal to 1 minus the sequencing error rate, Supplementary File S9). It is clear from these 325 results that any Actinomyces-related result from this experiment would have to be interpreted with 326 caution, consistent with the high evolutionary distance between the sample Actinomyces genome and its 327 next-closest relative in the database. 328 329 330 331 Figure 4 Estimated alignment identity and spatial genome coverage for Helicobacter pylori 26695-1 (panel A) and Actinomyces 332 meyeri (panel B) in the HMP7 experiment. Actinomyces meyeri is the next-closest database relative for the Actinomyces 333 odontolyticus genome present in the HMP7 sequencing data (mash distance 0.14). Helicobacter pylori 26695-1 is present in 334 HMP7 and in the reference database. Uneven spatial coverage and estimated mapping identities shifted away from the 335 expected mode around 0.88 for Actinomyces meyeri are indicative of a mismatch between the sequencing sample and the 336 reference database. Complete plots for HMP7 are contained in Supplementary File S9.

337
Runtime and memory-efficient mode 338 At the expense of slightly increased runtimes, MetaMaps can be run in memory-efficient mode, with an 339 upper memory consumption target to be specified by the user. We evaluate this mode by repeating the 340 i100 experiment with target memory set to 20 GB. First, the complete MetaMaps classification process in 341 standard mode takes 9 CPU hours and memory consumption peaks at 139 GB, well above the capacity of 342 standard workstation computers. With a target memory set to 20 GB, the classification process takes 15 343 CPU hours and memory peaks at 26 GB, a requirement that can be satisfied by medium-ranged 344 workstations (Supplementary Table S10). Note that effectively consumed memory can exceed the 345 specified target maximum amount (Methods). The accuracy of both read assignment and sample 346 composition is virtually unaffected by limiting memory (Supplementary Table S7 and Supplementary  347  Table S8). Kraken/Bracken are 1-2 orders of magnitude faster than MetaMaps (Supplementary Table  348 S10) but require more memory (154 GB) and do not report read mapping information. 349

350
We have presented MetaMaps, an algorithm specifically developed for the analysis of long-read 351 metagenomic datasets that enables simultaneous read assignment and sample composition analysis. The 352 key novelty of MetaMaps is the combination of an approximate mapping algorithm with a model for 353 mapping qualities and the application of the EM algorithm for estimation of overall sample composition. 354 As discussed in the Introduction, this design was motivated by the aim to develop an algorithm tailored 355 for long reads that is both fast and preserves per-read spatial and quality information. 356 Our evaluations show that MetaMaps outperforms Bracken in terms of sample composition estimation, 357 that the read assignments of MetaMaps are more accurate than those of Kraken, and that recall for long 358 reads (above defined as > 2000 bp) is consistently higher than that of Kraken. However, a proportion of 359 reads remain unassigned under the MetaMaps model because they do not meet the minimum length 360 requirement. This is a direct consequence of the approach we chose for approximate mapping, which 361 determines minimizer density based on expected read lengths and alignment identities. Reads that fall 362 below the chosen minimum length end up with minimizer sets that are too small to reliably determine 363 their mapping locations. It is worth noting, however, that minimum read length and expected alignment 364 identities are user-defined parameters that can be set empirically (for example based on the distribution of 365 read lengths) and according to user preferences (e.g. with respect to runtime and the proportion of reads 366 that remain unclassified). In addition, read lengths can be optimized with specific protocols for the 367 extraction of high-molecular-weight DNA; the applicability of these, however, depends on sample and 368 experimental conditions. 369 MetaMaps computes a maximum likelihood approximate mapping location, an estimated identity and 370 mapping qualities for all candidate mapping locations. Its output is nearly as rich as alignment-based 371 methods and enables a very similar set of applications, while being many times faster. 372 We have demonstrated the advantages of this approach. First, MetaMaps is robust against the presence of 373 large out-of-database genomes, for example eukaryotic genomes. Contamination and environmental DNA 374 are important concerns in many metagenomic studies, and the MetaMaps model is more robust against 375 these than methods based purely on individual k-mers. Second, estimated alignment identities can be 376 informative about the presence of novel species or strains that have a related in-database genome. This 377 too is an important concern, as microbial reference databases comprise but a fraction of total microbial 378 genome diversity. Third, because it reports mapping information, MetaMaps can be used to ascertain the 379 presence of particular genes or loci of interest, for example antibiotic resistance genes or virulence 380 factors. 381 In many plausible metagenomic analysis scenarios, computing resources are limitedfor example when 382 sequencing metagenomes using a portable nanopore device during a field trip without reliable internet 383 connection. We therefore developed a feature to limit memory consumption during the approximate 384 mapping step. As we showed, reducing memory consumption comes at a runtime cost, but accuracy 385 remains unaffected, and, in contrast to many other similar approaches, classification is still carried out 386 against the complete reference database. 387 There are two important directions for future work. First, building support for streaming data into 388 MetaMaps would be an important feature for many clinical applications. It could also be used to control 389 the "read until" {Loose, 2016 #190} feature of the Nanopore technology. Such an extension would be 390 relatively straightforward to implement in terms of the algorithms' architecture by dynamically re-391 computing mapping qualities, to the extent that they are influenced by changes in the global sample 392 composition frequency vector. Second, it would be desirable to integrate an explicit term for genomic 393 divergence directly into the statistical models of MetaMaps; this would enable the explicit detection of 394 and testing for novel strains and species in the sequencing sample. K-mer painting approaches [18] have 395 been suggested as a solution to this problem in the short-read space. How to best implement the detection