This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Genome Biology Open Access 26 August 2021
Nature Communications Open Access 26 February 2018
Genes & Genomics Open Access 09 November 2017
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Berger, B., Peng, J. & Singh, M. Nat. Rev. Genet. 14, 333–346 (2013).
Loh, P.R., Baym, M. & Berger, B. Nat. Biotechnol. 30, 627–630 (2012).
Li, H. & Durbin, R. Bioinformatics 25, 1754–1760 (2009).
Langmead, B. & Salzberg, S.L. Nat. Methods 9, 357–359 (2012).
Huang, L., Popic, V. & Batzoglou, S. Bioinformatics 29, i361–i370 (2013).
Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G. & Birney, E. Genome Res. 21, 734–740 (2011).
Hach, F. et al. Nat. Methods 7, 576–577 (2010).
DePristo, M.A. et al. Nat. Genet. 43, 491–498 (2011).
Sboner, A., Mu, X.J., Greenbaum, D., Auerbach, R.K. & Gerstein, M.B. Genome Biol. 12, 125 (2011).
Hach, F. et al. Nucleic Acids Res. 42, W494–W500 (2014).
Siragusa, E., Weese, D. & Reinert, K. Nucleic Acids Res. 41, e78 (2013).
Li, H. & Durbin, R. Bioinformatics 26, 589–595 (2010).
Veeneman, B.A., Iyer, M.K. & Chinnaiyan, A.M. BMC Bioinformatics 13, 297 (2012).
Treangen, T.J. & Salzberg, S.L. Nat. Rev. Genet. 13, 36–46 (2012).
Hormozdiari, F. et al. Genome Res. 21, 840–849 (2011).
Hormozdiari, F. et al. Bioinformatics 26, i350–i357 (2010).
Simola, D.F. & Kim, J. Genome Biol. 12, R55 (2011).
Jubin, C., Serero, A., Loeillet, S., Barillot, E. & Nicolas, A. G3 (Bethesda) 4, 707–715 (2014).
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621–628 (2008).
Dao, P. et al. Bioinformatics 30, 644–651 (2014).
Li, B. & Dewey, C.N. BMC Bioinformatics 12, 323 (2011).
Anders, S., Pyl, P.T. & Huber, W. Bioinformatics 31, 166–169 (2015).
Kelley, D.R. et al. Genome Res. 15, 537 (2014).
Special thanks to M.H. Baym and P.-R. Loh for launching compressive genomics with B.B. Thanks to M.H. Baym, P.-R. Loh, N.M. Daniels, R. Daniels, G. Tucker, H. Cho and P. Palmedo for most helpful comments. D.Y., J.P., Y.W.Y. and B.B. are partially supported by the US National Institutes of Health GM108348 (to B.B.). D.Y. is also partially supported by the Howard Hughes Medical Institute and IBM. Y.W.Y. gratefully acknowledges support from the Fannie and John Hertz Foundation.
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Comparison between conventional read mapping methods and compressively-accelerated read mapping framework (CORA).
Comparison between conventional read mapping methods and compressively-accelerated read mapping framework (CORA). (a) In order to generate read-mapping results, existing read mappers compare each read to the reference or a previously-constructed index for the reference. (i) Unique-read case: Some reads have a unique match to the reference. (ii) Multi-read case: Due to the repetitive structure of DNA sequences, a single read can often be aligned to more than one location with high similarity in the reference. Most existing approaches involve costly seed-extension or suffix-array traversal stages for each of these locations, requiring additional computational time reporting multiple mappings for each read (or comparing them to report a best mapping). Furthermore, in high depth-coverage datasets (especially when multiple individuals are mapped together), there can often be reads that are fully or partially similar to other reads in the dataset. (iii) Redundancy within reads: As existing aligners cannot utilize inexact redundancy within reads, they process each read individually, potentially duplicating previous computations performed for earlier reads in the dataset. This inefficiency is particularly an issue for multi-reads sequenced from highly-repetitive regions in the genome, since they require sequence comparisons with a large number of loci in the reference. Compressively accelerated read mapping addresses the inefficiencies of mapping high-throughput NGS reads by capitalizing on redundancy within both read datasets and the reference. (b) CORA capitalizes on redundancy in both reads and reference. (1) As a preprocessing step, a high-resolution homology table is created for the reference sequence by mapping the reference to itself. The homology table contains all homologous pairs of loci in the reference above a similarity threshold, allowing fast direct access to similar locations in the reference during mapping. (2) The first step in compressive read-mapping is to compress the reads in order to eliminate full or partial redundancies across reads in the dataset. Compression is achieved through self-mapping of the read dataset. (3) Next, an off-the-shelf aligner can be used to perform a coarse mapping from the compressed read data—clusters of similar substrings—to the reference. (4) Each read link represents a cluster of substrings from one or more reads in the dataset and stores their differences from a locus in the reference. (5) Read links are further expanded to obtain final mapping results through traversal of the pre-computed homology table, and final mapping results are reported. (6) Far fewer comparisons are required for compressive read mapping due to efficient utilization of redundancy within read sequences as well as the reference.
Supplementary Figure 2 Runtime and sensitivity comparison results for whole genome ungapped (substitution-only) and gapped (with indels) best-mapping of 1000 Genomes Phase 1 Illumina 2×108bp paired-end read datasets.
Runtime and sensitivity comparison results for whole genome ungapped (substitution-only) and gapped (with indels) best-mapping of 1000 Genomes Phase 1 Illumina 2×108bp paired-end read datasets of 4 Finnish individuals with ∼16× read depth-coverage; similarity threshold is defined as Hamming distance of 4 for each end for the ungapped mapping and Levenshtein (edit) distance of 4 for each end for the gapped mapping benchmarks. Paired-end insert size interval is defined to be between 150 and 650 base pairs. We compared best-mapping runtimes of Bowtie2, BWA aln, BWA mem, mrsFAST-ultra (only for ungapped), GEM (only for gapped) and Masai against compressively accelerated version of BWA with two different modes: CORA-BWA and CORA-BWA-fast, which sacrifices some sensitivity allowing for faster best-mapping. The bars in the upper panel represent mappers' runtime performance, whereas the bars in the lower two panels indicate sensitivity performance: percentage sensitivity for ungapped mapping and number of mappings for gapped mapping. As BWA mem does not accept insert size intervals, we show two different mapping sensitivity measurements for it: sensitivity within the defined insert size interval and the increased sensitivity when mappings outside of the interval are included as well (the difference is indicated by the hatch pattern). Some of the results are estimated from a down-sampled set of reads; detailed benchmark criteria as well as mapping parameters can be found in Further details on experimental setup. The plots indicate that compared to the fastest best-mappers we tested against, CORA-BWA mappers are at least ∼2× faster with superior or comparable sensitivity. The only mapper that approached CORA-BWA in terms of best-mapping runtime was Masai for the ungapped mapping experiment, albeit with drastically lower sensitivity; even then CORA-BWA-fast was >1.4× faster than Masai using less memory. Moreover, compared to the original BWA aln, CORA-BWA generated best-mapping results with near-identical sensitivity, but >3.2× faster for gapped and >3.1× faster for ungapped mapping. Furthermore, comparisons with CORA's all-mapping runtime results in Figure 1a reveal that CORA can perform near-perfect sensitivity all-mapping faster than BWA, Bowtie2 and mrsFAST-Ultra can report best-mapping results. The peak memory usage of Bowtie2 was 3.2GB, BWA used 4.7GB and 6.2GB respectively for aln and mem, mrsFAST-Ultra used 4.7GB, whereas GEM and Masai's memory usages were 4.1GB and 23.2 GB respectively. CORA-BWA, at the maximum of collapsing, coarse-mapping, and homology table traversal stages, used 19.7GB of memory for the runs that only utilized the exact homology table, whereas it used 64.1GB for the runs that also loaded the inexact homology table into memory (e.g. ungapped mapping with CORA-BWA default mode).
Compact representation of homology table with exact and inexact homologies. Identical homology blocks of a certain length are collected under the same equivalence class. Each equivalence class has a representative locus, as well as other loci that are oriented with respect to the representative locus. Inexact homology table is a graph with equivalence classes as nodes and partial similarities between equivalence classes as edges. Therefore, inexact homologies can only be defined between equivalence class representative loci. Each inexact homology edge identifies the offset from the beginning of the first equivalence class, the offset from the beginning of the second equivalence class, the length of the inexact homology block, forward or reverse-complement direction of homology, and the positions of differences or base substitutions (text above bold bi-directional arrow). An inexact homology block size indicates the final length of the inexact homology after two or more consecutive and concordant inexact k-mer homologies are merged together. Two consecutive inexact k-mers are defined as concordant if their target positions in the reference are also consecutive and their edit positions in the k-mer are concordant, i.e., containing the same (k-1)-mers.
Seed position selection scheme for inexact homology table construction. This seed selection scheme allows detection of all inexact homologies of a given k-mer length within a Hamming distance of 2 (by pigeonhole principle). For detection of higher numbers of mismatches, the number of seeds sampled can be increased. Rather than selecting consecutive positions for each seed, spacing out seed positions throughout the k-mer allows for more evenly-sized bins within the hash table. The positions sampled for each seed is designed to be closed under reverse complementation, so that hash tables created for each seed are disjoint. This allows for both space savings and easy parallelization of inexact homology table construction.
Homology table traversal scheme of CORA framework. Each read link, representing one or more reads, points to the genomic location determined by the coarse mapping stage in either forward or reverse complement direction, indicating homology within a small number of edits. This locus is either unique in the reference genome or associated with an equivalence class in the exact homology table. In the latter case, the anchor is linked to the equivalence class in the forward or reverse complement direction, with a block offset value indicating the starting location of a substring of the equivalence class representative that is identical to the coarse genome target. The representative of the equivalence class points to all members of the equivalence class for each valid offset (until the end of the block) in forward or reverse complement direction. Furthermore, the equivalence class representative can be linked to other equivalence classes through the inexact homology table. Each of these pointers contains the direction of homology (forward or reverse complementary), block offset in the compressed inexact homology block representation, and the edit script to convert one class representative to another. Provided that the juxtaposition of the edits from a read link to the anchor and from the anchor's class representative to a neighboring equivalence class still contains less than or equal to the user specified number of errors, read mappings associated with all members of these neighbor equivalence classes will also be reported.
Supplementary Figure 6 Estimation of redundancy within read datasets in absence of sequencing error.
Estimation of redundancy within read datasets in absence of sequencing error. The plot above demonstrates how the number of k-mers processed by the coarse mapping stage scales with respect to total number of reads in the input dataset, for a high depth-coverage simulation of 100bp paired-end reads on hg19 chromosome 20, with 0.1% mutation rate and 0% sequencing error. After roughly 100 million reads in the input dataset, additional reads in the dataset do not affect the size of the coarse mapping stage, thus resulting in sublinear mapping scalability with the CORA framework.
Sublinearity analysis of CORA framework on simulated paired-end read data from hg19 chromosome 20 with 0.1% mutation rate and 2%, 0.5%, and 0.125% sequencing errors (E2, E0.5, and E0.125), respectively. Rate of redundancy is calculated as the total number of k-mers in the read dataset divided by the number of k-mers processed by CORA during the coarse mapping stage. Results indicate that rate of redundancy monotonically increases even in the presence of sequencing errors; therefore, for all three datasets the CORA framework spends less coarse mapping time per additional read, indicating sublinear scalability regardless of sequencing errors in practice.
Supplementary Figure 8 An overview of CORA framework and input/output relations between different stages of the CORA software.
An overview of CORA framework and input/output relations between different stages of the CORA software. Green, purple, yellow and red panels describe pre-processing, k-mer collapsing, coarse mapping and homology table traversal stages of the CORA pipeline respectively. The grey panel on the lower right provides a legend for different types of boxes and arrows used in the diagram.
Green panel: The inputs for the preprocessing stage are a reference sequence in multi-FASTA format, k-mer length value K, distance threshold value S, and an executable binary for an off-the-shelf read mapper to be used in coarse-mapping. The off-the-shelf mapper's indexing algorithm is called in order to generate a reference index. Exact and inexact homology tables are constructed and compressed to be used in the homology table traversal stage.
Purple panel: K-mer collapsing stage takes in a set of FASTQ files together with the k-mer length value as input and generate a list of unique k-mers (with compact IDs that encode reads that contain the extracted k-mers). An auxiliary lookup table is generated for some of the k-mer IDs that are known to cause problems during coarse-mapping (e.g. very long IDs). Optionally, k-mer collapsing stage uses the reference genome in order to identify k-mers that are identical to a k-mer in the reference, in which case these k-mers will be separately reported as perfect k-mer links to the reference genome skipping the coarse mapping stage.
Yellow panel: The coarse mapping stage requires an off-the-shelf mapper executable (which could be built-in tools such as BWA-aln/BWA-mem, Bowtie/Bowtie2, mrsFAST/mrsFAST-Ultra but could also be manually described through manual mode) and its reference index in order to place each of the unique k-mers to a locus in the reference. Afterwards the coarse-mappings are converted to k-mer links within the link construction step of CORA. In the case that k-mer collapsing was performed with the reference sequence, the perfect k-mer links from the previous stage are merged with the k-mer links generated from the link construction stage.
Red panel: Homology table traversal stage takes in the k-mer links from the previous stage, reference sequence and the exact/inexact homology tables, in order to generate the final set of mapping (in SAM format) by traversing the homology table(s) following the k-mer links. While some mapping modes can be performed only using the exact homology table (e.g. fast best-mapping mode), other mapping modes would require the inexact homology table as well. Optionally the original read dataset is used in this stage, in order to print the SAM file with the original read names (as opposed to just their order information in the input file) and/or the quality scores.
About this article
Cite this article
Yorukoglu, D., Yu, Y., Peng, J. et al. Compressive mapping for next-generation sequencing. Nat Biotechnol 34, 374–376 (2016). https://doi.org/10.1038/nbt.3511
This article is cited by
Nature Reviews Genetics (2022)
Genome Biology (2021)
Nature Biotechnology (2021)
Nature Communications (2018)
Genes & Genomics (2018)