Compressive mapping for next-generation sequencing

Yorukoglu, Deniz; Yu, Yun William; Peng, Jian; Berger, Bonnie

doi:10.1038/nbt.3511

Correspondence
Published: 07 April 2016

Compressive mapping for next-generation sequencing

Deniz Yorukoglu¹,
Yun William Yu ORCID: orcid.org/0000-0002-8275-9576^1,2,
Jian Peng^1,2,3 &
…
Bonnie Berger^1,2

Nature Biotechnology volume 34, pages 374–376 (2016)Cite this article

6215 Accesses
22 Citations
13 Altmetric
Metrics details

Subjects

Access through your institution

Buy or subscribe

To the Editor:

The analysis and storage of ever-increasing amounts of sequencing data present a huge computational challenge for the genomics community¹. However, only a small proportion of this sequence information varies between individuals, and it is this variation that we hope to identify and understand. Recently, compressive genomics has been introduced as a way of improving the accuracy and efficiency of searching large sequencing databases². Compressive genomics removes redundancies in genomic sequences and enables compressed data to be analyzed directly—facilitating parsimonious storage and fast access. This approach has been shown to accelerate the performance of standard search tools, such as BLAST and BLAT². Although some read-mapping tools also remove redundancies in the reference genome to facilitate mapping individual reads^3,4,5, they do not take full advantage of the redundancy across reads present in large sequencing data sets, which are often much larger and more redundant than the reference genome itself⁶ (Supplementary Fig. 1a).

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Mora: abundance aware metagenomic read re-assignment for disentangling similar strains
- Andrew Zheng
- , Jim Shaw
- & Yun William Yu
BMC Bioinformatics Open Access 23 April 2024
Performance evaluation method for read mapping tool in clinical panel sequencing
- Hojun Lee
- , Ki-Wook Lee
- … Dae-Soon Son
Genes & Genomics Open Access 09 November 2017

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: All-mapping performance of CORA framework.**

References

Berger, B., Peng, J. & Singh, M. Nat. Rev. Genet. 14, 333–346 (2013).
Article CAS Google Scholar
Loh, P.R., Baym, M. & Berger, B. Nat. Biotechnol. 30, 627–630 (2012).
Article CAS Google Scholar
Li, H. & Durbin, R. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar
Langmead, B. & Salzberg, S.L. Nat. Methods 9, 357–359 (2012).
Article CAS Google Scholar
Huang, L., Popic, V. & Batzoglou, S. Bioinformatics 29, i361–i370 (2013).
Article CAS Google Scholar
Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G. & Birney, E. Genome Res. 21, 734–740 (2011).
Article Google Scholar
Hach, F. et al. Nat. Methods 7, 576–577 (2010).
Article CAS Google Scholar
DePristo, M.A. et al. Nat. Genet. 43, 491–498 (2011).
Article CAS Google Scholar
Sboner, A., Mu, X.J., Greenbaum, D., Auerbach, R.K. & Gerstein, M.B. Genome Biol. 12, 125 (2011).
Article Google Scholar
Hach, F. et al. Nucleic Acids Res. 42, W494–W500 (2014).
Article CAS Google Scholar
Siragusa, E., Weese, D. & Reinert, K. Nucleic Acids Res. 41, e78 (2013).
Article CAS Google Scholar
Li, H. & Durbin, R. Bioinformatics 26, 589–595 (2010).
Article Google Scholar
Veeneman, B.A., Iyer, M.K. & Chinnaiyan, A.M. BMC Bioinformatics 13, 297 (2012).
Article Google Scholar
Treangen, T.J. & Salzberg, S.L. Nat. Rev. Genet. 13, 36–46 (2012).
Article CAS Google Scholar
Hormozdiari, F. et al. Genome Res. 21, 840–849 (2011).
Article CAS Google Scholar
Hormozdiari, F. et al. Bioinformatics 26, i350–i357 (2010).
Article CAS Google Scholar
Simola, D.F. & Kim, J. Genome Biol. 12, R55 (2011).
Article Google Scholar
Jubin, C., Serero, A., Loeillet, S., Barillot, E. & Nicolas, A. G3 (Bethesda) 4, 707–715 (2014).
Article Google Scholar
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621–628 (2008).
Article CAS Google Scholar
Dao, P. et al. Bioinformatics 30, 644–651 (2014).
Article CAS Google Scholar
Li, B. & Dewey, C.N. BMC Bioinformatics 12, 323 (2011).
Article CAS Google Scholar
Anders, S., Pyl, P.T. & Huber, W. Bioinformatics 31, 166–169 (2015).
Article CAS Google Scholar
Kelley, D.R. et al. Genome Res. 15, 537 (2014).
Google Scholar

Download references

Acknowledgements

Special thanks to M.H. Baym and P.-R. Loh for launching compressive genomics with B.B. Thanks to M.H. Baym, P.-R. Loh, N.M. Daniels, R. Daniels, G. Tucker, H. Cho and P. Palmedo for most helpful comments. D.Y., J.P., Y.W.Y. and B.B. are partially supported by the US National Institutes of Health GM108348 (to B.B.). D.Y. is also partially supported by the Howard Hughes Medical Institute and IBM. Y.W.Y. gratefully acknowledges support from the Fannie and John Hertz Foundation.

Author information

Authors and Affiliations

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Deniz Yorukoglu, Yun William Yu, Jian Peng & Bonnie Berger
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Yun William Yu, Jian Peng & Bonnie Berger
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
Jian Peng

Authors

Deniz Yorukoglu
View author publications
You can also search for this author in PubMed Google Scholar
Yun William Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jian Peng
View author publications
You can also search for this author in PubMed Google Scholar
Bonnie Berger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bonnie Berger.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Comparison between conventional read mapping methods and compressively-accelerated read mapping framework (CORA).

Comparison between conventional read mapping methods and compressively-accelerated read mapping framework (CORA). (a) In order to generate read-mapping results, existing read mappers compare each read to the reference or a previously-constructed index for the reference. (i) Unique-read case: Some reads have a unique match to the reference. (ii) Multi-read case: Due to the repetitive structure of DNA sequences, a single read can often be aligned to more than one location with high similarity in the reference. Most existing approaches involve costly seed-extension or suffix-array traversal stages for each of these locations, requiring additional computational time reporting multiple mappings for each read (or comparing them to report a best mapping). Furthermore, in high depth-coverage datasets (especially when multiple individuals are mapped together), there can often be reads that are fully or partially similar to other reads in the dataset. (iii) Redundancy within reads: As existing aligners cannot utilize inexact redundancy within reads, they process each read individually, potentially duplicating previous computations performed for earlier reads in the dataset. This inefficiency is particularly an issue for multi-reads sequenced from highly-repetitive regions in the genome, since they require sequence comparisons with a large number of loci in the reference. Compressively accelerated read mapping addresses the inefficiencies of mapping high-throughput NGS reads by capitalizing on redundancy within both read datasets and the reference. (b) CORA capitalizes on redundancy in both reads and reference. (1) As a preprocessing step, a high-resolution homology table is created for the reference sequence by mapping the reference to itself. The homology table contains all homologous pairs of loci in the reference above a similarity threshold, allowing fast direct access to similar locations in the reference during mapping. (2) The first step in compressive read-mapping is to compress the reads in order to eliminate full or partial redundancies across reads in the dataset. Compression is achieved through self-mapping of the read dataset. (3) Next, an off-the-shelf aligner can be used to perform a coarse mapping from the compressed read data—clusters of similar substrings—to the reference. (4) Each read link represents a cluster of substrings from one or more reads in the dataset and stores their differences from a locus in the reference. (5) Read links are further expanded to obtain final mapping results through traversal of the pre-computed homology table, and final mapping results are reported. (6) Far fewer comparisons are required for compressive read mapping due to efficient utilization of redundancy within read sequences as well as the reference.

Supplementary Figure 2 Runtime and sensitivity comparison results for whole genome ungapped (substitution-only) and gapped (with indels) best-mapping of 1000 Genomes Phase 1 Illumina 2×108bp paired-end read datasets.

Runtime and sensitivity comparison results for whole genome ungapped (substitution-only) and gapped (with indels) best-mapping of 1000 Genomes Phase 1 Illumina 2×108bp paired-end read datasets of 4 Finnish individuals with ∼16× read depth-coverage; similarity threshold is defined as Hamming distance of 4 for each end for the ungapped mapping and Levenshtein (edit) distance of 4 for each end for the gapped mapping benchmarks. Paired-end insert size interval is defined to be between 150 and 650 base pairs. We compared best-mapping runtimes of Bowtie2, BWA aln, BWA mem, mrsFAST-ultra (only for ungapped), GEM (only for gapped) and Masai against compressively accelerated version of BWA with two different modes: CORA-BWA and CORA-BWA-fast, which sacrifices some sensitivity allowing for faster best-mapping. The bars in the upper panel represent mappers' runtime performance, whereas the bars in the lower two panels indicate sensitivity performance: percentage sensitivity for ungapped mapping and number of mappings for gapped mapping. As BWA mem does not accept insert size intervals, we show two different mapping sensitivity measurements for it: sensitivity within the defined insert size interval and the increased sensitivity when mappings outside of the interval are included as well (the difference is indicated by the hatch pattern). Some of the results are estimated from a down-sampled set of reads; detailed benchmark criteria as well as mapping parameters can be found in Further details on experimental setup. The plots indicate that compared to the fastest best-mappers we tested against, CORA-BWA mappers are at least ∼2× faster with superior or comparable sensitivity. The only mapper that approached CORA-BWA in terms of best-mapping runtime was Masai for the ungapped mapping experiment, albeit with drastically lower sensitivity; even then CORA-BWA-fast was >1.4× faster than Masai using less memory. Moreover, compared to the original BWA aln, CORA-BWA generated best-mapping results with near-identical sensitivity, but >3.2× faster for gapped and >3.1× faster for ungapped mapping. Furthermore, comparisons with CORA's all-mapping runtime results in Figure 1a reveal that CORA can perform near-perfect sensitivity all-mapping faster than BWA, Bowtie2 and mrsFAST-Ultra can report best-mapping results. The peak memory usage of Bowtie2 was 3.2GB, BWA used 4.7GB and 6.2GB respectively for aln and mem, mrsFAST-Ultra used 4.7GB, whereas GEM and Masai's memory usages were 4.1GB and 23.2 GB respectively. CORA-BWA, at the maximum of collapsing, coarse-mapping, and homology table traversal stages, used 19.7GB of memory for the runs that only utilized the exact homology table, whereas it used 64.1GB for the runs that also loaded the inexact homology table into memory (e.g. ungapped mapping with CORA-BWA default mode).

Supplementary Figure 3 Compact representation of homology table with exact and inexact homologies.

Compact representation of homology table with exact and inexact homologies. Identical homology blocks of a certain length are collected under the same equivalence class. Each equivalence class has a representative locus, as well as other loci that are oriented with respect to the representative locus. Inexact homology table is a graph with equivalence classes as nodes and partial similarities between equivalence classes as edges. Therefore, inexact homologies can only be defined between equivalence class representative loci. Each inexact homology edge identifies the offset from the beginning of the first equivalence class, the offset from the beginning of the second equivalence class, the length of the inexact homology block, forward or reverse-complement direction of homology, and the positions of differences or base substitutions (text above bold bi-directional arrow). An inexact homology block size indicates the final length of the inexact homology after two or more consecutive and concordant inexact k-mer homologies are merged together. Two consecutive inexact k-mers are defined as concordant if their target positions in the reference are also consecutive and their edit positions in the k-mer are concordant, i.e., containing the same (k-1)-mers.

Supplementary Figure 4 Seed position selection scheme for inexact homology table construction.

Seed position selection scheme for inexact homology table construction. This seed selection scheme allows detection of all inexact homologies of a given k-mer length within a Hamming distance of 2 (by pigeonhole principle). For detection of higher numbers of mismatches, the number of seeds sampled can be increased. Rather than selecting consecutive positions for each seed, spacing out seed positions throughout the k-mer allows for more evenly-sized bins within the hash table. The positions sampled for each seed is designed to be closed under reverse complementation, so that hash tables created for each seed are disjoint. This allows for both space savings and easy parallelization of inexact homology table construction.

Supplementary Figure 5 Homology table traversal scheme of CORA framework.

Homology table traversal scheme of CORA framework. Each read link, representing one or more reads, points to the genomic location determined by the coarse mapping stage in either forward or reverse complement direction, indicating homology within a small number of edits. This locus is either unique in the reference genome or associated with an equivalence class in the exact homology table. In the latter case, the anchor is linked to the equivalence class in the forward or reverse complement direction, with a block offset value indicating the starting location of a substring of the equivalence class representative that is identical to the coarse genome target. The representative of the equivalence class points to all members of the equivalence class for each valid offset (until the end of the block) in forward or reverse complement direction. Furthermore, the equivalence class representative can be linked to other equivalence classes through the inexact homology table. Each of these pointers contains the direction of homology (forward or reverse complementary), block offset in the compressed inexact homology block representation, and the edit script to convert one class representative to another. Provided that the juxtaposition of the edits from a read link to the anchor and from the anchor's class representative to a neighboring equivalence class still contains less than or equal to the user specified number of errors, read mappings associated with all members of these neighbor equivalence classes will also be reported.

Supplementary Figure 6 Estimation of redundancy within read datasets in absence of sequencing error.

Estimation of redundancy within read datasets in absence of sequencing error. The plot above demonstrates how the number of k-mers processed by the coarse mapping stage scales with respect to total number of reads in the input dataset, for a high depth-coverage simulation of 100bp paired-end reads on hg19 chromosome 20, with 0.1% mutation rate and 0% sequencing error. After roughly 100 million reads in the input dataset, additional reads in the dataset do not affect the size of the coarse mapping stage, thus resulting in sublinear mapping scalability with the CORA framework.

Supplementary Figure 7 Sublinearity analysis of CORA framework on simulated paired-end read data.

Sublinearity analysis of CORA framework on simulated paired-end read data from hg19 chromosome 20 with 0.1% mutation rate and 2%, 0.5%, and 0.125% sequencing errors (E2, E0.5, and E0.125), respectively. Rate of redundancy is calculated as the total number of k-mers in the read dataset divided by the number of k-mers processed by CORA during the coarse mapping stage. Results indicate that rate of redundancy monotonically increases even in the presence of sequencing errors; therefore, for all three datasets the CORA framework spends less coarse mapping time per additional read, indicating sublinear scalability regardless of sequencing errors in practice.

Supplementary Figure 8 An overview of CORA framework and input/output relations between different stages of the CORA software.

An overview of CORA framework and input/output relations between different stages of the CORA software. Green, purple, yellow and red panels describe pre-processing, k-mer collapsing, coarse mapping and homology table traversal stages of the CORA pipeline respectively. The grey panel on the lower right provides a legend for different types of boxes and arrows used in the diagram.

Green panel: The inputs for the preprocessing stage are a reference sequence in multi-FASTA format, k-mer length value K, distance threshold value S, and an executable binary for an off-the-shelf read mapper to be used in coarse-mapping. The off-the-shelf mapper's indexing algorithm is called in order to generate a reference index. Exact and inexact homology tables are constructed and compressed to be used in the homology table traversal stage.

Purple panel: K-mer collapsing stage takes in a set of FASTQ files together with the k-mer length value as input and generate a list of unique k-mers (with compact IDs that encode reads that contain the extracted k-mers). An auxiliary lookup table is generated for some of the k-mer IDs that are known to cause problems during coarse-mapping (e.g. very long IDs). Optionally, k-mer collapsing stage uses the reference genome in order to identify k-mers that are identical to a k-mer in the reference, in which case these k-mers will be separately reported as perfect k-mer links to the reference genome skipping the coarse mapping stage.

Yellow panel: The coarse mapping stage requires an off-the-shelf mapper executable (which could be built-in tools such as BWA-aln/BWA-mem, Bowtie/Bowtie2, mrsFAST/mrsFAST-Ultra but could also be manually described through manual mode) and its reference index in order to place each of the unique k-mers to a locus in the reference. Afterwards the coarse-mappings are converted to k-mer links within the link construction step of CORA. In the case that k-mer collapsing was performed with the reference sequence, the perfect k-mer links from the previous stage are merged with the k-mer links generated from the link construction stage.

Red panel: Homology table traversal stage takes in the k-mer links from the previous stage, reference sequence and the exact/inexact homology tables, in order to generate the final set of mapping (in SAM format) by traversing the homology table(s) following the k-mer links. While some mapping modes can be performed only using the exact homology table (e.g. fast best-mapping mode), other mapping modes would require the inexact homology table as well. Optionally the original read dataset is used in this stage, in order to print the SAM file with the original read names (as opposed to just their order information in the input file) and/or the quality scores.

Supplementary information

Supplementary Figures and Text

Supplementary Figures 1–8, Supplementary Tables 1 and 2 and Supplementary Notes 1 and 2 (PDF 4189 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yorukoglu, D., Yu, Y., Peng, J. et al. Compressive mapping for next-generation sequencing. Nat Biotechnol 34, 374–376 (2016). https://doi.org/10.1038/nbt.3511

Download citation

Published: 07 April 2016
Issue Date: April 2016
DOI: https://doi.org/10.1038/nbt.3511

This article is cited by

Mora: abundance aware metagenomic read re-assignment for disentangling similar strains
- Andrew Zheng
- Jim Shaw
- Yun William Yu
BMC Bioinformatics (2024)
A Python-based programming language for high-performance computational genomics
- Ariya Shajii
- Ibrahim Numanagić
- Bonnie Berger
Nature Biotechnology (2021)
Performance evaluation method for read mapping tool in clinical panel sequencing
- Hojun Lee
- Ki-Wook Lee
- Dae-Soon Son
Genes & Genomics (2018)

Compressive mapping for next-generation sequencing

Subjects

Relevant articles

Mora: abundance aware metagenomic read re-assignment for disentangling similar strains

Performance evaluation method for read mapping tool in clinical panel sequencing

Access options

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary Figure 1 Comparison between conventional read mapping methods and compressively-accelerated read mapping framework (CORA).

Supplementary Figure 2 Runtime and sensitivity comparison results for whole genome ungapped (substitution-only) and gapped (with indels) best-mapping of 1000 Genomes Phase 1 Illumina 2×108bp paired-end read datasets.

Supplementary Figure 3 Compact representation of homology table with exact and inexact homologies.

Supplementary Figure 4 Seed position selection scheme for inexact homology table construction.

Supplementary Figure 5 Homology table traversal scheme of CORA framework.

Supplementary Figure 6 Estimation of redundancy within read datasets in absence of sequencing error.

Supplementary Figure 7 Sublinearity analysis of CORA framework on simulated paired-end read data.

Supplementary Figure 8 An overview of CORA framework and input/output relations between different stages of the CORA software.

Supplementary information

Supplementary Figures and Text

Rights and permissions

About this article

Cite this article

This article is cited by

Mora: abundance aware metagenomic read re-assignment for disentangling similar strains

A Python-based programming language for high-performance computational genomics

Performance evaluation method for read mapping tool in clinical panel sequencing

Search

Quick links

Subjects

Relevant articles

Access options

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links