To the Editor:
Because high-throughput sequencing generates large amounts of data, several tools1,2,3,4,5,6,7,8 are used to compress raw sequencing data for efficient storage and data transfer. Mapped sequence data represented in SAM8 files (plain-text human-readable files that include the raw reads as well as information about their mapping loci, quality, etc.) are also compressed, typically by blocked Gzip (BGZF), to BAM files.
One alternative to the BAM format is the CRAM format of CRAM Tools and Scramble7, which improves BAM compression by encoding the differences between the reads and their mapping loci on the reference genome. However, because CRAM Tools and Scramble represent the differences between each read and its mapping locus separately (Supplementary Note), common sequence features of the reads mapping to the same locus are redundantly encoded. In addition, they are not lossless because they modify some of the fields of the SAM format during compression and do not reconstruct them exactly during decompression, which may affect further downstream analysis.
There are additional compression tools that are based on arithmetic coding (AC) and other data modeling methods, such as Quip5 and Samcomp6. Although these tools provide superior compression, they do not provide random-access capability: one needs to decompress the whole file (requiring large memory and running time) and perform manual search to locate the region of interest.
Here we present DeeZ, a SAM and BAM file compression tool, which provides both a better compression ratio than SAMtools and random-access capability. DeeZ's compression performance, which is on par with that of state-of-the-art AC tools, is a result of the observation that the vast majority of the nucleotide differences between each read and its mapping locus on the reference are shared with other reads mapped to the same locus. DeeZ lowers the cost of representing such common differences by obtaining the 'consensus' of the reads mapped to a specific locus (implicitly 'assembling' the donor genome by the use of mapping information) and encoding the differences between the consensus (i.e., implicitly assembled) contigs and the reference genome once. As there is no difference between the consensus contigs and the reads with the exception of mapping errors or highly allelic regions, DeeZ encodes the positional information of each read within only the relevant contig. Moreover, DeeZ uses a unique compression method for each field of the SAM record in order to exploit its specific properties: read names are 'tokenized' and compressed by the use of delta encoding; quality scores are encoded using an order-2 AC, etc. (Supplementary Note).
DeeZ provides random-access capability by encoding the input SAM or BAM file in a block-by-block manner, via AC for the quality scores and mapping locations and via Gzip for the other fields. Additional features of DeeZ include support for fast flag statistics of a SAM file and location-based read-sorting ability (as per SAMtools).
We compared DeeZ to other tools on bacterial RNA-seq data as well as human HiSeq and RNA-seq libraries (Table 1). DeeZ outperformed all competitors except Samcomp, whose compression performance was comparable to that of DeeZ. However, Samcomp does not provide random-access ability and does not compress all fields in the SAM format. For the human HiSeq data set, Quip performed slightly better than DeeZ on the default settings owing to its use of high-order quality-score compression. However, Quip does not provide random-access ability either. For users in need of high compression performance but not random-access capability (for quality scores), DeeZ provides the option to use the AC model from Samcomp8. With this quality model, DeeZ outperformed Quip on this data set while still providing partial random-access ability for all fields except quality scores.
In terms of compression speed, DeeZ was the fastest on the bacterial RNA-seq and human HiSeq data sets. For the human RNA-seq data set, DeeZ compression speed was slower because many reads of eukaryotic RNA-seq originated from splice junctions. DeeZ's decompression speed was also on par with or better than that of its competitors with the exception of SAMtools and Scramble. This is due to the LZ77 decompression scheme employed by these tools being much faster than the AC decompression used by DeeZ. Because quality scores usually consume the largest portion of a compressed file owing to their high entropy, DeeZ provides an optional lossy quality transformation similar to that of our reference-free compression tool SCALCE1, with minimal impact on standard downstream analyses.
DeeZ is available for download at http://deez-compression.sourceforge.net.
Hach, F., Numanagic, I., Alkan, C. & Sahinalp, S.C. Bioinformatics 28, 3051–3057 (2012).
Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V. & Varghese, G. J. Comput. Biol. 18, 401–413 (2011).
Cox, A.J., Bauer, M.J., Jakobi, T. & Rosone, G. Bioinformatics 28, 1415–1419 (2012).
Deorowicz, S. & Grabowski, S. Bioinformatics 27, 860–862 (2011).
Jones, D.C., Ruzzo, W.L., Peng, X. & Katze, M.G. Nucleic Acids Res. 40, e171 (2012).
Bonfield, J.K. & Mahoney, M.V. PLoS ONE 8, e59190 (2013).
Bonfield, J.K. Bioinformatics 30, 2818–2819 (2014).
Li, H. et al. Bioinformatics 25, 2078–2079 (2009).
We thank the Natural Sciences and Engineering Research Council of Canada (NSERC) for the Discovery Frontiers “Cancer Genome Collaboratory” project, Discovery Grants program and Genome Canada (S.C.S.) and the Vanier Canada Graduate Scholarship program (I.N.) for funding this research. We also thank L. Stein, T. Beck and H. Babaran as well as L. Ding and R.J. Mashi for testing and benchmarking the performance of DeeZ. We thank J. Bonfield for helping us to improve DeeZ. We thank J. Dale, G. Asimenos and S. Batzoglou for support in the initial stages of this research.
The authors declare no competing financial interests.
About this article
Cite this article
Hach, F., Numanagic, I. & Sahinalp, S. DeeZ: reference-based compression by local assembly. Nat Methods 11, 1082–1084 (2014). https://doi.org/10.1038/nmeth.3133
Nature Methods (2020)
Computational Biology and Chemistry (2020)