Abstract
High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
References
Giancarlo, R., Rombo, S.E. & Utro, F. Brief. Bioinform. 15, 390–406 (2014).
Holland, R.C. & Lynch, N. GigaScience 2, 5 (2013).
Deorowicz, S. & Grabowski, S. Algorithms Mol. Biol. 8, 25 (2013).
Roguski, L. & Deorowicz, S. Bioinformatics 30, 2213–2215 (2014).
Dutta, A., Haque, M.M., Bose, T., Reddy, C.V. & Mande, S.S. J Bioinform. Comput. Biol. 13, 1541003 (2015).
Bonfield, J.K. & Mahoney, M.V. PLoS One 8, e59190 (2013).
Nicolae, M., Pathak, S. & Rajasekaran, S. Bioinformatics 31, 3276–3281 (2015).
Hach, F., Numanagić, I., Alkan, C. & Sahinalp, S.C. Bioinformatics 28, 3051–3057 (2012).
Grabowski, S., Deorowicz, S. & Roguski, L. Bioinformatics 31, 1389–1395 (2015).
Patro, R. & Kingsford, C. Bioinformatics 31, 2770–2777 (2015).
Cox, A.J., Bauer, M.J., Jakobi, T. & Rosone, G. Bioinformatics 1415–1419 (2012).
Zhang, Y. et al. BMC Bioinformatics 16, 188 (2015).
Jones, D.C., Ruzzo, W.L., Peng, X. & Katze, M.G. Nucleic Acids Res. 40, e171 (2012).
Benoit, G. et al. BMC Bioinformatics 16, 288 (2015).
Kingsford, C. & Patro, R. Bioinformatics 31, 1920–1928 (2015).
Zhang, Y., Patel, K., Endrawis, T., Bowers, A. & Sun, Y. Gene 579, 75–81 (2016).
Li, H. et al. Bioinformatics 25, 2078–2079 (2009).
Tarasov, A., Vilella, A.J., Cuppen, E., Nijman, I.J. & Prins, P. Bioinformatics 31, 2032–2034 (2015).
Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G. & Birney, E. Genome Res. 21, 734–740 (2011).
Bonfield, J.K. Bioinformatics 30, 2818–2819 (2014).
Hach, F., Numanagić, I. & Sahinalp, S.C. Nat. Methods 11, 1082–1084 (2014).
Ochoa, I., Hernaez, M. & Weissman, T. J. Bioinform. Comput. Biol. 12, 1442002 (2014).
Voges, J., Munderloh, M. & Ostermann, J. Predictive coding of aligned next-generation sequencing data. In Proc. 2016 Data Compression Conference 241–250 (IEEE, 2016).
Acknowledgements
This research was supported by Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Frontiers program 'Cancer Genome Collaboratory' project (S.C.S., F.H., I.N.); the Vanier Canada Graduate Scholarships program (I.N.); National Institutes of Health (NIH) (R01GM108348 to S.C.S.); National Science Foundation (NSF) (1619081 to S.C.S.); Indiana University Grant Challenges Program Precision Health Initiative (S.C.S.); Wellcome Trust (098051 to J.K.B.); Leibniz Universität Hannover eNIFE grant (J.V. and J.O.); Swiss Platform for Advanced Scientific Computing (PASC) PoSeNoGap project (C.A. and M.M.). We would also like to thank the authors of evaluated compression tools for providing support for their tools and replying to our bug reports.
Author information
Authors and Affiliations
Contributions
The study was initiated by I.N., C.A. and M.M. I.N. designed the benchmarking framework and performed the experiments. J.K.B. evaluated the framework. I.N., J.K.B., J.V., J.O., F.H., C.A., M.M. and S.C.S. contributed to writing the manuscript. S.C.S. and F.H. oversaw the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Figure 1, Supplementary Tables 1–7 and Supplementary Notes 1–6. (PDF 2002 kb)
Rights and permissions
About this article
Cite this article
Numanagić, I., Bonfield, J., Hach, F. et al. Comparison of high-throughput sequencing data compression tools. Nat Methods 13, 1005–1008 (2016). https://doi.org/10.1038/nmeth.4037
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.4037
This article is cited by
-
Disk compression of k-mer sets
Algorithms for Molecular Biology (2021)
-
FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy
BMC Bioinformatics (2021)
-
Towards scalable genomic data access
Nature Computational Science (2021)
-
LCQS: an efficient lossless compression tool of quality scores with random access functionality
BMC Bioinformatics (2020)
-
IonCRAM: a reference-based compression tool for ion torrent sequence files
BMC Bioinformatics (2020)