CoLoRd: compressing long reads

Kokot, Marek; Gudyś, Adam; Li, Heng; Deorowicz, Sebastian

doi:10.1038/s41592-022-01432-3

Brief Communication
Published: 28 March 2022

CoLoRd: compressing long reads

Nature Methods volume 19, pages 441–444 (2022)Cite this article

3489 Accesses
6 Citations
72 Altmetric
Metrics details

Subjects

Abstract

The cost of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today’s genomic research. In spite of the increasing popularity of third-generation sequencing, the existing algorithms for compressing long reads exhibit a minor advantage over the general-purpose gzip. We present CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: CoLoRd operation principles.**

**Fig. 2: Analysis of lossless and lossy compression.**

FQSqueezer: k-mer-based compression of sequencing data

Article Open access 17 January 2020

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

Article Open access 06 February 2023

DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer

Article 01 September 2022

Data availability

ONT datasets: Mycoplasma bovis²⁴, ERR4179765; M. bovis Bonito²⁴, ERR4179766; vir²⁵, ERR2708427–ERR2708436; lun²⁶, PRJEB30781; Sorghum²⁷, SRX4104135–SRX4104138; Zymo2, https://nanopore.s3.climb.ac.uk/Zymo-GridION-EVEN-BB-SN.fq.gz; Zymo2 R10, https://s3.climb.ac.uk/nanopore/Zymo-GridION-EVEN-BB-SN-PCR-R10HC-flipflop.fq.gz; Zymo2 Bonito, ERR5396170; Zymo2 PromethION, https://nanopore.s3.climb.ac.uk/Zymo-PromethION-EVEN-BB-SN.fq.gz; NA12878 (ref. ¹), http://s3.amazonaws.com/nanopore-human-wgs/rel6/rel_6.fastq.gz; CHM13, https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/nanopore/rel6/rel6.fastq.gz; CHM13 Bonito, https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/nanopore/rel7/rel7.fastq.gz. PacBio HiFi datasets: mosquito, SRX8642991, SRX8642992; Drosophila²⁸, SRX499318; strawberry²⁹, SRR11606867; ATCC WGS, PRJNA546278; mouse²⁹, SRR11606870; HG002 HiFi³⁰, SRR10382244, SRR10382245, SRR10382248, SRR10382249; CHM13 HiFi, SRR11292120–SRR11292123. PacBio CLR datasets: yeast, http://gembox.cbcb.umd.edu/mhap/raw/yeast_filtered.fastq.gz; Arabidopsis, http://gembox.cbcb.umd.edu/mhap/raw/athal_filtered.fastq.gz; Macadamia³¹, SRR11191909; 48 plex, https://downloads.pacbcloud.com/public/dataset/MicrobialMultiplexing_48plex/48-plex_sequences/; CHM1, http://datasets.pacb.com/2014/Human54x/fastq.html; HG002 CLR³², SRX5590586. Other: CHM13 T2T assembly version 1.1 (ref. ¹⁶), GCA_009914755.3; GRCh38 assembly, GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz; Sorghum assembly, GCF_000003195.3; M. bovis assembly, GCF_000183385.1; Genome in a Bottle version 4.2.1 HG002 reference variants²⁰, HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.

Code availability

Source code for CoLoRd is available at https://github.com/refresh-bio/colord. The package can be installed via Bioconda at https://anaconda.org/bioconda/colord.

References

Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Article CAS PubMed PubMed Central Google Scholar
Stancu, M. C. et al. Mapping and phasing of structural variation in patient genomes using Nanopore sequencing. Nat. Commun. 8, 1326 (2017).
Article Google Scholar
Jones, D. C., Ruzzo, W. L., Peng, X. & Katze, M. G. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171 (2012).
Article CAS PubMed PubMed Central Google Scholar
Bonfield, J. K. & Mahoney, M. V. Compression of FASTQ and SAM format sequencing data. PloS ONE 8, e59190 (2013).
Article CAS PubMed PubMed Central Google Scholar
Roguski, Ł. & Deorowicz, S. DSRC 2: industry-oriented compression of FASTQ files. Bioinformatics 30, 2213–2215 (2014).
Article CAS PubMed Google Scholar
Grabowski., S., Deorowicz, S. & Roguski, Ł. Disk-based compression of data from genome sequencing. Bioinformatics 31, 1389–1395 (2015).
Article CAS PubMed Google Scholar
Roguski, Ł., Ochoa, I., Hernaez, M. & Deorowicz, S. FaStore: a space-saving solution for raw sequencing data. Bioinformatics 34, 2748–2756 (2018).
Article CAS PubMed Google Scholar
Liu, Y., Yu, Z., Dinger, M. E. & Li, J. Index suffix–prefix overlaps by (w, k) -minimizer to generate long contigs for reads compression. Bioinformatics 35, 2066–2074 (2018).
Article Google Scholar
Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. & Weissman, T. SPRING: a next-generation compressor for FASTQ data. Bioinformatics 35, 2674–2676 (2018).
Article PubMed Central Google Scholar
Dufort y Álvarez., G. et al. ENANO: encoder for NANOpore FASTQ files. Bioinformatics 36, 4506–4507 (2020).
Article PubMed Google Scholar
Nicolae, M., Pathak, S. & Rajasekaran, S. LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31, 3276–3281 (2015).
Article CAS PubMed PubMed Central Google Scholar
Myers, E. The fragment assembly string graph. Bioinformatics 21, 79–85 (2005).
Google Scholar
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Article CAS PubMed PubMed Central Google Scholar
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R. & Phillippy, A. M. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Article CAS PubMed PubMed Central Google Scholar
Dufort y Álvarez., G. et al. RENANO: a REference-based compressor for NANOpore FASTQ files. Bioinformatics 37, 4862–4864 (2021).
Article Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Preprint at bioRxiv https://doi.org/10.1101/2021.05.26.445798v1 (2021).
Vaser, R., Sovic, I., Nagarajan, N. & Sikic, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Article CAS PubMed Google Scholar
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
Article CAS PubMed PubMed Central Google Scholar
Deorowicz, S. FQSqueezer: k-mer-based compression of sequencing data. Sci. Rep. 10, 578 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).
Article CAS PubMed Google Scholar
Sosić, M. & Sikić, M. Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33, 1394–1395 (2017).
Article PubMed PubMed Central Google Scholar
Vereecke, N. et al. High quality genome assemblies of Mycoplasma bovis using a taxon-specific Bonito basecaller for MinION and Flongle long-read Nanopore sequencing. BMC Bioinformatics 21, 517 (2020).
Article CAS PubMed PubMed Central Google Scholar
Depledge, D. P. et al. Direct RNA sequencing on Nanopore arrays redefines the transcriptional complexity of a viral pathogen. Nat. Commun. 10, 754 (2019).
Article CAS PubMed PubMed Central Google Scholar
Charalampous, T. et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat. Biotechnol. 7, 783–792 (2019).
Article Google Scholar
Deschamps, S. et al. A chromosome-scale assembly of the sorghum genome using Nanopore sequencing and optical mapping. Nat. Commun. 9, 4844 (2018).
Article PubMed PubMed Central Google Scholar
Kim, K. et al. Long-read, whole-genome shotgun sequence data for five model organisms. Sci. Data 1, 140045 (2014).
Article CAS PubMed PubMed Central Google Scholar
Hon, T. et al. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci. Data 7, 399 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Murigneux, V. et al. Comparison of long-read methods for sequencing and assembly of a plant genome. GigaScience 9, giaa146 (2020).
Article PubMed PubMed Central Google Scholar
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the National Science Centre, Poland, project DEC-2019/33/B/ST6/02040 (M.K., A.G. and S.D.) and by the US National Institutes of Health (grants R01HG010040, U01HG010971 and U41HG010972 to H.L.).

Author information

Authors and Affiliations

Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
Marek Kokot, Adam Gudyś & Sebastian Deorowicz
Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
Heng Li
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Heng Li

Authors

Marek Kokot
View author publications
You can also search for this author in PubMed Google Scholar
Adam Gudyś
View author publications
You can also search for this author in PubMed Google Scholar
Heng Li
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Deorowicz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.K. and S.D. designed and implemented the algorithm. H.L. designed the variant-calling and consensus analyses, investigated and described their results. A.G., S.D. and M.K. conducted experiments. A.G. prepared visualizations and wrote the majority of the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Heng Li or Sebastian Deorowicz.

Ethics declarations

Competing interests

H.L. is a consultant of Integrated DNA Technologies and is on the scientific advisory boards of Sentieon and Innozeen. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kokot, M., Gudyś, A., Li, H. et al. CoLoRd: compressing long reads. Nat Methods 19, 441–444 (2022). https://doi.org/10.1038/s41592-022-01432-3

Download citation

Received: 20 July 2021
Accepted: 23 February 2022
Published: 28 March 2022
Issue Date: April 2022
DOI: https://doi.org/10.1038/s41592-022-01432-3

This article is cited by

PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
- Hui Sun
- Yingfeng Zheng
- Gang Wang
BMC Bioinformatics (2023)
Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach
- Qingxi Meng
- Shubham Chandak
- Tsachy Weissman
Scientific Reports (2023)

CoLoRd: compressing long reads

Subjects

Abstract

Access options

Similar content being viewed by others

FQSqueezer: k-mer-based compression of sequencing data

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Peer Review Information

Supplementary Data

Rights and permissions

About this article

Cite this article

This article is cited by

PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links