The cost of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today’s genomic research. In spite of the increasing popularity of third-generation sequencing, the existing algorithms for compressing long reads exhibit a minor advantage over the general-purpose gzip. We present CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses.
Your institute does not have access to this article
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
ONT datasets: Mycoplasma bovis24, ERR4179765; M. bovis Bonito24, ERR4179766; vir25, ERR2708427–ERR2708436; lun26, PRJEB30781; Sorghum27, SRX4104135–SRX4104138; Zymo2, https://nanopore.s3.climb.ac.uk/Zymo-GridION-EVEN-BB-SN.fq.gz; Zymo2 R10, https://s3.climb.ac.uk/nanopore/Zymo-GridION-EVEN-BB-SN-PCR-R10HC-flipflop.fq.gz; Zymo2 Bonito, ERR5396170; Zymo2 PromethION, https://nanopore.s3.climb.ac.uk/Zymo-PromethION-EVEN-BB-SN.fq.gz; NA12878 (ref. 1), http://s3.amazonaws.com/nanopore-human-wgs/rel6/rel_6.fastq.gz; CHM13, https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/nanopore/rel6/rel6.fastq.gz; CHM13 Bonito, https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/nanopore/rel7/rel7.fastq.gz. PacBio HiFi datasets: mosquito, SRX8642991, SRX8642992; Drosophila28, SRX499318; strawberry29, SRR11606867; ATCC WGS, PRJNA546278; mouse29, SRR11606870; HG002 HiFi30, SRR10382244, SRR10382245, SRR10382248, SRR10382249; CHM13 HiFi, SRR11292120–SRR11292123. PacBio CLR datasets: yeast, http://gembox.cbcb.umd.edu/mhap/raw/yeast_filtered.fastq.gz; Arabidopsis, http://gembox.cbcb.umd.edu/mhap/raw/athal_filtered.fastq.gz; Macadamia31, SRR11191909; 48 plex, https://downloads.pacbcloud.com/public/dataset/MicrobialMultiplexing_48plex/48-plex_sequences/; CHM1, http://datasets.pacb.com/2014/Human54x/fastq.html; HG002 CLR32, SRX5590586. Other: CHM13 T2T assembly version 1.1 (ref. 16), GCA_009914755.3; GRCh38 assembly, GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz; Sorghum assembly, GCF_000003195.3; M. bovis assembly, GCF_000183385.1; Genome in a Bottle version 4.2.1 HG002 reference variants20, HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Stancu, M. C. et al. Mapping and phasing of structural variation in patient genomes using Nanopore sequencing. Nat. Commun. 8, 1326 (2017).
Jones, D. C., Ruzzo, W. L., Peng, X. & Katze, M. G. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171 (2012).
Bonfield, J. K. & Mahoney, M. V. Compression of FASTQ and SAM format sequencing data. PloS ONE 8, e59190 (2013).
Roguski, Ł. & Deorowicz, S. DSRC 2: industry-oriented compression of FASTQ files. Bioinformatics 30, 2213–2215 (2014).
Grabowski., S., Deorowicz, S. & Roguski, Ł. Disk-based compression of data from genome sequencing. Bioinformatics 31, 1389–1395 (2015).
Roguski, Ł., Ochoa, I., Hernaez, M. & Deorowicz, S. FaStore: a space-saving solution for raw sequencing data. Bioinformatics 34, 2748–2756 (2018).
Liu, Y., Yu, Z., Dinger, M. E. & Li, J. Index suffix–prefix overlaps by (w, k) -minimizer to generate long contigs for reads compression. Bioinformatics 35, 2066–2074 (2018).
Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. & Weissman, T. SPRING: a next-generation compressor for FASTQ data. Bioinformatics 35, 2674–2676 (2018).
Dufort y Álvarez., G. et al. ENANO: encoder for NANOpore FASTQ files. Bioinformatics 36, 4506–4507 (2020).
Nicolae, M., Pathak, S. & Rajasekaran, S. LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31, 3276–3281 (2015).
Myers, E. The fragment assembly string graph. Bioinformatics 21, 79–85 (2005).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R. & Phillippy, A. M. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Dufort y Álvarez., G. et al. RENANO: a REference-based compressor for NANOpore FASTQ files. Bioinformatics 37, 4862–4864 (2021).
Nurk, S. et al. The complete sequence of a human genome. Preprint at bioRxiv https://doi.org/10.1101/2021.05.26.445798v1 (2021).
Vaser, R., Sovic, I., Nagarajan, N. & Sikic, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
Deorowicz, S. FQSqueezer: k-mer-based compression of sequencing data. Sci. Rep. 10, 578 (2020).
Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).
Sosić, M. & Sikić, M. Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33, 1394–1395 (2017).
Vereecke, N. et al. High quality genome assemblies of Mycoplasma bovis using a taxon-specific Bonito basecaller for MinION and Flongle long-read Nanopore sequencing. BMC Bioinformatics 21, 517 (2020).
Depledge, D. P. et al. Direct RNA sequencing on Nanopore arrays redefines the transcriptional complexity of a viral pathogen. Nat. Commun. 10, 754 (2019).
Charalampous, T. et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat. Biotechnol. 7, 783–792 (2019).
Deschamps, S. et al. A chromosome-scale assembly of the sorghum genome using Nanopore sequencing and optical mapping. Nat. Commun. 9, 4844 (2018).
Kim, K. et al. Long-read, whole-genome shotgun sequence data for five model organisms. Sci. Data 1, 140045 (2014).
Hon, T. et al. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci. Data 7, 399 (2020).
Cheng, H., Concepcion, G. T., Feng, X. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Murigneux, V. et al. Comparison of long-read methods for sequencing and assembly of a plant genome. GigaScience 9, giaa146 (2020).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
This work was supported by the National Science Centre, Poland, project DEC-2019/33/B/ST6/02040 (M.K., A.G. and S.D.) and by the US National Institutes of Health (grants R01HG010040, U01HG010971 and U41HG010972 to H.L.).
H.L. is a consultant of Integrated DNA Technologies and is on the scientific advisory boards of Sentieon and Innozeen. The remaining authors declare no competing interests.
Peer review information
Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Kokot, M., Gudyś, A., Li, H. et al. CoLoRd: compressing long reads. Nat Methods 19, 441–444 (2022). https://doi.org/10.1038/s41592-022-01432-3