Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

CoLoRd: compressing long reads

Abstract

The cost of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today’s genomic research. In spite of the increasing popularity of third-generation sequencing, the existing algorithms for compressing long reads exhibit a minor advantage over the general-purpose gzip. We present CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses.

Your institute does not have access to this article

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: CoLoRd operation principles.
Fig. 2: Analysis of lossless and lossy compression.

Data availability

ONT datasets: Mycoplasma bovis24, ERR4179765; M. bovis Bonito24, ERR4179766; vir25, ERR2708427ERR2708436; lun26, PRJEB30781; Sorghum27, SRX4104135SRX4104138; Zymo2, https://nanopore.s3.climb.ac.uk/Zymo-GridION-EVEN-BB-SN.fq.gz; Zymo2 R10, https://s3.climb.ac.uk/nanopore/Zymo-GridION-EVEN-BB-SN-PCR-R10HC-flipflop.fq.gz; Zymo2 Bonito, ERR5396170; Zymo2 PromethION, https://nanopore.s3.climb.ac.uk/Zymo-PromethION-EVEN-BB-SN.fq.gz; NA12878 (ref. 1), http://s3.amazonaws.com/nanopore-human-wgs/rel6/rel_6.fastq.gz; CHM13, https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/nanopore/rel6/rel6.fastq.gz; CHM13 Bonito, https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/nanopore/rel7/rel7.fastq.gz. PacBio HiFi datasets: mosquito, SRX8642991, SRX8642992; Drosophila28, SRX499318; strawberry29, SRR11606867; ATCC WGS, PRJNA546278; mouse29, SRR11606870; HG002 HiFi30, SRR10382244, SRR10382245, SRR10382248, SRR10382249; CHM13 HiFi, SRR11292120SRR11292123. PacBio CLR datasets: yeast, http://gembox.cbcb.umd.edu/mhap/raw/yeast_filtered.fastq.gz; Arabidopsis, http://gembox.cbcb.umd.edu/mhap/raw/athal_filtered.fastq.gz; Macadamia31, SRR11191909; 48 plex, https://downloads.pacbcloud.com/public/dataset/MicrobialMultiplexing_48plex/48-plex_sequences/; CHM1, http://datasets.pacb.com/2014/Human54x/fastq.html; HG002 CLR32, SRX5590586. Other: CHM13 T2T assembly version 1.1 (ref. 16), GCA_009914755.3; GRCh38 assembly, GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz; Sorghum assembly, GCF_000003195.3; M. bovis assembly, GCF_000183385.1; Genome in a Bottle version 4.2.1 HG002 reference variants20, HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.

Code availability

Source code for CoLoRd is available at https://github.com/refresh-bio/colord. The package can be installed via Bioconda at https://anaconda.org/bioconda/colord.

References

  1. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  2. Stancu, M. C. et al. Mapping and phasing of structural variation in patient genomes using Nanopore sequencing. Nat. Commun. 8, 1326 (2017).

    Article  Google Scholar 

  3. Jones, D. C., Ruzzo, W. L., Peng, X. & Katze, M. G. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. Bonfield, J. K. & Mahoney, M. V. Compression of FASTQ and SAM format sequencing data. PloS ONE 8, e59190 (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  5. Roguski, Ł. & Deorowicz, S. DSRC 2: industry-oriented compression of FASTQ files. Bioinformatics 30, 2213–2215 (2014).

    CAS  Article  PubMed  Google Scholar 

  6. Grabowski., S., Deorowicz, S. & Roguski, Ł. Disk-based compression of data from genome sequencing. Bioinformatics 31, 1389–1395 (2015).

    CAS  Article  PubMed  Google Scholar 

  7. Roguski, Ł., Ochoa, I., Hernaez, M. & Deorowicz, S. FaStore: a space-saving solution for raw sequencing data. Bioinformatics 34, 2748–2756 (2018).

    CAS  Article  PubMed  Google Scholar 

  8. Liu, Y., Yu, Z., Dinger, M. E. & Li, J. Index suffix–prefix overlaps by (w, k) -minimizer to generate long contigs for reads compression. Bioinformatics 35, 2066–2074 (2018).

    Article  Google Scholar 

  9. Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. & Weissman, T. SPRING: a next-generation compressor for FASTQ data. Bioinformatics 35, 2674–2676 (2018).

    Article  PubMed Central  Google Scholar 

  10. Dufort y Álvarez., G. et al. ENANO: encoder for NANOpore FASTQ files. Bioinformatics 36, 4506–4507 (2020).

    Article  PubMed  Google Scholar 

  11. Nicolae, M., Pathak, S. & Rajasekaran, S. LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31, 3276–3281 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  12. Myers, E. The fragment assembly string graph. Bioinformatics 21, 79–85 (2005).

    Google Scholar 

  13. Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  14. Koren, S., Walenz, B. P., Berlin, K., Miller, J. R. & Phillippy, A. M. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  15. Dufort y Álvarez., G. et al. RENANO: a REference-based compressor for NANOpore FASTQ files. Bioinformatics 37, 4862–4864 (2021).

    Article  Google Scholar 

  16. Nurk, S. et al. The complete sequence of a human genome. Preprint at bioRxiv https://doi.org/10.1101/2021.05.26.445798v1 (2021).

  17. Vaser, R., Sovic, I., Nagarajan, N. & Sikic, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  18. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  19. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    CAS  Article  PubMed  Google Scholar 

  20. Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  21. Deorowicz, S. FQSqueezer: k-mer-based compression of sequencing data. Sci. Rep. 10, 578 (2020).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  22. Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).

    CAS  Article  PubMed  Google Scholar 

  23. Sosić, M. & Sikić, M. Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33, 1394–1395 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Vereecke, N. et al. High quality genome assemblies of Mycoplasma bovis using a taxon-specific Bonito basecaller for MinION and Flongle long-read Nanopore sequencing. BMC Bioinformatics 21, 517 (2020).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  25. Depledge, D. P. et al. Direct RNA sequencing on Nanopore arrays redefines the transcriptional complexity of a viral pathogen. Nat. Commun. 10, 754 (2019).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  26. Charalampous, T. et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat. Biotechnol. 7, 783–792 (2019).

    Article  Google Scholar 

  27. Deschamps, S. et al. A chromosome-scale assembly of the sorghum genome using Nanopore sequencing and optical mapping. Nat. Commun. 9, 4844 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Kim, K. et al. Long-read, whole-genome shotgun sequence data for five model organisms. Sci. Data 1, 140045 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  29. Hon, T. et al. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci. Data 7, 399 (2020).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  30. Cheng, H., Concepcion, G. T., Feng, X. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  31. Murigneux, V. et al. Comparison of long-read methods for sequencing and assembly of a plant genome. GigaScience 9, giaa146 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Science Centre, Poland, project DEC-2019/33/B/ST6/02040 (M.K., A.G. and S.D.) and by the US National Institutes of Health (grants R01HG010040, U01HG010971 and U41HG010972 to H.L.).

Author information

Authors and Affiliations

Authors

Contributions

M.K. and S.D. designed and implemented the algorithm. H.L. designed the variant-calling and consensus analyses, investigated and described their results. A.G., S.D. and M.K. conducted experiments. A.G. prepared visualizations and wrote the majority of the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Heng Li or Sebastian Deorowicz.

Ethics declarations

Competing interests

H.L. is a consultant of Integrated DNA Technologies and is on the scientific advisory boards of Sentieon and Innozeen. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Tables 1–4 and Fig. 1

Reporting Summary

Peer Review Information

Supplementary Data

Detailed results: compression, consensus generation and variant calling.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kokot, M., Gudyś, A., Li, H. et al. CoLoRd: compressing long reads. Nat Methods 19, 441–444 (2022). https://doi.org/10.1038/s41592-022-01432-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-022-01432-3

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing