Introduction

In the recent years, genome sequencing has became a mature technology with numerous applications in the medicine. The instruments by Illumina (encountered to the 2nd generation of sequencers) produce majority of available data for little money (e.g., about one thousand of U.S. dollars for whole human genome sequencing). The sequenced reads are relatively short (up to a few hundreds of bases) but are of high quality. The 3rd generation instruments by PacBio or Oxford Nanopore can deliver much longer reads but unfortunately of much worse quality and at much lower throughput.

The estimations of the amount of genomic data and the related costs can be found in the studies like1,2. The presented huge numbers directly lead to the conclusion that in the near future just a storage and transfer of sequenced (and mapped) reads will consume a lot of money and could be a dominant factor in total costs related to sequencing. Therefore, it is not surprising that a lot of research was made to overcome this problem. The obvious first step was application of gzip, a general-purpose compressor. The about 3-fold reduction of files was remarkable, but the data deluge asked for more. The main problem of gzip is that it was designed mainly for textual data, or more precisely, for data with textual-like redundancy types. Unfortunately things like repetitions of parts of data at short distances are uncommon in read collections.

The next step was an invention of specialized algorithms taking into account types of redundancies specific for FASTQ files3. Some of the most important early results were presented in4,5,6,7. The details of the proposed algorithms were different, but in general the authors tried (with some exceptions) to compress the reads locally, i.e., not looking for the large-scale relations between the reads. The reason was rather simple and practical: the amounts of memory necessary to construct a dictionary data structure allowing to find overlaps between reads could be a few times larger than the input file size, e.g., hundreds of GB for human genomes. The improvement over gzip was, however, limited. For example, the best algorithms were able to reduce the space necessary for DNA bases about 5 times. This value should be compared to 4-fold reduction by simple spending 2 bits to distinguish among 4 valid bases. Moreover, it appeared that the compression of quality values was even more problematic.

This situation motivated researchers to look for alternatives. At the beginning, they focused just on the compression of bases. The key idea was to reorder the data to gather reads originating from close regions of genomes. This could seem as a loss of information, but as the original ordering of reads in a FASTQ file is usually more or less random one can argue that it is hard to say which of two random orderings is better (and even how we can define what “better” means here). The first notable attempt into this direction was described in8. The authors introduced a variant of the Burrows–Wheeler transform to find overlaps between reads. For human reads with 40-fold coverage they were able to spend about 0.5 bits per base, which was a significant improvement.

In the following years, other researchers explored the concept of using minimizers9, i.e., short, lexicographically smallest, substrings of sequences, to find reads from close regions. The key observation was that if two reads originate from the close regions of a genome their minimizers are usually the same. Thus, the reads can be grouped by their minimizers. In the first work following this idea10, for the mentioned human dataset it was sufficient to use about 0.3 bits per base. A similar result was obtained later in11. The possible gains were, however, limited by the fact that the reads identified to originate from the same genome region could span no more than two read lengths.

In12, it was shown how to group reads from a bit larger genome regions. Significantly better results were, however, obtained in three recent articles presenting HARC13, Spring14, and Minicom15. The attempts differ in details, but are based on similar ideas. The overlaps are found for much larger genome regions (in theory up to chromosome size) thanks to dictionary structures storing minimizers of parts of reads. What is also worth to mention, FaStore12 and Spring14 do not focus just on DNA bases and they can compress also the complete FASTQ files.

Together with improving the compression ratio for DNA symbols, the quality scores became responsible for a dominant part of the compressed archives. Therefore a number of works focused on this problem. One of the simplest strategies was to reduce the resolution of quality scores. Illumina in their HiSeq sequencers restricted the quality scores to eight values, and then in the NovaSeq instruments to just four values. The rational for these decisions was that the quality of sequencing is currently very good and the prices of sequencing are low. Therefore, if necessary, it is easier (and cheaper) to perform sequencing with a bit larger coverage than store high-resolution quality scores. The recent experiments suggest that reduction of quality score resolution has very little (if any) impact on the quality of variant calling. For example, in12 it was shown that even more aggressive reduction to just two quality values can be justified, at least in some situations. Moreover, there are several algorithms like QVZ16,17 and Crumble18 that perform advanced analysis of quality scores to preserve only the most important information.

In this article, we propose a novel compression algorithm for FASTQ files. The main novelty is in the compression of DNA bases, as for quality scores and read identifiers we follow similar strategies as in the state-of-the-art tools. Our algorithm, FQSqueezer, make use of the ideas from the prediction by partial matching (PPM)19,20 and dynamic Markov coder (DMC)21 general-purpose methods. A direct adaptation of the PPM-like strategy to sequencing reads would be, however, very hard and likely unsuccessful. There are at least four main reasons for that. First, in the ideal case, the PPM algorithm should construct a dictionary of all already seen strings of length up to some threshold, significantly larger than log4 (genome_size), which for human genomes seems to be unimplementable on workstations and even medium-sized servers. Second, the PPM algorithms often need many accesses to the main memory to compress a single symbol. For huge dictionaries this could result in a very slow processing (due to cache-misses). Third, sequenced data contain errors that should be corrected to refrain from expansion of the dictionary structures. Fourth, the PPM algorithms usually learn slowly, which is a good strategy for texts, but seems to be bad for genomic data.

To overcome these problems we designed a few fixed-k dictionaries for k-mers (k-symbol long substrings) found in the reads. Moreover, the dictionaries are organized in a way reducing the number of cache misses. We also estimate the probability of symbols occurrence much more aggressively, which results in significantly better compression (compared to classical PPM-like estimation). Finally, for the storage of k-mers in the dictionaries we perform some kind of error correction.

The proposed ideas has some similarity to earlier works. For example, some correction of bases was employed in AssembleTrie22. The authors used it, however, just to “synchronize” reads from both strands (forward and reverse). Formerly, in GeCo23 a similar correction was used to improve context determination in the field of genomic (complete genomes, chromosomes, contigs) data compression. The estimation of the probability of appearance of the current symbol based on k-mers of some size (much smaller than in our solution due to huge memory requirements of the picked dictionary implementation) was used in Fqzcomp4. Modeling the genomic data (genomes, chromosomes, contigs) as a Markov source was studied by Pinho et al.24,25. It is, however, worth to emphasis the differences between compression of sequencing reads and collections of longer genomic fragments, e.g., contigs, chromosomes, genomes. The longer (assembled) fragments are of much better quality than reads, so small differences between very similar fragments are usually due to variations between organisms and they are expected. In sequencing datasets, we usually work with reads originating from a single genome and the differences between very similar fragments are in majority due to sequencing errors. Some of them (minority) are also due to diploid structure of some genomes. Even in case of metagenomic studies, when the reads are from many genomes, a significant part of the differences are due to sequencing errors. Therefore, a compressor of sequencing data should be designed in a different way than a compressor of assembled genomes or its parts. They also should not be compared directly in practice as they were designed for solving different tasks.

In the present article, we significantly improved the mentioned ideas, as well as, proposed other techniques. The most important ones are: the organization of the huge dictionaries, design of the PPM-like estimation of probabilities, use of a custom DMC as the stage following the PPM, the technique for prediction and correction of sequencing errors, technique of ordering the reads and making use of shared prefixes. What is also important, we developed the complete FASTQ compressors.

The main asset of FQSqueezer is its compression ratio, usually much better than of the state-of-the-art competitors, i.e., FaStore12, Spring14, and Minicom15. Our tool has, however, also some drawbacks in terms of speed and memory usage. Namely, it is a few times slower than the mentioned competitors in compression and much slower in decompression.

Results

Tools and datasets

For the evaluation we used the state-of-the-art competitors, i.e., FaStore12, Spring14, and Minicom15. We resigned from testing some other good compressors like BEETL8, Orcom10, AssembleTrie22, and HARC13 as the previous works demonstrated that they perform worse than the picked tools. The older utilities are not competitive in terms of compression ratio, as was demonstrated in the recent papers12,14 (see also Table 1 for experiments with one of our datasets). Some of them are very fast, e.g., DSRC 27. Nevertheless, in this article we focus mainly on the compression ratio.

Table 1 Comparison of compression ratios and running times of selected general-purpose compressors and FASTQ-specialized compressors.

The datasets for experiments were taken from the previous studies. They are characterized in Supplementary Table 1. In the main part of the article, we used 9 datasets but more results can be found in the Supplementary Worksheet. Unfortunately, some compressors do not support all the examined modes, which is a reason of lack of their results in some tables.

All experiments were run at workstation equipped with two Intel Xeon E5-2670 v3 CPUs (2 × 12 double-threaded 2.3 GHz cores), 256 GB of RAM, and six 1 TB HDDs in RAID-5. If not stated explicitly the programs were run with 12 threads.

Compression of the bases

The most important part of the present work is the compression of bases. Therefore, in the first experiment we evaluated the tools in this scenario. The results for the single-end (SE) reads are given in Table 2. The ratios are in output bits per input base. As it is easy to observe, in the majority of cases FQSqueezer outperforms the competitors. For some datasets the gain is large. Nevertheless, for two datasets FQSqueezer loses to Minicom in the original-order-preserving (OO) mode. We investigated this situation a bit closer by checking whether the compression ratio will depend on the initial ordering of the reads. When we shuffled the reads prior to compression, the compression ratios for SRR1265495_1 and SRR1265496_1 changed significantly for all the examined compressors, i.e., 0.632 → 0.531 and 0.646 → 0.527 (Spring), 0.448 → 0.539 and 0.484 → 0.568 (Minicom), 0.506 → 0.472 and 0.517 → 0.487 (FQSqueezer). This shows that the initial ordering of the reads in these datasets is far from random and Minicom can benefit from this. In the mode allowing reordering of the reads (REO), FQSqueezer always wins. Moreover, the compression ratios are roughly twice better than in the OO mode for all the examined methods.

Table 2 Compression ratios for single-end reads.

The results for the paired-end (PE) reads can be found in Table 3. In the OO mode, FQSqueezer usually outperforms Spring significantly. Nevertheless, for the largest dataset it loses slightly. It is hard to say what is the reason. The situation for the REO mode is similar, but the gains are usually even larger than in the OO mode.

Table 3 Compression ratios for paired-end reads.

The most important drawbacks of FQSqueezer are, however, its time and space requirements (Table 4 and Supplementary Worksheet). In the compression, it is a few times slower than the competitors, but in the decompression the difference is larger. The reason is simple. FaStore, Spring, and Minicom need time to find overlapping reads that likely origin ate from close genome positions. Nevertheless, decoding of the matches between the overlapping reads is very fast. FQSqueezer can be classified as a PPM algorithm. In the decompression, the algorithms from this family essentially mimic the same work made in the compression, so the differences in compression and decompression times are negligible.

Table 4 Time and memory requirements for compression of SE reads in the reordering mode.

The case of memory usage is similar. The same dictionaries must be maintained by FQSqueezer in the decompression that are necessary in the compression. Moreover, to predict the successive symbols a lot of statistics must be collected. Nevertheless, without the applied correction mechanisms the occupied memory would likely be doubled.

The dictionaries are updated only at the synchronization points, i.e., after processing each FASTQ block of size 16 MB, so the number of threads has some impact on the compression ratio. The results in Table 5 show that reducing the number of threads from 12 to 1 we can gain 1–2% in ratio, but the processing would be significantly slower.

Table 5 Multithreding scalability of FQSqueezer for SRR327342_1 dataset.

One of the parameters of FQSqueezer is the assumed genome size, which should be comparable to the true genome size in the sequencing experiment. The genome size is used to set the lengths of the k-mers stored in the dictionaries. In the last experiment in this part, we checked the impact of this parameter. The results presented in Table 6 show that declaring improper genome size deteriorates the compression ratio, but the differences are not large (In Supplementary Section 1.1 one can find some suggestion how to set this parameter).

Table 6 Impact of FQSqueezer declared genome size for SRR870667_1 dataset.

Full FASTQ compressors

FQSqueezer is a FASTQ compressor, so we ran it for a few datasets to verify its performance in such scenario. There were only two competitors: FaStore and Spring as Minicom was designed just for bases. The results in Table 7 are for three modes. In the lossless mode, all data were preserved. In the reduced mode, the IDs were truncated to just the instrument name and the quality values were down-sampled to 8 levels (i.e., Illumina 8-level binning). In the bases only mode, only the bases are stored. The table presents only the sizes of the compressed archives, but the timings and memory occupation can be found in Supplementary Worksheet.

Table 7 Compression of complete FASTQ files.

The experiment confirmed the advantage of FQSqueezer in terms of compression ratio. Nevertheless, our compressor is slower and needs more memory than the competitors. It is interesting to note that in the lossless modes both Spring and FQSqueezer give smaller archives when they do not permute the reads. This is caused by the large cost of compression of IDs that are not so similar for subsequent reordered reads. In the remaining modes, the cost of ID storage is negligible, so the reordering modes win.

Discussion

In the article, we presented a novel compression algorithm for FASTQ files. Its architecture was motivated by the PPM and DMC general-purpose compressors. Nevertheless, significant amount of work was necessary to make it possible to adapt these two approaches for genomic data. First, it was crucial to prepare specialized data structures for statistics gathering, with care of fast memory accesses (i.e., reduction of cache misses). Second, some dedicated correction of bases was implemented for better prediction and reduction of memory usage. Third, a special approach was necessary to efficiently store paired reads. Fourth, the DMC-like mechanism for aggressive estimation of symbol occurrence probabilities was proposed.

The experiments show advantage of FQSqueezer in terms of compression ratio for the majority of datasets. The differences between our tool and the state-of-the-art competitors were sometimes quite large. Nevertheless, for the largest paired-end dataset we perform slightly worse than Spring. This phenomenon deserves further investigation.

The most important drawbacks of FQSqueezer are slow processing and large memory consumption. These features are typical for PPM-based algorithms. Nevertheless, some work to reduce these drawbacks is probably possible. For example, the two most important components responsible for slow processing are looking for approximate matches and queries for incomplete k-mers. In the future work, it should be possible to attack at least these two problems, e.g., try to minimize the number of queries to the dictionary data structures without deterioration of the compression ratios.

For those, who would find the memory and time requirements prohibitive for application of the proposed tool in real pipelines, FQSqueezer could be seen as an attempt into better estimation of the entropy present in sequencing data. It should also be possible to implement some of the concepts present in FQSqueezer in the more practical FASTQ compressors to improve their compression ratios.

Methods

Basic definitions

For clarity of description of the algorithm let us assume that DNA sequences \(x={x}_{1}{x}_{2}\ldots {x}_{r}\) contain symbols from alphabet {A,C,G,T,N}. The length (size) of a sequence is the number of elements it is composed of. A substring can be obtained from a sequence by removing (possibly 0) symbols from the beginning and the end. The notation xi,j means a substring \({x}_{i}{x}_{i+1}\ldots {x}_{j}\). A k-mer is a sequence of length k. A canonical k-mer is lexicographically smaller of a k-mer and its reverse complement.

Basic description of the algorithm

FQSqueezer is a multi-threaded algorithm, but for simplicity of presentation we will start from a single-threaded variant. Our tool accepts both single-end (SE) and paired-end (PE) reads. The reads can be stored in the original ordering (OO) or can be reordered (REO). In the reordering mode, the reads are initially sorted according to the DNA sequence (first read of a pair in the PE mode).

The input FASTQ files (or sorted files in the REO mode) are loaded in blocks of size 16 MB. The reads from a single block (pair of blocks in the PE mode) are compressed one by one (or pair by pair). The read ID and quality values are compressed using rather standard means (similarly like in the top existing FASTQ compressors). The details are described at the end of this section. Below, we will focus just on the DNA symbols.

Compression of bases

In the compression of a SE read (or first read of a pair), we process the bases from the beginning of a read position by position. For each base we determine the statistics of occurrences of k-mers ending at this base in the already processed part of the input data. To this end, we maintain a few dictionaries: De, Dp, Ds, and Db that store numbers of occurrences of: e-mers, p-mers, s-mers, and b-mers, respectively, where \(e < p < s < b\). The details of the organization of the dictionaries are given in Supplementary Section 1.1.

At the beginning we will discuss the OO-SE mode. In general, the longest possible, but no longer than b–1 symbols, context (substring preceding the current symbol in a read) is taken to predict the current symbol. Then, the symbol is encoded using these predictions (as well as some other properties of a read and the current position) with a use of a range coder.

Let us follow the example given in Fig. 1 assuming the sizes of k-mers: e = 4, p = 6, s = 9, b = 12. The subfigures (a)–(c) show how the probabilities of appearance of each symbol from the alphabet {A,C,G,T,N} are estimated for some symbols of a read. In Fig. 1a, the 6th symbol (orange cell in the figure) is encoded. We consult the Dp dictionary (as there are too few already-processed-read symbols to use Ds or Db) for statistics of appearance of all p-mers ATACC*, where “*” represents any symbol. The answer (blue font in the figure) is that: ATACCA appeared 31 times, ATACCC—10 times, ATACCG—454 times, ATACCT—5 times. Then, we reorder the alphabet symbols according to this statistics (for simplicity let us assume that for equal counters the symbols are ordered lexicographically): G (rank 0), A (rank 1), C (rank 2), T (rank 3), N (rank 4). In the next step, we check in the Model dictionary how many times for such ordered counters, at the 6th position, when the statistics are from the Dp dictionary the current symbol was the one with rank 0, 1,…. (In fact we use more items that define the query to the Model dictionary, but for simplicity of presentation we omit them here. More details are given in Supplementary Section 1.2.) The answer is that 1204 times the symbol of rank 0 was encoded, 15 times the symbol of rank 1, and so on. The frequencies 1204, 15, 8, 8, 2 are then used by the range coder to encode the current symbol.

Figure 1
figure 1

Illustration of compression of a single read in a single-end, original ordering mode. The subfigures (ac) show how the estimation of probabilities for entropy coding is performed for some symbols.

The situation presented in Fig. 1b is a bit different. Here we are at the 8th position in a read so we are able to look for statistics in Ds dictionary. Nevertheless, this dictionary stores statistics for 9-mers and we can construct only 8-mers from the already processed symbols. Therefore, the dictionary is asked for *ATACCGT* 9-mers, where “*” represents any symbol. The answer is that: *ATACCGTA appeared 2 times, *ATACCGTC—155 times, *ATACCGTG—54 times, *ATACCGTT—12 times. Then, we ask the Model dictionary and obtain estimates: 350 for rank-0 symbol (C in this situation), 100 for rank-1 symbol (G), 14 for rank-2 symbol (T), 4 for rank-3 symbol (A), 3 for rank-4 symbol (N). These estimates are used for encoding the current symbol using the range coder.

In Fig. 1c, we show the processing of the 14th symbol. Now, the number of processed symbols is sufficient to use Db dictionary for statistics of occurrence of 12-mers ACCGTCAGGTA*. Unfortunately, the answer is that no such 12-mer has been seen so far. Therefore, we use the Ds dictionary for GTCAGGTA* and obtain statistics: 15 for A, 0 for C, 15 for G, 0 for T. Then we use the Model dictionary and see that the estimates for the current symbol are: 102 for rank-0 symbol (A), 104 for rank-1 symbol (G), 7 for rank-2 symbol (C), 5 for rank-3 symbol (T), 1 for rank-4 symbol (N). As previously, these values can be used to encode the current symbol by the range coder.

When a symbol is encoded, it is checked whether it looks like a sequencing error. Let us assume that the statistics from the Db dictionary are \(C=[{0}_{{\rm{A}}},{5}_{{\rm{C}}},{0}_{{\rm{G}}},{0}_{{\rm{T}}}]\) and the current symbol is A. The dominance of C symbols suggests that A is a sequencing error. Therefore, we encode A at this position but replace A by C in the read, which has impact on the processing of the next symbols in the read. To assume that we have a sequencing error the most frequent symbol should have counter at least 3 and the current symbol should have counter 0.

In the REO mode, the prefix of size p of a SE read (or first read of a pair in the PE mode) is encoded in a different way. Roughly speaking, we treat it as a 2p-bit unsigned integer and encode the difference between it and the integer representing the prefix of the previous read. Using the same read as in the above example, we pick the 6-mer ATACCG and convert it to 12-bit integer using mapping: 002 for A, 012 for C, 102 for G, 112 for T. Therefore the read prefix is represented as \({001100010110}_{2}={790}_{10}\). Let us also assume the previous read prefix was ATACAT, which translates to \({001100010011}_{2}={787}_{10}\). Therefore the current read prefix is encoded as a difference \(790-787=3\). The differences are usually small numbers which can be encoded efficiently using a range coder. The suffix is compressed in the same way as in the OO mode.

In the PE modes, the first read of a pair is compressed exactly in the same way as in the SE mode. The processing of the second read is a bit more complex, but it is the same in the OO and REO modes. Initially we try to predict some b-mer of the second read. To this end, we use a dictionary Mb that stores pairs of minimizers of read pairs seen so far. Quite often this allows to encode the b-mer of the second read in an efficient way. Then, we store the substrings of the read following and preceding this b-mer. If we were unable to predict the minimizer of the second read, we store the read in the same way as a SE read in the OO mode.

Figure 2 shows an example how the pair of minimizers are used to predict some of the second read b-mer. At the beginning a minimizer, \({m}_{1}={\rm{ACCGAGGTAG}}\), in the first read is found (green cells). Then, we look in the Mb dictionary which minimizers in the second read appeared together with m1 in the past. We obtain four candidates ordered by the number of appearances. Then for each of the candidates we check whether it appears in the second read. In our example, we found AAGATGTCCAGT (orange cells). Thus, we encode just the rank of the candidate in the ordered list of candidates, i.e., 2 (we count from 0) and the position of the candidate in the second read, i.e., 10.

Figure 2
figure 2

Example of prediction of some b-mer of the paired read from the statistics of occurrences of pairs of minimizers in PE reads.

After processing a read the dictionaries De, Dp, Ds, Db, and Mb for a case of PE reads, are updated.

For clarity of presentation, the above description of bases compression is a bit simplified. For example, in practice we work on canonical k-mers, a second correction mechanism (rarely used) is also employed, more than a single pair of minimizers is stored in Mb dictionary, the Model dictionary is indexed with a use of some function of the mentioned (and other) properties of the current position in a read, a dynamic Markov coder-like mechanism is used in the Model dictionary to provide estimated probabilities. A discussion of all of this can be found in Supplementary Section 1.2.

Compression of IDs

In the lossless mode, IDs are compressed similarly as in the state-of-the-art compressors, like Spring or FaStore. The ID of each read is tokenized (the separators are non-alphanumerical characters). Then the tokens are compared with the tokens of the previous read. If the tokens at corresponding positions contain numerical values, the difference between the integers is calculated and stored. Otherwise the corresponding tokens are compared as strings. If they are equal we just store a flag. In the opposite case, we store a mismatch flag and compare the tokens character by character storing the result of comparison and (if necessary) the letter from the current read. It can also happen that the list of tokens differ significantly, i.e., they are of different length or the corresponding tokens are of different type. In this situation, the ID is stored character by character.

FQSqueezer offers also two lossy modes. In the first one, it preserves just the instrument name (first part of the ID). These names are organized in a move-to-front list26 and the position of the current ID at the list is encoded. If the current instrument name is absent form the list, it is encoded explicitly. In the last possible mode, IDs are discarded.

Compression of quality scores

The quality scores can be compressed in five modes allowing different number of values: 96, 8, 4, 2, none. If the input FASTQ file has already reduced resolution of quality scores (e.g., 4 for Illumina NovaSeq sequencers), no conversion is necessary. Otherwise, in the lossy mode the necessary resolution reduction is made by the compressor. The encoding is made using contexts containing the position in a read and 2 (96-value alphabet), 6 (8-value alphabet), 9 (4-value alphabet), or 10 (binary alphabet) previous scores.

Implementation details

The implementation is in the C++14 programming language. Most of the dictionaries are implemented as hash tables with linear probing for collision handling. To reduce delays caused by cache misses we make use of software prefetching. The multithreading is implemented using the native C++ threads. The FASTQ blocks of size 16 MB are split into as many parts as the number of threads. Each thread processes its part independently and the global dictionaries are available only for querying. At the synchronization points the threads update the global dictionaries. Nevertheless, the threads update different parts of the dictionaries so they can operate in parallel.