FQSqueezer: k-mer-based compression of sequencing data

The amount of data produced by modern sequencing instruments that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives. We present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools. The drawbacks of the proposed method are large memory and time requirements.


.1 Data structures organization
The dictionaries are organized in various ways. The D p dictionary is a vector of 2-bit integers of size 4 p . It is indexed by an integer representation of a p-mer. Each 2-bit field stores a number of occurrences of a related p-mer, saturated at 3. When a p-mer is added to the dictionary the counters are incremented for both p-mer and its reversed complement. The concept of treating the k-mers and its reverse complement similarly is present in may previous works, e.g. [3,2].
The D s dictionary is a collection of 4 s−10 hash tables representing canonical s-mers and the counters (no. of occurrences of each s-mer). When looking for (or inserting to) s-mer x in the dictionary, a hash table from the collection is identified by the substring x 3,s−8 . Each 32-bit entry of a hash table contains: 4 bits for x 1,2 , 16 bits for x s−7,s , and 12 bits for the counter. A hash function calculates its value just using x 3,s−2 , i.e., the kernel of an s-mer. Such an organization allows to make use of software prefetching mechanism to reduce delays related to cache misses. Moreover, the s-mers sharing their (s − 1) starting symbols are at close positions, which also reduces the number of cache misses. The maximal value of a counter is 2 12 − 1, which can be saturated for repetitive genomes. Therefore, the counter incrementation is a bit tricky. Counter values v not exceeding 2 11 are always incremented. The larger values are incremented with probability 1/(v − 2 11 ). Thus, the large counts are approximate, but the maximal representable value is much larger.
The D b dictionary is organized in a similar way as D s , but here we have a collection of 4 b−13 hash tables. The hash table is identified by an integer representation of x 3,b−11 . The 32-bit entries are composed of: 4 bits for x 1,2 , 22 bits for x b−10,b , and 6 bits for the counter. The incrementation of counter value v larger than 8 is with probability 1/2(v − 8).
The compressor maintains global dictionaries D p , D s , and D b that can be queried in parallel by separate threads. The update of these dictionaries is after processing the complete 16 MB blocks (in fact for the 1st, 2nd, . . . , 99th block it is more frequent, i.e., 100 − block number times for a block). It is made by all threads as each thread updates different hash tables from a collection (or different part of vector D p ). The delayed update means, however, that the recently processed k-mers are absent from the global dictionaries. To overcome this problem, each thread stores also small local dictionaries D local s and D local b for representation of s-mers and b-mers seen in this thread from the last update of the global dictionaries. This local dictionaries are hash tables with 64-bit entries.
The D e dictionary is a hash table storing statistics of occurrences of all substrings of length from 1 to e.
In the PE mode, we need also M b dictionary. It is a hash table containing 128-bit entries storing two b-mers that are signatures (variant of a minimizer, defined as in [2]) of some parts of reads and the counter stored at 64 − 2b bits. More precisely, for each pair of reads x and y we determine several signatures, i.e., • x 1 , x 2 , x 3 -the signatures in the first, second, and third part of x, • y 1 , y 2 , y 3 -as above for y, • x 23rc and y 23rc -the signatures in the reverse-complemented second and third part of x, and y.
Then, we add the following pairs to M b : x 1 , y 1 , x 1 , y 3 , x 1 , x 23rc , x 2 , y 1 , x 2 , y 3 , x 3 , y 1 , x 3 , y 3 , y 1 , x 1 , y 1 , x 3 , y 1 , y 23rc , y 2 , x 1 , y 2 , x 3 , y 3 , x 1 , y 3 , x 3 . The local dictionary M local b is maintained in the similar way as local variants of other dictionaries. The choice of parameters f , p, s, b has some impact on the compression ratios. They should be adjusted to the genome size of the sequenced organism (or a sum of genome sizes in metagenomic experiments). Therefore, the user should specify the genome size, which is used to set the parameters as shown in Table 1. If the genome size is unknown, it roughly can be guessed from the FASTQ file size and the sequenced coverage.

Algorithmic details
There are 4 different modes in which FQSqueezer can be run: (i ) for single-end (SE) reads with preserving the original ordering of the reads, (ii ) for SE reads without preserving the original ordering of the reads, (iii ) for paired-end (PE) reads with preserving the original ordering of the reads, (iv ) for PE reads without preserving the original ordering of the reads. Most of the algorithmic details are the same in the modes, so we will start the description from the simplest SE order-preserving mode.

SE reads -original ordering
The first f symbols of a read are compressed using the statistics of occurrence of up to e previous symbols. This value is 1 at the beginning of compression and quite fast grows to 9 (when there are sufficient amount of statistics). The processing of successive symbols is as follows. We maintain 3 pairs of k-mers: • (p − 1)-mers: z unc p and z p , where the first contains recently seen p − 1 symbols and the second is a copy of z unc p in which some symbols could be changed by our correction mechanism, • (s − 1)-mers: z unc s and z s defined as above for s − 1 recent symbols, and z b defined as above for b − 1 recent symbols.
At the beginning of the processing of a read the k-mers are incomplete.
Let us assume that we are at the ith position in a read x. First, we try to obtain as accurate predictions of the current symbol as possible. To this end, if s < i we ask D b and D local b for statistics of occurrences of z b followed by A, C, G, or T. If z b is complete, the queries are straightforward. Otherwise, to collect the statistics we fill z b (at the beginning) by all possible combinations of bases, which can be costly (in terms of time). In case of no success, we try also the same for z unc b . If we were unable to collect any statistics and p < i, we try the similar procedure for z s and D s , D local s . Moreover, we use this step also in situation in which for more than one base the counters obtained from querying D b achieve 63, the maximal counter value. If necessary (no success), the search is performed for z p and D p . If we were not able to find any statistics in the previous stages we perform more aggressive search. If b ≤ i, we try to collect the statistics for z b and D b for approximate matches (with 1 mismatch). Unfortunately, this means dozens of queries. In case of no success and s ≤ i, we try the same for z s and D s . Finally, if necessary, we use also z p and D p .
If after this search procedure we do not have any statistics, we encode the current symbol taking into account from 1 to 9 previous symbols (exactly like for the read prefix). The situation is more interesting if we were able to collect some statistics. We form a context for encoding of the current symbol. The context components are: 1. pos -current position in a read, 2. pattern -sorted counters of occurrences of symbols, 3. cnt lev -origin of statistics, one of: • there was a correction in the most recent half of z b , • there was a correction in the former half of z b , • the statistics are from the approximate queries, • none of the above.
The actual context is constructed from the context components in the way motivated by the dynamic Markov coder [1]. We allow 7 levels of contexts. In the parentheses we give the number of times the context can be used before it is split into more detailed ones (i.e., before we go to the next level): 1. context is empty (1),

context contains (32):
• cnt lev, At the end of processing of a current position, we try to correct (just to make better representation in the dictionaries) the current symbol. If i < b we do nothing with the exception for a case in which the current symbol is N, as we update it to A. For i ≥ b we alternatively run two correction procedures. First is applied if the counters are from inspecting the D b or D local b .
If the counter value for the current symbol is 0 and the most frequent symbol has counter 3 or more we make a substitution. Second is applied in case of no statistics or if the statistics are from the D p . Moreover, to apply this correction the current coverage (estimated from the number of updates to D p ) must be at least 7. We try to change one of the previous 6 symbols to find a match in the D b .
As a simple time optimization at the beginning of processing of a read we check whether it is a copy of the previous read. We store this information and do not process copied reads position by position.

SE reads -reordered
In the REO mode, the reads are initially split into 256 bins according to the first 4 symbols (Ns are treated as Ts). Then the reads are read from bins and sorted in the main memory before further processing.
The only difference in the compression from the OO mode is handling of the read prefix. We treat the first p symbols as an 2p-bit unsigned integer t. Let t prev be such value for the previous read. We check in D p the counter c for t. We encode c and then determine (in D p ) the number of p-mers between t prev and t with counter equal c and store this number.
A special care is necessary for Ns in the read prefix. They are replaced by Ts for the conversion to the integers, but a special flag is stored to tell whether such substitutions took place in the present read. If necessary the substitutions are stored to allow perfect reconstruction of the prefix.

PE reads -original ordering
The compression of PE reads in the OO mode is similar to the compression of SE reads. The first read of a pair is compressed exactly in the same way. The second read also can be compressed in the same way, but fortunately, quite often we can do much better. To this end, we determine 4 signatures of the first read, i.e., form the 1st, 2nd, 3rd, 4th part of the read. For each signature we ask M b for the paired candidate signatures. We merge, the lists and pick the top 16 candidates. Then in the second read of a pair we look for the candidates. In case of success, we just store the position of the found signature at the candidate list and the position of the signature in the second read. Then, we store the parts after and before the signature in the second read using the same way as in the SE mode.

PE reads -reordered
The compression of PE reads in the REO mode is a mix of the compression of SE-REO and PE-OO modes. The first read of a pair is stored as in the SE-REO mode. The second one is processed as in the PE-OO mode.

Examined programs
The following programs were used in the experimental part. The running parameters are also given.

Environment
The computer used in the tests was of the following configuration: • 2 Intel Xeon E5-2670 v3 CPUs, 12 double-threaded cores per CPU, each clocked at 2.3 GHz, • 256 GiB RAM, • 6 HDDs of size 1 TiB each in RAID-5 configuration, hdparm -t reported buffered read speed 528.45 MB/s.
For compilation we used G++ v. 7.2.0. The machine was running Debian 9.3.1 x86-64 OS.