Introduction

The genome sequencing technology has recently become so cheap that it started to be considered as a useful tool in medicine. Companies like Illumina offer whole human genome sequencing for medical purposes for five thousand U.S. dollars1. There are also large scale projects designed to find the common differences between individual genomes. One of the most famous is the 1000 Genome Project2 which aims at sequencing the genomes of several thousand humans and determining the genetic variants with at least 1% frequency. There are, however, even broader attempts for human genome sequencing, to mention the UK10K project3, the Personal Genomes Project4 and the Million Veteran Project (MVP)5. The planned number of sequenced genomes are 10 K, 100 K and 1 M, respectively. Large collections of genomes are built also for other species. E.g., in the 1001 Genomes Project (1001 GP)6,7 about 1000 of genomes of Arabidopsis thaliana are to be sequenced.

The sequencing is of course challenging, but due to the large amounts of produced data, the pure storage and transfer of the results becomes a challenge too. The recent papers8,9 show that the IT costs are (or will be soon) comparable to the sequencing costs. Due to the slow progress in reducing the IT prices, the effective ways of representing genomic data in compact form are intensively investigated. Several subproblems can be identified here. The first is the compression of raw sequencing reads10,11,12,13 The second is the compression of reads after mapping onto reference genomes10,14,15. The third is the compression of results of variant calling16,17,18. The fourth is the compression of complete genomic sequences19,20,21,22. These subproblems are related, nevertheless require different approaches. The recent surveys discuss most of the existing algorithms9,23,24.

In this paper we deal with the last of the mentioned tasks, i.e., storage of collections of genomes. We propose Genome Differential Compressor 2 (GDC 2), a utility for compression of large sets of genomes of the same species. Since such genomes are highly similar, e.g., it was estimated that two humans have their genomes identical in 99.5 percent25, it is clear that when compressing a collection of genomes one can obtain better compression ratios than when compressing the sequences separately. Initially, the researchers tried to use the similarity between a sequence to be compressed and a reference sequence. The first impressive result was by Christley et al.16. They showed that the description of differences between James Watson’s genome and the reference genome can be stored in as little as 4.1 MB. Taking into account that the complete haploid human genome is of size 3.1 Gbases, this translates to ~750-fold compression. This result was recently improved by Pavlichin et al.17 who reduced the space for the JW genome to about 2.5 MB (compression ratio ~1250).

Such large compression ratio was possible since the data were preprocessed, i.e., precise information of all variants were available. This is not always the case, as the genomes can be obtained in different experiments with different reference genomes or the genomes can be de novo assembled. In such situations the data to be compressed are collections of complete genomic sequences. This significantly complicates the compression task, as the differences between sequences are not given explicitly; they have to be found, e.g., by multiple complete genome alignment, which is a very complex problem. Moreover, for technological reasons, the differences between de novo assembled genomes are usually larger than between the reassembled genomes.

Several papers for the problem of compression of collections of genomic sequences were published19,20,22,26,27. In majority of them, each single sequence is compressed separately, by identifying the differences between it and a single reference genome. This allowed to obtain compression ratios for human genomes up to 400, much poorer than ~1250 obtained by Pavlichin et al.17. This is the price for the lack of prior knowledge about the compressed data. The most successful attempts at obtaining higher compression ratios were possible by exploring the knowledge of similarities between more sequences in the collection. Since such approaches are the real competitors to the proposed algorithm, we will describe them a little more.

The first attempt in this direction was GDC-ultra19. It takes a single reference sequence and constructs a search structure (namely, hash table) for it. Then it compresses the first sequence of the collection by looking for similarities between this sequence and the reference. When the sequence is processed, it is used as an additional reference sequence for further sequences, so a separate search structure is constructed for it. The same is for the following sequences, so for example, the 25th input sequence of the collection is compressed by looking for the differences between it and: the main reference sequence, the formerly processed 24 sequences of the collection. The number of additional reference sequences is limited to 39 (for technical reasons only, mainly to keep the necessary amount of memory at a reasonable level). If the collection consists of more than 39 sequences, the 40th, 41st, etc. sequence is compressed with the 40 references only. The differences between the current sequence and the referential sequences is finally Huffman coded. Such approach proved to be promising, since the collection of 69 human genomes were compressed with ratio ~1000.

A different approach was used by Wandelt et al.21 in their FRESCO algorithm. They investigated several variants and below we will describe the one that gave the best results. The collection is divided into two sets: (i) additional references, (ii) remaining sequences. FRESCO constructs a search structure (suffix tree) for the main reference sequence. Then it looks for similarities between the additional reference sequences and the main reference performing classical Ziv–Lempel parsing28 of additional reference sequences. As a result it obtains for each additional reference a sequence of triples (position in the main reference, length of the identical part, next symbol). For the Ziv–Lempel-parsed additional reference sequences a search structure (hash table) is built. After that FRESCO is ready to perform the compression of the remaining sequences from the collection. Each sequence is Ziv–Lempel-parsed against the main reference sequence. Then, the sequence of triples is compressed using the additional Ziv–Lempel-parsed reference sequences serving as the second-level reference. The obtained compression ratios are impressive as they are approximately 3000 for the collection of about 1000 haploid genomes of the 1000 GP, when 70 additional reference sequences were used.

The best compression ratios for the genomic collection was obtained by TGC algorithm18. It is, however, from a different category, since as an input it takes a Variant Call Format (VCF)29 file describing the differences between genomes and the reference sequence, so it processes essentially the same data as Pavlichin et al.17. In this work we deal with complete genomes stored in FASTA format. In theory it is possible to convert FASTA files into VCF files, but it would require making a close to optimal alignment of many complete genomes (i.e., finding the smallest set of differences between these genomes), which is far from being trivial, especially due to a presence of long structural variants. Nevertheless, comparing the obtained results with TGC will be interesting, as it will allow us to see how far we are from the top algorithm for the similar problem. The main idea of TGC is to split the VCF file into two files. The first (dictionary of variants) stores a description of each variant (i.e., its type, position, alternative alleles, etc.). The second file stores the binary representation of presence/absence of each single variant in each single sequence. The bit vectors (one for each sequence) are compressed using a specialized Ziv–Lempel-based algorithm. The dictionary file is also compressed using a specialized algorithm. The compression ratios of TGC for the collection of 1092 diploid human genomes (when taking only 1 reference sequence) is about 15,500.

Methods

Definitions

For precise description of the proposed algorithm let us define some terms. As an input we have a single reference sequence R and a collection of genome sequences ={S1, S2, ..., Sn}. Each sequence is composed of symbols from some alphabet Σ, i.e, for each 1 ≤ k ≤ n, where for each valid i and denotes the length of Sk. Also , where ri Σ for each valid i and denotes the length of R. For any sequence X (a reference or from the collection) Xi,j = xixi+1xj.

For the DNA sequences the alphabet should ideally contain only 4 symbols (A, C, G, T), but in practice N (unknown) symbols are quite frequent. Moreover, sometimes also other IUPAC codes appear. Thus in the work we assume only that the symbols are letters from the ASCII code (we also distinguish between lower- and uppercase letters).

Compression algorithm

At the beginning, the compression algorithm reads the reference sequence R and constructs a search structure HTR (namely, hash table with linear probing30) for it. The hash value is computed for each h1m-symbol long substring of R (h1m = 15 by default, but a different value can be specified by a user), i.e., for all Ri,i+h1m−1, where . After that, the main processing of the collection starts. The compression algorithm is two-level.

At the first level, we perform the Ziv–Lempel factoring of all sequences from the collection . This means that for each sequence Sk from we produce a sequence Lk composed of tuples (the first symbol of a tuple, denoted as fx (where x indicates the type of the tuple), is an identifier, as it will become clear later in the text). To this end, we start from i = 1 and look for the longest common substring present in R. Since the search structure HTR contains substrings of length h1m it is not possible to find shorter matches. There are two possibilities here:

  • No match of length at least h1m is found. Then, we append a tuple describing single symbol , i.e., to Lk and update the current sequence position: ii + 1.

  • Otherwise we have a match of length j − i + 1. We encode it by appending the tuple to Lk. Then, we update the current sequence position: ij + 1.

There is, however, some exception to the general rule that no shorter than h1m symbols match can be found. Genomic sequences often differ by single nucleotide polymorphism (SNPs) or short indels (a few symbols long insertions or deletions). Thus, when some match is found, before looking for another match in R using the hash table HTR, we do 3 (or 5, depending on the user-specified option) simple verifications. We check whether the next symbol(s) after the current match is just a single nucleotide mutation or a single-symbol (or double-symbol) indel. We allow matches found after such variation to be of length h1e (equal to 4 by default). The rationale for such decision is two-fold. Firstly, it speeds up the searching as for the verification we do not need to query the hash table HTR. Secondly, such matches (even if they are short) can be quite efficiently encoded as the match position is easy to predict (encoding of Ziv–Lempel parsing results is described below). Thus, even if the sequence Lk will be longer when such short matches are allowed, the final compression ratio can be better.

At the second level, the algorithm performs a similar Ziv–Lempel factoring of the collection  = {L1, L2, ..., Ln} to obtain the collection  = {D1, D2, ..., Dn}. We will use here similar notations as for the sequences , i.e., is the ith tuple of sequence Lk, is . Additionally we define the weight of a substring as the sum of weights of the tuples it is composed of, where the weight of a literal tuple is 1 and the weight of a match tuple is 7 (values chosen experimentally). A search structure HTL (namely, hash table with linear probing) is used here to look for matches in . At the beginning HTL is empty, but we update it by adding the already processed sequences of , i.e., when processing Lk the hash table HTL contains only all substrings of tuples of weights “close” to h2 = 11 of L1, L2,…, Lk−1. (For each position i in the tuple sequence Lu we take the shortest substring (in terms of the number of tuples) of weight not smaller than h2.) The substrings of tuples of Lk are added to the hash table after Lk is processed.

Now, when we process Lk starting from i = 1 to obtain Dk, we look for the match of the largest weight . There are two possible situations here:

  • No match of weight at least h2 is found. In this case we append the tuple (describing the first-level literal or the first-level match) to Dk and update the current sequence position: ii + 1.

  • Match is found. In this case we append the tuple to Dk and update the current sequence position: ij + 1.

The sequence Dk is composed of tuples of three kinds: first-level literal (pair), first-level match (triple), second-level match (quadruple). Since when processing L1 the search structure HTL is empty, D1 = L1.

The reason for using two-level Ziv–Lempel factoring is that the genome sequences are usually highly similar, so in the whole collection the same series of matches and literals between the current sequence and the reference sequence can be found. Thus, instead of storing the series of tuples many times, it is beneficial to encode them once and only reference to them for other sequences. Figure 1 shows how the two-level factoring is performed.

Figure 1
figure 1

Example of first- and second-level factoring in GDC 2 algorithm, where: h1m = 3, h1e = 2, h2 = 3, weight of a literal tuple is 1 and weight of a match tuple is 2.

Blue and green colors are used only to distinguish between adjacent first-level matches. The red underline is to point the second-level matches. The used abbreviations: L1L — fliteral, L1M — fmatch_1st_lev, L2M — fmatch_2nd_lev.

The collection is a succinct representation of the input collection . Nevertheless, it has potential to be compressed even more if we use an arithmetic coder31. What is important, instead of encoding the tuples as they are, we predict some of their values (e.g., matching positions) and encode only the differences between our predictions and the real values. The successive fields of the tuples are arithmetically encoded as follows.

Flags

There are only 3 different flags distinguishing between the tuple types. We encode them contextually, where the context is composed of two recently encoded flags.

Codes of symbols in the first-level literals

Codes of symbols are encoded contextually, where the context is the recently encoded symbol.

Positions of the first-level matches

These positions can be from a broad range, i.e., between 1 and . Since the genomic sequences are similar, the position of the current match is likely to be close to the position of the previous match increased by the number of symbols encoded in the meantime. Thus, before encoding the position pos we estimate its value expected_pos and encode only the difference relative_pos = expected_pospos. The expected_pos is calculated by increasing the recently encoded pos by: (i) the length of the last match, (ii) the number of literals encoded since the last match, (iii) the number of symbols encoded as the second-level matches seen from the recent first-level match. Then, the estimation is classified as: perfect (relative_pos = 0), good, (), poor (other values). Finally, the estimation type is encoded without a context and the necessary number of bytes (0, 1, or 4) of relative_pos are encoded with context being the estimation type and number of encoded byte. (Please note that various decisions on boundaries between, e.g., classes of estimations, length of matches, are based on preliminary experiments on parts of the input data and the exact influence of these decisions on the overall compression ratio is not presented.)

Lengths of the first-level matches

Each length is classified as: short (not longer than 28 symbols), long (of length between 28 and 216 + 28 symbols), very long (longer than 216 + 28 symbols). Then, the length type is encoded (without a context). Finally, the necessary number of bytes (1, 2, or 4) of the length are encoded with context being the length type and the number of encoded byte.

Sequence ids of the second-level matches

The value id is split into two integers: (prefix) and (suffix). The prefix is encoded without a context. The context of the suffix is the prefix.

Positions of the second-level matches

Similarly like the positions of the first-level matches, these values can be from a broad range. Thus, instead of encoding them as they are, we estimate the position and encode only the difference. Let us assume the current sequence is Lk. We need some auxiliary array A[1..k] to make the estimations possible. Now we will discuss how A is maintained when processing Lk Then, we will show how it is used to estimate the positions of the second-level matches.

Let us assume that the we have a match in the sequence Lu. After encoding it we store in A[u] the pair , where pA is the match position in Lu and sA is the number of symbols of Sk processed before the current match.

Thus, the encoding of the match positions is made as follows. For a match in the sequence Lu we calculate the difference d between the current position in Sk and the position SA stored in A[u]. Then, we advance the position pA (stored in A[u]) of Lu as long as the number of the symbols covered by the first-level literals and matches is not larger than d. What we obtain is the expected position in Lu for the current match.

Then, we calculate the difference between the expectation and the value of the current tuple. The estimations are classified as: perfect (difference is 0), good (absolute value of the difference between 1 and 16), moderate (absolute value of the difference between 16 and 256) and poor (other values). Finally, the estimation type is encoded without a context and the necessary number of bytes of the difference are encoded with context being the estimation type and the number of the encoded byte.

Lengths of the second-level matches

The lengths are classified according the their value to: short (not longer than 24 tuples), medium-sized (between 24 and 25 + 24 tuples), long (between 25 + 24 and 27 + 25 + 24 tuples), very long (between 27 + 25 + 24 and 28 + 27 + 25 + 24 tuples), extremely long (the rest). Then, the length type is encoded (without a context). Finally, the necessary number of bytes is encoded, where the context is the length type and additionally (for extremely long lengths) also the number of encoded byte.

Decompression algorithm

Decompression is straightforward. At the beginning the collection is obtained by arithmetically decoding the compressed file. Then, the collection is decoded. Finally, the sequences of are constructed from and R.

Access to a single compressed sequence

A drawback of the proposed algorithm is that to decompress Sn we need to decompress (at least at the second level) all other sequences. More precisely, to obtain Sm we need to have L1, L2,…, Lm−1 as they must be known to obtain Lm. Then, we can obtain Sm from Lm and R. This can be important especially when m is large. To partially solve this problem we implemented a variant of the compression algorithm in which we allow to set by the user (during compression) the fraction of the sequences that can be used as the second-level references. Thus, when this parameter is, e.g., 30%, in the worst case only 30% of must be decompressed. This deteriorates the compression ratio, so this is rather a compromise than a perfect solution.

Real implementation

To increase the speed of the compression and decompression we designed the compressor in a multithreaded fashion. There are several (user-defined) threads performing the first-level compression (and decompression) and a single thread performing the second-level compression (and decompression). For example, in the compression, each of the first-level threads reads a sequence Sk from a queue of sequences to compress and performs the Ziv–Lempel factoring of Sk according to R. The results Lk are stored in an in-memory queue Q. The second-level-compression thread reads sequences Lk from Q, performs the Ziv–Lempel factoring of it according to the already processed part of sequences from obtaining Dk and finally performs also the entropy coding of Dk. (We use a popular and fast arithmetic coding variant by Schindler, also known as a range coder (http://www.compressconsult.com/rangecoder/).) The queue Q has FIFO (first in first out) organization, so there is no guarantee in which order the sequences of will be processed (it depends on the processing time of the sequences by the first-level threads). Thus, the compression ratios can slightly differ between the executions of the algorithm.

The parallel design of the decompression algorithm is similar.

The compression output is composed of three files. The one with extension gdc2_desc stores file names, sequence sizes and ids of the multi-FASTA sequences. It is small, but to provide the best possible compression ratio of the whole algorithm, it is compressed using popular zlib library. The file with extension gdc2_rc contains the compressed representation of the collection . Finally, the file with extension gdc2_ref stores the compressed reference sequence R. As it is not a part of the collection to be compressed, its size is not counted in the experimental results. Nevertheless, we decided to compress it for the situations in which the user is interested in storing both the reference R and the collection in a single place in a compact form. This file is compressed by gathering symbols in triples and encoding them arithmetically.

Relation of the proposed compressor to the existing works

The proposed compressor bares some similarities to the existing works. The main concept of two-level Ziv–Lempel factoring is an extension of what was done in FRESCO21. There are, however, significant differences between these two approaches. FRESCO uses LZ77 factoring28, in which the sequences are divided into triples match position, match length, next symbol after the match, while GDC 2 uses LZSS factoring32 (which encodes the sequence as a list of matches and literals as described in the previous section). Moreover, FRESCO looks for longer matches, while GDC 2 allows very short matches if they are close to the previous matches. These different approaches have impact on the next stage of the compression, as in FRESCO there are less factors to be entropy encoded, but each needs more bits. The concept of looking for short matches after some longer ones is an extension of what was made in our previous work19. In GDC 2 we, however, permit not only single-letter mismatches, but also short indels. We also do not limit the number of short matches in a series.

In FRESCO, the collection of sequences is split into two sets: additional references and the remaining sequences. The additional references are compressed only according to the main reference sequence. The remaining sequences are compressed only according to the main and the additional references. Such design decision of FRESCO means that the maximal number of additional reference sequences is, in some way, limited. The reason is that if there are too many additional references, they could occupy a significant amount of space and the gain in better compression of the remaining sequences could not compensate that. In GDC 2, we do not split the collection into two sets. We just use all of the already processed sequences as the additional references for the current sequence, with significant boost in the compression ratio.

The most important difference between FRESCO and GDC 2 is, however, in the compression of tuple fields. FRESCO estimates the position of the first-level matches in a similar way as GDC 2, but the positions and lengths of the second-level matches are processed as they are, without estimations. All the integers are stored using specific byte code. The whole resulting byte sequence is then Huffman coded (with no contextual statistics). In GDC 2, the positions of the second-level matches are estimated in a complex way, which allows to store only small values (differences between estimations and real positions). Moreover, the second-level matches are also taken into account for the estimation of the first-level matches. Finally, GDC 2 uses a better entropy coder and the fields are compressed contextually, to reduce the redundancy even more. The way the tuples are encoded using an arithmetic coder, especially the calculation of the expected positions for the first- and second-level matches, is novel in this context.

GDC 2 and FRESCO differs also in the design of the internal data structures used for indexing sequences, which influences the processing speed.

The multithreaded design of GDC 2 was not used by existing multi-reference genome compressors.

Results

Our compressor, GDC 2, was implemented in C++11 language using C++ built-in concurrency mechanisms. The test machine was equipped with Intel i7 4930 K CPU (6 cores, clocked at 3.4 GHz), 64 GB of RAM and two 3 TB HDDs in RAID 0 (measured average read speed about 350 MB/s).

For the experiments we used two large datasets. A.thaliana dataset of total size 94 GB was obtained from the 1001 GP7 and contains 775 sequences. H.sapiens dataset of total size 6670 GB was obtained from the 1000 GP2 and contains 2184 sequences (from 1092 diploid human genomes). We also used one smaller dataset, H.sapiens alternate assemblies, which contains 11 alternatively assembled sequenced (haploid) human genomes. A more detailed description of all datasets (e.g., the chosen reference sequences) is given in the supplementary material.

The comparison of all of the existing genomic data compressors would be very hard due to many reasons. For example, some compressors do not support symbols other than ACGT, some cannot work with so huge data, some are very slow and performing complete experiments would take months. Thus we selected the compressors that proved to be the best (in terms of compression ratio) in the previous studies: 7z (general purpose compressor from the Ziv–Lempel family), RLZ26, GReEn27, ABRC20, GDC normal19, GDC ultra19, iDoComp22, FRESCO21. In the preliminary experiments (Table 1), we evaluated them on subsets of our datasets to select the candidates for more complete evaluation. As the results show, the single-reference compressors (RLZ, GReEn, ABRC, GDC-normal, iDoComp) give ratios much smaller than 1000 for H.sapiens chromosomes and smaller than 160 for A.thaliana chromosomes.

Table 1 Compression ratios for subsets of the datasets for various compressors.

The general purpose 7z can be seen as a multi-reference compressor since it looks for matches between the present sequence and the sequences seen in the past 1 GB. For H.sapiens Chromosome 21 it means about 20 recently processed sequences. Nevertheless, for H.sapiens Chromosome 1 these would be only 4 sequences. The true multi-reference compressors GDC-ultra and FRESCO give much better ratios for human chromosomes. For FRESCO we set the number of additional reference sequences to 100 as in a preliminary experiment (results not shown) this leaded to better compression ratios than the value 70 used in the original paper21.

In a consequence, for further experiments we selected two best single-reference compressors, i.e., GDC-normal and iDoComp and two best multi-reference compressors, i.e., GDC-ultra and FRESCO. The results of evaluation of the chosen compressors and the proposed GDC 2 are presented in Tables 2 and 3. For the H.sapiens dataset (Table 2) the compression ratio of GDC 2 is about 9500, which is approximately 4 times better than the best of the existing competitors. As it was mentioned earlier, GDC 2 bears some similarity to FRESCO, so it would be interesting to ask which of the changes between these algorithms have the highest influence on the GDC 2 advantage in the compression ratio. It is hard to answer precisely, as the algorithms differ in many details and the improvements implemented in GDC 2 are not independent. Nevertheless, from the presented results and other preliminary tests, it seems that the two most important things are: careful estimation of the positions of the second-level matches and allowing more reference sequences (especially, using all the already processed sequences as the references for the current one). Most of the rest of the advantage is probably due to: allowing short matches if they are close to the previous ones and contextual encoding of integers.

Table 2 Compression ratios for H.sapiens dataset.
Table 3 Compression ratios for A.thaliana dataset.

In the compression, the fastest is GDC 2, which works with a speed about 200 MB/s. Measuring of the speed of decompression is problematic as some of the compressors work faster than the disk speed (~350 MB/s), which in practice is more than sufficient. Nevertheless, we were interested in what is the true decompression speed of the GDC 2 algorithm, so we measured it with the output redirected to /dev/null (i.e., the sequences were decompressed but not stored) obtaining about 1000 MB/s.

The experiment for the A.thaliana dataset (Table 3) shows that the compression ratios are much worse. The best ratio, almost 600 was obtained by GDC 2. This result is approximately 2.4 times better than the second best, GDC-ultra. Also the compression speeds are worse here.

We also experimented with H.sapiens alternate assemblies dataset (Table 4). In contrast to previous datasets, here the sequences are much more diversified. Also the collection is much smaller as it contains only 11 individuals. Now the best compression ratios were obtained by GDC ultra algorithm. The advantage over GDC 2 is, however, rather small. Moreover, GDC ultra is more than 30 times slower. It is also important to stress that GDC ultra scales poorly, as the maximal number of reference sequences is 40, so for larger collections GDC 2 should win in the compression ratio. The fastest algorithm here was iDoComp. When running FRESCO, we selected here one-level compression because the collection was so small (or too divergent; it is hard to find the correct answer due to the small number of individuals) that two-level approach gave worse results.

Table 4 Compression ratios for H.sapiens alternate assemblies dataset.

It is also interesting to compare the compression ratios with what is possible, when much more knowledge of the data is given (cf. TGC columns in Tables 2 and 3). Namely, when the input data are given as differences between the sequences and the reference (in VCF format), the best available compressor, TGC, obtained even better ratios. For human dataset they were about 15,500, on average. When we compare this with about 9,500 of GDC 2 we see that we are quite close to what is theoretically possible. Similar results are for A.thaliana dataset: ~590 ratio for GDC 2 and ~860 ratio for TGC. What is, however, worth to stress, GDC 2 is able to compress collections of sequences of the same species gathered from various sources (e.g., de novo assembled), when no alignment of them is given, while TGC input must be provided as aligned sequences described as variants between them. Therefore, TGC ratios should be seen only as a hint what in theory is possible for the examined collections of genomes, even if it would be extremely hard to obtain such results and should not be directly compared to the ratios of the rest of examined compressors.

In the next experiment, we measured the influence of the number of sequences in the input collection on the compression ratio, compression and decompression speeds and memory usage. The results for two chromosomes are shown in Fig. 2. As one can see for the human chromosome the compression ratio is about 8000 for 300 input sequences and increases moderately for growing number of input sequences. The same phenomenon can be observed for A.thaliana dataset, but the ratio is about an order of magnitude lower.

Figure 2
figure 2

Influence of the number of sequences in the input collection on GDC 2: compression ratio (left top), memory usage (right top), compression speed (left bottom), decompression speed (right bottom).

The decompression speed was measured when the output was redirected to /dev/null, i.e., the sequences were decompressed but not stored.

The memory usage of GDC 2 depends mainly on the number of sequences serving as the second-level references as they must be stored (and indexed) in memory during compression. In this experiment all sequences were used as additional references, so the memory consumption grew constantly up to about 6 GB. (The most memory consuming was compression of H.sapiens Chromosome 2 for which about 24 GB of RAM was necessary.) The visible stepwise increment of the memory usage is a consequence of the assumed possible hash table size (being always a power of 2).

The compression and decompression speeds for the human dataset initially grow with the increasing number of sequences and are the highest for the collection of size about 300–500. This is correlated with the growing compression ratio. Roughly speaking, the more second-level references, the better the second-level factoring (i.e., longer matches can be found) and so, there are significantly less data to process by the arithmetic coder. However, for larger collections, much more data must be analyzed during the second-level factoring, so the speed of compression falls down. A similar thing happens in the decompression. The better second-level factoring means less data to be arithmetically decoded, which increases the speed. Unfortunately, more second-level references means much more computations for the estimation of the positions of matches, which significantly influences the decompression time for large collections.

In the next experiment, we measured the influence of the number of reference sequences in the second level of GDC 2 on the compression ratio, (de)compression speeds and the extraction time of a single sequence of a collection. The most important results are presented in Fig. 3 (the complete results are in Supplementary Figure S1). Decreasing the number of second-level references by half results in a reduced RAM usage (about half less RAM is used) and a noticeable speed up of compression (24% for H.sapiens dataset and 17% for A.thaliana dataset) at a cost of some decrease of compression ratio (26% and 14%, respectively). Using even less sequences in the second level of GDC 2 leads also to significant gains in speed of decompression of complete collection or a single sequence, obviously at a cost of decreased compression ratio. For 10% of the sequences used, average single sequence access times decreased from 53 to 31 seconds for H.sapiens dataset (at a cost of 2.85 worse compression ratio) and from 63 to 21 seconds for A.thaliana dataset (at a cost of 1.79 worse compression ratio).

Figure 3
figure 3

Influence of the percent of 2nd level references on GDC 2 compression ratio (left), decompression (access) time of a single sequence (right).

In both cases the input dataset contained all available sequences.

GDC 2 is implemented in a multithreaded fashion, so it is natural to ask how its speed scales when the number of threads is increased. By default, GDC 2 uses 4 threads: 3 for the first level Ziv–Lempel factoring and 1 for the second-level factoring and arithmetic coding. The results presented in Fig. 4 show that the value 3 or 4 seems to be the best choice for the used datasets and the test machine. The speed is limited by disk speed or (for fast disks) by the single second-level compressing thread. This suggest that splitting this thread into two, e.g., one performing Ziv–Lempel factoring and other performing arithmetic compression would increase the total performance of GDC 2. Nevertheless, since the absolute values of compression speeds are high, we resigned from that in the present version of the software.

Figure 4
figure 4

Influence of the number of threads used by GDC 2 algorithm on compression and decompression speeds.

Discussion

We proposed a new algorithm for compression of collections of complete genome sequences. The evaluation shows that its compression ratios are roughly 4 times better than the best existing competitors. Moreover, it is very fast, as the compression speed for the H.sapiens dataset is about 200 MB/s. The decompression speed is limited by the speed of the disk used in the experiments. When we measured this speed without storing the files onto disks, it was about 1000 MB/s. The algorithm is designed primarily to compress and decompress efficiently a large collection of genomes all at once. However, it also performs well for relatively small, divergent genome sets. Moreover, extraction of a single sequence is also possible. The access time, although not impressive (counted in tens of seconds), can be significantly improved at a cost of some decrease in the overall compression ratio.

Additional Information

How to cite this article: Deorowicz, S. et al. GDC 2: Compression of large collections of genomes. Sci. Rep. 5, 11565; doi: 10.1038/srep11565 (2015).