Optimal compressed representation of high throughput sequence data via light assembly

Ginart, Antonio A.; Hui, Joseph; Zhu, Kaiyuan; Numanagić, Ibrahim; Courtade, Thomas A.; Sahinalp, S. Cenk; Tse, David N.

doi:10.1038/s41467-017-02480-6

Download PDF

Article
Open access
Published: 08 February 2018

Optimal compressed representation of high throughput sequence data via light assembly

Antonio A. Ginart¹^na1,
Joseph Hui²^na1,
Kaiyuan Zhu³^na1,
Ibrahim Numanagić⁴,
Thomas A. Courtade⁵,
S. Cenk Sahinalp³ &
…
David N. Tse¹

Nature Communications volume 9, Article number: 566 (2018) Cite this article

3099 Accesses
11 Citations
9 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 26 April 2018

This article has been updated

Abstract

The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie. We show how to efficiently build such tries to compactly represent reads and demonstrate that among all methods using this representation (including all de novo assembly based methods), our method achieves the shortest possible output. We also provide an lower bound on the compression rate achievable on uniformly sampled genomic read data, which is approximated by our method well. Our method significantly improves the compression performance of alternatives without compromising speed.

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Wenpin Hou & Zhicheng Ji

Gene trajectory inference for single-cell data by optimal transport metrics

Article 05 April 2024

Rihao Qu, Xiuyuan Cheng, … Yuval Kluger

scGHOST: identifying single-cell 3D genome subcompartments

Article 08 April 2024

Kyle Xiong, Ruochi Zhang & Jian Ma

Introduction

Recent advances in high-throughput sequencing (HTS) have made it possible to sequence genomes of complex organisms in a matter of hours. As worldwide increase in genomic sequence data generation put strain on storage and communication systems, new compression methods are designed to reduce this burden by exploiting redundancy within and across reads extracted from genome sequences. Currently, available HTS data compression methods reduce the redundancy within a genomic sequence data set by either (i) assembling reads into long contigs, typically by a de Bruijn graph-based approach (e.g., Quip¹, Leon², k-Path³, and KIC⁴), or by (ii) aligning the reads to a reference genome (e.g., LW-FQZip⁵); the reads are then encoded as simple pointers to the reference or the assembled contigs.

Because both sequence mapping and assembly are computationally intensive tasks, all the above HTS compression methods are typically slow and thus are not commonly used even though they achieve high compression performance. The best compromise between the compression rate and running time is typically achieved by HTS data compressors that perform read mapping or assembly only implicitly; these methods first cluster reads that share long substrings, and independently compress reads in each cluster after a reordering or implicit assembly. Since the read order within unmapped read files (in FASTA/Q formats) is not important, reordering of the reads in a manner that brings the similar reads together can significantly boost the compression rates while avoiding information loss (e.g., SCALCE⁶, Orcom⁷, Mince⁸, and BEETL⁹). If the underlying genome is highly repetitive, or if the coverage of the data is high, even general-purpose compressors, such as gzip¹⁰ or bzip¹¹, significantly benefit from the improvement in data locality as a result of reordering.

In this paper, we introduce a new reference-free compressed representation for HTS data that is based on a crude yet more explicit de novo assembly of reads for improving compression rates. (Note that compression performance is measured in terms of either "compression rate", the number of bits in the output per each input bit, or "compression ratio", its inverse.) Our representation represents a set of input reads as nodes of compact tries. Each edge, from a node v to its parent u, represents the suffix of the read corresponding to v that is not covered by the read corresponding to u. Unique to our representation, the tries are not organized in a top-down, but rather in a bottom-up fashion, i.e., each node v has a link only to its parent u and not to its children.

Next, we describe an iterative method called Assembltrie, to build a forest of such tries, which, for each read r, (i) greedily picks an already processed read r′ for which the overlap between a prefix of r and the suffix of r′ is maximum possible, and a new node v is created with a link to the node u corresponding to r′ as its parent (if no such r′ exists, it starts a new tree with r only), and, (ii) greedily identifies each (already processed) read r″, which has a longer prefix that match a suffix of r than that of the read corresponding to its existing parent—and updates its parent as v.

As a first in HTS data compression, we show that Assembltrie achieves optimality from a combinatorial point of view (on finite size HTS data) as follows. Given a finite HTS read collection to be compressed by an explicit or implicit representation of the input as a subgraph of the standard read overlap graph, where reads are represented through pointers to nodes, Assembltrie is guaranteed to produce the smallest possible representation in terms of the number of symbols used in the subgraph as well as the number of pointers; Assembltrie uses no additional information to maintain the topology of the subgraph. This guarantee covers all algorithms that are constrained with the compressed representation described above, including those methods that represent the reads as pointers to their (string graph based) de novo assembly.

Note that this notion of combinatorial optimality is achieved for read sets that do not have read errors; it also does not extend to compressors that represent the input as a subgraph of a de Brujn graph (even though in such representations, the number of pointers could be superlinear with the number of reads). Furthermore, our notion of optimality does not imply bitwise optimality on all inputs due to standard information theoretic limitations.

We also provide the first information theoretic lower bound on the compression rate achievable on a collection of reads that are uniformly selected from a reference with known entropy, and demonstrate that Assembltrie closely approaches this ultimate performance limit.

We have evaluated our method on a recent benchmarking data set¹² comprised of unmapped read (FASTQ) collections with both deep and shallow coverage from various organisms (H. sapiens, bacteria, plants) and from several sample types (genomic and metagenomic), as well as some simulated and aligned (i.e., reads are extracted SAM/BAM files, where the positions of the reads were discarded) HTS data. Our method improves the compression rate achieved by the best available software tools, such as Orcom⁷ or Mince⁸, by 10–50%, depending on the type of the data set. With respect to the running time, Assembltrie is competitive with the reordering-based methods and offers remarkable gain over the alignment/assembly based methods. Furthermore, the runtime performance is significantly improved through parallelization without sacrificing compression performance.

Results

Assembltrie was tested on a Linux server equipped with 39 10-core Intel^® Xeon^® E5-2650 v3 2.30 GHz CPUs, 1058 GB of RAM in total, and a Lustre-based file system with 750TB disk space. For convenience, the units in this section take decimal multiples, i.e., 1 M = 10⁶, 1 G = 10⁹ etc., to represent the number of reads, file sizes, and memory usage.

Benchmarking Assembltrie against other methods

The data sets we used to benchmark Assembltrie consists of a number of FASTQ files compiled by MPEG HTS working group, composed of ~200 GB of human, human microbiome (metagenomic), bacterial and plant genome sequence, and gives a comprehensive assessment for the relative performance of Assembltrie against all existing FASTQ compression methods. (See Table 1.) Note that since the read collection from the T. cacao genome are of poor quality with a very high error rate, we also added a simulated set of reads sampled from the T. cacao reference genome (with a uniform i.i.d. error rate of 1%) for comparison. In addition to this data set collection, we have extracted reads from a BAM-formatted human genome data set NA12878 and converted them to FASTA format. (This file is not a part of the MPEG HTS working group FASTQ/FASTA compression benchmarking data set. Instead, a higher coverage version is a part of the SAM/BAM benchmarking data set. We downsampled it to match the coverage of ERR1743010.) Assembltrie is developed for compressing unmapped reads in FASTQ/FASTA format, and a conversion from mapped reads to unmapped read collections is wasteful. However, we used this data set to measure the effectiveness of the strand correction heuristic used by Assembltrie (see “Methods” section) since reads in a SAM/BAM file are already strand corrected.

Table 1 The MPEG HTS benchmarking data set

Full size table

Compression performance and computational resource utilization

The experimental results, including the compression performance, and the computational resources used by our method are presented in Tables 2 and 3. (Note that compression performance is measured in terms of either "compression rate", the number of bits in the output per each input bit, or "compression ratio", its inverse.) In addition to Assembltrie, we present the compression performance (on the MPEG HTS compression benchmark data set) of Orcom⁷, BEETL⁹, Mince⁸, and k-Path³, as provided in the MPEG benchmarking study¹². As per Assembltrie, all of these compressors are developed to compress only the sequence information in a FASTQ file. Among these methods, we observed that Orcom is much faster than the others thus we compared the compression time of Assembltrie against Orcom only.

Table 2 Comparative compression ratios achieved by Assembltrie on the MPEG HTS benchmarking data set

Full size table

Table 3 Compression times and memory usage of Assembltrie and Orcom

Full size table

Table 2 presents the compression rates achieved by the above methods on the MPEG benchmark data set. As we can see, Assembltrie outperforms alternatives on all samples, typically providing an improvement of 10–35% in compression ratio in comparison to the best alternative (usually Orcom or Mince)—with the exception of the SRR870667 data set. This data set from the plant T. cacao is comprised of low-quality reads: in fact, more than 40% of the reads do not share an overlap greater than ⌊L/5⌋ = 21 with any other read (here L = 108), even when the number of mismatches allowed is 4. In contrast, more than 90% of the reads from the simulated T. cacao data (with similar coverage and depth, but with 1% error rate) get “assembled” (placed in the cycle-rooted tries).

Note that Assembltrie performs especially well on the low-coverage samples. In fact, as illustrated in Fig. 1, the compression gap between Assembltrie and Orcom increases as the coverage decreases. Table 2 also demonstrates the effectiveness of our greedy strand correction heuristic (see “Methods” section for details). As can be seen, Assembltrie can achieve as much as 20% gain on the compression ratio by applying a simple greedy strand correction heuristic (which works especially well on low-coverage data); however, the heuristic cannot determine the correct strand orientation for many of the reads. To demonstrate the gap between what our greedy heuristic achieves and how Assembltrie performs on data with reads that are all correctly oriented, we give results on a read collection we extracted from a human genome BAM file, NA12878—where reads are already strand corrected. We extracted reads from this BAM file while ensuring that the resulting data set would have coverage and read length similar to that of ERR174310. The two human data sets are compressed equally well by Orcom. Interestingly, there is a big gap between the compression rates achieved by Assembltrie on these files. Orcom’s performance is comparable to Assembltrie on the non-strand-corrected file, when the heuristic is not used. The use of the heuristic improves Assembltrie results by ~20%. On top of this, Assembltrie achieves another 22% improvement on data that is already strand corrected. In other words, Assembltrie is likely to provide significantly improved compression rates with better strand correction methods. (It is possible to perform mapping for strand correction but that would add burden on the running time.)

Table 3 presents the running time and memory usage of Assembltrie in comparison to Orcom. Note that Assembltrie is slower than Orcom with default read overlap length K = ⌊L/5⌋ but with increasing K it achieves a similar running time (with the exception of the single problematic read collection—SRR870667, where lower compressibility results in a higher search time for the placement of each read). Further, note that the runtime and RAM usage of Assembltrie is affected by the use of its strand correction heuristic. In terms of memory usage, Orcom has an advantage since Assembltrie must maintain all of the processed reads as well as the prefix and suffix hash tables, necessitating Ω(NL) memory.

Finally, Table 4 presents the running time and memory usage of Assembltrie vs Orcom for decompression purposes. Assembltrie is consistently faster than Orcom, with or without strand correction, sometimes significantly so (e.g., for SRR870667 data set, Assembltrie decompression time is 3.5 times faster). The memory usage of Orcom, however, is better—as explained above.

Table 4 Decompression times and memory usage of Assembltrie and Orcom

Full size table

Assembltrie performance vs our entropy approximation

We have also compared Assembltrie’s performance with our entropy approximation for a given collection of HTS reads (see “Methods” section for details). For this, we generated six simulated data sets, where the reads were obtained by randomly sampling E.coli K-12 DH10B genome (length 4.69 Mbases), varying the read length, coverage, and error rate (the probability that each base is mutated). The number of reads (in millions), read length, coverage, and error rate in each of these data sets are (respectively): DH10B 1: 1.56 M, 120 bp, 40x, 0.30%; DH10B 2: 1.88 M, 100 bp, 40x, 0.35%; DH10B 3: 2.34 M, 80 bp, 40x, 0.27%; DH10B 4: 1.17 M, 100 bp, 25x, 0.35%; DH10B 5: 3.75 M, 100 bp, 80x, 0.35%; and DH10B 6: 1.88 M, 100 bp, 40x, 0.10%.

Table 5 presents the performance of Assembltrie on these simulated data sets. The entropy rate (in bits per base) $H({\cal R})$ of each read collection ${\cal R}$ is calculated according to Eq. (1) (see “Methods” section). Clearly, Assembltrie is the only software tool we tested that produces compression results that come close to $H({\cal R})$; there is a big gap between our entropy approximation and even the best performing Orcom’s compression results. In fact, Assembltrie outperforms Orcom by a factor of 2.1 to 2.6 in these data sets.

Table 5 Assembltrie’s compression performance is comparable to our entropy approximation for E. coli read collection

Full size table

In addition to the data sets above, we have also experimented with simulated read collections sampled from the same E.coli strain, this time with fixed coverage (1.8 M reads all together) and read length (101 bp) but with varying error rate. Figure 2 demonstrates how Assembltrie’s performance varies on this data set as a function of the error rate (here only substitutions are considered as errors since they are the major form of sequencing errors in Illumina data). As can be expected, the gap (shown in Fig. 2) between the maximum achievable compression ratio and what is obtained by Assembltrie grows as the error rate increases. This gap is primarily due to Assembltrie’s requirement of an exact K-mer match during the process of identifying potential parents/children for each read. As the error rate approaches 0, the gap between Assembltrie’s performance and our entropy approximation diminishes.

Parameter selection

As mentioned earlier, the running time of Assembltrie is primarily determined by the minimum overlap length K—the larger value of K results in a shorter running time. Unfortunately, larger K values may result in missed overlaps, potentially sacrificing overall compression performance. Figure 3 depicts the tradeoff between Assembltrie’s running time and compression performance, with varying values of K. On read data sets from a large (e.g., human) genome, choosing K ≥ ⌊L/4⌋ provides a good tradeoff between the running time and compression performance.

Discussion

As demonstrated above, Assembltrie is a high-performance sequence compression tool for large genomic read collections, capable of improving the compression ratio achieved by all available methods significantly. It may be possible to further improve its performance through improved strand correction heuristics; however, this should not come at the expense of a poorer running time. Assembltrie’s main contribution is in providing improved compression without sacrificing running time. However, its memory usage is relatively high. It may be possible to reduce the memory usage by providing a tradeoff between running time or compression ratio and the memory usage by limiting the branching factor in the compact tries it builds (in fact limiting the branching factor to 1 will result in contigs that will not only avoid the use of pointers and improve the memory usage, but also improve the running time—however, this is likely to result in poorer compression performance).

Methods

Problem definition

The main goal of Assembltrie is to achieve lossless compression of a collection (multiset) of fixed length genomic reads, i.e., strings from the four letter DNA alphabet (and possibly the letter N—to represent unknown nucleotides), denoted by ${\cal R}$ (see the literature^{13,14,15,16,17,18} for discussions on the general problem of lossless multiset compression). Since the reads form a multiset, and because the read locations are arbitrary, we do not maintain the order of the reads.

Assembltrie is based on a light assembly of reads in the sense that instead of assembling the reads into independent contigs, they are organized in a more compact data structure that we call a read forest. Specifically, a read forest ${\cal T} = (V,E,w,{\mathrm{st}})$ is a directed graph, where each node v ∈ V corresponds to a specific read $r \in {\cal R}$ associated with a string st(v) of fixed length $L = \left| {{\mathrm{st}}(v)} \right|$. Each connected component of ${\cal T}$ can be thought as a trie-like connected subgraph T = (V_T, E_T, w, st) (V_T ⊆ V, E_T ⊆ E), such that for each node v in V_T, there is exactly one outgoing-directed edge, say (v, u), where u ∈ V_T is said to be the parent of v and is denoted π(v) = u (it is possible that π(v) = NIL indicating that v is a root). Each edge (v, u) ∈ E_T is said to have weight w(v, u), which is the length of the shortest suffix of st(v) that cannot be covered (i.e., matched exactly) by a suffix of st(u). In other words, w(v, u) = L − l, implies that st(v)[1:l] = st(u)[L − l + 1:L], where l is the length of such suffix–prefix overlap between st(v) and st(u), and s[i:j] denotes the substring from index i to j in string s. Since T is comprised of nodes with a single outgoing edge, it forms a graph with at most one cycle. Note that each edge of T represents a string; hence, we can think of T as a trie, which has either a single node or a cycle (instead of a single node) acting as the root. As a result, the set ${\cal T}$ of trie-like graphs T (we will call them cycle-rooted tries, with an example given in Fig. 4) forms a forest, which can be thought of as a subgraph of the standard read-overlap graph (one of the two basic frameworks commonly used in genome assembly—see ref. ¹⁹ for a definition) formed by the input reads.

Unique to our representation, a cycle-rooted trie T is not arranged in a top-down, but rather in a bottom-up fashion, since each node v has a link only to its parent π(v). Thus, the construction of a read forest ${\cal T}$ simply constitutes the computation of the parent node π(v) for each node v. Since the compressed representation of the input ${\cal R}$ is simply an efficient encoding of ${\cal T}$, the objective of Assembltrie is to construct a read forest ${\cal T}^ \ast$ that contains minimum number of symbols, i.e., in which the sum of edge weights is minimum possible.

Given a read forest ${\cal T}$, let its total weight be the sum of the weights of its edges, i.e.,

$$\left| {\cal T} \right| = \mathop {\sum }\limits_{T \in {\cal T}} \mathop {\sum }\limits_{v \in V_T} w(v,\pi (v)),$$

where we can assume that w(r, NIL) = L for each root node r. In the remainder of the paper, we will consider read forests ${\cal T}_K$, where the minimum overlap length between any string st(v) and st(π(v)) is a user specified K (see Fig. 3 for details). As we will show, Assembltrie produces the read forest with minimum total weight, i.e., ${\cal T}^ \ast = {\mathrm{argmin}}\left| {{\cal T}_K} \right|$, for any value of K. Compared with the shortest superstring ${\cal S}$—as an alternative representation of the input set ${\cal R}$ of reads—${\cal T}^ \ast$ is guaranteed to contain at most as many symbols (and possibly less), while maintaining a single pointer per read.

Constructing read forest

As mentioned above, in order to construct the read forest ${\cal T}_K$, one simply needs to identify for each node v, the parent node π(v). This process is performed iteratively: for each distinct read $r \in {\cal R}$ Assembltrie iteratively considers, it creates a new vertex v that corresponds to r. (Note that the number of occurrences of each specific read can be maintained in a separate data structure.) It then greedily picks an existing node π(v), satisfying two conditions: (i) the overlap between suffix of π(v) and prefix of v is maximal among all existing nodes and (ii) the overlap size is at least K. In case the overlap is shorter (than K), v forms an independent (cyclic) trie with π(v) = NIL. Next, Assembltrie identifies all children of v, i.e., those nodes u such that the (longest) overlap between a prefix of st(u) and a suffix of st(v) is longer than that between a prefix of st(u) and a suffix of st(π(u)), provided that the overlap length is, again, at least K. For each such node u, its parent π(u) is reset to v.

In order to find the suffix–prefix overlaps, Assembltrie constructs two hash tables, one for maintaining the length K prefix and the other for the length K suffix of each of the reads. Given a node v, Assembltrie searches its potential parents by sliding a window of length K across st(v) from right to left, identifying each node u for which the length K suffix of st(u) exactly matches this window. In case this initial match extends to a full match between the prefix of st(v) that includes this window, and the corresponding suffix of st(u) of identical length (see Implementation Details for a detailed description of how we extend an initial match), u is declared as the parent of v, completing the search. If no such node u can act as a parent of v, then the window slides one position to the left and the search resumes. Assembltrie searches for possible children of each vertex v in a similar manner; note that we now need to slide the window from left to right, and use the hash table for prefixes. If for a matching node u, st(u) has an overlap with st(v), which is longer than that with its current parent, i.e., w(u, v) < w(u, π(u)) ≤ L − K, then π(u) is reset to v. After the parent and the children of v are established, the suffix and prefix of st(v) of length K are inserted into the corresponding hash tables. The overall workflow of Assembltrie is depicted in Fig. 5.

Running time

The process of identifying the potential parent and children of each read (or node) takes O(m(L − K)) time, where m is the maximum load of an entry in either hash table. This implies that in the worst case we need O(mN(L − K)) = O(mNL) time to construct the read forest. Under the condition that the underlying reference genome G is uniform i.i.d. and the reads ${\cal R}$ are extracted from G uniformly at random, m will be a small constant^{20, 21} and the expected time to construct the read forest becomes O(NL). In practice, however, long genomic repeats as well as sequencing errors lead to hash collisions and hence a moderate slowdown.

Combinatorial optimality

We now prove that the greedy algorithm used by Assembltrie actually produces the optimal (that is, the one containing minimum total number of symbols) read forest ${\cal T}^ \ast$, based on the assumption that the reads have no sequencing errors. In particular, the total weight of ${\cal T}^ \ast$ ends up at most as much as the length of a shortest superstring constructed from {st(v): v ∈ V}, since each superstring must correspond to a set of paths that connect all v ∈ V, which also forms a valid read forest ${\cal T}$.

In order to show the invariant maintained through the greedy construction of ${\cal T}_K = \left( {V,E_K,w,{\mathrm{st}}} \right)$, we extend the definition of the shortest non-overlapping length w to each pair of nodes and define for each node v ∈ V its representation as:

$${\mathrm{rep}}\left( v \right) = {\mathrm{min}}\left\{ {\begin{array}{*{20}{l}}{w\left( {v,u} \right)} \hfill & {{\mathrm{if}}\,w\left( {v,u} \right) \le L - K,u \ne v \in V} \hfill \\ L \hfill & {\mathrm{otherwise}} \hfill \end{array}.} \right.$$

Lemma 1 For any read forest ${\cal T}_K = (V,E_K,w,{\mathrm{st}})$, its total weight $\left| {{\cal T}_K} \right| \ge \sum _{v \in V}{\mathrm{rep}}(v)$.

Proof Let π:V → V ∪ {NIL} map each node to its parent, that is, π(v) = u if (v, u) ∈ E_K and π(v) = NIL if there exists no parent. By definition,

$$\begin{array}{*{20}{l}} {\left| {{\cal T}_K} \right|} \hfill & = \hfill & {\mathop {\sum }\limits_{v \in V,\pi (v) \ne {\mathrm{NIL}}} w(v,\pi (v)) + L \cdot \left| {\left\{ {v:\pi (v) = {\mathrm{NIL}}} \right\}} \right|} \hfill \\ {} \hfill & \geq \hfill & {\mathop {\sum }\limits_{v \in V,\pi (v) \ne {\mathrm{NIL}}} {\mathrm{rep}}(v) + \mathop {\sum }\limits_{v \in V,\pi \left( v \right) = {\mathrm{NIL}}} {\mathrm{rep}}(v)} \hfill \\ {} \hfill & = \hfill & {\mathop {\sum }\limits_{v \in V} {\mathrm{rep}}(v).} \hfill \end{array}$$

With the above lemma, we can see the greedy algorithm indeed returns a read forest with the minimum total weight (i.e., number of symbols) since it guarantees for any node v, the parent it will identify, π(v), will satisfy w(v,π(v)) = rep(v); thus, we have the following theorem.

Theorem 1 The greedy algorithm computes ${\cal T}_K^ \ast$.

An information theoretic lower bound for HTS data compression

In the previous section, we established that Assembltrie constructs the smallest representation of error-free reads among all descriptions that take the form of a read forest. However, this does not preclude the possibility that representation of reads by a fundamentally different data structure could obtain better compression performance. In this section, we address this question through the lens of information theory. More specifically, in his seminal 1948 paper²², Shannon proved that any realization of a random source can, on average, be described by a number of bits not exceeding the entropy of the source. Conversely, any compression scheme that uses fewer bits is doomed to encounter errors in reproducing the source from its compressed form. The key point is that this latter bound on compression performance applies uniformly to any compression algorithm, and therefore provides a fundamental benchmark that no viable compression scheme can outperform. In this section, we attempt to estimate this performance limit for HTS data under realistic assumptions. Later in the paper, we demonstrate that Assembltrie comes close to achieving this bound on bacterial genomes across a spectrum of read-error rates.

For a random variable X taking values in a set ${\cal X}$ with probability mass function p_X, the entropy associated to X is given by:

$$H\left( X \right): = \mathop {\sum }\limits_{x \in {\cal X}} p_X(x){\mathrm{log}}\frac{1}{{p_X(x)}},$$

where log(⋅) denotes the base-2 logarithm, and H(X) has units of bits. Direct computation of the entropy associated to a set of reads ${\cal R}$ sampled from a genome G is not practical since a probability distribution over possible genomes G is not available. Nevertheless, it is well known that universal compression algorithms, such as LZ77, are guaranteed to produce descriptions that approach the entropy of a random source with an unknown probability distribution under mild assumptions^23,24,25,26. As such, we are justified in estimating H(G) by LZ(G), defined as the description length of G produced by the universal compression algorithm LZ77 (as implemented in gzip). Using this, we may approximate $H({\cal R})$, the minimum number of bits needed by any algorithm to describe the reads ${\cal R}$. Specifically, we argue below that

$$\begin{array}{*{20}{c}} {H\left( {\cal R} \right) \approx NL\,{\mathrm{log}}\left( 3 \right)h_2\left( p \right) + \left| G \right| \cdot H\left( {{\mathrm{Poisson}}\left( {N{\mathrm{/}}\left| G \right|} \right)} \right) + \mathrm{LZ}\left( G \right) }\end{array},$$

(1)

where h₂(p) = −plog(p) − (1 − p)log(1 − p) is the binary entropy of p ∈ (0, 1), $H\left( {{\mathrm{Poisson}}\left( {N{\mathrm{/}}\left| G \right|} \right)} \right)$ denotes the entropy of a Poisson random variable with mean $N{\mathrm{/}}\left| G \right|$, N is the number of reads, and $\left| G \right|$ denotes the length of the sequence G. This approximation is generally valid under the following assumptions:

Poisson sampling: Reads in ${\cal R}$ are assumed to be sampled independently and uniformly at random from all positions in G. In such sampling, the number of reads at each position is well-approximated by independent Poisson(N/|G|) random variables for N and |G| large.
Independent substitution errors: We assume that each base of each read in ${\cal R}$ is corrupted independently with probability p ∈ (0, 1). When a particular base is selected for corruption, it is substituted with another base, chosen with equal probability from the remaining three bases.

Estimating $H({\cal R})$

Our goal in this section is to justify the approximation in (1) for $H({\cal R})$, the entropy of the set of reads sampled from a genome G, possibly containing read errors. To do this, we first make the assumption that the random process of sampling reads from G is independent of the read-error process that corrupts individual bases through substitutions, insertions, or deletions. In other words, if ${\cal R}^ \star$ denotes the set of error-free reads (i.e., the reads in ${\cal R}^ \star$ coincide with those in ${\cal R}$, but do not contain any read errors), then the genome G and the sampled reads (with errors) ${\cal R}$ are conditionally independent given the corresponding error-free reads ${\cal R}^ \star$.

Using this assumption and elementary properties of entropy²⁷, we may write

$$H({\cal R}) = \underbrace {H\left( {{\cal R}|{\cal R}^ \star } \right)}_{{\mathrm{entropy}}\,{\mathrm{of}}\,{\mathrm{error}}\,{\mathrm{process}}} + \underbrace {H\left( {{\cal R}^ \star |G} \right)}_{{\mathrm{entropy}}\,{\mathrm{of}}\,{\mathrm{read}}\,{\mathrm{sampling}}\,{\mathrm{process}}} + \underbrace {H(G)}_{{\mathrm{entropy}}\,{\mathrm{of}}\,{\mathrm{sequences}}} \\ - H(G|{\cal R}) - H\left( {{\cal R}^ \star |G,{\cal R}} \right),$$

where H(⋅ | ⋅) denotes conditional entropy. As denoted, the first three terms in the expression for $H({\cal R})$ have intuitive interpretations, and may be estimated in practice:

The first term $H\left( {{\cal R}|{\cal R}^ \star } \right)$ is the entropy of the reads ${\cal R}$, given that the error-free reads ${\cal R}^ \star$ are available to us. That is, $H\left( {{\cal R}|{\cal R}^ \star } \right)$ is roughly equal to the number of bits needed to describe the read errors, if the error-free reads themselves were already known. For example, if the read-error process corresponds to corrupting each base independently with probability $p \in (0,1)$ by replacing it randomly with one of the remaining three bases, we have $H\left( {{\cal R}{\mathrm{|}}{\cal R}^ \star } \right) = NL\,{\mathrm{log}}(3)h_2(p)$, where N denotes the total number of reads, and L denotes their length (in bases).
The second term $H\left( {{\cal R}^ \star |G} \right)$ denotes the entropy of the error-free reads, given that the sequence is known. In other words, this is the entropy of the random process of sampling reads from different locations in the genome. If we adopt the Poisson sampling model, in which reads are sampled uniformly at random from the genome, we have to first order the approximation

$$H\left( {{\cal R}^ \star |G} \right) \approx \left| G \right| \cdot H\left( {{\mathrm{Poisson}}\left( {N{\mathrm{/}}\left| G \right|} \right)} \right),$$

where $H\left( {{\mathrm{Poisson}}\left( {N{\mathrm{/}}|G|} \right)} \right)$ denotes again the entropy of a Poisson random variable with mean $N{\mathrm{/}}\left| G \right|$, where N is the number of reads, and $\left| G \right|$ denotes the length of the sequence G. Indeed, given that the sequence G is known, the entropy of the reads is determined entirely by the locations of the reads on the genome G. For N and $\left| G \right|$ both modestly large, the number of reads sampled from each locus are well approximated as independent Poisson random variables, each with mean $N{\mathrm{/}}\left| G \right|$, provided the number of non-unique L-mers in the genome is much smaller than |G|.

The third term H(G), as already discussed above, corresponds to the descriptive complexity of the sequence G itself and may thus be estimated by LZ(G), the description length of G produced by the universal compressor LZ77 (implemented by gzip).

In practice, the contribution of the remaining terms $H(G|{\cal R})$ and $H\left( {{\cal R}^ \star |G,{\cal R}} \right)$ appearing in the decomposition of $H({\cal R})$ will be negligible compared to the first three terms described above. To be more precise, the term $H(G|{\cal R})$ represents our remaining uncertainty about the sequence G (in bits), given that we observe the sampled reads ${\cal R}$. In practice, the set of reads ${\cal R}$ is generally rich enough (in terms of coverage) to permit reliable assembly of a small number of contigs, which collectively cover the sequence G. Thus, $H(G|{\cal R})$ is at most the logarithm of the number of arrangements of the assembled contigs, so that

$$H(G|{\cal R}) \le {\mathrm{log}}\left( {N_{\mathrm{c}}!} \right) \le N_{\mathrm{c}}{\mathrm{log}}\,N_{\mathrm{c}},$$

where N_c denotes the number of contigs assembled from ${\cal R}$. Since N_c will generally be several orders of magnitude smaller than NL or $\left| G \right|$, the contribution of $H(G|{\cal R})$ to computing $H({\cal R})$ will be negligible in practical settings. In the case of error-free reads, this heuristic argument can be made rigorous, in which case $H(G|{\cal R})$ is only on the order of tens of bits for real genomes²⁸.

Finally, the term $H\left( {{\cal R}^ \star |G,{\cal R}} \right)$ is seen to be small as follows: the entropy of the error-free reads ${\cal R}^ \star$ given the sequence G coincides precisely with the entropy of the random locations from which the reads are sampled. However, if the error rate is not too severe, access to both G and ${\cal R}$ will effectively determine these locations. Indeed, mapping the reads in ${\cal R}$ to the sequence G should, in all but few extreme cases, reveal the locations on the genome from which the reads in ${\cal R}^ \star$ are sampled. Thus, we may conclude that $H\left( {{\cal R}^ \star |G,{\cal R}} \right)$ is much smaller in comparison to $H\left( {{\cal R}^ \star |G} \right)$, implying that its contribution to the calculation of $H({\cal R})$ can be safely ignored.

Putting everything together yields the approximation (1).

Implementation of the Assembltrie

An actual implementation of Assembltrie approach needs to address a number of issues such as read errors, strand correction for diploid or multiploid genomes, and parallelization as discussed below.

Noisy reads: Assembltrie allows a maximum of (user specified) ε mismatches between a parent and a child in the read forest. For that, for each suffix/prefix hash table hit (of the current window of size K) encountered during the search for the parent/children of a read, Assembltrie extends the corresponding initial match to a full match with maximum Hamming distance ε. The encoding of mismatching symbols is described below.

Strand correction: As the underlying DNA sequence is double stranded, and the reads are obtained from either of these two strands, the following greedy heuristic is implemented as a user option to correct (i.e., synchronize) the read orientation: for each node v, pick the strand (i.e., either the read as is or its reverse complement) for which the prefix–suffix overlap between v and an already processed read u is as long as possible. Then π(v) is set to u with the strand of v. Note that when π(v) is reset afterward, the strand of v will not change.

Trie encoding: Assembltrie encodes the read forest ${\cal T}$ in a simple, bottom-up fashion, which preserves the total weight of ${\cal T}$. To be specific, the encoding is done by breaking down the constructed read forest back into a set ${\cal S}$ of disjoint contigs, each of which represents a simple path from a leaf node to an internal node or root, with its own metadata indicating the start indices of the reads it includes. Each step of the encoding process starts with an unprocessed leaf node (the special case of single cycles is detected and handled afterward), traversing the path to the root, until reaching an already processed node, or otherwise the root itself. The contig $S \in {\cal S}$ corresponding to such a path is represented by the Assembltrie encoder as a 6-tuple:

S′ = π(S): a pointer to the parent sequence S′ of S, or NIL if the last read in S corresponds to a root node. Note that it is possible to have S′ = S, which implies that a single cycle exists in S.
{start(r)}: the start location of each read r contained in S, encoded differentially with a fixed-codebook Huffman code. The sum of the differential codes is $\left| {{\mathrm{seq}}(S)} \right|$ (i.e., the number of symbols to be read in a decompression process from the main stream—as defined below).
{rev(r)}: binary flags indicating whether each read r is in its original order, or its reverse-complement order. (This only takes effect if the read-orientation correction heuristic is applied by the user.)
r′ ∈ S′: a pointer to a parent read of S encoded with ${\mathrm{log}}_2\left| {n(S{\prime})} \right|$ bits, where n(S′) is the number of reads in contig S′, provided that π(S) ≠ NIL.
seq(S): the main stream of symbols indicating the consensus sequence of S (an example is given in Fig. 4b). Although each base can be naively encoded in 2 bits, (note that the letter “N” is provisionally regarded as an arbitrary symbol in {A, C, G, T}) in Assembltrie, the main stream may be further compressed with a universal compressor.
{error(r)}: a record of all mismatching symbols, as well as their positions, between each read r and the consensus sequence seq(S).

In addition to the above encoding, a separate data stream is maintained for the reads for which neither a parent nor a child could be identified (due to insufficient overlaps in the filtered read overlap graph G_K). Finally, a third stream is maintained for representing the positions of symbol N. Note that even though these positions are detectable through the use of quality scores, Assembltrie is designed to processes only the reads and not the quality scores and thus in order to achieve lossless compression, it needs to explicitly maintain these positions.

Parallelization: Unlike most available HTS compression software, Assembltrie’s parallelization is based on a shared memory scheme, which avoids any tradeoff between speed and compression rate. In particular, Assembltrie utilizes the concurrent containers offered by Intel^® Threading Building Blocks to construct hash tables that support concurrent insertion and lookup, without explicit locks.

Paired-end reads: As per many available reference-free sequence compressors (including Orcom), Assembltrie is primarily designed for single-end reads. For data sets are comprised of paired-end reads, it is possible to convert each paired-end read to a single-end read by simply concatenating the two ends into a single string. This conversion (which could be applied as a preprocessing step to all reads) works well especially if the insert size (the distance between between the two ends) has limited variation across the reads. Additional ideas for handling paired-end reads are discussed in, for instance²⁹.

Data availability

The source code, including some python scripts used for benchmarking, is available at https://github.com/kyzhu/assembltrie.

Change history

26 April 2018
The original version of this Article contained errors in the affiliations of the authors Ibrahim Numanagić and Thomas A. Courtade, which were incorrectly given as ‘Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA’ and ‘Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA’, respectively. Also, the hyperlink for the source code in the Data Availability section was incorrectly given as https://github.iu.edu/kzhu/assembltrie, which links to a page that is not publicly accessible. The source code is publicly accessible at https://github.com/kyzhu/assembltrie. Furthermore, in the PDF version of the Article, the right-hand side of Figure 3 was inadvertently cropped. These errors have now been corrected in both the PDF and HTML versions of the Article.

References

Jones, D. C., Ruzzo, W. L., Peng, X. & Katze, M. G. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res 40, e171 (2012).
Benoit, G. et al. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform. 16, 288 (2015).
Article MathSciNet Google Scholar
Kingsford, C. & Patro, R. Reference-based compression of short-read sequences using path encoding. Bioinformatics 31, 1920–1928 (2015).
Article CAS PubMed Central PubMed Google Scholar
Zhang, Y., Patel, K., Endrawis, T., Bowers, A. & Sun, Y. A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene 579, 75–81 (2016).
Article CAS PubMed Google Scholar
Zhang, Y. et al. Light-weight reference-based compression of FASTQ data. BMC Bioinform. 16, 188 (2015).
Article Google Scholar
Hach, F., Numanagić, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).
Article CAS PubMed Central PubMed Google Scholar
Grabowski, S., Deorowicz, S. & Roguski, Ł. Disk-based compression of data from genome sequencing. Bioinformatics 31, 1389–1395 (2015).
Article CAS PubMed Google Scholar
Patro, R. & Kingsford, C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics. 31, 2770–2777 (2015).
Cox, A. J., Bauer, M. J., Jakobi, T. & Rosone, G. Large-scale compression of genomic sequence databases with the burrows–wheeler transform. Bioinformatics 28, 1415–1419 (2012).
Article CAS PubMed Google Scholar
GZIP File Format Specification. https://tools.ietf.org/html/rfc1952 (1996).
bzip2. http://www.bzip.org/index.html (2010).
Numanagić, I. et al. Comparison of high-throughput sequencing data compression tools. Nat. Methods 13, 1005–1008 (2016).
Article PubMed Google Scholar
Varshney, L. R. & Goyal, V. K. On universal coding of unordered data. In Information Theory and Applications Workshop, 183–187 (ITA, San Diego, CA, 2007).
Varshney, L. R. & Goyal, V. K. Toward a source coding theory for sets. In Data Compression Conference, 2006. DCC 2006. Proceedings, 13–22 (IEEE, Snowbird, UT, 2006).
Steinruecken, C. Compressing sets and multisets of sequences. IEEE Trans. Inf. Theory 61, 1485–1490 (2015).
Article MathSciNet MATH Google Scholar
Steinruecken, C. Compressing combinatorial objects. In Data Compression Conference (DCC), 2016, 389–396 (IEEE, Snowbird, UT, 2016).
Gripon, V., Rabbat, M., Skachek, V. & Gross, W. J. Compressing multisets using tries. In Information Theory Workshop (ITW), 2012 IEEE, 642–646 (IEEE, Snowbird, UT, 2012).
Reznik, Y. A. Codes for unordered sets of words. In Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on, 1322–1326 (IEEE, Snowbird, UT, 2011).
Shomorony, I., Kim, S. H., Courtade, T. A. & Tse, D. N. Information-optimal genome assembly via sparse read-overlap graphs. Bioinformatics 32, i494–i502 (2016).
Article CAS PubMed Google Scholar
Arratia, R., Martin, D., Reinert, G. & Waterman, M. S. Poisson process approximation for sequence repeats, and sequencing by hybridization. J. Comput. Biol. 3, 425–463 (1996).
Article CAS PubMed Google Scholar
Motahari, A. S., Bresler, G. & Tse, D. N. Information theory of DNA shotgun sequencing. IEEE Trans. Inf. Theory 59, 6273–6289 (2013).
Article MathSciNet MATH Google Scholar
Shannon, C. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).
Article MathSciNet MATH Google Scholar
Ziv, J. & Lempel, Z. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977).
Article MathSciNet MATH Google Scholar
Ziv, J. & Lempel, Z. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 24, 530–536 (1978).
Article MATH Google Scholar
Wyner, A. & Ziv, J. Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression. IEEE Trans. Inf. Theory 35, 1250–1258 (1989).
Article MathSciNet MATH Google Scholar
Ziv, J. The universal LZ77 compression algorithm is essentially optimal for individual finite-length n-blocks. IEEE Trans. Inf. Theory 55, 1941–1944 (2009).
Article MathSciNet MATH Google Scholar
Cover, T. & Thomas, J. Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing (John Wiley & Sons, 1991).
Shomorony, I., Kamath, G., Xia, F., Courtade, T. & Tse, D. Partial DNA assembly: a rate-distortion perspective. In Information Theory (ISIT) 2016 IEEE International Symposium on, 1799–1803 (IEEE, Barcelona, Spain, 2016).
Pritt, J. & Langmead, B. Boiler: lossy compression of RNA-seq alignments using coverage vectors. Nucleic Acids Res. 44, e133 (2016).
Article PubMed Central PubMed Google Scholar

Download references

Acknowledgements

D.T., T.C., and S.C.S. acknowledge the Algorithmic Challenges in Genomics program at the Simons Institute of Theoretical Computer Science in U.C. Berkeley. This work was supported in part by the National Science Foundation (NSF) grants CCF-1528132 and CCF-0939370 (Center for Science of Information) to T.C.; National Science Foundation (NSF) grant CCF-1619081, National Institutes of Health (NIH) grant GM108348, NSERC Discovery Frontiers Program on the Cancer Genome Collaboratory, and the Indiana University Grant Challenges Program, Precision Health Initiative, to S.C.S.

Author information

Antonio A. Ginart, Joseph Hui and Kaiyuan Zhu contributed equally to this work.

Authors and Affiliations

Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA
Antonio A. Ginart & David N. Tse
Department of Electrical Engineering & Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
Joseph Hui
Department of Computer Science, Indiana University Bloomington, Bloomington, IN, 47405, USA
Kaiyuan Zhu & S. Cenk Sahinalp
Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
Ibrahim Numanagić
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, 94720, USA
Thomas A. Courtade

Authors

Antonio A. Ginart
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Hui
View author publications
You can also search for this author in PubMed Google Scholar
Kaiyuan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Ibrahim Numanagić
View author publications
You can also search for this author in PubMed Google Scholar
Thomas A. Courtade
View author publications
You can also search for this author in PubMed Google Scholar
S. Cenk Sahinalp
View author publications
You can also search for this author in PubMed Google Scholar
David N. Tse
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.H. and T.A.C. initially came up with the cycle-rooted trie data structure. K.Z. and S.C.S. formulated the combinatorial problem. A.A.G. and T.A.C. derived the information theoretic lower bound. A.A.G., K.Z., and I.N. implemented the construction of read forest (cycle-rooted tries). A.A.G., T.A.C., K.Z., S.C.S., and D.N.T. co-wrote the manuscript. T.A.C., S.C.S., and D.N.T. supervised the study.

Corresponding authors

Correspondence to Kaiyuan Zhu or S. Cenk Sahinalp.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ginart, A.A., Hui, J., Zhu, K. et al. Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 9, 566 (2018). https://doi.org/10.1038/s41467-017-02480-6

Download citation

Received: 27 August 2017
Accepted: 05 December 2017
Published: 08 February 2018
DOI: https://doi.org/10.1038/s41467-017-02480-6

This article is cited by

Sketching algorithms for genomic data analysis and querying in a secure enclave
- Can Kockan
- Kaiyuan Zhu
- S. Cenk Sahinalp
Nature Methods (2020)
FQSqueezer: k-mer-based compression of sequencing data
- Sebastian Deorowicz
Scientific Reports (2020)
Stoichiogenomics reveal oxygen usage bias, key proteins and pathways associated with stomach cancer
- Xiaoyan Zuo
- Bo Li
- Yu-Juan Zhang
Scientific Reports (2019)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Gene trajectory inference for single-cell data by optimal transport metrics

scGHOST: identifying single-cell 3D genome subcompartments

Introduction

Results

Benchmarking Assembltrie against other methods

Compression performance and computational resource utilization

Assembltrie performance vs our entropy approximation

Parameter selection

Discussion

Methods

Problem definition

Constructing read forest

Running time

Combinatorial optimality

An information theoretic lower bound for HTS data compression

Estimating \(H({\cal R})\)

Implementation of the Assembltrie

Data availability

Change history

26 April 2018

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Electronic supplementary material

Peer Review File

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Sketching algorithms for genomic data analysis and querying in a secure enclave

FQSqueezer: k-mer-based compression of sequencing data

Stoichiogenomics reveal oxygen usage bias, key proteins and pathways associated with stomach cancer

Comments

Search

Quick links