The amount of sequence information in public repositories is growing at a rapid rate. Although these data are likely to contain clinically important information that has not yet been uncovered, our ability to effectively mine these repositories is limited. Here we introduce Sequence Bloom Trees (SBTs), a method for querying thousands of short-read sequencing experiments by sequence, 162 times faster than existing approaches. The approach searches large data archives for all experiments that involve a given sequence. We use SBTs to search 2,652 human blood, breast and brain RNA-seq experiments for all 214,293 known transcripts in under 4 days using less than 239 MB of RAM and a single CPU. Searching sequence archives at this scale and in this time frame is currently not possible using existing tools.
At a glance
- The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011). , &
- BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009). et al.
- A block sorting lossless data compression algorithm. Technical Report 124 (Digital Equipment Corporation, 1994). &
- Indexing compressed text. J. Assoc. Comput. Mach. 52, 552–581 (2005). &
- Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 378–407 (2005). &
- Wavelet trees: from theory to practice. in Data Compression, Communications and Processing (CCP), 2011 First International Conference on 21–24 June 2011 (pp. 210–221). (IEEE, 2011). , &
- Compressed full-text indexes. ACM Comput. Surv. 39, Article No. 2 doi:10.1145/1216370.1216372 (2007). &
- Compression: a key for next-generation text retrieval systems. IEEE Computer 33, 37–44 (2000). , , &
- Adding compression to block addressing inverted indexes. Inf. Retrieval 3, 49–77 (2000). , , , &
- Compressive genomics. Nat. Biotechnol. 30, 627–630 (2012). , &
- Compressive genomics for protein databases. Bioinformatics 29, i283–i290 (2013). et al.
- Entropy-scaling search of massive biological data. Cell Syst. 1, 130–140 (2015). , , &
- Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970).
- Network applications of bloom filters: a survey. Internet Math. 1, 485–509 (2005). &
- Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. in Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '02 (233–242) (Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002). , &
- Bloofi: a hierarchical bloom filter index with applications to distributed data provenance. in Proceedings of the 2nd International Workshop on Cloud Intelligence, article 4. doi:10.1145/2501928.2501931 (ACM, 2013).
- Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. USA 109, 13272–13277 (2012). et al.
- Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8, 22 (2013). &
- STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). et al.
- Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014). , &
- Classification of DNA sequences using Bloom filters. Bioinformatics 26, 1595–1600 (2010). et al.
- Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12, 333 (2011). &
- Multidimensional bloom filters. Inf. Syst. 54, 311–324 (2015). &
- Using cascading Bloom filters to improve the memory usage for de Brujin graphs. Algorithms Mol. Biol. 9, 2 (2014). , &
- Fast lossless compression via cascading Bloom filters. BMC Bioinformatics 15 (suppl. 9), S7 (2014). , &
- Managing Gigabytes, 2nd edn. (Morgan Kaufmann, 1999). , &
- Modern Information Retrieval (Addison-Wesley, 1999). &
- These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One 9, e101271 (2014). , , , &
- A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arXiv:1203.4802 [q-bio.GN]. Preprint at http://arxiv.org/abs/1203.4802. , , , &
- Efficient q-gram filters for finding all ε-matches over a given length. J. Comput. Biol. 13, 296–308 (2006). , &
- CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biol. 14, R30 (2013). , , &
- A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011). &
- 13th International Symposium on Experimental Algorithms, Copenhagen, 29 June–1 July 2014 (eds. Gudmundsson, J. & Katajainen, J.) 326–337 (Springer, 2014). , , & in
- Supplementary Figure 1: Schematic of a Sequence Bloom Tree. (64 KB)
Each node contains a bloom filter that holds the kmers present in the sequencing experiments under it. θ is the fraction of kmers required to be found at each node in order to continue to search its subtree. The SBT returns the experiments that likely contain the query sequence on which further analysis can be performed.
- Supplementary Figure 2: Comparison with STAR on batched queries. (27 KB)
STAR was run using an index built from 100 batch-queries and a size 11 pre-index string. Both SBT and STAR were run using one thread and SBT was limited to a single filter in RAM. SBT is an estimated 4056 times faster than STAR under these conditions. STAR times are estimated from extrapolating from querying 100 random SRR files.
- Supplementary Figure 3: Average time to query a >1,000-nt sequence across 2,652 SRA files indexed in a Sequence Bloom Tree. (36 KB)
To explore the parameter space of the Sequence Bloom Tree, each of the High, Medium, and Low sets defined in Online Methods were individually queried in the Sequence Bloom Tree for various kmer thresholds θ = 0.7, 0.8, 0.9 and the results averaged. As the threshold increases from 0.7 to 0.9, the time per query decreases while the time per query varies little between changes in TPM.
- Supplementary Figure 4: Receiver operating characteristic curve averaged over 100 queries with estimated expression >10 TPM and variable θ. (28 KB)
Solid lines represent mean TP and FP rates, dashed lines represent the median rates on the same experiment. Accuracy is determined by taking Sailfish estimates as ground truth and comparing against SBT estimates on a per-transcript basis.
- Supplementary Figure 5: Total number of Sequence Bloom Tree nodes visited as a function of the number of leaf hits when querying 100 random human transcripts in the Low query set. (49 KB)
Number of nodes includes both internal and leaf nodes of the SBT. Each point represents a single query. When a query is found in many of the leaves, the query must also visit a nearly equal number of internal tree nodes, and so the tree structure would not provide any benefit over merely searching all the leaf filters directly. On the other hand, when the query is found in only a few leaves, the total number of nodes visited can be significantly smaller than the number of leaves. For the SBT built here, we find that for queries that are found in 600 or fewer leaves, the tree structure and internal nodes result in an improvement of overall efficiency by visiting fewer than 2652 nodes. A naive approach that did not use the tree would require querying 2652 leaf filters for all queries (denoted by dashed line). Approximately half of the randomly selected queries known to be expressed in the included experiments fall below this threshold.
- Supplementary Figure 6: The benefit of SBT as a pre-processing filter. (38 KB)
Estimated time to process the full 2652 dataset using a combination of SBT followed by STAR or SRA-BLAST versus the time to process with STAR or SRA-BLAST without SBT. Dark bars denote the SBT time while the total height represents the total time using conventional search tools on the post-SBT set.
- Supplementary Figure 7: Distribution of hit counts (the number of individual SRR files that matched a given query) for all known human transcripts as a function of threshold. (71 KB)
Approximately 83%, 89%, and 95% of all queries (depending on TPM threshold) have fewer than 600 matching files, and the SBT hierarchy would provide significant improvements over a search that used leaf filters only on these queries.
- Supplementary Figure 8: Time for querying all known human transcripts. (34 KB)
Total times (single-threaded) for querying all 214,293 human transcripts (in batch mode) against all publicly available blood, breast, and brain RNA-seq experiments in the SRA for θ = 0.7, 0.8, 0.9 as well as the extrapolated time to run Sailfish on the full dataset. Sailfish is significantly faster than nearly all other algorithms for RNA-seq quantification.
- Supplementary Text and Figures (1,015 KB)
Supplementary Figures 1–8 and Supplementary Tables 2–5
- Supplementary Table 1 (83,147 KB)
Supplementary Table 1
- Supplementary Software zip file (174 KB)