Fast search of thousands of short-read sequencing experiments

Journal name:
Nature Biotechnology
Volume:
34,
Pages:
300–302
Year published:
DOI:
doi:10.1038/nbt.3442
Received
Accepted
Published online

Abstract

The amount of sequence information in public repositories is growing at a rapid rate. Although these data are likely to contain clinically important information that has not yet been uncovered, our ability to effectively mine these repositories is limited. Here we introduce Sequence Bloom Trees (SBTs), a method for querying thousands of short-read sequencing experiments by sequence, 162 times faster than existing approaches. The approach searches large data archives for all experiments that involve a given sequence. We use SBTs to search 2,652 human blood, breast and brain RNA-seq experiments for all 214,293 known transcripts in under 4 days using less than 239 MB of RAM and a single CPU. Searching sequence archives at this scale and in this time frame is currently not possible using existing tools.

At a glance

Figures

  1. Estimated running times of search tools for one transcript.
    Figure 1: Estimated running times of search tools for one transcript.

    The SBT per-query time was recorded using a maximum of a single filter in active memory and one thread. The other bars show the estimated time to achieve the same query results using SRA-BLAST and STAR.

  2. Receiver operating characteristic (ROC) curve averaged over 100 queries with estimated expression >100, >500 and >1,000 TPM and variable [theta] (Online Methods).
    Figure 2: Receiver operating characteristic (ROC) curve averaged over 100 queries with estimated expression >100, >500 and >1,000 TPM and variable θ (Online Methods).

    Solid lines represent mean true-positive and false-positive rates, dashed lines represent the median rates on the same experiments. Relaxing θ leads to a higher sensitivity at the cost of specificity. In more than half of all queries, 100% of true-positive hits can be found with θ as high as 0.9.

  3. Schematic of a Sequence Bloom Tree.
    Supplementary Fig. 1: Schematic of a Sequence Bloom Tree.

    Each node contains a bloom filter that holds the kmers present in the sequencing experiments under it. θ is the fraction of kmers required to be found at each node in order to continue to search its subtree. The SBT returns the experiments that likely contain the query sequence on which further analysis can be performed.

  4. Comparison with STAR on batched queries.
    Supplementary Fig. 2: Comparison with STAR on batched queries.

    STAR was run using an index built from 100 batch-queries and a size 11 pre-index string. Both SBT and STAR were run using one thread and SBT was limited to a single filter in RAM. SBT is an estimated 4056 times faster than STAR under these conditions. STAR times are estimated from extrapolating from querying 100 random SRR files.

  5. Average time to query a >1,000-nt sequence across 2,652 SRA files indexed in a Sequence Bloom Tree.
    Supplementary Fig. 3: Average time to query a >1,000-nt sequence across 2,652 SRA files indexed in a Sequence Bloom Tree.

    To explore the parameter space of the Sequence Bloom Tree, each of the High, Medium, and Low sets defined in Online Methods were individually queried in the Sequence Bloom Tree for various kmer thresholds θ = 0.7, 0.8, 0.9 and the results averaged. As the threshold increases from 0.7 to 0.9, the time per query decreases while the time per query varies little between changes in TPM.

  6. Receiver operating characteristic curve averaged over 100 queries with estimated expression >10 TPM and variable [theta].
    Supplementary Fig. 4: Receiver operating characteristic curve averaged over 100 queries with estimated expression >10 TPM and variable θ.

    Solid lines represent mean TP and FP rates, dashed lines represent the median rates on the same experiment. Accuracy is determined by taking Sailfish estimates as ground truth and comparing against SBT estimates on a per-transcript basis.

  7. Total number of Sequence Bloom Tree nodes visited as a function of the number of leaf hits when querying 100 random human transcripts in the Low query set.
    Supplementary Fig. 5: Total number of Sequence Bloom Tree nodes visited as a function of the number of leaf hits when querying 100 random human transcripts in the Low query set.

    Number of nodes includes both internal and leaf nodes of the SBT. Each point represents a single query. When a query is found in many of the leaves, the query must also visit a nearly equal number of internal tree nodes, and so the tree structure would not provide any benefit over merely searching all the leaf filters directly. On the other hand, when the query is found in only a few leaves, the total number of nodes visited can be significantly smaller than the number of leaves. For the SBT built here, we find that for queries that are found in 600 or fewer leaves, the tree structure and internal nodes result in an improvement of overall efficiency by visiting fewer than 2652 nodes. A naive approach that did not use the tree would require querying 2652 leaf filters for all queries (denoted by dashed line). Approximately half of the randomly selected queries known to be expressed in the included experiments fall below this threshold.

  8. The benefit of SBT as a pre-processing filter.
    Supplementary Fig. 6: The benefit of SBT as a pre-processing filter.

    Estimated time to process the full 2652 dataset using a combination of SBT followed by STAR or SRA-BLAST versus the time to process with STAR or SRA-BLAST without SBT. Dark bars denote the SBT time while the total height represents the total time using conventional search tools on the post-SBT set.

  9. Distribution of hit counts (the number of individual SRR files that matched a given query) for all known human transcripts as a function of threshold.
    Supplementary Fig. 7: Distribution of hit counts (the number of individual SRR files that matched a given query) for all known human transcripts as a function of threshold.

    Approximately 83%, 89%, and 95% of all queries (depending on TPM threshold) have fewer than 600 matching files, and the SBT hierarchy would provide significant improvements over a search that used leaf filters only on these queries.

  10. Time for querying all known human transcripts.
    Supplementary Fig. 8: Time for querying all known human transcripts.

    Total times (single-threaded) for querying all 214,293 human transcripts (in batch mode) against all publicly available blood, breast, and brain RNA-seq experiments in the SRA for θ = 0.7, 0.8, 0.9 as well as the extrapolated time to run Sailfish on the full dataset. Sailfish is significantly faster than nearly all other algorithms for RNA-seq quantification.

References

  1. Leinonen, R., Sugawara, H. & Shumway, M. The sequence read archive. Nucleic Acids Res. 39, D19D21 (2011).
  2. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
  3. Burrows, M. & Wheeler, D.J. A block sorting lossless data compression algorithm. Technical Report 124 (Digital Equipment Corporation, 1994).
  4. Ferragina, P. & Manzini, G. Indexing compressed text. J. Assoc. Comput. Mach. 52, 552581 (2005).
  5. Grossi, R. & Vitter, J.S. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 378407 (2005).
  6. Grossi, R., Vitter, J.S. & Xu, B. Wavelet trees: from theory to practice. in Data Compression, Communications and Processing (CCP), 2011 First International Conference on 21–24 June 2011 (pp. 210–221). (IEEE, 2011).
  7. Navarro, G. & Mäkinen, V. Compressed full-text indexes. ACM Comput. Surv. 39, Article No. 2 doi:10.1145/1216370.1216372 (2007).
  8. Ziviani, N., Moura, E., Navarro, G. & Baeza-Yates, R. Compression: a key for next-generation text retrieval systems. IEEE Computer 33, 3744 (2000).
  9. Navarro, G., Moura, E., Neubert, M., Ziviani, N. & Baeza-Yates, R. Adding compression to block addressing inverted indexes. Inf. Retrieval 3, 4977 (2000).
  10. Loh, P.-R., Baym, M. & Berger, B. Compressive genomics. Nat. Biotechnol. 30, 627630 (2012).
  11. Daniels, N.M. et al. Compressive genomics for protein databases. Bioinformatics 29, i283i290 (2013).
  12. Yu, Y.W., Daniels, N.M., Danko, D.C. & Berger, B. Entropy-scaling search of massive biological data. Cell Syst. 1, 130140 (2015).
  13. Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422426 (1970).
  14. Broder, A. & Mitzenmacher, M. Network applications of bloom filters: a survey. Internet Math. 1, 485509 (2005).
  15. Raman, R., Raman, V. & Srinivasa Rao, S. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. in Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '02 (233–242) (Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002).
  16. Crainiceanu, A. Bloofi: a hierarchical bloom filter index with applications to distributed data provenance. in Proceedings of the 2nd International Workshop on Cloud Intelligence, article 4. doi:10.1145/2501928.2501931 (ACM, 2013).
  17. Pell, J. et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. USA 109, 1327213277 (2012).
  18. Chikhi, R. & Rizk, G. Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8, 22 (2013).
  19. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 1521 (2013).
  20. Patro, R., Mount, S.M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462464 (2014).
  21. Stranneheim, H. et al. Classification of DNA sequences using Bloom filters. Bioinformatics 26, 15951600 (2010).
  22. Melsted, P. & Pritchard, J.K. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12, 333 (2011).
  23. Crainiceanu, A. & Lemire, D. Multidimensional bloom filters. Inf. Syst. 54, 311324 (2015).
  24. Salikhov, K., Sacomoto, G. & Kucherov, G. Using cascading Bloom filters to improve the memory usage for de Brujin graphs. Algorithms Mol. Biol. 9, 2 (2014).
  25. Rozov, R., Shamir, R. & Halperin, E. Fast lossless compression via cascading Bloom filters. BMC Bioinformatics 15 (suppl. 9), S7 (2014).
  26. Witten, I., Moffat, A. & Bell, T. Managing Gigabytes, 2nd edn. (Morgan Kaufmann, 1999).
  27. Baeza-Yates, R. & Ribeiro, B. Modern Information Retrieval (Addison-Wesley, 1999).
  28. Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C. & Brown, C.T. These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One 9, e101271 (2014).
  29. Brown, C.T., Howe, A.C., Zhang, Q., Pyrkosz, A.B. & Brom, T.H. A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arXiv:1203.4802 [q-bio.GN]. Preprint at http://arxiv.org/abs/1203.4802.
  30. Rasmussen, K.R., Stoye, J. & Myers, E.W. Efficient q-gram filters for finding all ε-matches over a given length. J. Comput. Biol. 13, 296308 (2006).
  31. Philippe, N., Salson, M., Commes, T. & Rivals, E. CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biol. 14, R30 (2013).
  32. Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764770 (2011).
  33. Gog, S., Beller, T., Moffat, A. & Petri, M. in 13th International Symposium on Experimental Algorithms, Copenhagen, 29 June–1 July 2014 (eds. Gudmundsson, J. & Katajainen, J.) 326337 (Springer, 2014).

Download references

Author information

Affiliations

  1. Joint Carnegie Mellon University–University of Pittsburgh Ph.D. Program in Computational Biology, Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

    • Brad Solomon
  2. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

    • Carl Kingsford

Contributions

B.S. and C.K. designed the method, devised the experiments, implemented the software and wrote the manuscript. B.S. performed the experiments.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Schematic of a Sequence Bloom Tree. (64 KB)

    Each node contains a bloom filter that holds the kmers present in the sequencing experiments under it. θ is the fraction of kmers required to be found at each node in order to continue to search its subtree. The SBT returns the experiments that likely contain the query sequence on which further analysis can be performed.

  2. Supplementary Figure 2: Comparison with STAR on batched queries. (27 KB)

    STAR was run using an index built from 100 batch-queries and a size 11 pre-index string. Both SBT and STAR were run using one thread and SBT was limited to a single filter in RAM. SBT is an estimated 4056 times faster than STAR under these conditions. STAR times are estimated from extrapolating from querying 100 random SRR files.

  3. Supplementary Figure 3: Average time to query a >1,000-nt sequence across 2,652 SRA files indexed in a Sequence Bloom Tree. (36 KB)

    To explore the parameter space of the Sequence Bloom Tree, each of the High, Medium, and Low sets defined in Online Methods were individually queried in the Sequence Bloom Tree for various kmer thresholds θ = 0.7, 0.8, 0.9 and the results averaged. As the threshold increases from 0.7 to 0.9, the time per query decreases while the time per query varies little between changes in TPM.

  4. Supplementary Figure 4: Receiver operating characteristic curve averaged over 100 queries with estimated expression >10 TPM and variable θ. (28 KB)

    Solid lines represent mean TP and FP rates, dashed lines represent the median rates on the same experiment. Accuracy is determined by taking Sailfish estimates as ground truth and comparing against SBT estimates on a per-transcript basis.

  5. Supplementary Figure 5: Total number of Sequence Bloom Tree nodes visited as a function of the number of leaf hits when querying 100 random human transcripts in the Low query set. (49 KB)

    Number of nodes includes both internal and leaf nodes of the SBT. Each point represents a single query. When a query is found in many of the leaves, the query must also visit a nearly equal number of internal tree nodes, and so the tree structure would not provide any benefit over merely searching all the leaf filters directly. On the other hand, when the query is found in only a few leaves, the total number of nodes visited can be significantly smaller than the number of leaves. For the SBT built here, we find that for queries that are found in 600 or fewer leaves, the tree structure and internal nodes result in an improvement of overall efficiency by visiting fewer than 2652 nodes. A naive approach that did not use the tree would require querying 2652 leaf filters for all queries (denoted by dashed line). Approximately half of the randomly selected queries known to be expressed in the included experiments fall below this threshold.

  6. Supplementary Figure 6: The benefit of SBT as a pre-processing filter. (38 KB)

    Estimated time to process the full 2652 dataset using a combination of SBT followed by STAR or SRA-BLAST versus the time to process with STAR or SRA-BLAST without SBT. Dark bars denote the SBT time while the total height represents the total time using conventional search tools on the post-SBT set.

  7. Supplementary Figure 7: Distribution of hit counts (the number of individual SRR files that matched a given query) for all known human transcripts as a function of threshold. (71 KB)

    Approximately 83%, 89%, and 95% of all queries (depending on TPM threshold) have fewer than 600 matching files, and the SBT hierarchy would provide significant improvements over a search that used leaf filters only on these queries.

  8. Supplementary Figure 8: Time for querying all known human transcripts. (34 KB)

    Total times (single-threaded) for querying all 214,293 human transcripts (in batch mode) against all publicly available blood, breast, and brain RNA-seq experiments in the SRA for θ = 0.7, 0.8, 0.9 as well as the extrapolated time to run Sailfish on the full dataset. Sailfish is significantly faster than nearly all other algorithms for RNA-seq quantification.

PDF files

  1. Supplementary Text and Figures (1,015 KB)

    Supplementary Figures 1–8 and Supplementary Tables 2–5

Text files

  1. Supplementary Table 1 (83,147 KB)

    Supplementary Table 1

Zip files

  1. Supplementary Software zip file (174 KB)

    Supplementary Software

Additional data