Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms

Journal name:
Nature Biotechnology
Volume:
32,
Pages:
462–464
Year published:
DOI:
doi:10.1038/nbt.2862
Received
Accepted
Published online

Abstract

We introduce Sailfish, a computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data. Because Sailfish entirely avoids mapping reads, a time-consuming step in all current methods, it provides quantification estimates much faster than do existing approaches (typically 20 times faster) without loss of accuracy. By facilitating frequent reanalysis of data and reducing the need to optimize parameters, Sailfish exemplifies the potential of lightweight algorithms for efficiently processing sequencing reads.

At a glance

Figures

  1. Overview of the Sailfish pipeline.
    Figure 1: Overview of the Sailfish pipeline.

    (a,b) Sailfish consists of an indexing phase (a) that is invoked by the command 'sailfish index' and a quantification phase (b) invoked by the command 'sailfish quant'. The Sailfish index has four components: (1) a perfect hash function mapping each k-mer in the transcript set to a unique integer between 0 and N−1, where N is the number of unique k-mers in the set of transcripts; (2) an array recording the number of times each k-mer occurs in the reference set; (3) an index mapping each transcript to the multiset of k-mers that it contains; (4) an index mapping each k-mer to the set of transcripts in which it appears. The quantification phase consists of counting the indexed k-mers in the set of reads and then applying an EM procedure to determine the maximum-likelihood estimates of relative transcript abundance. K-mer count assignments are illustrated by vertical gray bars on lines representing known transcripts; the horizontal lines intersecting the gray bars represent the average of the current k-mer count assignments for each transcript.

  2. Speed and accuracy of Sailfish.
    Figure 2: Speed and accuracy of Sailfish.

    (a) The correlation between qPCR estimates of gene abundance (x axis) and the estimates of Sailfish. The qPCR results are taken from the microarray quality control study (MAQC)15. The results shown here are for the human brain tissue, and the RNA-seq–based estimates were computed using the reads from SRA accession SRX016366 (81,250,481 35bp single-end reads). The set of transcripts used in this experiment were the curated RefSeq20 transcripts (accession prefix NM) from hg18 (31,148 transcripts). (b) The correlation between the ground truth FPKM in a simulated data set (x axis) and the abundance estimates of Sailfish. The quantification in this experiment was performed on a set of 96,520 transcript sequences taken from Ensembl21 GRCh37.73. (c) The total time taken by each method, Sailfish, RSEM, eXpress and Cufflinks, to estimate isoform abundance on each data set. The total time taken by a method is the height of the corresponding bar, and the total is further broken down into the time taken to perform read-alignment (for Sailfish, we instead measured the time taken to count the k-mers in the read set) and the time taken to quantify abundance given the aligned reads (or k-mer counts). All tools were run in multithreaded mode (where applicable) and were allowed to use up to 16 threads. (d) Accuracy of each of the methods on human brain tissue and a synthetic data set. Accuracy is measured by the Pearson (log-transformed) and Spearman correlation coefficients between estimated abundance values and MAQC qPCR data (for human brain tissue) or ground truth (for simulated data). Root-mean-square error (RMSE) and median percentage error (medPE) are calculated as described in the Online Methods.

Accession codes

Referenced accessions

Sequence Read Archive

References

  1. Soneson, C. & Delorenzi, M. BMC Bioinformatics 14, 91 (2013).
  2. Roychowdhury, S. et al. Sci. Trans. Med. 111ra121 (2011).
  3. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Genome Biol. 10, R25 (2009).
  4. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621628 (2008).
  5. Trapnell, C. et al. Nat. Biotechnol. 28, 511515 (2010).
  6. Li, B. & Dewey, C. BMC Bioinformatics 12, 323 (2011).
  7. Roberts, A. & Pachter, L. Nat. Methods 10, 7173 (2012).
  8. Philippe, N., Salson, M., Commes, T. & Rivals, E. Genome Biol. 14, R30 (2013).
  9. Botelho, F.C., Pagh, R. & Ziviani, N. Proceedings of the 10th International Workshop on Algorithms and Data Structures Halifax, NS, Canada, August 15–17, 2007 (eds. Dehne, F., Sack, J.-R. & Zeh, N.)139150 (Springer, 2007).
  10. Marçais, G. & Kingsford, C. Bioinformatics 27, 764770 (2011).
  11. Varadhan, R. & Roland, C. Scand. J. Stat. 35, 335353 (2008).
  12. Nicolae, M., Mangul, S., Mandoiu, I. & Zelikovsky, A. Algorithms Mol. Biol. 6, 9 (2011).
  13. Salzman, J., Jiang, H. & Wong, W.H. Stat. Sci. 26, 6283 (2011).
  14. Zheng, W., Chung, L.M. & Zhao, H. BMC Bioinformatics 12, 290 (2011).
  15. Shi, L. et al. Nat. Biotechnol. 24, 11511161 (2006).
  16. Bullard, J.H., Purdom, E., Hansen, K.D. & Dudoit, S. BMC Bioinformatics 11, 94 (2010).
  17. Griebel, T. et al. Nucleic Acids Res. 40, 1007310083 (2012).
  18. Grabherr, M.G. et al. Nat. Biotechnol. 29, 644652 (2011).
  19. Sacomoto, G.A. et al. BMC Bioinformatics 13 (suppl. 6), S5 (2012).
  20. Pruitt, K.D., Tatusova, T., Brown, G.R. & Maglott, D.R. Nucleic Acids Res. 40, D1, D130–D135 (2012).
  21. Flicek, P. et al. Nucleic Acids Res. 41, D1, D48–D55 (2013).
  22. Trapnell, C., Pachter, L. & Salzberg, S. Bioinformatics 25, 11051111 (2009).
  23. Pheatt, C. J. Comput. Sci. Coll. 23, 298298 (2008).

Download references

Author information

Affiliations

  1. Lane Center for Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

    • Rob Patro &
    • Carl Kingsford
  2. Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, Maryland, USA.

    • Stephen M Mount
  3. Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA.

    • Stephen M Mount

Contributions

R.P., S.M.M. and C.K. designed the method and algorithms, devised the experiments, and wrote the manuscript. R.P. implemented the Sailfish software.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (1,369 KB)

    Supplementary Figures 1–7, Supplementary Table 1 and Supplementary Notes 1–3

Zip files

  1. Supplementary Data (682 KB)

    Version 0.6.3 of the Sailfish source code

Additional data