Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Near-optimal probabilistic RNA-seq quantification

An Erratum to this article was published on 09 August 2016

This article has been updated

Abstract

We present kallisto, an RNA-seq quantification program that is two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases. We use kallisto to analyze 30 million unaligned paired-end RNA-seq reads in <10 min on a standard laptop computer. This removes a major computational bottleneck in RNA-seq analysis.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Overview of kallisto.
Figure 2: Performance of kallisto and other methods.

Similar content being viewed by others

Change history

  • 27 July 2016

    In the version of this article initially published, in the HTML version only, the equation “αtN > 0.01” was written as “αtN > 0.01.” In addition, in the Figure 1 legend, the formatting of the nodes was incorrect (v_1, etc., rather than v1). The errors have been corrected in the HTML and PDF versions of the article.

References

  1. Kim, D. et al. Genome Biol. 14, R36 (2013).

    Article  Google Scholar 

  2. Trapnell, C. et al. Nat. Biotechnol. 28, 511–515 (2010).

    Article  CAS  Google Scholar 

  3. Roberts, A. & Pachter, L. Nat. Methods 10, 71–73 (2013).

    Article  CAS  Google Scholar 

  4. Anders, S., Pyl, P.T. & Huber, W. Bioinformatics 31, 166–169 (2015).

    Article  CAS  Google Scholar 

  5. Patro, R., Mount, S.M. & Kingsford, C. Nat. Biotechnol. 32, 462–464 (2014).

    Article  CAS  Google Scholar 

  6. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621–628 (2008).

    Article  CAS  Google Scholar 

  7. Nicolae, M., Mangul, S., Măndoiu, I. & Zelikovsky, A. in Algorithms in Bioinformatics (eds. Moulton, V. & Singh, M.) 202–214 (Springer, 2010).

  8. Compeau, P.E.C., Pevzner, P.A. & Tesler, G. Nat. Biotechnol. 29, 987–991 (2011).

    Article  CAS  Google Scholar 

  9. Li, B. & Dewey, C.N. BMC Bioinformatics 12, 323 (2011).

    Article  CAS  Google Scholar 

  10. SEQC/MAQC-III Consortium. Nat. Biotechnol. 32, 903–914 (2014).

  11. Lappalainen, T. et al. Nature 501, 506–511 (2013).

    Article  CAS  Google Scholar 

  12. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Genome Biol. 12, R22 (2011).

    Article  CAS  Google Scholar 

  13. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. Genome Res. 18, 1509–1517 (2008).

    Article  CAS  Google Scholar 

  14. Wold, B. & Myers, R.M. Nat. Methods 5, 19–21 (2008).

    Article  CAS  Google Scholar 

  15. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. Nat. Genet. 44, 226–232 (2012).

    Article  CAS  Google Scholar 

  16. Lee, S., Seo, C.H., Alver, B.H., Lee, S. & Park, P.J. BMC Bioinformatics 16, 278 (2015).

    Article  Google Scholar 

  17. Köster, J. & Rahmann, S. Bioinformatics 28, 2520–2522 (2012).

    Article  Google Scholar 

Download references

Acknowledgements

N.L.B., H.P. and L.P. were partially funded by NIH R01 HG006129. P.M. was partially funded by a Fulbright fellowship.

Author information

Authors and Affiliations

Authors

Contributions

N.L.B. and L.P. developed the concept of pseudoalignment and conceived the idea for applying it to RNA-seq quantification. P.M. conceived the implementation using De Bruijn graphs. N.L.B., H.P., P.M. and L.P. designed the kallisto software and N.L.B. implemented a prototype. H.P. and P.M. wrote the current kallisto implementation. N.B. and H.P. automated production of the results. N.L.B., H.P., P.M. and L.P. analyzed results and wrote the paper.

Corresponding author

Correspondence to Lior Pachter.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Median relative difference for abundance estimates using varying values of k.

Median relative difference for abundance estimates using varying values of k on a dataset of 30 million 75bp paired-end reads that were simulated without errors. The “k-mers method” uses the k-compatibility of each k-mer independently and runs the EM algorithm on k-mers, whereas kallisto uses the intersection of k-compatibility classes across both ends of a read. Even for k=75, the full read length in the simulation, independent use of k-mers results in a significant drop in accuracy due to the loss of paired-end information.

Supplementary Figure 2 Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM.

Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM on 20 RSEM simulations of 30 million 75bp paired-end reads based on the TPM estimates and error profile of Geuvadis sample NA12716 (selected for its depth of sequencing). For each simulation we report the accuracy as the median relative difference in the estimated TPM value of each transcript. The values reported are means across the 20 simulations (the variance was too small for this plot). Relative difference is defined as the absolute difference between the estimated TPM values and the ground truth divided by the average of the two.

Supplementary Figure 3 Performance of different quantification programs on the set of paralogs in the human genome.

Performance of different quantification programs on the set of paralogs in the human genome supplied by the Duplicated Genes Database (http://dgd.genouest.org). This set includes 8,636 transcripts in 3,163 genes.

Supplementary Figure 4 Count distribution of one simulation.

Count distribution of one simulation. The left panel contains the transcripts used in Supplementary Figure 3. The right panel contains the remaining transcripts. The x-axis is on the log scale. Both distributions appear very similar, suggesting that the drop in performance in Supplementary Figure 3 is from sequence similarity and not oddities in the distribution such as very low counts.

Supplementary Figure 5 Comparison of technical variance in abundances.

The data comes from a single library with 216M, 101bp paired-end reads sequenced. Each point corresponds to a transcript and is colored by the decile of its expression level in the single bootstrapped subsample. The Y-axis represents variance of abundance estimates across 40 subsamples, with 30M reads in each subsample. The X-axis represents variance as computed from 40 bootstraps of a single subsampled dataset of 30M reads. The red lines emanating from the lower left corner consist of transcripts that have an estimated abundance of zero in the single bootstrapped experiment, but show expression in some of the subsamples (12968 transcripts), and vice versa (720 transcripts).

Supplementary Figure 6 Median relative error (with respect to 1,000 bootstraps) of inferred transcript variances.

Median relative error (with respect to 1000 bootstraps) of inferred transcript variances as a function of number of bootstrap samples performed. The relative error with 40 bootstraps (red line) is 7.8%.

Supplementary Figure 7 Relationship between the mean and variance of estimated counts from subsamples.

Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 subsamples of 30M reads from a dataset of 216M PE reads. The x-axis is the mean of each count estimate calculated across the subsamples. The y-axis is the variance of the count estimates calculated across subsamples.

Supplementary Figure 8 Relationship between the mean and variance of estimated counts from bootstraps.

Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 bootstraps of a single subsample of 30M reads from the same 216M PE read dataset. The x-axis is the mean of the count estimates calculated across the 40 bootstraps. The y-axis is the variance of the count estimates calculated across the 40 bootstraps.

Supplementary Figure 9 Median relative difference from 30 million 75-bp PE reads simulated with error for different values of k.

Median relative difference from 30M 75bp PE reads simulated with error for different values of k. The “k-mers method” uses the k-compatibility of each k-mer independently and runs the EM algorithm on k-mers, whereas kallisto uses the intersection of k-compatibility classes across both ends of reads. When there are errors in the reads, kallisto requires smaller k-mer lengths for robustness in pseudoalignment.

Supplementary Figure 10 Run time for index building and quantification

Run time for index building and quantification as a function of k-mer length for one of the simulated samples.

Supplementary Figure 11 The distribution of the number of k-mers hashed per read.

The distribution of the number of k-mers hashed per read for k=31. Note that for the majority of reads (61.35%) only two k-mers are hashed. This happens when the entire read pseudoaligns to a single contig of the T-DBG and we can skip to the end of the read. Since we also check the last k-mer we can skip over, the most common cases are checking 2, 4, 6, and 8 k-mers. Only 1.6% of reads required hashing every k-mer of the read.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–11 (PDF 878 kb)

Supplementary Table 1a

Performance of quantification as measured by SEQC qPCR (XLSX 10 kb)

Supplementary Table 1b

Gene level performance of quantification as measured by SEQC (XLSX 10 kb)

Supplementary Table 2

Performance of kallisto with and without bias (XLSX 10 kb)

Supplementary Software (ZIP 1034 kb)

Supplementary Code (ZIP 25926 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bray, N., Pimentel, H., Melsted, P. et al. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34, 525–527 (2016). https://doi.org/10.1038/nbt.3519

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.3519

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing