Nature Biotechnology  Research  Brief Communications
Nearoptimal probabilistic RNAseq quantification
 Nicolas L Bray^{1}^{, }
 Harold Pimentel^{2}^{, }
 Páll Melsted^{3}^{, }
 Lior Pachter^{2, 4, 5}^{, }
 Journal name:
 Nature Biotechnology
 Volume:
 34,
 Pages:
 525–527
 Year published:
 DOI:
 doi:10.1038/nbt.3519
Abstract
We present kallisto, an RNAseq quantification program that is two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases. We use kallisto to analyze 30 million unaligned pairedend RNAseq reads in <10 min on a standard laptop computer. This removes a major computational bottleneck in RNAseq analysis.
Subject terms:
At a glance
Figures
Change history
 Corrected online 27 July 2016
 In the version of this article initially published, in the HTML version only, the equation “α_{t}N > 0.01” was written as “α_{tN} > 0.01.” In addition, in the Figure 1 legend, the formatting of the nodes was incorrect (v_1, etc., rather than v_{1}). The errors have been corrected in the HTML and PDF versions of the article.
References
 Kim, D. et al. Genome Biol. 14, R36 (2013).
 Trapnell, C. et al. Nat. Biotechnol. 28, 511–515 (2010).
 Roberts, A. & Pachter, L. Nat. Methods 10, 71–73 (2013).
 Anders, S., Pyl, P.T. & Huber, W. Bioinformatics 31, 166–169 (2015).
 Patro, R., Mount, S.M. & Kingsford, C. Nat. Biotechnol. 32, 462–464 (2014).
 Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621–628 (2008).
 Nicolae, M., Mangul, S., Măndoiu, I. & Zelikovsky, A. in Algorithms in Bioinformatics (eds. Moulton, V. & Singh, M.) 202–214 (Springer, 2010).
 Compeau, P.E.C., Pevzner, P.A. & Tesler, G. Nat. Biotechnol. 29, 987–991 (2011).
 Li, B. & Dewey, C.N. BMC Bioinformatics 12, 323 (2011).
 SEQC/MAQCIII Consortium. Nat. Biotechnol. 32, 903–914 (2014).
 Lappalainen, T. et al. Nature 501, 506–511 (2013).
 Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Genome Biol. 12, R22 (2011).
 Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. Genome Res. 18, 1509–1517 (2008).
 Wold, B. & Myers, R.M. Nat. Methods 5, 19–21 (2008).
 Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. Nat. Genet. 44, 226–232 (2012).
 Lee, S., Seo, C.H., Alver, B.H., Lee, S. & Park, P.J. BMC Bioinformatics 16, 278 (2015).
 Köster, J. & Rahmann, S. Bioinformatics 28, 2520–2522 (2012).
Author information
Affiliations

Innovative Genomics Initiative, University of California, Berkeley, California, USA.
 Nicolas L Bray

Department of Computer Science, University of California, Berkeley, California, USA.
 Harold Pimentel &
 Lior Pachter

Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland.
 Páll Melsted

Department of Mathematics, University of California, Berkeley, California, USA.
 Lior Pachter

Department of Molecular & Cell Biology, University of California, Berkeley, California, USA.
 Lior Pachter
Contributions
N.L.B. and L.P. developed the concept of pseudoalignment and conceived the idea for applying it to RNAseq quantification. P.M. conceived the implementation using De Bruijn graphs. N.L.B., H.P., P.M. and L.P. designed the kallisto software and N.L.B. implemented a prototype. H.P. and P.M. wrote the current kallisto implementation. N.B. and H.P. automated production of the results. N.L.B., H.P., P.M. and L.P. analyzed results and wrote the paper.
Competing financial interests
The authors declare no competing financial interests.
Author details
Nicolas L Bray
Search for this author in:
Harold Pimentel
Search for this author in:
Páll Melsted
Search for this author in:
Lior Pachter
Search for this author in:
Supplementary information
Supplementary Figures
 Supplementary Figure 1: Median relative difference for abundance estimates using varying values of k. (39 KB)
Median relative difference for abundance estimates using varying values of k on a dataset of 30 million 75bp pairedend reads that were simulated without errors. The “kmers method” uses the kcompatibility of each kmer independently and runs the EM algorithm on kmers, whereas kallisto uses the intersection of kcompatibility classes across both ends of a read. Even for k=75, the full read length in the simulation, independent use of kmers results in a significant drop in accuracy due to the loss of pairedend information.
 Supplementary Figure 2: Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM. (29 KB)
Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM on 20 RSEM simulations of 30 million 75bp pairedend reads based on the TPM estimates and error profile of Geuvadis sample NA12716 (selected for its depth of sequencing). For each simulation we report the accuracy as the median relative difference in the estimated TPM value of each transcript. The values reported are means across the 20 simulations (the variance was too small for this plot). Relative difference is defined as the absolute difference between the estimated TPM values and the ground truth divided by the average of the two.
 Supplementary Figure 3: Performance of different quantification programs on the set of paralogs in the human genome. (27 KB)
Performance of different quantification programs on the set of paralogs in the human genome supplied by the Duplicated Genes Database (http://dgd.genouest.org). This set includes 8,636 transcripts in 3,163 genes.
 Supplementary Figure 4: Count distribution of one simulation. (28 KB)
Count distribution of one simulation. The left panel contains the transcripts used in Supplementary Figure 3. The right panel contains the remaining transcripts. The xaxis is on the log scale. Both distributions appear very similar, suggesting that the drop in performance in Supplementary Figure 3 is from sequence similarity and not oddities in the distribution such as very low counts.
 Supplementary Figure 5: Comparison of technical variance in abundances. (113 KB)
The data comes from a single library with 216M, 101bp pairedend reads sequenced. Each point corresponds to a transcript and is colored by the decile of its expression level in the single bootstrapped subsample. The Yaxis represents variance of abundance estimates across 40 subsamples, with 30M reads in each subsample. The Xaxis represents variance as computed from 40 bootstraps of a single subsampled dataset of 30M reads. The red lines emanating from the lower left corner consist of transcripts that have an estimated abundance of zero in the single bootstrapped experiment, but show expression in some of the subsamples (12968 transcripts), and vice versa (720 transcripts).
 Supplementary Figure 6: Median relative error (with respect to 1,000 bootstraps) of inferred transcript variances. (34 KB)
Median relative error (with respect to 1000 bootstraps) of inferred transcript variances as a function of number of bootstrap samples performed. The relative error with 40 bootstraps (red line) is 7.8%.
 Supplementary Figure 7: Relationship between the mean and variance of estimated counts from subsamples. (109 KB)
Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 subsamples of 30M reads from a dataset of 216M PE reads. The xaxis is the mean of each count estimate calculated across the subsamples. The yaxis is the variance of the count estimates calculated across subsamples.
 Supplementary Figure 8: Relationship between the mean and variance of estimated counts from bootstraps. (92 KB)
Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 bootstraps of a single subsample of 30M reads from the same 216M PE read dataset. The xaxis is the mean of the count estimates calculated across the 40 bootstraps. The yaxis is the variance of the count estimates calculated across the 40 bootstraps.
 Supplementary Figure 9: Median relative difference from 30 million 75bp PE reads simulated with error for different values of k. (40 KB)
Median relative difference from 30M 75bp PE reads simulated with error for different values of k. The “kmers method” uses the kcompatibility of each kmer independently and runs the EM algorithm on kmers, whereas kallisto uses the intersection of kcompatibility classes across both ends of reads. When there are errors in the reads, kallisto requires smaller kmer lengths for robustness in pseudoalignment.
 Supplementary Figure 10: Run time for index building and quantification (35 KB)
Run time for index building and quantification as a function of kmer length for one of the simulated samples.
 Supplementary Figure 11: The distribution of the number of kmers hashed per read. (34 KB)
The distribution of the number of kmers hashed per read for k=31. Note that for the majority of reads (61.35%) only two kmers are hashed. This happens when the entire read pseudoaligns to a single contig of the TDBG and we can skip to the end of the read. Since we also check the last kmer we can skip over, the most common cases are checking 2, 4, 6, and 8 kmers. Only 1.6% of reads required hashing every kmer of the read.
PDF files
 Supplementary Text and Figures (899 KB)
Supplementary Figures 1–11
Excel files
 Supplementary Table 1a (10 KB)
Performance of quantification as measured by SEQC qPCR
 Supplementary Table 1b (10 KB)
Gene level performance of quantification as measured by SEQC
 Supplementary Table 2 (11 KB)
Performance of kallisto with and without bias
Zip files
Additional data

Supplementary Figure 1: Median relative difference for abundance estimates using varying values of k.Hover over figure to zoom

Supplementary Figure 2: Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM.Hover over figure to zoom

Supplementary Figure 3: Performance of different quantification programs on the set of paralogs in the human genome.Hover over figure to zoom

Supplementary Figure 4: Count distribution of one simulation.Hover over figure to zoom

Supplementary Figure 5: Comparison of technical variance in abundances.Hover over figure to zoom

Supplementary Figure 6: Median relative error (with respect to 1,000 bootstraps) of inferred transcript variances.Hover over figure to zoom

Supplementary Figure 7: Relationship between the mean and variance of estimated counts from subsamples.Hover over figure to zoom

Supplementary Figure 8: Relationship between the mean and variance of estimated counts from bootstraps.Hover over figure to zoom

Supplementary Figure 9: Median relative difference from 30 million 75bp PE reads simulated with error for different values of k.Hover over figure to zoom

Supplementary Figure 10: Run time for index building and quantificationHover over figure to zoom

Supplementary Figure 11: The distribution of the number of kmers hashed per read.Hover over figure to zoom