Near-optimal probabilistic RNA-seq quantification

Journal name:
Nature Biotechnology
Volume:
34,
Pages:
525–527
Year published:
DOI:
doi:10.1038/nbt.3519
Received
Accepted
Published online
Corrected online

Abstract

We present kallisto, an RNA-seq quantification program that is two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases. We use kallisto to analyze 30 million unaligned paired-end RNA-seq reads in <10 min on a standard laptop computer. This removes a major computational bottleneck in RNA-seq analysis.

At a glance

Figures

  1. Overview of kallisto.
    Figure 1: Overview of kallisto.

    The input consists of a reference transcriptome and reads from an RNA-seq experiment. (a) An example of a read (in black) and three overlapping transcripts with exonic regions as shown. (b) An index is constructed by creating the transcriptome de Bruijn Graph (T-DBG) where nodes (v1, v2, v3, ... ) are k-mers, each transcript corresponds to a colored path as shown and the path cover of the transcriptome induces a k-compatibility class for each k-mer. (c) Conceptually, the k-mers of a read are hashed (black nodes) to find the k-compatibility class of a read. (d) Skipping (black dashed lines) uses the information stored in the T-DBG to skip k-mers that are redundant because they have the same k-compatibility class. (e) The k-compatibility class of the read is determined by taking the intersection of the k-compatibility classes of its constituent k-mers.

  2. Performance of kallisto and other methods.
    Figure 2: Performance of kallisto and other methods.

    (a) Accuracy of kallisto, Cufflinks, Sailfish, EMSAR, eXpress and RSEM on 20 RSEM simulations of 30 million 75-bp paired-end reads based on the abundances and error profile of GEUVADIS sample NA12716_7 (selected for its depth of sequencing). For each simulation, we report the accuracy as the median relative difference in the estimated read count of each transcript. Estimated counts were used rather than transcripts per million (TPM) because the latter is based on both the assignment of ambiguous reads and the estimation of effective lengths of transcripts, so a program might be penalized for having a differing notion of effective length despite accurately assigning reads. The values reported are means across the 20 simulations (the variance was too small to be visible in this plot). Relative difference is defined as the absolute difference between the estimated abundance and the ground truth divided by the average of the two. (b) Total running time in minutes for processing the 20 simulated data sets of 30 million paired-end reads described in a. All processing was done using 20 cores, with programs being run with 20 threads when possible (Bowtie2, TopHat2, RSEM, Cufflinks) and 20 parallel processes otherwise (eXpress, kallisto). Each box represents one dataset. Since eXpress and kallisto process all datasets in parallel, the only quantification time shown is the maximum of all the quantifications.

  3. Median relative difference for abundance estimates using varying values of k.
    Supplementary Fig. 1: Median relative difference for abundance estimates using varying values of k.

    Median relative difference for abundance estimates using varying values of k on a dataset of 30 million 75bp paired-end reads that were simulated without errors. The “k-mers method” uses the k-compatibility of each k-mer independently and runs the EM algorithm on k-mers, whereas kallisto uses the intersection of k-compatibility classes across both ends of a read. Even for k=75, the full read length in the simulation, independent use of k-mers results in a significant drop in accuracy due to the loss of paired-end information.

  4. Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM.
    Supplementary Fig. 2: Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM.

    Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM on 20 RSEM simulations of 30 million 75bp paired-end reads based on the TPM estimates and error profile of Geuvadis sample NA12716 (selected for its depth of sequencing). For each simulation we report the accuracy as the median relative difference in the estimated TPM value of each transcript. The values reported are means across the 20 simulations (the variance was too small for this plot). Relative difference is defined as the absolute difference between the estimated TPM values and the ground truth divided by the average of the two.

  5. Performance of different quantification programs on the set of paralogs in the human genome.
    Supplementary Fig. 3: Performance of different quantification programs on the set of paralogs in the human genome.

    Performance of different quantification programs on the set of paralogs in the human genome supplied by the Duplicated Genes Database (http://dgd.genouest.org). This set includes 8,636 transcripts in 3,163 genes.

  6. Count distribution of one simulation.
    Supplementary Fig. 4: Count distribution of one simulation.

    Count distribution of one simulation. The left panel contains the transcripts used in Supplementary Figure 3. The right panel contains the remaining transcripts. The x-axis is on the log scale. Both distributions appear very similar, suggesting that the drop in performance in Supplementary Figure 3 is from sequence similarity and not oddities in the distribution such as very low counts.

  7. Comparison of technical variance in abundances.
    Supplementary Fig. 5: Comparison of technical variance in abundances.

    The data comes from a single library with 216M, 101bp paired-end reads sequenced. Each point corresponds to a transcript and is colored by the decile of its expression level in the single bootstrapped subsample. The Y-axis represents variance of abundance estimates across 40 subsamples, with 30M reads in each subsample. The X-axis represents variance as computed from 40 bootstraps of a single subsampled dataset of 30M reads. The red lines emanating from the lower left corner consist of transcripts that have an estimated abundance of zero in the single bootstrapped experiment, but show expression in some of the subsamples (12968 transcripts), and vice versa (720 transcripts).

  8. Median relative error (with respect to 1,000 bootstraps) of inferred transcript variances.
    Supplementary Fig. 6: Median relative error (with respect to 1,000 bootstraps) of inferred transcript variances.

    Median relative error (with respect to 1000 bootstraps) of inferred transcript variances as a function of number of bootstrap samples performed. The relative error with 40 bootstraps (red line) is 7.8%.

  9. Relationship between the mean and variance of estimated counts from subsamples.
    Supplementary Fig. 7: Relationship between the mean and variance of estimated counts from subsamples.

    Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 subsamples of 30M reads from a dataset of 216M PE reads. The x-axis is the mean of each count estimate calculated across the subsamples. The y-axis is the variance of the count estimates calculated across subsamples.

  10. Relationship between the mean and variance of estimated counts from bootstraps.
    Supplementary Fig. 8: Relationship between the mean and variance of estimated counts from bootstraps.

    Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 bootstraps of a single subsample of 30M reads from the same 216M PE read dataset. The x-axis is the mean of the count estimates calculated across the 40 bootstraps. The y-axis is the variance of the count estimates calculated across the 40 bootstraps.

  11. Median relative difference from 30 million 75-bp PE reads simulated with error for different values of k.
    Supplementary Fig. 9: Median relative difference from 30 million 75-bp PE reads simulated with error for different values of k.

    Median relative difference from 30M 75bp PE reads simulated with error for different values of k. The “k-mers method” uses the k-compatibility of each k-mer independently and runs the EM algorithm on k-mers, whereas kallisto uses the intersection of k-compatibility classes across both ends of reads. When there are errors in the reads, kallisto requires smaller k-mer lengths for robustness in pseudoalignment.

  12. Run time for index building and quantification
    Supplementary Fig. 10: Run time for index building and quantification

    Run time for index building and quantification as a function of k-mer length for one of the simulated samples.

  13. The distribution of the number of k-mers hashed per read.
    Supplementary Fig. 11: The distribution of the number of k-mers hashed per read.

    The distribution of the number of k-mers hashed per read for k=31. Note that for the majority of reads (61.35%) only two k-mers are hashed. This happens when the entire read pseudoaligns to a single contig of the T-DBG and we can skip to the end of the read. Since we also check the last k-mer we can skip over, the most common cases are checking 2, 4, 6, and 8 k-mers. Only 1.6% of reads required hashing every k-mer of the read.

Change history

Corrected online 27 July 2016
In the version of this article initially published, in the HTML version only, the equation “αtN > 0.01” was written as “αtN > 0.01.” In addition, in the Figure 1 legend, the formatting of the nodes was incorrect (v_1, etc., rather than v1). The errors have been corrected in the HTML and PDF versions of the article.

References

  1. Kim, D. et al. Genome Biol. 14, R36 (2013).
  2. Trapnell, C. et al. Nat. Biotechnol. 28, 511515 (2010).
  3. Roberts, A. & Pachter, L. Nat. Methods 10, 7173 (2013).
  4. Anders, S., Pyl, P.T. & Huber, W. Bioinformatics 31, 166169 (2015).
  5. Patro, R., Mount, S.M. & Kingsford, C. Nat. Biotechnol. 32, 462464 (2014).
  6. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621628 (2008).
  7. Nicolae, M., Mangul, S., Măndoiu, I. & Zelikovsky, A. in Algorithms in Bioinformatics (eds. Moulton, V. & Singh, M.) 202214 (Springer, 2010).
  8. Compeau, P.E.C., Pevzner, P.A. & Tesler, G. Nat. Biotechnol. 29, 987991 (2011).
  9. Li, B. & Dewey, C.N. BMC Bioinformatics 12, 323 (2011).
  10. SEQC/MAQC-III Consortium. Nat. Biotechnol. 32, 903914 (2014).
  11. Lappalainen, T. et al. Nature 501, 506511 (2013).
  12. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Genome Biol. 12, R22 (2011).
  13. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. Genome Res. 18, 15091517 (2008).
  14. Wold, B. & Myers, R.M. Nat. Methods 5, 1921 (2008).
  15. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. Nat. Genet. 44, 226232 (2012).
  16. Lee, S., Seo, C.H., Alver, B.H., Lee, S. & Park, P.J. BMC Bioinformatics 16, 278 (2015).
  17. Köster, J. & Rahmann, S. Bioinformatics 28, 25202522 (2012).

Download references

Author information

Affiliations

  1. Innovative Genomics Initiative, University of California, Berkeley, California, USA.

    • Nicolas L Bray
  2. Department of Computer Science, University of California, Berkeley, California, USA.

    • Harold Pimentel &
    • Lior Pachter
  3. Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland.

    • Páll Melsted
  4. Department of Mathematics, University of California, Berkeley, California, USA.

    • Lior Pachter
  5. Department of Molecular & Cell Biology, University of California, Berkeley, California, USA.

    • Lior Pachter

Contributions

N.L.B. and L.P. developed the concept of pseudoalignment and conceived the idea for applying it to RNA-seq quantification. P.M. conceived the implementation using De Bruijn graphs. N.L.B., H.P., P.M. and L.P. designed the kallisto software and N.L.B. implemented a prototype. H.P. and P.M. wrote the current kallisto implementation. N.B. and H.P. automated production of the results. N.L.B., H.P., P.M. and L.P. analyzed results and wrote the paper.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Median relative difference for abundance estimates using varying values of k. (39 KB)

    Median relative difference for abundance estimates using varying values of k on a dataset of 30 million 75bp paired-end reads that were simulated without errors. The “k-mers method” uses the k-compatibility of each k-mer independently and runs the EM algorithm on k-mers, whereas kallisto uses the intersection of k-compatibility classes across both ends of a read. Even for k=75, the full read length in the simulation, independent use of k-mers results in a significant drop in accuracy due to the loss of paired-end information.

  2. Supplementary Figure 2: Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM. (29 KB)

    Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM on 20 RSEM simulations of 30 million 75bp paired-end reads based on the TPM estimates and error profile of Geuvadis sample NA12716 (selected for its depth of sequencing). For each simulation we report the accuracy as the median relative difference in the estimated TPM value of each transcript. The values reported are means across the 20 simulations (the variance was too small for this plot). Relative difference is defined as the absolute difference between the estimated TPM values and the ground truth divided by the average of the two.

  3. Supplementary Figure 3: Performance of different quantification programs on the set of paralogs in the human genome. (27 KB)

    Performance of different quantification programs on the set of paralogs in the human genome supplied by the Duplicated Genes Database (http://dgd.genouest.org). This set includes 8,636 transcripts in 3,163 genes.

  4. Supplementary Figure 4: Count distribution of one simulation. (28 KB)

    Count distribution of one simulation. The left panel contains the transcripts used in Supplementary Figure 3. The right panel contains the remaining transcripts. The x-axis is on the log scale. Both distributions appear very similar, suggesting that the drop in performance in Supplementary Figure 3 is from sequence similarity and not oddities in the distribution such as very low counts.

  5. Supplementary Figure 5: Comparison of technical variance in abundances. (113 KB)

    The data comes from a single library with 216M, 101bp paired-end reads sequenced. Each point corresponds to a transcript and is colored by the decile of its expression level in the single bootstrapped subsample. The Y-axis represents variance of abundance estimates across 40 subsamples, with 30M reads in each subsample. The X-axis represents variance as computed from 40 bootstraps of a single subsampled dataset of 30M reads. The red lines emanating from the lower left corner consist of transcripts that have an estimated abundance of zero in the single bootstrapped experiment, but show expression in some of the subsamples (12968 transcripts), and vice versa (720 transcripts).

  6. Supplementary Figure 6: Median relative error (with respect to 1,000 bootstraps) of inferred transcript variances. (34 KB)

    Median relative error (with respect to 1000 bootstraps) of inferred transcript variances as a function of number of bootstrap samples performed. The relative error with 40 bootstraps (red line) is 7.8%.

  7. Supplementary Figure 7: Relationship between the mean and variance of estimated counts from subsamples. (109 KB)

    Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 subsamples of 30M reads from a dataset of 216M PE reads. The x-axis is the mean of each count estimate calculated across the subsamples. The y-axis is the variance of the count estimates calculated across subsamples.

  8. Supplementary Figure 8: Relationship between the mean and variance of estimated counts from bootstraps. (92 KB)

    Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 bootstraps of a single subsample of 30M reads from the same 216M PE read dataset. The x-axis is the mean of the count estimates calculated across the 40 bootstraps. The y-axis is the variance of the count estimates calculated across the 40 bootstraps.

  9. Supplementary Figure 9: Median relative difference from 30 million 75-bp PE reads simulated with error for different values of k. (40 KB)

    Median relative difference from 30M 75bp PE reads simulated with error for different values of k. The “k-mers method” uses the k-compatibility of each k-mer independently and runs the EM algorithm on k-mers, whereas kallisto uses the intersection of k-compatibility classes across both ends of reads. When there are errors in the reads, kallisto requires smaller k-mer lengths for robustness in pseudoalignment.

  10. Supplementary Figure 10: Run time for index building and quantification (35 KB)

    Run time for index building and quantification as a function of k-mer length for one of the simulated samples.

  11. Supplementary Figure 11: The distribution of the number of k-mers hashed per read. (34 KB)

    The distribution of the number of k-mers hashed per read for k=31. Note that for the majority of reads (61.35%) only two k-mers are hashed. This happens when the entire read pseudoaligns to a single contig of the T-DBG and we can skip to the end of the read. Since we also check the last k-mer we can skip over, the most common cases are checking 2, 4, 6, and 8 k-mers. Only 1.6% of reads required hashing every k-mer of the read.

PDF files

  1. Supplementary Text and Figures (899 KB)

    Supplementary Figures 1–11

Excel files

  1. Supplementary Table 1a (10 KB)

    Performance of quantification as measured by SEQC qPCR

  2. Supplementary Table 1b (10 KB)

    Gene level performance of quantification as measured by SEQC

  3. Supplementary Table 2 (11 KB)

    Performance of kallisto with and without bias

Zip files

  1. Supplementary Software (1059 KB)
  2. Supplementary Code (26 KB)

Additional data