Near-optimal probabilistic RNA-seq quantification

Bray, Nicolas L; Pimentel, Harold; Melsted, Páll; Pachter, Lior

doi:10.1038/nbt.3519

Brief Communication
Published: 04 April 2016

Near-optimal probabilistic RNA-seq quantification

Nicolas L Bray¹,
Harold Pimentel²,
Páll Melsted ORCID: orcid.org/0000-0002-8418-6724³ &
…
Lior Pachter^2,4,5

Nature Biotechnology volume 34, pages 525–527 (2016)Cite this article

105k Accesses
3053 Citations
198 Altmetric
Metrics details

Subjects

An Erratum to this article was published on 09 August 2016

This article has been updated

Abstract

We present kallisto, an RNA-seq quantification program that is two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases. We use kallisto to analyze 30 million unaligned paired-end RNA-seq reads in <10 min on a standard laptop computer. This removes a major computational bottleneck in RNA-seq analysis.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 2: Performance of kallisto and other methods.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Change history

27 July 2016
In the version of this article initially published, in the HTML version only, the equation “α_tN > 0.01” was written as “α_tN > 0.01.” In addition, in the Figure 1 legend, the formatting of the nodes was incorrect (v_1, etc., rather than v₁). The errors have been corrected in the HTML and PDF versions of the article.

References

Kim, D. et al. Genome Biol. 14, R36 (2013).
Article Google Scholar
Trapnell, C. et al. Nat. Biotechnol. 28, 511–515 (2010).
Article CAS Google Scholar
Roberts, A. & Pachter, L. Nat. Methods 10, 71–73 (2013).
Article CAS Google Scholar
Anders, S., Pyl, P.T. & Huber, W. Bioinformatics 31, 166–169 (2015).
Article CAS Google Scholar
Patro, R., Mount, S.M. & Kingsford, C. Nat. Biotechnol. 32, 462–464 (2014).
Article CAS Google Scholar
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621–628 (2008).
Article CAS Google Scholar
Nicolae, M., Mangul, S., Măndoiu, I. & Zelikovsky, A. in Algorithms in Bioinformatics (eds. Moulton, V. & Singh, M.) 202–214 (Springer, 2010).
Compeau, P.E.C., Pevzner, P.A. & Tesler, G. Nat. Biotechnol. 29, 987–991 (2011).
Article CAS Google Scholar
Li, B. & Dewey, C.N. BMC Bioinformatics 12, 323 (2011).
Article CAS Google Scholar
SEQC/MAQC-III Consortium. Nat. Biotechnol. 32, 903–914 (2014).
Lappalainen, T. et al. Nature 501, 506–511 (2013).
Article CAS Google Scholar
Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Genome Biol. 12, R22 (2011).
Article CAS Google Scholar
Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. Genome Res. 18, 1509–1517 (2008).
Article CAS Google Scholar
Wold, B. & Myers, R.M. Nat. Methods 5, 19–21 (2008).
Article CAS Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. Nat. Genet. 44, 226–232 (2012).
Article CAS Google Scholar
Lee, S., Seo, C.H., Alver, B.H., Lee, S. & Park, P.J. BMC Bioinformatics 16, 278 (2015).
Article Google Scholar
Köster, J. & Rahmann, S. Bioinformatics 28, 2520–2522 (2012).
Article Google Scholar

Download references

Acknowledgements

N.L.B., H.P. and L.P. were partially funded by NIH R01 HG006129. P.M. was partially funded by a Fulbright fellowship.

Author information

Authors and Affiliations

Innovative Genomics Initiative, University of California, Berkeley, California, USA.,
Nicolas L Bray
Department of Computer Science, University of California, Berkeley, California, USA
Harold Pimentel & Lior Pachter
Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland
Páll Melsted
Department of Mathematics, University of California, Berkeley, California, USA
Lior Pachter
Department of Molecular & Cell Biology, University of California, Berkeley, California, USA.,
Lior Pachter

Authors

Nicolas L Bray
View author publications
You can also search for this author in PubMed Google Scholar
Harold Pimentel
View author publications
You can also search for this author in PubMed Google Scholar
Páll Melsted
View author publications
You can also search for this author in PubMed Google Scholar
Lior Pachter
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.L.B. and L.P. developed the concept of pseudoalignment and conceived the idea for applying it to RNA-seq quantification. P.M. conceived the implementation using De Bruijn graphs. N.L.B., H.P., P.M. and L.P. designed the kallisto software and N.L.B. implemented a prototype. H.P. and P.M. wrote the current kallisto implementation. N.B. and H.P. automated production of the results. N.L.B., H.P., P.M. and L.P. analyzed results and wrote the paper.

Corresponding author

Correspondence to Lior Pachter.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Median relative difference for abundance estimates using varying values of k.

Median relative difference for abundance estimates using varying values of k on a dataset of 30 million 75bp paired-end reads that were simulated without errors. The “k-mers method” uses the k-compatibility of each k-mer independently and runs the EM algorithm on k-mers, whereas kallisto uses the intersection of k-compatibility classes across both ends of a read. Even for k=75, the full read length in the simulation, independent use of k-mers results in a significant drop in accuracy due to the loss of paired-end information.

Supplementary Figure 2 Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM.

Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM on 20 RSEM simulations of 30 million 75bp paired-end reads based on the TPM estimates and error profile of Geuvadis sample NA12716 (selected for its depth of sequencing). For each simulation we report the accuracy as the median relative difference in the estimated TPM value of each transcript. The values reported are means across the 20 simulations (the variance was too small for this plot). Relative difference is defined as the absolute difference between the estimated TPM values and the ground truth divided by the average of the two.

Supplementary Figure 3 Performance of different quantification programs on the set of paralogs in the human genome.

Performance of different quantification programs on the set of paralogs in the human genome supplied by the Duplicated Genes Database (http://dgd.genouest.org). This set includes 8,636 transcripts in 3,163 genes.

Supplementary Figure 4 Count distribution of one simulation.

Count distribution of one simulation. The left panel contains the transcripts used in Supplementary Figure 3. The right panel contains the remaining transcripts. The x-axis is on the log scale. Both distributions appear very similar, suggesting that the drop in performance in Supplementary Figure 3 is from sequence similarity and not oddities in the distribution such as very low counts.

Supplementary Figure 5 Comparison of technical variance in abundances.

The data comes from a single library with 216M, 101bp paired-end reads sequenced. Each point corresponds to a transcript and is colored by the decile of its expression level in the single bootstrapped subsample. The Y-axis represents variance of abundance estimates across 40 subsamples, with 30M reads in each subsample. The X-axis represents variance as computed from 40 bootstraps of a single subsampled dataset of 30M reads. The red lines emanating from the lower left corner consist of transcripts that have an estimated abundance of zero in the single bootstrapped experiment, but show expression in some of the subsamples (12968 transcripts), and vice versa (720 transcripts).

Supplementary Figure 6 Median relative error (with respect to 1,000 bootstraps) of inferred transcript variances.

Median relative error (with respect to 1000 bootstraps) of inferred transcript variances as a function of number of bootstrap samples performed. The relative error with 40 bootstraps (red line) is 7.8%.

Supplementary Figure 7 Relationship between the mean and variance of estimated counts from subsamples.

Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 subsamples of 30M reads from a dataset of 216M PE reads. The x-axis is the mean of each count estimate calculated across the subsamples. The y-axis is the variance of the count estimates calculated across subsamples.

Supplementary Figure 8 Relationship between the mean and variance of estimated counts from bootstraps.

Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 bootstraps of a single subsample of 30M reads from the same 216M PE read dataset. The x-axis is the mean of the count estimates calculated across the 40 bootstraps. The y-axis is the variance of the count estimates calculated across the 40 bootstraps.

Supplementary Figure 9 Median relative difference from 30 million 75-bp PE reads simulated with error for different values of k.

Median relative difference from 30M 75bp PE reads simulated with error for different values of k. The “k-mers method” uses the k-compatibility of each k-mer independently and runs the EM algorithm on k-mers, whereas kallisto uses the intersection of k-compatibility classes across both ends of reads. When there are errors in the reads, kallisto requires smaller k-mer lengths for robustness in pseudoalignment.

Supplementary Figure 10 Run time for index building and quantification

Run time for index building and quantification as a function of k-mer length for one of the simulated samples.

Supplementary Figure 11 The distribution of the number of k-mers hashed per read.

The distribution of the number of k-mers hashed per read for k=31. Note that for the majority of reads (61.35%) only two k-mers are hashed. This happens when the entire read pseudoaligns to a single contig of the T-DBG and we can skip to the end of the read. Since we also check the last k-mer we can skip over, the most common cases are checking 2, 4, 6, and 8 k-mers. Only 1.6% of reads required hashing every k-mer of the read.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bray, N., Pimentel, H., Melsted, P. et al. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34, 525–527 (2016). https://doi.org/10.1038/nbt.3519

Download citation

Received: 15 October 2015
Accepted: 25 February 2016
Published: 04 April 2016
Issue Date: May 2016
DOI: https://doi.org/10.1038/nbt.3519

This article is cited by

Application of bulk segregant RNA-Seq (BSR-Seq) and allele-specific primers to study soybean powdery mildew resistance
- Cheng-Chun Huang
- Chen-Hsiang Lin
- Hao-Xun Chang
BMC Plant Biology (2024)
Whole-genome sequencing and comparative genomics reveal candidate genes associated with quality traits in Dioscorea alata
- Ana Paula Zotta Mota
- Komivi Dossa
- Hana Chaïr
BMC Genomics (2024)
Building in vitro tools for livestock genomics: chromosomal variation within the PK15 cell line
- M. Johnsson
- J. M. Hickey
- M. K. Jungnickel
BMC Genomics (2024)
Introgressions lead to reference bias in wheat RNA-seq analysis
- Benedict Coombes
- Thomas Lux
- Anthony Hall
BMC Biology (2024)
Natural variation of STKc_GSK3 kinase TaSG-D1 contributes to heat stress tolerance in Indian dwarf wheat
- Jie Cao
- Zhen Qin
- Mingming Xin
Nature Communications (2024)