Salmon provides fast and bias-aware quantification of transcript expression

Journal name:
Nature Methods
Volume:
14,
Pages:
417–419
Year published:
DOI:
doi:10.1038/nmeth.4197
Received
Accepted
Published online

We introduce Salmon, a lightweight method for quantifying transcript abundance from RNA–seq reads. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. It is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which, as we demonstrate here, substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis.

At a glance

Figures

  1. Performance of Salmon.
    Figure 1: Performance of Salmon.

    (a) The median of absolute log-transformed fold changes (lfc) between the estimated and true transcript abundances under all 16 replicates of the Polyester simulated data. An lfc closer to 0 indicates more similar true and estimated abundances. The two halves of the plot show the distributions of each method's lfc values for samples simulated with different GC-bias curves that were learned from experimental data using alpine5 (Online Methods). In the box plot, the middle line of each box is the median, the box has width of the interquartile range (IQR), and the whiskers extend to the farthest observation still less than 1.5 times the IQR; all points are plotted for clarity. (b) The distribution of mean absolute relative differences (MARDs; Online Methods) of Salmon, Salmon using traditional alignments (“Salmon (a)”), kallisto, and eXpress under 20 simulated replicates generated by RSEM-sim. (c) The sensitivity of Salmon, Salmon using traditional alignments (“Salmon (a)”), kallisto, and eXpress in finding truly differentially expressed (DE) transcripts at typical FDR values for Polyester simulated data. (d) For 30 GEUVADIS samples, the number of transcripts called as DE at an expected FDR of 1% when the contrast between groups (i.e., the center at which they were sequenced) is a technical confound and not related to biology.

  2. Overview of Salmon/'s method and components and execution timeline.
    Supplementary Fig. 1: Overview of Salmon’s method and components and execution timeline.

    Salmon accepts either raw (green arrows) or aligned (gray arrow) reads as input. When processing quasi-mappings or aligned reads, Salmon executes an online inference algorithm. This ensures that transcript abundance estimates are available to estimate weights for the rich equivalence classes, and to consider the appropriate conditional probabilities when learning the experimental parameters and foreground bias models. After a fragment’s contributions to the online abundance estimates and bias models have been computed, the fragment is placed into an appropriate equivalence class (or one is created if it does not yet exist). Once all of the fragments have been observed, the initial abundances and fragment equivalence classes are passed to the offline inference module. The offline module learns the background bias models (based on initial abundance estimates) and then corrects the effective transcript lengths to account for the appropriate biases. Finally, the offline inference algorithm (EM or VBEM) is run over the reduced representation of the data until convergence. Once estimation is complete, posterior samples are generated via Gibbs sampling or a bootstrap procedure if the user has requested this.

  3. The false discovery rate (FDR) vs. sensitivity of detecting differentially expressed transcripts on Polyester simulated data
    Supplementary Fig. 2: The false discovery rate (FDR) vs. sensitivity of detecting differentially expressed transcripts on Polyester simulated data

    The false discovery rate (FDR) vs. the sensitivity of Salmon, Salmon (align), kallisto and eXpress on Polyester simulated RNA-seq data using empirically-derived fragment GC bias profiles. All methods were run with bias-correction enabled, but only Salmon’s model incorporates corrections for fragment GC bias. This leads to a large improvement in sensitivity at almost every FDR value.

  4. Abundance vs. fold change accuracy on Polyester simulated data
    Supplementary Fig. 3: Abundance vs. fold change accuracy on Polyester simulated data

    The log2 fold change between the estimated and true abundances as a function of the true abundance (measured in TPM), for all methods and for all replicates of both simulated “conditions” (each row displays points from all samples within a given condition). The top row corresponds to the 8 samples simulated from the data showing the weak fragment GC content bias, while the bottom row corresponds to the 8 samples simulated from the data showing the stronger fragment GC content bias. Points with an estimated log2 fold change of > 0.5 or < -0.5 are colored red. The fraction of red points appears in the upper right-hand corner of each plot. Salmon consistently demonstrates log fold changes closer to 0 than either kallisto or eXpress, across most of the range of expression.

  5. Consistency of estimates on SEQC data within and between centers
    Supplementary Fig. 4: Consistency of estimates on SEQC data within and between centers

    The distribution of the mean absolute error of (inverse hyperbolic sine-transformed) TPMs between different replicates of data from the SEQC [12] study. The A sample corresponds to universal human reference tissue (UHRR) and the B sample corresponds to human brain tissue (HBRR). When comparing the replicates that were sequenced at different centers, the inter-replicate distances are larger. However, we observe that Salmon’s bias correction methodology results in improved consistency (i.e., reduced distances) compared to the estimates produced by other methods, especially when comparing replicates sequenced at different centers, where we expect the effects of bias to be more pronounced.

  6. Salmon reduces false isoform switching
    Supplementary Fig. 5: Salmon reduces false isoform switching

    Transcripts demonstrating dominant isoform switching that results from technical bias. In the quantification estimates computed using kallisto and eXpress, these two-isoform genes show a change in the dominant isoform between conditions (an asterisk denotes a t-test on log2(TPM+1) with p < 1×10−6). However, Salmon directly corrects for technical biases that appear to underlie differences across sequencing center, revealing that the dominant isoform has not, in fact, switched across center.

  7. Quantification accuracy for Salmon, Salmon (align), kallisto and eXpress using RSEM-sim data.
    Supplementary Fig. 6: Quantification accuracy for Salmon, Salmon (align), kallisto and eXpress using RSEM-sim data.

    The distribution of Spearman correlations over all 20 replicates of the RSEM-sim data for Salmon, kallisto and eXpress. Salmon and kallisto yield very similar distributions of correlations (no statistically significant difference), while both methods yield correlations greater than that of eXpress (Mann-Whitney U test, p = 3.39780 × 10−8).

  8. Effect of number of GC models
    Supplementary Fig. 7: Effect of number of GC models

    The effect of the number of conditional GC models used to account for correlation between fragment GC and sequence-specific bias. We choose the default to be 3 bins; the simplest model that demonstrates the majority of the benefit. Panels a, b and c show the result of varying the number of conditional GC models on an analysis of the GEUVADIS data for all genes, all transcripts, and genes with only two transcripts, respectively.

References

  1. Hoadley, K.A. et al. Cell 158, 929944 (2014).
  2. Li, J.J., Huang, H., Bickel, P.J. & Brenner, S.E. Genome Res. 24, 10861101 (2014).
  3. Weinstein, J.N. et al. Nat. Genet. 45, 11131120 (2013).
  4. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Genome Biol. 12, R22 (2011).
  5. Love, M.I., Hogenesch, J.B. & Irizarry, R.A. Nat. Biotechnol. 34, 12871291 (2016).
  6. Morán, I. et al. Cell Metab. 16, 435448 (2012).
  7. Teng, M. et al. Genome Biol. 17, 74 (2016).
  8. Kodama, Y., Shumway, M. & Leinonen, R. Nucleic Acids Res. 40, D54D56 (2012).
  9. Patro, R., Mount, S.M. & Kingsford, C. Nat. Biotechnol. 32, 462464 (2014).
  10. Bray, N.L., Pimentel, H., Melsted, P. & Pachter, L. Nat. Biotechnol. 34, 525527 (2016).
  11. Lappalainen, T. et al. Nature 501, 506511 (2013).
  12. SEQC/MAQ-III Consortium. Nat. Biotechnol. 32, 903914 (2014).
  13. Frazee, A.C., Jaffe, A.E., Langmead, B. & Leek, J.T. Bioinformatics 31, 27782784 (2015).
  14. Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A. & Dewey, C.N. Bioinformatics 26, 493500 (2010).
  15. Roberts, A. & Pachter, L. Nat. Methods 10, 7173 (2013).
  16. Langmead, B. & Salzberg, S.L. Nat. Methods 9, 357359 (2012).
  17. Srivastava, A., Sarkar, H., Gupta, N. & Patro, R. Bioinformatics 32, i192i200 (2016).
  18. t'Hoen, P.A. et al. Nat. Biotechnol. 31, 10151022 (2013).
  19. Foulds, J., Boyles, L., DuBois, C., Smyth, P. & Welling, M. in Proc. 19th ACM SIGKDD Int. Conf. Knowledge Discov. & Data Mining 446454 (ACM, 2013).
  20. Bishop, C.M. et al. Pattern Recognition and Machine Learning (Springer, 2006).
  21. Hensman, J., Papastamoulis, P., Glaus, P., Honkela, A. & Rattray, M. Bioinformatics 31, 38813889 (2015).
  22. Nariai, N. et al. BMC Genomics 15 (Suppl. 10), S5 (2014).
  23. Cappé, O. in Mixtures: Estimation and Applications (eds. Mengersen, K.L., Robert, C.P. & Titterington, D.M.) Ch. 2 (John Wiley & Sons, 2011).
  24. Hsieh, C.-J., Yu, H.-F. & Dhillon, I.S. ICML 15, 23702379 (2015).
  25. Salzman, J., Jiang, H. & Wong, W.H. Stat. Sci. 26, 1 (2011).
  26. Nicolae, M., Mangul, S., Maă ndoiu, I.I. & Zelikovsky, A. Algorithms Mol. Biol. 6, 9 (2011).
  27. Turro, E. et al. Genome Biol. 12, R13 (2011).
  28. Li, X., David, G., Andersen, M.K. & Freedman, M.J. in Proc. Ninth Eur. Conf. Computer Syst. 27 (ACM, 2014).
  29. Jackman, S. & Birol, I. F1000Research 5, 1795 (2016).
  30. Merkel, D. Linux J. 2014 (2014).
  31. Di Tommaso, P., Chatzou, M., Baraja, P.P. & Notredame, C. figshare https://dx.doi.org/10.6084/m9.figshare.1254958.v2 (2014).
  32. Brett, K.B.-J. & Greene, C.S. Preprint at https://doi.org/10.1101/056473 (2016).

Download references

Author information

Affiliations

  1. Department of Computer Science, Stony Brook University, Stony Brook, New York, USA.

    • Rob Patro
  2. DNAnexus, Mountain View, California, USA.

    • Geet Duggal
  3. Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Cambridge, Massachusetts, USA.

    • Michael I Love &
    • Rafael A Irizarry
  4. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Cambridge, Massachusetts, USA.

    • Michael I Love &
    • Rafael A Irizarry
  5. Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

    • Carl Kingsford

Contributions

R.P. and C.K. designed the method, which was implemented by R.P. R.P., G.D., M.I.L., R.I., and C.K. designed the experiments, and R.P., G.D., and M.I.L. conducted the experiments. R.P., G.D., M.I.L., R.A.I., and C.K. wrote the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Overview of Salmon’s method and components and execution timeline. (361 KB)

    Salmon accepts either raw (green arrows) or aligned (gray arrow) reads as input. When processing quasi-mappings or aligned reads, Salmon executes an online inference algorithm. This ensures that transcript abundance estimates are available to estimate weights for the rich equivalence classes, and to consider the appropriate conditional probabilities when learning the experimental parameters and foreground bias models. After a fragment’s contributions to the online abundance estimates and bias models have been computed, the fragment is placed into an appropriate equivalence class (or one is created if it does not yet exist). Once all of the fragments have been observed, the initial abundances and fragment equivalence classes are passed to the offline inference module. The offline module learns the background bias models (based on initial abundance estimates) and then corrects the effective transcript lengths to account for the appropriate biases. Finally, the offline inference algorithm (EM or VBEM) is run over the reduced representation of the data until convergence. Once estimation is complete, posterior samples are generated via Gibbs sampling or a bootstrap procedure if the user has requested this.

  2. Supplementary Figure 2: The false discovery rate (FDR) vs. sensitivity of detecting differentially expressed transcripts on Polyester simulated data (182 KB)

    The false discovery rate (FDR) vs. the sensitivity of Salmon, Salmon (align), kallisto and eXpress on Polyester simulated RNA-seq data using empirically-derived fragment GC bias profiles. All methods were run with bias-correction enabled, but only Salmon’s model incorporates corrections for fragment GC bias. This leads to a large improvement in sensitivity at almost every FDR value.

  3. Supplementary Figure 3: Abundance vs. fold change accuracy on Polyester simulated data (246 KB)

    The log2 fold change between the estimated and true abundances as a function of the true abundance (measured in TPM), for all methods and for all replicates of both simulated “conditions” (each row displays points from all samples within a given condition). The top row corresponds to the 8 samples simulated from the data showing the weak fragment GC content bias, while the bottom row corresponds to the 8 samples simulated from the data showing the stronger fragment GC content bias. Points with an estimated log2 fold change of > 0.5 or < -0.5 are colored red. The fraction of red points appears in the upper right-hand corner of each plot. Salmon consistently demonstrates log fold changes closer to 0 than either kallisto or eXpress, across most of the range of expression.

  4. Supplementary Figure 4: Consistency of estimates on SEQC data within and between centers (113 KB)

    The distribution of the mean absolute error of (inverse hyperbolic sine-transformed) TPMs between different replicates of data from the SEQC [12] study. The A sample corresponds to universal human reference tissue (UHRR) and the B sample corresponds to human brain tissue (HBRR). When comparing the replicates that were sequenced at different centers, the inter-replicate distances are larger. However, we observe that Salmon’s bias correction methodology results in improved consistency (i.e., reduced distances) compared to the estimates produced by other methods, especially when comparing replicates sequenced at different centers, where we expect the effects of bias to be more pronounced.

  5. Supplementary Figure 5: Salmon reduces false isoform switching (671 KB)

    Transcripts demonstrating dominant isoform switching that results from technical bias. In the quantification estimates computed using kallisto and eXpress, these two-isoform genes show a change in the dominant isoform between conditions (an asterisk denotes a t-test on log2(TPM+1) with p < 1×10−6). However, Salmon directly corrects for technical biases that appear to underlie differences across sequencing center, revealing that the dominant isoform has not, in fact, switched across center.

  6. Supplementary Figure 6: Quantification accuracy for Salmon, Salmon (align), kallisto and eXpress using RSEM-sim data. (85 KB)

    The distribution of Spearman correlations over all 20 replicates of the RSEM-sim data for Salmon, kallisto and eXpress. Salmon and kallisto yield very similar distributions of correlations (no statistically significant difference), while both methods yield correlations greater than that of eXpress (Mann-Whitney U test, p = 3.39780 × 10−8).

  7. Supplementary Figure 7: Effect of number of GC models (111 KB)

    The effect of the number of conditional GC models used to account for correlation between fragment GC and sequence-specific bias. We choose the default to be 3 bins; the simplest model that demonstrates the majority of the benefit. Panels a, b and c show the result of varying the number of conditional GC models on an analysis of the GEUVADIS data for all genes, all transcripts, and genes with only two transcripts, respectively.

PDF files

  1. Supplementary Text and Figures (1,997 KB)

    Supplementary Figures 1–7, Supplementary Tables 1–4, Supplementary Notes 1 and 2, and Supplementary Algorithms 1

Additional data