Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Computational methods for transcriptome annotation and quantification using RNA-seq

This article has been updated

Abstract

High-throughput RNA sequencing (RNA-seq) promises a comprehensive picture of the transcriptome, allowing for the complete annotation and quantification of all genes and their isoforms across samples. Realizing this promise requires increasingly complex computational methods. These computational challenges fall into three main categories: (i) read mapping, (ii) transcriptome reconstruction and (iii) expression quantification. Here we explain the major conceptual and practical challenges, and the general classes of solutions for each category. Finally, we highlight the interdependence between these categories and discuss the benefits for different biological applications.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: Strategies for gapped alignments of RNA-seq reads to the genome.
Figure 2: Transcriptome reconstruction methods.
Figure 3: An overview of gene expression quantification with RNA-seq.
Figure 4: Overview of RNA-seq differential expression analysis.

Change history

  • 15 June 2011

    In the html version of this article initially published, the corresponding author was listed as Manfred G. Grabherr instead of Manuel Garber. The error has been corrected in the HTML version of the article.

References

  1. Marra, M. et al. An encyclopedia of mouse genes. Nat. Genet. 21, 191–194 (1999).

    CAS  Article  Google Scholar 

  2. Carninci, P. et al. Targeting a complex transcriptome: the construction of the mouse full-length cDNA encyclopedia. Genome Res. 13, 1273–1289 (2003).

    Article  Google Scholar 

  3. de Souza, S.J. et al. Identification of human chromosome 22 transcribed sequences with ORF expressed sequence tags. Proc. Natl. Acad. Sci. USA 97, 12690–12693 (2000).

    CAS  Article  Google Scholar 

  4. Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227 (2009).

    CAS  Article  Google Scholar 

  5. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).

    CAS  Article  Google Scholar 

  6. Adams, M.D. et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651–1656 (1991).

    CAS  Article  Google Scholar 

  7. Haas, B.J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).

    CAS  Article  Google Scholar 

  8. Kent, W.J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    CAS  Article  Google Scholar 

  9. Wu, T.D. & Watanabe, C.K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).

    CAS  Article  Google Scholar 

  10. Kapranov, P. et al. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–919 (2002).

    CAS  Article  Google Scholar 

  11. Pan, Q. et al. Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Mol. Cell 16, 929–941 (2004).

    CAS  Article  Google Scholar 

  12. Castle, J.C. et al. Expression of 24,426 human alternative splicing events and predicted cis regulation in 48 tissues and cell lines. Nat. Genet. 40, 1416–1425 (2008).

    CAS  Article  Google Scholar 

  13. Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).

    CAS  Article  Google Scholar 

  14. Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).

    CAS  Article  Google Scholar 

  15. Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods 5, 613–619 (2008).

    CAS  Article  Google Scholar 

  16. Denoeud, F. et al. Annotating genomes with massive-scale RNA sequencing. Genome Biol. 9, R175 (2008).

    Article  Google Scholar 

  17. Lister, R. et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133, 523–536 (2008).

    CAS  Article  Google Scholar 

  18. Maher, C.A. et al. Transcriptome sequencing to detect gene fusions in cancer. Nature 458, 97–101 (2009).

    CAS  Article  Google Scholar 

  19. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008). First systematic comparison of expression arrays and RNA-seq revealed that technical variability between RNA-seq runs is extremely low; the authors developed the first methods for principled differential analysis of expression with read counts.

    CAS  Article  Google Scholar 

  20. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621–628 (2008). One of the first papers to describe the RNA-seq experimental protocol and provided the foundations for the computational analysis of quantitative transcriptome sequencing by introducing the RPKM expression metric.

    CAS  Article  Google Scholar 

  21. Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008).

    CAS  Article  Google Scholar 

  22. Sultan, M. et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, 956–960 (2008).

    CAS  Article  Google Scholar 

  23. Yassour, M. et al. Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing. Proc. Natl. Acad. Sci. USA 106, 3264–3269 (2009).

    CAS  Article  Google Scholar 

  24. Blekhman, R., Marioni, J.C., Zumbo, P., Stephens, M. & Gilad, Y. Sex-specific and lineage-specific alternative splicing in primates. Genome Res. 20, 180–189 (2010).

    CAS  Article  Google Scholar 

  25. Wilhelm, B.T. et al. RNA-seq analysis of two closely related leukemia clones that differ in their self-renewal capacity. Blood 117, e27–e38 (2010).

    Article  Google Scholar 

  26. Berger, M.F. et al. Integrative analysis of the melanoma transcriptome. Genome Res. 20, 413–427 (2010).

    CAS  Article  Google Scholar 

  27. Mortazavi, A. et al. Scaffolding a Caenorhabditis nematode genome with RNA-seq. Genome Res. 20, 1740–1747 (2010).

    CAS  Article  Google Scholar 

  28. Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010). This paper describes a spliced alignment–based genome-guided transcript reconstruction methods that allow discovery of novel genes and isoforms from RNA-seq data.

    CAS  Article  Google Scholar 

  29. Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010). This paper describes a spliced alignment–based genome-guided transcript reconstruction methods that allow discovery of novel genes and isoforms from RNA-seq data and provided a method for estimating the expression of each reconstructed isoform.

    CAS  Article  Google Scholar 

  30. Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015 (2010). This paper describes a computational method that estimates isoform expression making use of both single and paired-end reads, and provides a Bayesian approach for detecting differential isoform expression.

    CAS  Article  Google Scholar 

  31. Homer, N., Merriman, B. & Nelson, S.F. BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4, e7767 (2009).

    Article  Google Scholar 

  32. Jiang, H. & Wong, W.H. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 24, 2395–2396 (2008). A statistical algorithm to calculate isoform abundances for alternatively spliced genes is described.

    CAS  Article  Google Scholar 

  33. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

    CAS  Article  Google Scholar 

  34. Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008).

    CAS  Article  Google Scholar 

  35. Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. advance online publication 27 October 2010 (doi:10.1101/gr.111120.110).

    Article  Google Scholar 

  36. Rizk, G. & Lavenier, D. GASSST: global alignment short sequence search tool. Bioinformatics 26, 2534–2540 (2010).

    CAS  Article  Google Scholar 

  37. Rumble, S.M. et al. SHRiMP: accurate mapping of short color-space reads. PLoS Comput. Biol. 5, e1000386 (2009).

    Article  Google Scholar 

  38. Smith, A.D., Xuan, Z. & Zhang, M.Q. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9, 128 (2008).

    Article  Google Scholar 

  39. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009). Introduced short read alignment with the Burrows-Wheeler transform, allowing the construction of the first fast alignment pipelines for RNA-seq.

    Article  Google Scholar 

  40. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    CAS  Article  Google Scholar 

  41. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).

    CAS  Article  Google Scholar 

  42. Burrows, M. & Wheeler, D.J.A. Block-sorting lossless data compression algorithm. Digital SRC Reports 124, [AU: provide an article ID number or page numbers, or some other identifying information for this paper, such as a doi number or Pubmed or CrossRef ID] (1994).

  43. Ferragina, P. & Manzini, G. An experimental study of a compressed index. Inf. Sci. 135, 13–28 (2001).

    Article  Google Scholar 

  44. Griffith, M. et al. Alternative expression analysis by RNA sequencing. Nat. Methods 7, 843–847 (2010).

    CAS  Article  Google Scholar 

  45. Cloonan, N. et al. RNA-MATE: a recursive mapping strategy for high-throughput RNA-sequencing data. Bioinformatics 25, 2615–2616 (2009).

    CAS  Article  Google Scholar 

  46. Degner, J.F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).

    CAS  Article  Google Scholar 

  47. Au, K.F., Jiang, H., Lin, L., Xing, Y. & Wong, W.H. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 38, 4570–4578 (2010).

    CAS  Article  Google Scholar 

  48. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009). This method combined fast read alignment using Burrows-Wheeler transform alignment with novel junction discovery, was one of the first scalable RNA-seq alignment programs, and paved the way for gene discovery and transcript reconstruction with RNA-seq.

    CAS  Article  Google Scholar 

  49. Wang, K. et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178 (2010).

    Article  Google Scholar 

  50. Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010).

    CAS  Article  Google Scholar 

  51. De Bona, F., Ossowski, S., Schneeberger, K. & Ratsch, G. Optimal spliced alignments of short sequence reads. Bioinformatics 24, i174–i180 (2008).

    Article  Google Scholar 

  52. Mikkelsen, T.S. et al. Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature 447, 167–177 (2007).

    CAS  Article  Google Scholar 

  53. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909–912 (2010). Described a variable k -mer approach for genome-independent reconstruction that allows for transcript discovery without a reference genome.

    CAS  Article  Google Scholar 

  54. Birol, I. et al. De novo transcriptome assembly with ABySS. Bioinformatics 25, 2872–2877 (2009).

    CAS  Article  Google Scholar 

  55. Surget-Groba, Y. & Montoya-Burgos, J.I. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res. 20, 1432–1440 (2010).

    CAS  Article  Google Scholar 

  56. De Bruijn, N.G. A combinatorial problem. Koninklijke Nederlandse Akademie v. Wetenschappen 46, 6 (1946).

    Google Scholar 

  57. Pevzner, P.A. 1-Tuple DNA sequencing: computer analysis. J. Biomol. Struct. Dyn. 7, 63–73 (1989).

    CAS  Article  Google Scholar 

  58. Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

    CAS  Article  Google Scholar 

  59. Zerbino, D.R. Using the Velvet de novo assembler for short-read sequencing technologies. Curr. Protoc. Bioinformatics 31, 11.5.1–11.5.12 (2010).

    Google Scholar 

  60. Blencowe, B.J., Ahmad, S. & Lee, L.J. Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes. Genes Dev. 23, 1379–1386 (2009).

    CAS  Article  Google Scholar 

  61. Lister, R., Gregory, B.D. & Ecker, J.R. Next is now: new technologies for sequencing of genomes, transcriptomes, and beyond. Curr. Opin. Plant Biol. 12, 107–118 (2009).

    CAS  Article  Google Scholar 

  62. Pepke, S., Wold, B. & Mortazavi, A. Computation for ChIP-seq and RNA-seq studies. Nat. Methods 6, S22–S32 (2009).

    CAS  Article  Google Scholar 

  63. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).

    CAS  Article  Google Scholar 

  64. Oshlack, A. & Wakefield, M.J. Transcript length bias in RNA-seq data confounds systems biology. Biol. Direct 4, 14 (2009).

    Article  Google Scholar 

  65. Robinson, M.D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).

    Article  Google Scholar 

  66. Jiang, H. & Wong, W.H. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25, 1026–1032 (2009).

    CAS  Article  Google Scholar 

  67. Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A. & Dewey, C.N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).

    Article  Google Scholar 

  68. Bullard, J.H., Purdom, E., Hansen, K.D. & Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11, 94 (2010).

    Article  Google Scholar 

  69. Wang, X., Wu, Z. & Zhang, X. Isoform abundance inference provides a more accurate estimation of gene expression levels in RNA-seq. J. Bioinform. Comput. Biol. 8 (Suppl. 1), 177–192 (2010).

    CAS  Article  Google Scholar 

  70. Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98, 5116–5121 (2001).

    CAS  Article  Google Scholar 

  71. Grant, G.R., Manduchi, E. & Stoeckert, C.J. Jr. Analysis and management of microarray gene expression data. Curr. Protoc. Mol. Biol. 19 6 (2007).

    PubMed  Google Scholar 

  72. Grant, G.R., Liu, J. & Stoeckert, C.J. Jr. A practical false discovery rate approach to identifying patterns of differential expression in microarray data. Bioinformatics 21, 2684–2690 (2005).

    CAS  Article  Google Scholar 

  73. Langmead, B., Hansen, K.D. & Leek, J.T. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11, R83 (2010).

    Article  Google Scholar 

  74. Robinson, M.D. & Smyth, G.K. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881–2887 (2007). Provided a statistical framework that is well suited to differential expression testing when a small number of RNA-seq replicates are available, and which also works well for larger experiments.

    CAS  Article  Google Scholar 

  75. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

    CAS  Article  Google Scholar 

  76. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).

    CAS  Article  Google Scholar 

  77. Wang, L., Feng, Z., Wang, X. & Zhang, X. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 26, 136–138 (2010).

    Article  Google Scholar 

  78. Levin, J.Z. et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat. Methods 7, 709–715 (2010).

    CAS  Article  Google Scholar 

  79. Jan, C.H., Friedman, R.C., Ruby, J.G. & Bartel, D.P. Formation, regulation and evolution of Caenorhabditis elegans 3′UTRs. Nature 469, 97–101 (2011).

    CAS  Article  Google Scholar 

  80. Mangone, M. et al. The landscape of C. elegans 3′UTRs. Science 329, 432–435 (2010).

    CAS  Article  Google Scholar 

  81. Plessy, C. et al. Linking promoters to functional transcripts in small samples with nanoCAGE and CAGEscan. Nat. Methods 7, 528–534 (2010).

    CAS  Article  Google Scholar 

  82. Lee, S. et al. Accurate quantification of transcriptome from RNA-Seq data by effective length normalization. Nucleic Acids Res. 39, e9 (2010).

    Article  Google Scholar 

Download references

Acknowledgements

We thank L. Gaffney for help with figures; B. Haas for making available scripts to run transAbyss and for many discussions; Y. Katz, C. Nusbaum, A. Pauli and M. Zody for helpful discussions and comments on the manuscript; and J. Alfoldi, C. Burge, M. Cabili, K. Lindblad-Toh, J. Rinn, L. Pachter, S. Salzberg and O. Zuk for helpful comments on the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manuel Garber.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–5, Supplementary Table 1 (PDF 391 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Garber, M., Grabherr, M., Guttman, M. et al. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods 8, 469–477 (2011). https://doi.org/10.1038/nmeth.1613

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.1613

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing