Count-based differential expression analysis of RNA sequencing data using R and Bioconductor

Abstract

RNA sequencing (RNA-seq) has been rapidly adopted for the profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations) while optionally adjusting for other systematic factors that affect the data-collection process. There are a number of subtle yet crucial aspects of these analyses, such as read counting, appropriate treatment of biological variability, quality control checks and appropriate setup of statistical modeling. Several variations have been presented in the literature, and there is a need for guidance on current best practices. This protocol presents a state-of-the-art computational and statistical RNA-seq differential expression analysis workflow largely based on the free open-source R language and Bioconductor software and, in particular, on two widely used tools, DESeq and edgeR. Hands-on time for typical small experiments (e.g., 4–10 samples) can be <1 h, with computation time <1 d using a standard desktop PC.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Count-based differential expression pipeline for RNA-seq data using edgeR and/or DESeq.
Figure 2
Figure 3
Figure 4: Plots of sample relations.
Figure 5: Plots of mean-variance relationship and dispersion.
Figure 6: M ('minus') versus A ('add') plots for RNA-seq data.
Figure 7

Accession codes

Accessions

Gene Expression Omnibus

References

  1. 1

    Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621–628 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. 2

    Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. 3

    Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. 4

    Robinson, M.D. & Smyth, G.K. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881–2887 (2007).

    Article  CAS  PubMed  Google Scholar 

  5. 5

    Robinson, M.D. & Smyth, G.K. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321–332 (2008).

    Article  PubMed  Google Scholar 

  6. 6

    Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. 7

    McCarthy, D.J., Chen, Y. & Smyth, G.K. Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Res. 40, 4288–4297 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. 8

    Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  9. 9

    Zemach, A. et al. The Arabidopsis nucleosome remodeler DDM1 allows DNA methyltransferases to access H1-containing heterochromatin. Cell 153, 193–205 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. 10

    Lam, M.T. et al. Rev-Erbs repress macrophage gene expression by inhibiting enhancer-directed transcription. Nature 498, 511–515 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. 11

    Ross-Innes, C.S. et al. Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature 481, 389–393 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. 12

    Robinson, M.D. et al. Copy-number-aware differential analysis of quantitative DNA sequencing data. Genome Res. 22, 2489–2496 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. 13

    Vanharanta, S. et al. Epigenetic expansion of VHL-HIF signal output drives multiorgan metastasis in renal cancer. Nat. Med. 19, 50–56 (2013).

    Article  CAS  PubMed  Google Scholar 

  14. 14

    Samstein, R.M. et al. Foxp3 exploits a pre-existent enhancer landscape for regulatory T cell lineage specification. Cell 151, 153–166 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. 15

    Johnson, E.K. et al. Proteomic analysis reveals new cardiac-specific dystrophin-associated proteins. PloS ONE 7, e43515 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. 16

    Fonseca, N.A., Rung, J., Brazma, A. & Marioni, J.C. Tools for mapping high-throughput sequencing data. Bioinformatics 28, 3169–3177 (2012).

    Article  CAS  PubMed  Google Scholar 

  17. 17

    Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. 18

    Bullard, J.H., Purdom, E., Hansen, K.D. & Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinform. 11, 94 (2010).

    Article  CAS  Google Scholar 

  19. 19

    Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. 20

    Siebert, S. et al. Differential gene expression in the siphonophore Nanomia bijuga (Cnidaria) assessed with multiple next-generation sequencing workflows. PLoS ONE 6, 12 (2011).

    Article  CAS  Google Scholar 

  21. 21

    Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. 22

    Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. 23

    Hardcastle, T.J. & Kelly, K.A. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinform. 11, 422 (2010).

    Article  Google Scholar 

  24. 24

    Zhou, Y.-H., Xia, K. & Wright, F.A. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics 27, 2672–2678 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. 25

    Tarazona, S., Garcia-Alcalde, F., Dopazo, J., Ferrer, A. & Conesa, A. Differential expression in RNA-seq: a matter of depth. Genome Res. 21, 2213–2223 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. 26

    Lund, S.P., Nettleton, D., McCarthy, D.J. & Smyth, G.K. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Stat. Appl. Genet. Mol. Biol. 11, pii (2012).

    Article  CAS  Google Scholar 

  27. 27

    Soneson, C. & Delorenzi, M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinform. 14, 91 (2013).

    Article  Google Scholar 

  28. 28

    Lareau, L.F., Inada, M., Green, R.E., Wengrod, J.C. & Brenner, S.E. Unproductive splicing of SR genes associated with highly conserved and ultraconserved DNA elements. Nature 446, 926–929 (2007).

    Article  CAS  PubMed  Google Scholar 

  29. 29

    Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 2008–2017 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. 30

    Glaus, P., Honkela, A. & Rattray, M. Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics 28, 1721–1728 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. 31

    Van De Wiel, M.A. et al. Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics 14, 113–128 (2013).

    Article  PubMed  Google Scholar 

  32. 32

    Blekhman, R., Marioni, J.C., Zumbo, P., Stephens, M. & Gilad, Y. Sex-specific and lineage-specific alternative splicing in primates. Genome Res. 20, 180–189 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. 33

    Okoniewski, M.J. et al. Preferred analysis methods for single genomic regions in RNA sequencing revealed by processing the shape of coverage. Nucleic Acids Res. 40, e63 (2012).

    Article  CAS  PubMed  Google Scholar 

  34. 34

    Hansen, K.D., Wu, Z., Irizarry, R.A. & Leek, J.T. Sequencing technology does not eliminate biological variability. Nat. Biotechnol. 29, 572–573 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. 35

    Leek, J.T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).

    Article  CAS  PubMed  Google Scholar 

  36. 36

    Auer, P.L. & Doerge, R.W. Statistical design and analysis of RNA sequencing data. Genetics 185, 405–416 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. 37

    Gagnon-Bartsch, J.A. & Speed, T.P. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–552 (2011).

    Article  PubMed  Google Scholar 

  38. 38

    Leek, J.T. & Storey, J.D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, 1724–1735 (2007).

    Article  CAS  PubMed  Google Scholar 

  39. 39

    Myers, R.M. Classical and Modern Regression with Applications 2nd edn. (Duxbury Classic Series, 2000).

  40. 40

    Gentleman, R. Reproducible research: a bioinformatics case study. Stat. Appl. Genet. Mol. Biol. 4, Article2 (2005).

    Article  PubMed  Google Scholar 

  41. 41

    Trapnell, C. & Salzberg, S.L. How to map billions of short reads onto genomes. Nat. Biotechnol. 27, 455–457 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. 42

    Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. 43

    Wang, K. et al. MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. 44

    Liao, Y., Smyth, G.K. & Shi, W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res. 41, e108 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. 45

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    Article  CAS  Google Scholar 

  46. 46

    Thorvaldsdóttir, H., Robinson, J.T. & Mesirov, J.P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. 47

    Fiume, M., Williams, V., Brook, A. & Brudno, M. Savant: genome browser for high-throughput sequencing data. Bioinformatics 26, 1938–1944 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. 48

    Fiume, M. et al. Savant genome browser 2: visualization and analysis for population-scale genomics. Nucleic Acids Res. 40, 1–7 (2012).

    Article  CAS  Google Scholar 

  49. 49

    Morgan, M. et al. ShortRead: a Bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics 25, 2607–2608 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. 50

    Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. 51

    Brooks, A.N. et al. Conservation of an RNA regulatory map between Drosophila and mammals. Genome Res. 21, 193–202 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. 52

    Edgar, R., Domrachev, M. & Lash, A.E. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. 53

    Cox, D.R. & Reid, N. Parameter orthogonality and approximate conditional inference. J. Roy. Stat. Soc. Ser. B Method. 49, 1–39 (1987).

    Google Scholar 

  54. 54

    Dudoit, S., Yang, Y.H., Callow, M.J. & Speed, T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sinica 12, 111–139 (2002).

    Google Scholar 

  55. 55

    Bourgon, R., Gentleman, R. & Huber, W. Independent filtering increases detection power for high-throughput experiments. Proc. Natl. Acad. Sci. USA 107, 9546–9551 (2010).

    Article  PubMed  Google Scholar 

  56. 56

    Robinson, M.D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. 57

    Cappiello, C., Francalanci, C. & Pernici, B. Data quality assessment from the user's perspective. Architecture 22, 68–73 (2004).

    Google Scholar 

  58. 58

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. Ser. B Methodol. 57, 289–300 (1995).

    Google Scholar 

  59. 59

    Wu, H., Wang, C. & Wu, Z. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics 14, 232–243 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  60. 60

    Smyth, G.K. Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (eds. Gentleman, R. et al.) 397–420 (Springer, 2005).

  61. 61

    Nookaew, I. et al. A comprehensive comparison of RNA-seq–based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res. 40, 10084–10097 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. 62

    Rapaport, F. et al. Comprehensive evaluation of differential expression analysis methods for RNA-seq data http://arXiv.org/abs/1301.5277v2 (23 January 2013).

  63. 63

    Hansen, K.D., Irizarry, R.A. & Wu, Z. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13, 204–216 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  64. 64

    Risso, D., Schwartz, K., Sherlock, G. & Dudoit, S. GC-content normalization for RNA-seq data. BMC Bioinform. 12, 480 (2011).

    Article  CAS  Google Scholar 

  65. 65

    Delhomme, N., Padioleau, I., Furlong, E.E. & Steinmetz, L. easyRNASeq: a Bioconductor package for processing RNA-seq data. Bioinformatics 28, 2532–2533 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. 66

    Leisch, F. Sweave: dynamic generation of statistical reports using literate data analysis. In Compstat 2002 Proceedings in Computational Statistics Vol. 69 (eds. Härdle, W. & Rönz, B.) 575–580. Institut für Statistik und Wahrscheinlichkeitstheorie, Technische Universität Wien (Physica Verlag, 2002).

Download references

Acknowledgements

We thank X. Zhou for comparing counting methods, O. Nikolayeva for feedback on an earlier version of the manuscript and participants in the European Conference on Computational Biology Workshop (Basel, September 2012) for their feedback. G.K.S. acknowledges funding from a National Health and Medical Research Council Project Grant (no. 1023454). D.J.M. acknowledges funding from the General Sir John Monash Foundation, Australia. M.D.R. wishes to acknowledge funding from the University of Zurich's Research Priority Program in Systems Biology and Functional Genomics and Swiss National Science Foundation Project grant (no. 143883). S.A., W.H. and M.D.R. acknowledge funding from the European Commission through the 7th Framework Collaborative Project RADIANT (grant agreement no. 305626).

Author information

Affiliations

Authors

Contributions

S.A. and W.H. are authors of the DESeq package. D.J.M., Y.C., G.K.S. and M.D.R. are authors of the edgeR package. S.A., M.O., W.H. and M.D.R. initiated the protocol format on the basis of the ECCB 2012 Workshop. S.A. and M.D.R. wrote the first draft and additions were made from all authors.

Corresponding authors

Correspondence to Wolfgang Huber or Mark D Robinson.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Data

Archive of files used in the protocol. This file is a compressed archive with the following files: the intermediate COUNT files used, a count table used in the statistical analysis, the metadata table and the original “SraRunInfo” CSV file that was downloaded from the NCBI's SRA. (ZIP 533 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Anders, S., McCarthy, D., Chen, Y. et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat Protoc 8, 1765–1786 (2013). https://doi.org/10.1038/nprot.2013.099

Download citation

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.