Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

Journal name:
Nature Protocols
Volume:
7,
Pages:
562–578
Year published:
DOI:
doi:10.1038/nprot.2012.016
Published online
Corrected online

Abstract

Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol's execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ~1 h of hands-on time.

At a glance

Figures

  1. Software components used in this protocol.
    Figure 1: Software components used in this protocol.

    Bowtie33 forms the algorithmic core of TopHat, which aligns millions of RNA-seq reads to the genome per CPU hour. TopHat's read alignments are assembled by Cufflinks and its associated utility program to produce a transcriptome annotation of the genome. Cuffdiff quantifies this transcriptome across multiple conditions using the TopHat read alignments. CummeRbund helps users rapidly explore and visualize the gene expression data produced by Cuffdiff, including differentially expressed genes and transcripts.

  2. An overview of the Tuxedo protocol.
    Figure 2: An overview of the Tuxedo protocol.

    In an experiment involving two conditions, reads are first mapped to the genome with TopHat. The reads for each biological replicate are mapped independently. These mapped reads are provided as input to Cufflinks, which produces one file of assembled transfrags for each replicate. The assembly files are merged with the reference transcriptome annotation into a unified annotation for further analysis. This merged annotation is quantified in each condition by Cuffdiff, which produces expression data in a set of tabular files. These files are indexed and visualized with CummeRbund to facilitate exploration of genes identified by Cuffdiff as differentially expressed, spliced, or transcriptionally regulated genes. FPKM, fragments per kilobase of transcript per million fragments mapped.

  3. Merging sample assemblies with a reference transcriptome annotation.
    Figure 3: Merging sample assemblies with a reference transcriptome annotation.

    Genes with low expression may receive insufficient sequencing depth to permit full reconstruction in each replicate. However, merging the replicate assemblies with Cuffmerge often recovers the complete gene. Newly discovered isoforms are also integrated with known ones at this stage into more complete gene models.

  4. Analyzing groups of transcripts identifies differentially regulated genes.
    Figure 4: Analyzing groups of transcripts identifies differentially regulated genes.

    (a) Genes may produce multiple splice variants (labeled A–C) at different abundances through alternative transcription start sites (TSS), alternative cleavage and polyadenylation of 3′ ends, or by alternative splicing of primary transcripts. (b) Grouping isoforms by TSS and looking for changes in relative abundance between and within these groups yield mechanistic clues into how genes are differentially regulated. (c) For example, in the above hypothetical gene, changes in the relative abundance between isoforms A and B within TSS I group across conditions may be attributable to differential splicing of the primary transcript from which they are both produced. (d) Adding their expression levels yields a proxy expression value for this primary transcript. (e) Changes in this level relative to the gene's other primary transcript (i.e., isoform C) indicate possible differential promoter preference across conditions. (f,g) Similarly, genes with multiple annotated coding sequences (CDS) (f) can be analyzed for differential output of protein-coding sequences (g).

  5. CummeRbund helps users rapidly explore their expression data and create publication-ready plots of differentially expressed and regulated genes.
    Figure 5: CummeRbund helps users rapidly explore their expression data and create publication-ready plots of differentially expressed and regulated genes.

    With just a few lines of plotting code, CummeRbund can visualize differential expression at the isoform level, as well as broad patterns among large sets of genes. (a) A myoblast differentiation time-course experiment reveals the emergence of a skeletal muscle-specific isoform of tropomyosin I. (b) This same time-course data capture the dynamics of hundreds of other genes in the mouse transcriptome during muscle development8. FPKM, fragments per kilobase of transcript per million fragments mapped.

  6. CummeRbund plots of the expression level distribution for all genes in simulated experimental conditions C1 and C2.
    Figure 6: CummeRbund plots of the expression level distribution for all genes in simulated experimental conditions C1 and C2.

    FPKM, fragments per kilobase of transcript per million fragments mapped.

  7. CummeRbund scatter plots highlight general similarities and specific outliers between conditions C1 and C2.
    Figure 7: CummeRbund scatter plots highlight general similarities and specific outliers between conditions C1 and C2.

    Scatter plots can be created from expression data for genes, splice isoforms, TSS groups or CDS groups.

  8. CummeRbund volcano plots reveal genes, transcripts, TSS groups or CDS groups that differ significantly between the pairs of conditions C1 and C2.
    Figure 8: CummeRbund volcano plots reveal genes, transcripts, TSS groups or CDS groups that differ significantly between the pairs of conditions C1 and C2.
  9. Differential analysis results for regucalcin.
    Figure 9: Differential analysis results for regucalcin.

    (a) Expression plot shows clear differences in the expression of regucalcin across conditions C1 and C2, measured in FPKM (Box 2). Expression of a transcript is proportional to the number of reads sequenced from that transcript after normalizing for that transcript's length. Each gene and transcript expression value is annotated with error bars that capture both cross-replicate variability and measurement uncertainty as estimated by Cuffdiff's statistical model of RNA-seq. (b) Changes in regucalcin expression are attributable to a large increase in the expression of one of four alternative isoforms. (c) The read coverage, viewed through the genome browsing application IGV42, shows an increase in sequencing reads originating from the gene in condition C2.

  10. Differential analysis results for Rala.
    Figure 10: Differential analysis results for Rala.

    (a) This gene has four isoforms in the merged assembly. (b) Cuffdiff identifies TCONS_00024713 and TCONS_00024715 as being significantly differentially expressed. The relatively modest overall change in gene-level expression, combined with high isoform-level measurement variability, leaves Cuffdiff unable to reject the null hypothesis that the observed gene level is attributable to measurement or cross-replicate variability.

Accession codes

Referenced accessions

Gene Expression Omnibus

Change history

Corrected online 07 August 2014

In the version of this article initially published, the computer script in Box 1 sections B and C, and in Procedure Step 1, contained errors: the last section of the final three lines of the script had ‘C1’ where it should have been ‘C2’, as follows:

C1_R1_2.fq should have been C2_R1_2.fq

C1_R2_2.fq should have been C2_R2_2.fq

C1_R3_2.fq should have been C2_R3_2.fq

Users are also directed to an official release version of Cufflinks (version 1.3.0) that produces nearly identical results to those shown in the manuscript, which were produced by Cufflinks 1.2.1 (an unofficial and undocumented development build that was the latest build available when the manuscript was originally written). The script in Procedure Step 16 and the data in Table 5 have been updated to reflect the output of version 1.3.0. The errors have been corrected in the HTML and PDF versions of the article.

References

  1. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621628 (2008).
  2. Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods 5, 613619 (2008).
  3. Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 13441349 (2008).
  4. Mardis, E.R. The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133141 (2008).
  5. Adams, M.D. et al. Sequence identification of 2,375 human brain genes. Nature 355, 632634 (1992).
  6. Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 19151927 (2011).
  7. Jiang, H. & Wong, W.H. Statistical inferences for isoform expression in RNA-seq. Bioinformatics 25, 10261032 (2009).
  8. Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511515 (2010).
  9. Mortimer, S.A. & Weeks, K.M. A fast-acting reagent for accurate analysis of RNA secondary and tertiary structure by SHAPE chemistry. J. Am. Chem. Soc. 129, 41444145 (2007).
  10. Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A. & Dewey, C.N. RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493500 (2010).
  11. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 15091517 (2008).
  12. Garber, M., Grabherr, M.G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat. Methods 8, 469477 (2011).
  13. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 11051111 (2009).
  14. Lister, R. et al. Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature 470, 6873 (2011).
  15. Graveley, B.R. et al. The developmental transcriptome of Drosophila melanogaster. Nature 471, 473479 (2011).
  16. Twine, N.A., Janitz, K., Wilkins, M.R. & Janitz, M. Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by Alzheimer's disease. PLoS ONE 6, e16266 (2011).
  17. Mizuno, H. et al. Massive parallel sequencing of mRNA in identification of unannotated salinity stress-inducible transcripts in rice (Oryza sativa L.). BMC Genomics 11, 683 (2010).
  18. Goecks, J., Nekrutenko, A. & Taylor, J. Galaxy Team Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010).
  19. Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873881 (2010).
  20. Wang, K. et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178 (2010).
  21. Au, K.F., Jiang, H., Lin, L., Xing, Y. & Wong, W.H. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 38, 45704578 (2010).
  22. Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503510 (2010).
  23. Griffith, M. et al. Alternative expression analysis by RNA sequencing. Nat. Methods 7, 843847 (2010).
  24. Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 10091015 (2010).
  25. Nicolae, M., Mangul, S., Măndoiu, I.I. & Zelikovsky, A. Estimation of alternative splicing isoform frequencies from RNA-seq data. Algorithms Mol. Biol. 6, 9 (2011).
  26. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
  27. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139140 (2009).
  28. Wang, L., Feng, Z., Wang, X., Wang, X. & Zhang, X. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 26, 136138 (2010).
  29. Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644652 (2011).
  30. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909912 (2010).
  31. Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 14971502 (2007).
  32. Ingolia, N.T., Ghaemmaghami, S., Newman, J.R.S. & Weissman, J.S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218223 (2009).
  33. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
  34. Ferragina, P. & Manzini, G. An experimental study of a compressed index. Information Sci. 135, 1328 (2001).
  35. Roberts, A., Pimentel, H., Trapnell, C. & Pachter, L. Identification of novel transcripts in annotated genomes using RNA-seq. Bioinformatics 27, 23252329 (2011).
  36. Li, J., Jiang, H. & Wong, W.H. Modeling non-uniformity in short-read rates in RNA-seq data. Genome Biol. 11, R50 (2010).
  37. Hansen, K.D., Brenner, S.E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131 (2010).
  38. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Improving RNA-seq expression estimates by correcting for fragment bias. Genome Biol. 12, R22 (2011).
  39. Levin, J.Z. et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat. Methods 7, 709715 (2010).
  40. Hansen, K.D., Wu, Z., Irizarry, R.A. & Leek, J.T. Sequencing technology does not eliminate biological variability. Nat. Biotechnol. 29, 572573 (2011).
  41. Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Use R) p 224 (Springer, 2009).
  42. Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 2426 (2011).
  43. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 20782079 (2009).
  44. Schatz, M.C., Langmead, B. & Salzberg, S.L. Cloud computing and the DNA data race. Nat. Biotechnol. 28, 691693 (2010).

Download references

Author information

Affiliations

  1. Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.

    • Cole Trapnell,
    • Loyal Goff,
    • David R Kelley &
    • John L Rinn
  2. Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, Massachusetts, USA.

    • Cole Trapnell,
    • Loyal Goff,
    • David R Kelley &
    • John L Rinn
  3. Department of Computer Science, University of California, Berkeley, California, USA.

    • Adam Roberts,
    • Harold Pimentel &
    • Lior Pachter
  4. Computer Science and Artificial Intelligence Lab, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Loyal Goff
  5. Department of Medicine, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA.

    • Geo Pertea,
    • Daehwan Kim &
    • Steven L Salzberg
  6. Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA.

    • Geo Pertea &
    • Steven L Salzberg
  7. Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA.

    • Daehwan Kim
  8. Department of Mathematics, University of California, Berkeley, California, USA.

    • Lior Pachter
  9. Department of Molecular and Cell Biology, University of California, Berkeley, California, USA.

    • Lior Pachter

Contributions

C.T. is the lead developer for the TopHat and Cufflinks projects. L.G. designed and wrote CummeRbund. D.K., H.P. and G.P. are developers of TopHat. A.R. and G.P. are developers of Cufflinks and its accompanying utilities. C.T. developed the protocol, generated the example experiment and performed the analysis. L.P., S.L.S. and C.T. conceived the TopHat and Cufflinks software projects. C.T., D.R.K. and J.L.R. wrote the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Comments

  1. Report this comment #63265

    Tanyeli Taze said:
    From Nature Protocols editorial staff:

    It has been brought to our attention that there is an error in the code of step 1 of the Procedure of this Protocol.

    The code in step 1 of the Procedure currently reads as follows:
    $ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq
    $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq
    $ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq
    $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq
    $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq
    $ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq

    However, the correct code should read as follows (note the last 3 lines where C1 has been changed to C2 so that both files are from the same sample):
    $ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq
    $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq
    $ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq
    $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq
    $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq
    $ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C2_R3_2.fq

    A formal correction will be published in due course.

Subscribe to comments

Additional data