Differential analysis of gene regulation at transcript resolution with RNA-seq

Journal name:
Nature Biotechnology
Volume:
31,
Pages:
46–53
Year published:
DOI:
doi:10.1038/nbt.2450
Received
Accepted
Published online

Abstract

Differential analysis of gene and transcript expression using high-throughput RNA sequencing (RNA-seq) is complicated by several sources of measurement variability and poses numerous statistical challenges. We present Cuffdiff 2, an algorithm that estimates expression at transcript-level resolution and controls for variability evident across replicate libraries. Cuffdiff 2 robustly identifies differentially expressed transcripts and genes and reveals differential splicing and promoter-preference changes. We demonstrate the accuracy of our approach through differential analysis of lung fibroblasts in response to loss of the developmental transcription factor HOXA1, which we show is required for lung fibroblast and HeLa cell cycle progression. Loss of HOXA1 results in significant expression level changes in thousands of individual transcripts, along with isoform switching events in key regulators of the cell cycle. Cuffdiff 2 performs robust differential analysis in RNA-seq experiments at transcript resolution, revealing a layer of regulation not readily observable with other high-throughput technologies.

At a glance

Figures

  1. Changes in fragment count for a gene does not necessarily equal a change in expression.
    Figure 1: Changes in fragment count for a gene does not necessarily equal a change in expression.

    (a) Simple read-counting schemes sum the fragments incident on a gene's exons. The exon-union model counts reads falling on any of a gene's exons, whereas the exon-intersection model counts only reads on constitutive exons. (b) Both of the exon-union and exon-intersection counting schemes may incorrectly estimate a change in expression in genes with multiple isoforms. The true expression is estimated by the sum of the length-normalized isoform read counts. The discrepancy between a change in the union or intersection count and a change in gene expression is driven by a change in the abundance of the isoforms with respect to one another. In the top row, the gene generates the same number of reads in conditions A and B, but in condition B, all of the reads come from the shorter of the two isoforms, and thus the true expression for the gene is higher in condition B. The intersection count scheme underestimates the true change in gene expression, and the union scheme fails to detect the change entirely. In the middle row, the intersection count fails to detect a change driven by a shift in the dominant isoform for the gene. The union scheme detects a shift in the wrong direction. In the bottom row, the gene's expression is constant, but the isoforms undergo a complete switch between conditions A and B. Both simplified counting schemes register a change in count that does not reflect a change in gene expression.

  2. An overview of the Cuffdiff 2 approach to isoform-level differential analysis of RNA-seq data.
    Figure 2: An overview of the Cuffdiff 2 approach to isoform-level differential analysis of RNA-seq data.

    (1) The variability in fragment count for each gene across replicates is modeled. (2) The fragment count for each isoform is estimated in each replicate, along with (3) a measure of uncertainty in this estimate arising from ambiguously mapped reads, which are extremely prevalent in alternatively spliced transcriptomes. (4) The algorithm combines estimates of uncertainty and cross-replicate variability under a beta negative binomial model of fragment count variability to estimate count variances for each transcript in each library. (5) These variance estimates are used during statistical testing to report significantly differentially expressed genes and transcripts.

  3. Comparison of Cuffdiff 2 with other expression platforms.
    Figure 3: Comparison of Cuffdiff 2 with other expression platforms.

    (a) Fold changes in multi-isoform gene expression measured by RNA-seq and microarrays before and after HOXA1 knockdown are highly concordant (Spearman correlation = 0.86). (b) Computing gene expression by isoform deconvolution instead of by fold change in gene-level fragment counts improves agreement with microarrays. Genes shown are those where Cuffdiff 2 and intersection-count fold changes are most discrepant (1% tails). (c) Methods for performing differential gene expression analysis with RNA-seq based on the exon-union counting method returned nearly all of the genes returned by Cuffdiff 2. (d) Lung fibroblasts show lower splicing complexity than human ESCs. Complexity is measured by the Shannon-Entropy of the relative isoform abundances of multi-isoform genes (Methods), where zero complexity indicates a gene has only a single detectable isoform. A gene with equally abundant isoforms has maximal complexity. hLF, human lung fibroblast. (e) The contribution of the major isoform of each gene to total gene expression in lung fibroblasts and ESCs.

  4. Accuracy of Cuffdiff 2 over varied experimental designs.
    Figure 4: Accuracy of Cuffdiff 2 over varied experimental designs.

    (a) Accuracy of the proposed model explored through simulated RNA-seq. Read-length series was generated with single-ended sequencing data. FPKM, fragments per kilobase per million fragments mapped. (b) Squared coefficient of variation versus expression for genes and transcripts as measured by the HiSeq 2000 and MiSeq. Each series is a fit of a generalized additive model to the individual expression, squared coefficient of variation pairs. (c) Significant gene lists returned by Cuffdiff 2 using the HiSeq 2000 and MiSeq compared against expression microarrays (FDR ≤ 1%). (d) Significant isoform lists returned by Cuffdiff 2 when using the HiSeq 2000 and MiSeq (FDR ≤ 1%).

  5. Changes in expression of cell cycle regulatory genes in response to HOXA1 knockdown.
    Figure 5: Changes in expression of cell cycle regulatory genes in response to HOXA1 knockdown.

    (a) GSEA analysis of the knockdown for selected REACTOME gene sets. (b) Cuffdiff 2 reports an increase in CDK2 expression, which is attributable to a single isoform that includes the full activation loop, a feature required for maximal CDK2 activity. KD, knockdown. (c) Cuffdiff 2 reports a decrease in ORC6 expression, which is attributable to a single isoform that includes the full suite of residues required for optimal DNA binding not present in the minor isoforms arising from the gene. (d) Cuffdiff 2 reports an increase in TBX3 attributable to a single isoform lacking an exon situated within the T-box DNA binding domain that is present in a highly similar minor isoform. (e) Cuffdiff 2 reports a decrease in CDC14B attributable to decreases in the two major isoforms. Error bars indicate 95% confidence intervals in expression. (f) Changes in isoform expression reported by Cuffdiff 2 compared against measurements made with isoform-specific qPCR. The black line indicates perfect correspondence between the two platforms. The orange line is a linear regression through all points, and the red line excludes the three major outliers, which target low abundance isoforms, two of which cannot be distinguished from primary transcript or genomic DNA. DE, differentially expressed.

  6. Cell cycle analysis after HOXA1 knockdown.
    Figure 6: Cell cycle analysis after HOXA1 knockdown.

    (a) Human lung fibroblasts transfected with scrambled siRNAs and a HOXA1 siRNA pool 48 h after transfection. HeLa cells transfected with scrambled siRNAs (left) and a HOXA1 siRNA pool 72 h after transfection (right). Scale bars, 500 μm. (b) HOXA1 siRNAs disrupt normal cell cycle distribution. Top, histograms of Hoechst-33342 staining in human lung fibroblasts transfected with scrambled siRNAs and a HOXA1 siRNA pool 48 h after transfection (right). Bottom, same as above using HeLa cells 72 h after transfection. Cell cycle phase gates were drawn as approximations of the Dean-Jett-Fox cell cycle modeling algorithm. (c) HOXA1 siRNAs shift cell cycle distribution toward the G1 phase and sub-G1 phase fraction. Percentage of cells in each phase of the cell cycle (average three experiments) using gates from b with scrambled and HOXA1 siRNAs in human lung fibroblasts and in HeLa cells. (d) Representative dot plot of Annexin V (x-axis) versus propidium iodide (PI) (y-axis) analyses of scrambled (left) and HOXA1 siRNA (right) transfected lung fibroblasts (top) and HeLa cells (bottom). Cells populating the upper (late apoptosis) and lower (early apoptosis) right quadrants were designated as apoptotic cells. (e) Percentage of apoptotic cells (average of three experiments) populating upper and lower right quadrants from lung fibroblasts and HeLa cells transfected with scrambled or HOXA1 siRNAs as shown in d. Error bars indicate 95% confidence intervals in expression.

Accession codes

Primary accessions

Gene Expression Omnibus

References

  1. Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods 5, 613619 (2008).
  2. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621628 (2008).
  3. Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511515 (2010).
  4. Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503510 (2010).
  5. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 15091517 (2008).
  6. Fu, X. et al. Estimating accuracy of RNA-seq and microarrays with proteomics. BMC Genomics 10, 161 (2009).
  7. Graveley, B.R. et al. The developmental transcriptome of Drosophila melanogaster. Nature 471, 473479 (2011).
  8. Lister, R. et al. Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature 471, 6873 (2011).
  9. Pickrell, J.K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768772 (2010).
  10. Montgomery, S.B. et al. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464, 773777 (2010).
  11. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470476 (2008).
  12. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40, 14131415 (2008).
  13. Jiang, H. & Wong, W.H. Statistical inferences for isoform expression in RNA-seq. Bioinformatics 25, 10261032 (2009).
  14. Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 10091015 (2010).
  15. Nicolae, M., Mangul, S., Măndoiu, I.I. & Zelikovsky, A. Estimation of alternative splicing isoform frequencies from RNA-seq data. Algorithms Mol. Biol. 6, 9 (2011).
  16. Lee, S. et al. Accurate quantification of transcriptome from RNA-seq data by effective length normalization. Nucleic Acids Res. 39, e9 (2011).
  17. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
  18. Langmead, B., Hansen, K.D. & Leek, J.T. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11, R83 (2010).
  19. Oshlack, A., Robinson, M.D. & Young, M.D. From RNA-seq reads to differential expression results. Genome Biol. 11, 220 (2010).
  20. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139140 (2010).
  21. Wang, L., Feng, Z., Wang, X., Wang, X. & Zhang, X. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 26, 136138 (2010).
  22. Hardcastle, T.J. & Kelly, K.A. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11, 422 (2010).
  23. Griffith, M. et al. Alternative expression analysis by RNA sequencing. Nat. Methods 7, 843847 (2010).
  24. Glaus, P., Honkela, A. & Rattray, M. Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics 28, 17211728 (2012).
  25. Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 20082017 (2012).
  26. Pearson, J.C., Lemons, D. & McGinnis, W. Modulating Hox gene functions during animal body patterning. Nat. Rev. Genet. 6, 893904 (2005).
  27. Xi, W., WU, Z. & Zhang, X. Isoform abundance inference provides a more accurate estimation of gene expression levels in RNA-seq. J. Bioinform. Comput. Biol. 08, 177 (2010).
  28. Tarazona, S., García-Alcalde, F., Dopazo, J., Ferrer, A. & Conesa, A. Differential expression in RNA-seq: a matter of depth. Genome Res. 21, 22132223 (2011).
  29. Hiller, D., Jiang, H., Xu, W. & Wong, W.H. Identifiability of isoform deconvolution from junction arrays and RNA-seq. Bioinformatics 25, 30563059 (2009).
  30. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Improving RNA-seq expression estimates by correcting for fragment bias. Genome Biol. 12, R22 (2011).
  31. Rinn, J.L., Bondre, C., Gladstone, H.B., Brown, P.O. & Chang, H.Y. Anatomic demarcation by positional variation in fibroblast gene expression programs. PLoS Genet. 2, e119 (2006).
  32. Wu, J.Q. et al. Dynamic transcriptomes during neural differentiation of human embryonic stem cells revealed by short, long, and paired-end sequencing. Proc. Natl. Acad. Sci. USA 107, 52545259 (2010).
  33. Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. (2011).
  34. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 1554515550 (2005).
  35. Morgan, D.O. & Morgan, D.O. Cyclin-dependent kinases: engines, clocks, and microprocessors. Annu. Rev. Cell Dev. Biol. 13, 261291 (1997).
  36. Liu, S. et al. Structural analysis of human Orc6 protein reveals a homology with transcription factor TFIIB. Proc. Natl. Acad. Sci. USA 108, 73737378 (2011).
  37. Dhar, S.K. & Dhar, S.K. Identification and characterization of the human ORC6 homolog. J. Biol. Chem. 275, 3498334988 (2000).
  38. Guillamot, M. et al. Cdc14b regulates mammalian RNA polymerase II and represses cell cycle transcription. Scientific Reports 1, 189 (2011).
  39. Washkowitz, A.J., Gavrilov, S., Begum, S. & Papaioannou, V.E. Diverse functional networks of Tbx3 in development and disease. Wiley Interdisciplinary Rev. Syst. Biol. Med. 4, 273283 (2012).
  40. Wilson, V., Wilson, V., Conlon, F.L. & Conlon, F.L. The T-box family. Genome Biol. 3, S3008 (2002).
  41. Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101108 (2012).
  42. Bradley, R.K., Merkin, J., Lambert, N.J. & Burge, C.B. Alternative splicing of RNA triplets is often regulated and accelerates proteome evolution. PLoS Biol. 10, e1001229 (2012).
  43. Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 14971502 (2007).
  44. Mikkelsen, T.S. et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448, 553560 (2007).
  45. Crawford, G.E. et al. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 16, 123131 (2006).
  46. Giresi, P.G. & Lieb, J.D. Isolation of active regulatory elements from eukaryotic chromatin using FAIRE (formaldehyde assisted isolation of regulatory elements). Methods 48, 233239 (2009).
  47. Fullwood, M.J. et al. An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 462, 5864 (2009).
  48. Zhao, J. et al. Genome-wide identification of polycomb-associated RNAs by RIP-seq. Mol. Cell 40, 939953 (2010).
  49. Licatalosi, D.D. et al. HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature 456, 464469 (2008).
  50. Wang, E.T. et al. Transcriptome-wide regulation of pre-mRNA splicing and mRNA localization by muscleblind proteins. Cell 150, 710724 (2012).

Download references

Author information

  1. These authors contributed equally to this work.

    • Cole Trapnell &
    • David G Hendrickson
  2. These authors contributed equally to this work.

    • John L Rinn &
    • Lior Pachter

Affiliations

  1. Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, Massachusetts, USA.

    • Cole Trapnell,
    • David G Hendrickson,
    • Martin Sauvageau,
    • Loyal Goff &
    • John L Rinn
  2. The Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts, USA.

    • Cole Trapnell,
    • David G Hendrickson,
    • Martin Sauvageau,
    • Loyal Goff &
    • John L Rinn
  3. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Loyal Goff
  4. Department of Mathematics, University of California Berkeley, Berkeley, California, USA.

    • Lior Pachter
  5. Department of Molecular & Cell Biology, University of California Berkeley, California, USA.

    • Lior Pachter

Contributions

C.T. and L.P. developed the mathematics and statistics. D.G.H. and M.S. performed the experiments. D.G.H. and C.T. designed the experiments and performed the analysis. C.T. and L.G. implemented the software. L.P., J.L.R., D.G.H. and C.T. conceived the research. All authors wrote and approved the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (22M)

    Supplementary Figures 1–87 and Supplementary Tables 1–3

Additional data