Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Comprehensive comparative analysis of 5′-end RNA-sequencing methods

An Author Correction to this article was published on 20 November 2018

This article has been updated

Abstract

Specialized RNA-seq methods are required to identify the 5′ ends of transcripts, which are critical for studies of gene regulation, but these methods have not been systematically benchmarked. We directly compared six such methods, including the performance of five methods on a single human cellular RNA sample and a new spike-in RNA assay that helps circumvent challenges resulting from uncertainties in annotation and RNA processing. We found that the ‘cap analysis of gene expression’ (CAGE) method performed best for mRNA and that most of its unannotated peaks were supported by evidence from other genomic methods. We applied CAGE to eight brain-related samples and determined sample-specific transcription start site (TSS) usage, as well as a transcriptome-wide shift in TSS usage between fetal and adult brain.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Methods for 5′-end RNA-seq.
Fig. 2: Sequence read performance metrics for 5′ end methods.
Fig. 3: TSS peak performance metrics.
Fig. 4: TSS discovery for unannotated CAGE peaks.
Fig. 5: Differential TSS usage in brain-related samples.
Fig. 6: Adult brain samples preferentially use more downstream TSSs.

Similar content being viewed by others

Change history

  • 20 November 2018

    The original version of this paper contained an incorrect primer sequence. In the Methods subsection “Rampage libraries,” the text for modification 3 stated that the reverse primer used for library indexing was 5′-CAAGCAGAAGACGGCATACGAGATXXXXXXXXGTGACTGGAGT-3′. The correct sequence of the oligonucleotide used is 5′-CAAGCAGAAGACGGCATACGAGATXXXXXXXXGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT-3′. This error has been corrected in the PDF and HTML versions of the paper.

References

  1. Heinzen, E. L., Neale, B. M., Traynelis, S. F., Allen, A. S. & Goldstein, D. B. The genetics of neuropsychiatric diseases: looking in and beyond the exome. Annu. Rev. Neurosci. 38, 47–68 (2015).

    Article  CAS  Google Scholar 

  2. Edwards, S. L., Beesley, J., French, J. D. & Dunning, A. M. Beyond GWASs: illuminating the dark road from association to function. Am. J. Hum. Genet. 93, 779–797 (2013).

    Article  CAS  Google Scholar 

  3. De Gobbi, M. et al. A regulatory SNP causes a human genetic disease by creating a new transcriptional promoter. Science 312, 1215–1217 (2006).

    Article  Google Scholar 

  4. Davuluri, R. V., Suzuki, Y., Sugano, S., Plass, C. & Huang, T. H. The functional consequences of alternative promoter use in mammalian genomes. Trends Genet. 24, 167–177 (2008).

    Article  CAS  Google Scholar 

  5. Grob, T. J. et al. Human delta Np73 regulates a dominant negative feedback loop for TAp73 and p53. Cell Death Differ. 8, 1213–1223 (2001).

    Article  CAS  Google Scholar 

  6. Béna, F. et al. Molecular and clinical characterization of 25 individuals with exonic deletions of NRXN1 and comprehensive review of the literature. Am. J. Med. Genet. B. Neuropsychiatr. Genet. 162B, 388–403 (2013).

    Article  Google Scholar 

  7. Hrdlickova, R., Toloue, M. & Tian, B. RNA-Seq methods for transcriptome analysis. Wiley Interdiscip. Rev. RNA 8, e1364 (2017).

    Article  Google Scholar 

  8. Tyner, C. et al. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 45, D626–D634 (2017).

    CAS  PubMed  Google Scholar 

  9. Murata, M. et al. Detecting expressed genes using CAGE. Methods Mol. Biol. 1164, 67–85 (2014).

    Article  Google Scholar 

  10. Batut, P., Dobin, A., Plessy, C., Carninci, P. & Gingeras, T. R. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 23, 169–180 (2013).

    Article  CAS  Google Scholar 

  11. Batut, P. & Gingeras, T. R. RAMPAGE: promoter activity profiling by paired-end sequencing of 5′-complete cDNAs. Curr. Protoc. Mol. Biol. 104, 25B.11.1–25B.11.16 (2013).

    Google Scholar 

  12. Islam, S. et al. Highly multiplexed and strand-specific single-cell RNA 5′ end sequencing. Nat. Protoc. 7, 813–828 (2012).

    Article  CAS  Google Scholar 

  13. Salimullah, M., Sakai, M., Plessy, C. & Carninci, P. NanoCAGE: a high-resolution technique to discover and interrogate cell transcriptomes. Cold Spring Harb. Protoc. 2011, pdb.prot5559 (2011).

    Article  Google Scholar 

  14. Cumbie, J. S., Ivanchenko, M. G. & Megraw, M. NanoCAGE-XL and CapFilter: an approach to genome wide identification of high confidence transcription start sites. BMC Genomics 16, 597 (2015).

    Article  Google Scholar 

  15. Yamashita, R. et al. Genome-wide characterization of transcriptional start sites in humans by integrative transcriptome analysis. Genome Res. 21, 775–789 (2011).

    Article  CAS  Google Scholar 

  16. Tsuchihara, K. et al. Massive transcriptional start site analysis of human genes in hypoxia cells. Nucleic Acids Res. 37, 2249–2263 (2009).

    Article  CAS  Google Scholar 

  17. Core, L. J. et al. Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers. Nat. Genet. 46, 1311–1320 (2014).

    Article  CAS  Google Scholar 

  18. Lam, M. T. et al. Rev-Erbs repress macrophage gene expression by inhibiting enhancer-directed transcription. Nature 498, 511–515 (2013).

    Article  CAS  Google Scholar 

  19. Adiconis, X. et al. Comparative analysis of RNA sequencing methods for degraded or low-input samples. Nat. Methods 10, 623–629 (2013).

    Article  CAS  Google Scholar 

  20. Hestand, M. S. et al. Tissue-specific transcript annotation and expression profiling with complementary next-generation sequencing technologies. Nucleic Acids Res. 38, e165 (2010).

    Article  Google Scholar 

  21. Morlan, J. D., Qu, K. & Sinicropi, D. V. Selective depletion of rRNA enables whole transcriptome profiling of archival fixed tissue. PLoS One 7, e42882 (2012).

    Article  CAS  Google Scholar 

  22. Schoenberg, D. R. & Maquat, L. E. Re-capping the message. Trends Biochem. Sci. 34, 435–442 (2009).

    Article  CAS  Google Scholar 

  23. Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011).

    Article  CAS  Google Scholar 

  24. Frith, M. C. et al. A code for transcription initiation in mammalian genomes. Genome Res. 18, 1–12 (2008).

    Article  CAS  Google Scholar 

  25. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    Article  CAS  Google Scholar 

  26. Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).

    Article  CAS  Google Scholar 

  27. FANTOM Consortium & RIKEN PMI and CLST. A promoter-levelmammalian expression atlas. Nature 507, 462–470 (2014)..

  28. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  Google Scholar 

  29. Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).

    Article  CAS  Google Scholar 

  30. Boyle, A. P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311–322 (2008).

    Article  CAS  Google Scholar 

  31. Hoffman, M. M. et al. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res. 41, 827–841 (2013).

    Article  CAS  Google Scholar 

  32. Kim, T. K. et al. Widespread transcription at neuronal activity-regulated enhancers. Nature 465, 182–187 (2010).

    Article  CAS  Google Scholar 

  33. Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

    Article  CAS  Google Scholar 

  34. Busskamp, V. et al. Rapid neurogenesis through transcriptional activation in human stem cells. Mol. Syst. Biol. 10, 760 (2014).

    Article  Google Scholar 

  35. Lancaster, M. A. & Knoblich, J. A. Organogenesis in a dish: modeling development and disease using organoid technologies. Science 345, 1247125 (2014).

    Article  Google Scholar 

  36. Hughes, T. et al. A loss-of-function variant in a minor isoform of ANK3 protects against bipolar disorder and schizophrenia. Biol. Psychiatry 80, 323–330 (2016).

    Article  CAS  Google Scholar 

  37. Rueckert, E. H. et al. Cis-acting regulation of brain-specific ANK3 gene expression by a genetic variant associated with bipolar disorder. Mol. Psychiatry 18, 922–929 (2013).

    Article  CAS  Google Scholar 

  38. Bae, B. I. et al. Evolutionarily dynamic alternative splicing of GPR56 regulates regional cerebral cortical patterning. Science 343, 764–768 (2014).

    Article  CAS  Google Scholar 

  39. Novak, G. & Tallerico, T. Nogo A, B and C expression in schizophrenia, depression and bipolar frontal cortex, and correlation of Nogo expression with CAA/TATC polymorphism in 3′-UTR. Brain Res. 1120, 161–171 (2006).

    Article  CAS  Google Scholar 

  40. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013).

    Article  CAS  Google Scholar 

  41. Bellin, M., Marchetto, M. C., Gage, F. H. & Mummery, C. L. Induced pluripotent stem cells: the new patient? Nat. Rev. Mol. Cell Biol. 13, 713–726 (2012).

    Article  Google Scholar 

  42. Sterneckert, J. L., Reinhardt, P. & Schöler, H. R. Investigating human disease using stem cell models. Nat. Rev. Genet. 15, 625–639 (2014).

    Article  CAS  Google Scholar 

  43. Imaizumi, Y. & Okano, H. Modeling human neurological disorders with induced pluripotent stem cells. J. Neurochem. 129, 388–399 (2014).

    Article  CAS  Google Scholar 

  44. Hyman, S. E. Revitalizing psychiatric therapeutics. Neuropsychopharmacology 39, 220–229 (2014).

    Article  CAS  Google Scholar 

  45. Arner, E. et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347, 1010–1014 (2015).

    Article  CAS  Google Scholar 

  46. Birdsill, A. C., Walker, D. G., Lue, L., Sue, L. I. & Beach, T. G. Postmortem interval effect on RNA and gene expression in human brain tissue. Cell Tissue Bank. 12, 311–318 (2011).

    Article  CAS  Google Scholar 

  47. Sandberg, R., Neilson, J. R., Sarma, A., Sharp, P. A. & Burge, C. B. Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science 320, 1643–1647 (2008).

    Article  CAS  Google Scholar 

  48. Miura, P., Shenker, S., Andreu-Agullo, C., Westholm, J. O. & Lai, E. C. Widespread and extensive lengthening of 3′ UTRs in the mammalian brain. Genome Res. 23, 812–825 (2013).

    Article  CAS  Google Scholar 

  49. Sarda, S., Das, A., Vinson, C. & Hannenhalli, S. Distal CpG islands can serve as alternative promoters to transcribe genes with silenced proximal promoters. Genome Res. 27, 553–566 (2017).

    Article  CAS  Google Scholar 

  50. Lancaster, M. A. & Knoblich, J. A. Generation of cerebral organoids from human pluripotent stem cells. Nat. Protoc. 9, 2329–2340 (2014).

    Article  CAS  Google Scholar 

  51. Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).

    Article  CAS  Google Scholar 

  52. Soumillon, M., Cacchiarelli, D., Semrau, S., van Oudenaarden, A. & Mikkelsen, T. S. Characterization of directed differentiation by high-throughput single-cell RNA-Seq. bioRxiv Preprint available at https://www.biorxiv.org/content/early/2014/03/05/003236 (2014).

  53. Suzuki, Y. & Sugano, S. Construction of a full-length enriched and a 5′-end enriched cDNA library using the oligo-capping method. Methods Mol. Biol. 221, 73–91 (2003).

    CAS  PubMed  Google Scholar 

  54. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    Article  CAS  Google Scholar 

  55. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  Google Scholar 

  56. Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Use R!) 2nd edn (Springer, New York, 2009).

  57. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  Google Scholar 

  58. Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).

    Article  CAS  Google Scholar 

  59. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    Article  CAS  Google Scholar 

  60. Zhang, K. et al. Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat. Methods 6, 613–618 (2009).

    Article  CAS  Google Scholar 

  61. Ashoor, H., Kleftogiannis, D., Radovanovic, A. & Bajic, V. B. DENdb: database of integrated human enhancers. Database (Oxford) 2015, bav085 (2015).

    Article  Google Scholar 

  62. Carninci, P. et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 38, 626–635 (2006).

    Article  CAS  Google Scholar 

  63. Zhao, X., Valen, E., Parker, B. J. & Sandelin, A. Systematic clustering of transcription start site landscapes. PLoS One 6, e23409 (2011).

    Article  CAS  Google Scholar 

  64. Wagih, O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647 (2017).

    Article  CAS  Google Scholar 

  65. Tang, D. T. et al. Suppression of artifacts and barcode bias in high-throughput transcriptome analyses utilizing template switching. Nucleic Acids Res. 41, e44 (2013).

    Article  CAS  Google Scholar 

  66. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).

    Article  CAS  Google Scholar 

  67. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).

    Article  Google Scholar 

  68. Sloan, C. A. et al. ENCODE data at the ENCODE portal. Nucleic Acids Res. 44, D726–D732 (2016).

    Article  CAS  Google Scholar 

  69. Chambers, S. M. et al. Combined small-molecule inhibition accelerates developmental timing and converts human pluripotent stem cells into nociceptors. Nat. Biotechnol. 30, 715–720 (2012).

    Article  CAS  Google Scholar 

  70. Venables, W. N. & Ripley, B. D. Modern Applied Statistics with S (Springer, New York, 2002).

    Book  Google Scholar 

Download references

Acknowledgements

We are grateful to M. Salit and J. McDaniel (National Institute of Standards and Technology, Gaithersburg, MD, USA) for ERCC spike-in RNA; P. Batut for sharing RAMPAGE peak-calling code; N. Shoresh for advice on epigenomics datasets; N. Sanjana for advice on preparing the NGN1/2 in vitro neuron sample; B. Haas, Y. Farjoun, and M. Hofree for statistical advice; L. Gaffney for assistance with figures; I. Wortman and C. Cheng for K-562 experiments; C. de Boer for helpful comments on this manuscript; and the Broad Genomics Platform for sequencing. We thank S. McCarroll for suggesting this research direction and helpful discussions in the early phases of this study. This work was supported by the Stanley Center for Psychiatric Research, the Klarman Cell Observatory, and the BRAIN Initiative (U01-MH105960-01 to A.R.).

Author information

Authors and Affiliations

Authors

Contributions

J.Z.L., X.A., and A.R. conceived the research. X.A. prepared the 5′-end RNA-seq libraries. J.J. prepared the standard RNA-seq library. X.S. prepared the in vitro neurons under the supervision of J.Q.P. M.A.L. prepared the brain organoid RNA. A.L.H., S.K.S., A.L.M., Z.J., and M.A.B. developed and performed computational analysis. J.Z.L., X.A., A.L.H., S.K.S., and A.R. wrote the paper. All of the authors edited the paper.

Corresponding authors

Correspondence to Aviv Regev or Joshua Z. Levin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Comparing lab methods of 5′-end sequencing for specific genes.

Sequencing coverage with five different lab methods for three highly expressed genes in K-562 cells. Shown is the scaled number of reads (y-axis) at each position in the genome (x-axis; top track). Bottom track shows position of annotated exons (filled boxes) and introns (lines) with direction of transcription shown by arrows based on UCSC annotation. Plots generated with IGV (Robinson, J.T. et al. Integrative genomics viewer. Nat Biotechnol 29, 24-26 (2011)).

Supplementary Figure 2 CapFilter improved peak-calling.

Sensitivity, Precision, and F1 scores (bars, y-axis) at varying levels of filtering by CapFilter (x-axis) for each of four lab methods. Each level corresponds to the minimum percent of reads per peak that begin with an extra G (Online Methods).

Supplementary Figure 3 Strand invasion and RAMPAGE filters did not improve peak-calling.

Sensitivity, Precision, and F1 scores (bars, y-axis) at (a) different levels of filtering with a strand invasion filter (Online Methods); and (b) comparing RAMPAGE (with and without read 2) and ParaClu peak callers. In all cases CapFilter was used.

Supplementary Figure 4 Sensitivity with and without corroborative DNase-seq data.

Shown is the sensitivity (y-axis) for each method (x-axis). False negatives were defined as all TSSs without overlapping 5’ end RNA-Seq peaks (“without” DNase-Seq) or only the subset overlapping DNase-Seq peaks in K-562 cells (“with” DNase-Seq, the method used in Fig. 4). DNase-Seq data permits better assessment of actual performance for K-562 cells rather than comparing only to the UCSC annotation, which is compiled from diverse samples.

Supplementary Figure 5 STRT performance is essentially independent of RNA input amount.

Sensitivity, precision, and F1 score (y-axis) for STRT with RNA input amounts ranging from 10 ng to 10 μg. Also included to aid comparison are the STRT data shown in Fig. 3a (10 ng input).

Supplementary Figure 6 5′-end-method performance metrics with Gencode annotation.

Sensitivity, precision, and F1 score (y-axis) for each lab method (x-axis) relative to the Gencode annotation.

Supplementary Figure 7 Performance of the 5′-end methods in published datasets.

Sensitivity, precision, and F1 score (y-axis) for each lab method (x-axis). Comparison of (a) CAGE (replicates A and B) to RAMPAGE for K-562, (b) CAGE to Oligo capping for MCF-7, and (c) CAGE to STRT for mouse hippocampus. CAGE performed better than other methods in these comparisons.

Supplementary Figure 8 TSS initiator sequences.

For each method, shown are the nucleotides right before (−1 position) and after (+1 position) the dominant TSS for each tag cluster (TC). The results are displayed as sequence logos for (a) broad TCs and (b) narrow TCs. Although the methods differ in the nucleotide distributions, in all cases, we do see a preference for a pyrimidine at position −1 and a purine at position +1, as has been found previously.

Supplementary Figure 9 Reproducibility of 5′-end methods.

(a) Shared peaks across CAGE replicates. Shown is the proportion of shared peaks. Main-1, Main-4, and Main-6 were processed in the same batch. (b) Normalized coverage by position for CAGE, RAMPAGE, and STRT replicates. For each library, shown is the average relative coverage (y-axis) at each relative position along the transcripts’ length (x-axis).

Supplementary Figure 10 Correlation of gene expression levels.

Shown are scatter plots for an all-versus-all comparison of gene expression levels (ln(TPM+1)) for (a) CAGE replicates, (b) RAMPAGE replicates, (c) STRT replicates, and (d) each 5’ end method and standard RNA-Seq. Points are colored based on their normalized density (Online Methods). Pearson's r shown for each comparison. Sample size for each method: n = 1 library per replicate or method, except CAGE (d) is a combination of 3 libraries.

Supplementary Figure 11 TSS discovery for unannotated peaks.

(a,b) Corroborative data for TSS peaks from all methods. Shown are the proportion (a) and number (b) of peaks (y axis) with support from each corroborative data source (color legend) for peaks initially defined as ‘true positive’, ‘false positive’ and ‘intergenic’ based on the UCSC annotation. (a) Peaks were assigned to only one category of support as in Fig. 4a. (b) Peaks were assigned to as many corroborative categories as evidence supported as in Fig. 4b.

Supplementary Figure 12 Corroborative evidence for 5′ ends identified by standard RNA-seq.

Venn diagram showing TSS prediction with Standard RNA-Seq, DNase-Seq and H3K4me3 ChIP-Seq data. Numbers of peaks shown here in overlapping categories correspond to RNA-Seq peaks for all overlaps involving RNA-Seq peaks and DNase-Seq peaks in the overlap with only H3K4me3 ChIP-Seq peaks. For each subset of RNA-Seq peaks, we also show the % true positives (TPs) out of all the RNA-Seq peaks in that category. Areas not to scale.

Supplementary Figure 13 Correlation of CAGE-based gene expression for brain-related samples.

Heatmap showing the Pearson correlation of expression levels based on ln(TPM+1) between each pair of brain-related samples. Correlation was calculated using all genes expressed in at least one sample. The associated hierarchical clustering is displayed above and to the left of the heatmap. Sample size for each method: n = 1 library per sample.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13, Supplementary Notes 1–5 and Supplementary Tables 1–6, 8–12

Reporting Summary

Supplementary Table 7

List of differential TSS usage in brain-related samples.

Source Data, Figure 2

Source Data, Figure 3

Source Data, Figure 4

Source Data, Figure 5

Source Data, Figure 6

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Adiconis, X., Haber, A.L., Simmons, S.K. et al. Comprehensive comparative analysis of 5′-end RNA-sequencing methods. Nat Methods 15, 505–511 (2018). https://doi.org/10.1038/s41592-018-0014-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-018-0014-2

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing