Specialized RNA-seq methods are required to identify the 5′ ends of transcripts, which are critical for studies of gene regulation, but these methods have not been systematically benchmarked. We directly compared six such methods, including the performance of five methods on a single human cellular RNA sample and a new spike-in RNA assay that helps circumvent challenges resulting from uncertainties in annotation and RNA processing. We found that the ‘cap analysis of gene expression’ (CAGE) method performed best for mRNA and that most of its unannotated peaks were supported by evidence from other genomic methods. We applied CAGE to eight brain-related samples and determined sample-specific transcription start site (TSS) usage, as well as a transcriptome-wide shift in TSS usage between fetal and adult brain.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
FIPRESCI: droplet microfluidics based combinatorial indexing for massive-scale 5′-end single-cell RNA sequencing
Genome Biology Open Access 06 April 2023
Genome Biology Open Access 29 June 2022
Science China Life Sciences Open Access 27 December 2021
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Heinzen, E. L., Neale, B. M., Traynelis, S. F., Allen, A. S. & Goldstein, D. B. The genetics of neuropsychiatric diseases: looking in and beyond the exome. Annu. Rev. Neurosci. 38, 47–68 (2015).
Edwards, S. L., Beesley, J., French, J. D. & Dunning, A. M. Beyond GWASs: illuminating the dark road from association to function. Am. J. Hum. Genet. 93, 779–797 (2013).
De Gobbi, M. et al. A regulatory SNP causes a human genetic disease by creating a new transcriptional promoter. Science 312, 1215–1217 (2006).
Davuluri, R. V., Suzuki, Y., Sugano, S., Plass, C. & Huang, T. H. The functional consequences of alternative promoter use in mammalian genomes. Trends Genet. 24, 167–177 (2008).
Grob, T. J. et al. Human delta Np73 regulates a dominant negative feedback loop for TAp73 and p53. Cell Death Differ. 8, 1213–1223 (2001).
Béna, F. et al. Molecular and clinical characterization of 25 individuals with exonic deletions of NRXN1 and comprehensive review of the literature. Am. J. Med. Genet. B. Neuropsychiatr. Genet. 162B, 388–403 (2013).
Hrdlickova, R., Toloue, M. & Tian, B. RNA-Seq methods for transcriptome analysis. Wiley Interdiscip. Rev. RNA 8, e1364 (2017).
Tyner, C. et al. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 45, D626–D634 (2017).
Murata, M. et al. Detecting expressed genes using CAGE. Methods Mol. Biol. 1164, 67–85 (2014).
Batut, P., Dobin, A., Plessy, C., Carninci, P. & Gingeras, T. R. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 23, 169–180 (2013).
Batut, P. & Gingeras, T. R. RAMPAGE: promoter activity profiling by paired-end sequencing of 5′-complete cDNAs. Curr. Protoc. Mol. Biol. 104, 25B.11.1–25B.11.16 (2013).
Islam, S. et al. Highly multiplexed and strand-specific single-cell RNA 5′ end sequencing. Nat. Protoc. 7, 813–828 (2012).
Salimullah, M., Sakai, M., Plessy, C. & Carninci, P. NanoCAGE: a high-resolution technique to discover and interrogate cell transcriptomes. Cold Spring Harb. Protoc. 2011, pdb.prot5559 (2011).
Cumbie, J. S., Ivanchenko, M. G. & Megraw, M. NanoCAGE-XL and CapFilter: an approach to genome wide identification of high confidence transcription start sites. BMC Genomics 16, 597 (2015).
Yamashita, R. et al. Genome-wide characterization of transcriptional start sites in humans by integrative transcriptome analysis. Genome Res. 21, 775–789 (2011).
Tsuchihara, K. et al. Massive transcriptional start site analysis of human genes in hypoxia cells. Nucleic Acids Res. 37, 2249–2263 (2009).
Core, L. J. et al. Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers. Nat. Genet. 46, 1311–1320 (2014).
Lam, M. T. et al. Rev-Erbs repress macrophage gene expression by inhibiting enhancer-directed transcription. Nature 498, 511–515 (2013).
Adiconis, X. et al. Comparative analysis of RNA sequencing methods for degraded or low-input samples. Nat. Methods 10, 623–629 (2013).
Hestand, M. S. et al. Tissue-specific transcript annotation and expression profiling with complementary next-generation sequencing technologies. Nucleic Acids Res. 38, e165 (2010).
Morlan, J. D., Qu, K. & Sinicropi, D. V. Selective depletion of rRNA enables whole transcriptome profiling of archival fixed tissue. PLoS One 7, e42882 (2012).
Schoenberg, D. R. & Maquat, L. E. Re-capping the message. Trends Biochem. Sci. 34, 435–442 (2009).
Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011).
Frith, M. C. et al. A code for transcription initiation in mammalian genomes. Genome Res. 18, 1–12 (2008).
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).
FANTOM Consortium & RIKEN PMI and CLST. A promoter-levelmammalian expression atlas. Nature 507, 462–470 (2014)..
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).
Boyle, A. P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311–322 (2008).
Hoffman, M. M. et al. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res. 41, 827–841 (2013).
Kim, T. K. et al. Widespread transcription at neuronal activity-regulated enhancers. Nature 465, 182–187 (2010).
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Busskamp, V. et al. Rapid neurogenesis through transcriptional activation in human stem cells. Mol. Syst. Biol. 10, 760 (2014).
Lancaster, M. A. & Knoblich, J. A. Organogenesis in a dish: modeling development and disease using organoid technologies. Science 345, 1247125 (2014).
Hughes, T. et al. A loss-of-function variant in a minor isoform of ANK3 protects against bipolar disorder and schizophrenia. Biol. Psychiatry 80, 323–330 (2016).
Rueckert, E. H. et al. Cis-acting regulation of brain-specific ANK3 gene expression by a genetic variant associated with bipolar disorder. Mol. Psychiatry 18, 922–929 (2013).
Bae, B. I. et al. Evolutionarily dynamic alternative splicing of GPR56 regulates regional cerebral cortical patterning. Science 343, 764–768 (2014).
Novak, G. & Tallerico, T. Nogo A, B and C expression in schizophrenia, depression and bipolar frontal cortex, and correlation of Nogo expression with CAA/TATC polymorphism in 3′-UTR. Brain Res. 1120, 161–171 (2006).
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013).
Bellin, M., Marchetto, M. C., Gage, F. H. & Mummery, C. L. Induced pluripotent stem cells: the new patient? Nat. Rev. Mol. Cell Biol. 13, 713–726 (2012).
Sterneckert, J. L., Reinhardt, P. & Schöler, H. R. Investigating human disease using stem cell models. Nat. Rev. Genet. 15, 625–639 (2014).
Imaizumi, Y. & Okano, H. Modeling human neurological disorders with induced pluripotent stem cells. J. Neurochem. 129, 388–399 (2014).
Hyman, S. E. Revitalizing psychiatric therapeutics. Neuropsychopharmacology 39, 220–229 (2014).
Arner, E. et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347, 1010–1014 (2015).
Birdsill, A. C., Walker, D. G., Lue, L., Sue, L. I. & Beach, T. G. Postmortem interval effect on RNA and gene expression in human brain tissue. Cell Tissue Bank. 12, 311–318 (2011).
Sandberg, R., Neilson, J. R., Sarma, A., Sharp, P. A. & Burge, C. B. Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science 320, 1643–1647 (2008).
Miura, P., Shenker, S., Andreu-Agullo, C., Westholm, J. O. & Lai, E. C. Widespread and extensive lengthening of 3′ UTRs in the mammalian brain. Genome Res. 23, 812–825 (2013).
Sarda, S., Das, A., Vinson, C. & Hannenhalli, S. Distal CpG islands can serve as alternative promoters to transcribe genes with silenced proximal promoters. Genome Res. 27, 553–566 (2017).
Lancaster, M. A. & Knoblich, J. A. Generation of cerebral organoids from human pluripotent stem cells. Nat. Protoc. 9, 2329–2340 (2014).
Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).
Soumillon, M., Cacchiarelli, D., Semrau, S., van Oudenaarden, A. & Mikkelsen, T. S. Characterization of directed differentiation by high-throughput single-cell RNA-Seq. bioRxiv Preprint available at https://www.biorxiv.org/content/early/2014/03/05/003236 (2014).
Suzuki, Y. & Sugano, S. Construction of a full-length enriched and a 5′-end enriched cDNA library using the oligo-capping method. Methods Mol. Biol. 221, 73–91 (2003).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Use R!) 2nd edn (Springer, New York, 2009).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Zhang, K. et al. Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat. Methods 6, 613–618 (2009).
Ashoor, H., Kleftogiannis, D., Radovanovic, A. & Bajic, V. B. DENdb: database of integrated human enhancers. Database (Oxford) 2015, bav085 (2015).
Carninci, P. et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 38, 626–635 (2006).
Zhao, X., Valen, E., Parker, B. J. & Sandelin, A. Systematic clustering of transcription start site landscapes. PLoS One 6, e23409 (2011).
Wagih, O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647 (2017).
Tang, D. T. et al. Suppression of artifacts and barcode bias in high-throughput transcriptome analyses utilizing template switching. Nucleic Acids Res. 41, e44 (2013).
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
Sloan, C. A. et al. ENCODE data at the ENCODE portal. Nucleic Acids Res. 44, D726–D732 (2016).
Chambers, S. M. et al. Combined small-molecule inhibition accelerates developmental timing and converts human pluripotent stem cells into nociceptors. Nat. Biotechnol. 30, 715–720 (2012).
Venables, W. N. & Ripley, B. D. Modern Applied Statistics with S (Springer, New York, 2002).
We are grateful to M. Salit and J. McDaniel (National Institute of Standards and Technology, Gaithersburg, MD, USA) for ERCC spike-in RNA; P. Batut for sharing RAMPAGE peak-calling code; N. Shoresh for advice on epigenomics datasets; N. Sanjana for advice on preparing the NGN1/2 in vitro neuron sample; B. Haas, Y. Farjoun, and M. Hofree for statistical advice; L. Gaffney for assistance with figures; I. Wortman and C. Cheng for K-562 experiments; C. de Boer for helpful comments on this manuscript; and the Broad Genomics Platform for sequencing. We thank S. McCarroll for suggesting this research direction and helpful discussions in the early phases of this study. This work was supported by the Stanley Center for Psychiatric Research, the Klarman Cell Observatory, and the BRAIN Initiative (U01-MH105960-01 to A.R.).
The authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Sequencing coverage with five different lab methods for three highly expressed genes in K-562 cells. Shown is the scaled number of reads (y-axis) at each position in the genome (x-axis; top track). Bottom track shows position of annotated exons (filled boxes) and introns (lines) with direction of transcription shown by arrows based on UCSC annotation. Plots generated with IGV (Robinson, J.T. et al. Integrative genomics viewer. Nat Biotechnol 29, 24-26 (2011)).
Sensitivity, Precision, and F1 scores (bars, y-axis) at varying levels of filtering by CapFilter (x-axis) for each of four lab methods. Each level corresponds to the minimum percent of reads per peak that begin with an extra G (Online Methods).
Sensitivity, Precision, and F1 scores (bars, y-axis) at (a) different levels of filtering with a strand invasion filter (Online Methods); and (b) comparing RAMPAGE (with and without read 2) and ParaClu peak callers. In all cases CapFilter was used.
Shown is the sensitivity (y-axis) for each method (x-axis). False negatives were defined as all TSSs without overlapping 5’ end RNA-Seq peaks (“without” DNase-Seq) or only the subset overlapping DNase-Seq peaks in K-562 cells (“with” DNase-Seq, the method used in Fig. 4). DNase-Seq data permits better assessment of actual performance for K-562 cells rather than comparing only to the UCSC annotation, which is compiled from diverse samples.
Sensitivity, precision, and F1 score (y-axis) for STRT with RNA input amounts ranging from 10 ng to 10 μg. Also included to aid comparison are the STRT data shown in Fig. 3a (10 ng input).
Sensitivity, precision, and F1 score (y-axis) for each lab method (x-axis) relative to the Gencode annotation.
Sensitivity, precision, and F1 score (y-axis) for each lab method (x-axis). Comparison of (a) CAGE (replicates A and B) to RAMPAGE for K-562, (b) CAGE to Oligo capping for MCF-7, and (c) CAGE to STRT for mouse hippocampus. CAGE performed better than other methods in these comparisons.
For each method, shown are the nucleotides right before (−1 position) and after (+1 position) the dominant TSS for each tag cluster (TC). The results are displayed as sequence logos for (a) broad TCs and (b) narrow TCs. Although the methods differ in the nucleotide distributions, in all cases, we do see a preference for a pyrimidine at position −1 and a purine at position +1, as has been found previously.
(a) Shared peaks across CAGE replicates. Shown is the proportion of shared peaks. Main-1, Main-4, and Main-6 were processed in the same batch. (b) Normalized coverage by position for CAGE, RAMPAGE, and STRT replicates. For each library, shown is the average relative coverage (y-axis) at each relative position along the transcripts’ length (x-axis).
Shown are scatter plots for an all-versus-all comparison of gene expression levels (ln(TPM+1)) for (a) CAGE replicates, (b) RAMPAGE replicates, (c) STRT replicates, and (d) each 5’ end method and standard RNA-Seq. Points are colored based on their normalized density (Online Methods). Pearson's r shown for each comparison. Sample size for each method: n = 1 library per replicate or method, except CAGE (d) is a combination of 3 libraries.
(a,b) Corroborative data for TSS peaks from all methods. Shown are the proportion (a) and number (b) of peaks (y axis) with support from each corroborative data source (color legend) for peaks initially defined as ‘true positive’, ‘false positive’ and ‘intergenic’ based on the UCSC annotation. (a) Peaks were assigned to only one category of support as in Fig. 4a. (b) Peaks were assigned to as many corroborative categories as evidence supported as in Fig. 4b.
Venn diagram showing TSS prediction with Standard RNA-Seq, DNase-Seq and H3K4me3 ChIP-Seq data. Numbers of peaks shown here in overlapping categories correspond to RNA-Seq peaks for all overlaps involving RNA-Seq peaks and DNase-Seq peaks in the overlap with only H3K4me3 ChIP-Seq peaks. For each subset of RNA-Seq peaks, we also show the % true positives (TPs) out of all the RNA-Seq peaks in that category. Areas not to scale.
Heatmap showing the Pearson correlation of expression levels based on ln(TPM+1) between each pair of brain-related samples. Correlation was calculated using all genes expressed in at least one sample. The associated hierarchical clustering is displayed above and to the left of the heatmap. Sample size for each method: n = 1 library per sample.
Supplementary Figures 1–13, Supplementary Notes 1–5 and Supplementary Tables 1–6, 8–12
List of differential TSS usage in brain-related samples.
About this article
Cite this article
Adiconis, X., Haber, A.L., Simmons, S.K. et al. Comprehensive comparative analysis of 5′-end RNA-sequencing methods. Nat Methods 15, 505–511 (2018). https://doi.org/10.1038/s41592-018-0014-2
This article is cited by
FIPRESCI: droplet microfluidics based combinatorial indexing for massive-scale 5′-end single-cell RNA sequencing
Genome Biology (2023)
Nature Cardiovascular Research (2023)
Genome Biology (2022)
A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers
Nature Biotechnology (2022)
Science China Life Sciences (2022)