Abstract

Specialized RNA-seq methods are required to identify the 5′ ends of transcripts, which are critical for studies of gene regulation, but these methods have not been systematically benchmarked. We directly compared six such methods, including the performance of five methods on a single human cellular RNA sample and a new spike-in RNA assay that helps circumvent challenges resulting from uncertainties in annotation and RNA processing. We found that the ‘cap analysis of gene expression’ (CAGE) method performed best for mRNA and that most of its unannotated peaks were supported by evidence from other genomic methods. We applied CAGE to eight brain-related samples and determined sample-specific transcription start site (TSS) usage, as well as a transcriptome-wide shift in TSS usage between fetal and adult brain.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Heinzen, E. L., Neale, B. M., Traynelis, S. F., Allen, A. S. & Goldstein, D. B. The genetics of neuropsychiatric diseases: looking in and beyond the exome. Annu. Rev. Neurosci. 38, 47–68 (2015).

  2. 2.

    Edwards, S. L., Beesley, J., French, J. D. & Dunning, A. M. Beyond GWASs: illuminating the dark road from association to function. Am. J. Hum. Genet. 93, 779–797 (2013).

  3. 3.

    De Gobbi, M. et al. A regulatory SNP causes a human genetic disease by creating a new transcriptional promoter. Science 312, 1215–1217 (2006).

  4. 4.

    Davuluri, R. V., Suzuki, Y., Sugano, S., Plass, C. & Huang, T. H. The functional consequences of alternative promoter use in mammalian genomes. Trends Genet. 24, 167–177 (2008).

  5. 5.

    Grob, T. J. et al. Human delta Np73 regulates a dominant negative feedback loop for TAp73 and p53. Cell Death Differ. 8, 1213–1223 (2001).

  6. 6.

    Béna, F. et al. Molecular and clinical characterization of 25 individuals with exonic deletions of NRXN1 and comprehensive review of the literature. Am. J. Med. Genet. B. Neuropsychiatr. Genet. 162B, 388–403 (2013).

  7. 7.

    Hrdlickova, R., Toloue, M. & Tian, B. RNA-Seq methods for transcriptome analysis. Wiley Interdiscip. Rev. RNA 8, e1364 (2017).

  8. 8.

    Tyner, C. et al. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 45, D626–D634 (2017).

  9. 9.

    Murata, M. et al. Detecting expressed genes using CAGE. Methods Mol. Biol. 1164, 67–85 (2014).

  10. 10.

    Batut, P., Dobin, A., Plessy, C., Carninci, P. & Gingeras, T. R. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 23, 169–180 (2013).

  11. 11.

    Batut, P. & Gingeras, T. R. RAMPAGE: promoter activity profiling by paired-end sequencing of 5′-complete cDNAs. Curr. Protoc. Mol. Biol. 104, 25B.11.1–25B.11.16 (2013).

  12. 12.

    Islam, S. et al. Highly multiplexed and strand-specific single-cell RNA 5′ end sequencing. Nat. Protoc. 7, 813–828 (2012).

  13. 13.

    Salimullah, M., Sakai, M., Plessy, C. & Carninci, P. NanoCAGE: a high-resolution technique to discover and interrogate cell transcriptomes. Cold Spring Harb. Protoc. 2011, pdb.prot5559 (2011).

  14. 14.

    Cumbie, J. S., Ivanchenko, M. G. & Megraw, M. NanoCAGE-XL and CapFilter: an approach to genome wide identification of high confidence transcription start sites. BMC Genomics 16, 597 (2015).

  15. 15.

    Yamashita, R. et al. Genome-wide characterization of transcriptional start sites in humans by integrative transcriptome analysis. Genome Res. 21, 775–789 (2011).

  16. 16.

    Tsuchihara, K. et al. Massive transcriptional start site analysis of human genes in hypoxia cells. Nucleic Acids Res. 37, 2249–2263 (2009).

  17. 17.

    Core, L. J. et al. Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers. Nat. Genet. 46, 1311–1320 (2014).

  18. 18.

    Lam, M. T. et al. Rev-Erbs repress macrophage gene expression by inhibiting enhancer-directed transcription. Nature 498, 511–515 (2013).

  19. 19.

    Adiconis, X. et al. Comparative analysis of RNA sequencing methods for degraded or low-input samples. Nat. Methods 10, 623–629 (2013).

  20. 20.

    Hestand, M. S. et al. Tissue-specific transcript annotation and expression profiling with complementary next-generation sequencing technologies. Nucleic Acids Res. 38, e165 (2010).

  21. 21.

    Morlan, J. D., Qu, K. & Sinicropi, D. V. Selective depletion of rRNA enables whole transcriptome profiling of archival fixed tissue. PLoS One 7, e42882 (2012).

  22. 22.

    Schoenberg, D. R. & Maquat, L. E. Re-capping the message. Trends Biochem. Sci. 34, 435–442 (2009).

  23. 23.

    Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011).

  24. 24.

    Frith, M. C. et al. A code for transcription initiation in mammalian genomes. Genome Res. 18, 1–12 (2008).

  25. 25.

    Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

  26. 26.

    Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).

  27. 27.

    FANTOM Consortium & RIKEN PMI and CLST. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).

  28. 28.

    ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  29. 29.

    Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).

  30. 30.

    Boyle, A. P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311–322 (2008).

  31. 31.

    Hoffman, M. M. et al. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res. 41, 827–841 (2013).

  32. 32.

    Kim, T. K. et al. Widespread transcription at neuronal activity-regulated enhancers. Nature 465, 182–187 (2010).

  33. 33.

    Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

  34. 34.

    Busskamp, V. et al. Rapid neurogenesis through transcriptional activation in human stem cells. Mol. Syst. Biol. 10, 760 (2014).

  35. 35.

    Lancaster, M. A. & Knoblich, J. A. Organogenesis in a dish: modeling development and disease using organoid technologies. Science 345, 1247125 (2014).

  36. 36.

    Hughes, T. et al. A loss-of-function variant in a minor isoform of ANK3 protects against bipolar disorder and schizophrenia. Biol. Psychiatry 80, 323–330 (2016).

  37. 37.

    Rueckert, E. H. et al. Cis-acting regulation of brain-specific ANK3 gene expression by a genetic variant associated with bipolar disorder. Mol. Psychiatry 18, 922–929 (2013).

  38. 38.

    Bae, B. I. et al. Evolutionarily dynamic alternative splicing of GPR56 regulates regional cerebral cortical patterning. Science 343, 764–768 (2014).

  39. 39.

    Novak, G. & Tallerico, T. Nogo A, B and C expression in schizophrenia, depression and bipolar frontal cortex, and correlation of Nogo expression with CAA/TATC polymorphism in 3′-UTR. Brain Res. 1120, 161–171 (2006).

  40. 40.

    Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013).

  41. 41.

    Bellin, M., Marchetto, M. C., Gage, F. H. & Mummery, C. L. Induced pluripotent stem cells: the new patient? Nat. Rev. Mol. Cell Biol. 13, 713–726 (2012).

  42. 42.

    Sterneckert, J. L., Reinhardt, P. & Schöler, H. R. Investigating human disease using stem cell models. Nat. Rev. Genet. 15, 625–639 (2014).

  43. 43.

    Imaizumi, Y. & Okano, H. Modeling human neurological disorders with induced pluripotent stem cells. J. Neurochem. 129, 388–399 (2014).

  44. 44.

    Hyman, S. E. Revitalizing psychiatric therapeutics. Neuropsychopharmacology 39, 220–229 (2014).

  45. 45.

    Arner, E. et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347, 1010–1014 (2015).

  46. 46.

    Birdsill, A. C., Walker, D. G., Lue, L., Sue, L. I. & Beach, T. G. Postmortem interval effect on RNA and gene expression in human brain tissue. Cell Tissue Bank. 12, 311–318 (2011).

  47. 47.

    Sandberg, R., Neilson, J. R., Sarma, A., Sharp, P. A. & Burge, C. B. Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science 320, 1643–1647 (2008).

  48. 48.

    Miura, P., Shenker, S., Andreu-Agullo, C., Westholm, J. O. & Lai, E. C. Widespread and extensive lengthening of 3′ UTRs in the mammalian brain. Genome Res. 23, 812–825 (2013).

  49. 49.

    Sarda, S., Das, A., Vinson, C. & Hannenhalli, S. Distal CpG islands can serve as alternative promoters to transcribe genes with silenced proximal promoters. Genome Res. 27, 553–566 (2017).

  50. 50.

    Lancaster, M. A. & Knoblich, J. A. Generation of cerebral organoids from human pluripotent stem cells. Nat. Protoc. 9, 2329–2340 (2014).

  51. 51.

    Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).

  52. 52.

    Soumillon, M., Cacchiarelli, D., Semrau, S., van Oudenaarden, A. & Mikkelsen, T. S. Characterization of directed differentiation by high-throughput single-cell RNA-Seq. bioRxiv Preprint available at https://www.biorxiv.org/content/early/2014/03/05/003236 (2014).

  53. 53.

    Suzuki, Y. & Sugano, S. Construction of a full-length enriched and a 5′-end enriched cDNA library using the oligo-capping method. Methods Mol. Biol. 221, 73–91 (2003).

  54. 54.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

  55. 55.

    Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

  56. 56.

    Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Use R!) 2nd edn (Springer, New York, 2009).

  57. 57.

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  58. 58.

    Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).

  59. 59.

    Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

  60. 60.

    Zhang, K. et al. Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat. Methods 6, 613–618 (2009).

  61. 61.

    Ashoor, H., Kleftogiannis, D., Radovanovic, A. & Bajic, V. B. DENdb: database of integrated human enhancers. Database (Oxford) 2015, bav085 (2015).

  62. 62.

    Carninci, P. et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 38, 626–635 (2006).

  63. 63.

    Zhao, X., Valen, E., Parker, B. J. & Sandelin, A. Systematic clustering of transcription start site landscapes. PLoS One 6, e23409 (2011).

  64. 64.

    Wagih, O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647 (2017).

  65. 65.

    Tang, D. T. et al. Suppression of artifacts and barcode bias in high-throughput transcriptome analyses utilizing template switching. Nucleic Acids Res. 41, e44 (2013).

  66. 66.

    Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).

  67. 67.

    Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).

  68. 68.

    Sloan, C. A. et al. ENCODE data at the ENCODE portal. Nucleic Acids Res. 44, D726–D732 (2016).

  69. 69.

    Chambers, S. M. et al. Combined small-molecule inhibition accelerates developmental timing and converts human pluripotent stem cells into nociceptors. Nat. Biotechnol. 30, 715–720 (2012).

  70. 70.

    Venables, W. N. & Ripley, B. D. Modern Applied Statistics with S (Springer, New York, 2002).

Download references

Acknowledgements

We are grateful to M. Salit and J. McDaniel (National Institute of Standards and Technology, Gaithersburg, MD, USA) for ERCC spike-in RNA; P. Batut for sharing RAMPAGE peak-calling code; N. Shoresh for advice on epigenomics datasets; N. Sanjana for advice on preparing the NGN1/2 in vitro neuron sample; B. Haas, Y. Farjoun, and M. Hofree for statistical advice; L. Gaffney for assistance with figures; I. Wortman and C. Cheng for K-562 experiments; C. de Boer for helpful comments on this manuscript; and the Broad Genomics Platform for sequencing. We thank S. McCarroll for suggesting this research direction and helpful discussions in the early phases of this study. This work was supported by the Stanley Center for Psychiatric Research, the Klarman Cell Observatory, and the BRAIN Initiative (U01-MH105960-01 to A.R.).

Author information

Author notes

  1. These authors contributed equally: Xian Adiconis, Adam L. Haber, Sean K. Simmons.

Affiliations

  1. Klarman Cell Observatory, Broad Institute of MIT and Harvard, Cambridge, MA, USA

    • Xian Adiconis
    • , Adam L. Haber
    • , Zhe Ji
    • , Aviv Regev
    •  & Joshua Z. Levin
  2. Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA

    • Xian Adiconis
    • , Sean K. Simmons
    • , Xi Shi
    • , Justin Jacques
    • , Jen Q. Pan
    •  & Joshua Z. Levin
  3. Broad Institute of MIT and Harvard, Cambridge, MA, USA

    • Ami Levy Moonshine
    • , Michele A. Busby
    •  & Aviv Regev
  4. Laboratory of Molecular Biology, Medical Research Council, Cambridge, UK

    • Madeline A. Lancaster
  5. Department of Biology, Howard Hughes Medical Institute, Massachusetts Institute of Technology, Cambridge, MA, USA

    • Aviv Regev
  6. The David H. Koch Institute for Integrative Cancer Research at Massachusetts Institute of Technology, Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA

    • Aviv Regev

Authors

  1. Search for Xian Adiconis in:

  2. Search for Adam L. Haber in:

  3. Search for Sean K. Simmons in:

  4. Search for Ami Levy Moonshine in:

  5. Search for Zhe Ji in:

  6. Search for Michele A. Busby in:

  7. Search for Xi Shi in:

  8. Search for Justin Jacques in:

  9. Search for Madeline A. Lancaster in:

  10. Search for Jen Q. Pan in:

  11. Search for Aviv Regev in:

  12. Search for Joshua Z. Levin in:

Contributions

J.Z.L., X.A., and A.R. conceived the research. X.A. prepared the 5′-end RNA-seq libraries. J.J. prepared the standard RNA-seq library. X.S. prepared the in vitro neurons under the supervision of J.Q.P. M.A.L. prepared the brain organoid RNA. A.L.H., S.K.S., A.L.M., Z.J., and M.A.B. developed and performed computational analysis. J.Z.L., X.A., A.L.H., S.K.S., and A.R. wrote the paper. All of the authors edited the paper.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Aviv Regev or Joshua Z. Levin.

Integrated supplementary information

  1. Supplementary Figure 1 Comparing lab methods of 5′-end sequencing for specific genes.

    Sequencing coverage with five different lab methods for three highly expressed genes in K-562 cells. Shown is the scaled number of reads (y-axis) at each position in the genome (x-axis; top track). Bottom track shows position of annotated exons (filled boxes) and introns (lines) with direction of transcription shown by arrows based on UCSC annotation. Plots generated with IGV (Robinson, J.T. et al. Integrative genomics viewer. Nat Biotechnol 29, 24-26 (2011)).

  2. Supplementary Figure 2 CapFilter improved peak-calling.

    Sensitivity, Precision, and F1 scores (bars, y-axis) at varying levels of filtering by CapFilter (x-axis) for each of four lab methods. Each level corresponds to the minimum percent of reads per peak that begin with an extra G (Online Methods).

  3. Supplementary Figure 3 Strand invasion and RAMPAGE filters did not improve peak-calling.

    Sensitivity, Precision, and F1 scores (bars, y-axis) at (a) different levels of filtering with a strand invasion filter (Online Methods); and (b) comparing RAMPAGE (with and without read 2) and ParaClu peak callers. In all cases CapFilter was used.

  4. Supplementary Figure 4 Sensitivity with and without corroborative DNase-seq data.

    Shown is the sensitivity (y-axis) for each method (x-axis). False negatives were defined as all TSSs without overlapping 5’ end RNA-Seq peaks (“without” DNase-Seq) or only the subset overlapping DNase-Seq peaks in K-562 cells (“with” DNase-Seq, the method used in Fig. 4). DNase-Seq data permits better assessment of actual performance for K-562 cells rather than comparing only to the UCSC annotation, which is compiled from diverse samples.

  5. Supplementary Figure 5 STRT performance is essentially independent of RNA input amount.

    Sensitivity, precision, and F1 score (y-axis) for STRT with RNA input amounts ranging from 10 ng to 10 μg. Also included to aid comparison are the STRT data shown in Fig. 3a (10 ng input).

  6. Supplementary Figure 6 5′-end-method performance metrics with Gencode annotation.

    Sensitivity, precision, and F1 score (y-axis) for each lab method (x-axis) relative to the Gencode annotation.

  7. Supplementary Figure 7 Performance of the 5′-end methods in published datasets.

    Sensitivity, precision, and F1 score (y-axis) for each lab method (x-axis). Comparison of (a) CAGE (replicates A and B) to RAMPAGE for K-562, (b) CAGE to Oligo capping for MCF-7, and (c) CAGE to STRT for mouse hippocampus. CAGE performed better than other methods in these comparisons.

  8. Supplementary Figure 8 TSS initiator sequences.

    For each method, shown are the nucleotides right before (−1 position) and after (+1 position) the dominant TSS for each tag cluster (TC). The results are displayed as sequence logos for (a) broad TCs and (b) narrow TCs. Although the methods differ in the nucleotide distributions, in all cases, we do see a preference for a pyrimidine at position −1 and a purine at position +1, as has been found previously.

  9. Supplementary Figure 9 Reproducibility of 5′-end methods.

    (a) Shared peaks across CAGE replicates. Shown is the proportion of shared peaks. Main-1, Main-4, and Main-6 were processed in the same batch. (b) Normalized coverage by position for CAGE, RAMPAGE, and STRT replicates. For each library, shown is the average relative coverage (y-axis) at each relative position along the transcripts’ length (x-axis).

  10. Supplementary Figure 10 Correlation of gene expression levels.

    Shown are scatter plots for an all-versus-all comparison of gene expression levels (ln(TPM+1)) for (a) CAGE replicates, (b) RAMPAGE replicates, (c) STRT replicates, and (d) each 5’ end method and standard RNA-Seq. Points are colored based on their normalized density (Online Methods). Pearson's r shown for each comparison. Sample size for each method: n = 1 library per replicate or method, except CAGE (d) is a combination of 3 libraries.

  11. Supplementary Figure 11 TSS discovery for unannotated peaks.

    (a,b) Corroborative data for TSS peaks from all methods. Shown are the proportion (a) and number (b) of peaks (y axis) with support from each corroborative data source (color legend) for peaks initially defined as ‘true positive’, ‘false positive’ and ‘intergenic’ based on the UCSC annotation. (a) Peaks were assigned to only one category of support as in Fig. 4a. (b) Peaks were assigned to as many corroborative categories as evidence supported as in Fig. 4b.

  12. Supplementary Figure 12 Corroborative evidence for 5′ ends identified by standard RNA-seq.

    Venn diagram showing TSS prediction with Standard RNA-Seq, DNase-Seq and H3K4me3 ChIP-Seq data. Numbers of peaks shown here in overlapping categories correspond to RNA-Seq peaks for all overlaps involving RNA-Seq peaks and DNase-Seq peaks in the overlap with only H3K4me3 ChIP-Seq peaks. For each subset of RNA-Seq peaks, we also show the % true positives (TPs) out of all the RNA-Seq peaks in that category. Areas not to scale.

  13. Supplementary Figure 13 Correlation of CAGE-based gene expression for brain-related samples.

    Heatmap showing the Pearson correlation of expression levels based on ln(TPM+1) between each pair of brain-related samples. Correlation was calculated using all genes expressed in at least one sample. The associated hierarchical clustering is displayed above and to the left of the heatmap. Sample size for each method: n = 1 library per sample.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–13, Supplementary Notes 1–5 and Supplementary Tables 1–6, 8–12

  2. Reporting Summary

  3. Supplementary Table 7

    List of differential TSS usage in brain-related samples.

  4. Source Data, Figure 2

  5. Source Data, Figure 3

  6. Source Data, Figure 4

  7. Source Data, Figure 5

  8. Source Data, Figure 6

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41592-018-0014-2