Specialized RNA-seq methods are required to identify the 5′ ends of transcripts, which are critical for studies of gene regulation, but these methods have not been systematically benchmarked. We directly compared six such methods, including the performance of five methods on a single human cellular RNA sample and a new spike-in RNA assay that helps circumvent challenges resulting from uncertainties in annotation and RNA processing. We found that the ‘cap analysis of gene expression’ (CAGE) method performed best for mRNA and that most of its unannotated peaks were supported by evidence from other genomic methods. We applied CAGE to eight brain-related samples and determined sample-specific transcription start site (TSS) usage, as well as a transcriptome-wide shift in TSS usage between fetal and adult brain.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We are grateful to M. Salit and J. McDaniel (National Institute of Standards and Technology, Gaithersburg, MD, USA) for ERCC spike-in RNA; P. Batut for sharing RAMPAGE peak-calling code; N. Shoresh for advice on epigenomics datasets; N. Sanjana for advice on preparing the NGN1/2 in vitro neuron sample; B. Haas, Y. Farjoun, and M. Hofree for statistical advice; L. Gaffney for assistance with figures; I. Wortman and C. Cheng for K-562 experiments; C. de Boer for helpful comments on this manuscript; and the Broad Genomics Platform for sequencing. We thank S. McCarroll for suggesting this research direction and helpful discussions in the early phases of this study. This work was supported by the Stanley Center for Psychiatric Research, the Klarman Cell Observatory, and the BRAIN Initiative (U01-MH105960-01 to A.R.).
Integrated supplementary information
Sequencing coverage with five different lab methods for three highly expressed genes in K-562 cells. Shown is the scaled number of reads (y-axis) at each position in the genome (x-axis; top track). Bottom track shows position of annotated exons (filled boxes) and introns (lines) with direction of transcription shown by arrows based on UCSC annotation. Plots generated with IGV (Robinson, J.T. et al. Integrative genomics viewer. Nat Biotechnol 29, 24-26 (2011)).
Sensitivity, Precision, and F1 scores (bars, y-axis) at varying levels of filtering by CapFilter (x-axis) for each of four lab methods. Each level corresponds to the minimum percent of reads per peak that begin with an extra G (Online Methods).
Sensitivity, Precision, and F1 scores (bars, y-axis) at (a) different levels of filtering with a strand invasion filter (Online Methods); and (b) comparing RAMPAGE (with and without read 2) and ParaClu peak callers. In all cases CapFilter was used.
Shown is the sensitivity (y-axis) for each method (x-axis). False negatives were defined as all TSSs without overlapping 5’ end RNA-Seq peaks (“without” DNase-Seq) or only the subset overlapping DNase-Seq peaks in K-562 cells (“with” DNase-Seq, the method used in Fig. 4). DNase-Seq data permits better assessment of actual performance for K-562 cells rather than comparing only to the UCSC annotation, which is compiled from diverse samples.
Sensitivity, precision, and F1 score (y-axis) for STRT with RNA input amounts ranging from 10 ng to 10 μg. Also included to aid comparison are the STRT data shown in Fig. 3a (10 ng input).
Sensitivity, precision, and F1 score (y-axis) for each lab method (x-axis) relative to the Gencode annotation.
Sensitivity, precision, and F1 score (y-axis) for each lab method (x-axis). Comparison of (a) CAGE (replicates A and B) to RAMPAGE for K-562, (b) CAGE to Oligo capping for MCF-7, and (c) CAGE to STRT for mouse hippocampus. CAGE performed better than other methods in these comparisons.
For each method, shown are the nucleotides right before (−1 position) and after (+1 position) the dominant TSS for each tag cluster (TC). The results are displayed as sequence logos for (a) broad TCs and (b) narrow TCs. Although the methods differ in the nucleotide distributions, in all cases, we do see a preference for a pyrimidine at position −1 and a purine at position +1, as has been found previously.
(a) Shared peaks across CAGE replicates. Shown is the proportion of shared peaks. Main-1, Main-4, and Main-6 were processed in the same batch. (b) Normalized coverage by position for CAGE, RAMPAGE, and STRT replicates. For each library, shown is the average relative coverage (y-axis) at each relative position along the transcripts’ length (x-axis).
Shown are scatter plots for an all-versus-all comparison of gene expression levels (ln(TPM+1)) for (a) CAGE replicates, (b) RAMPAGE replicates, (c) STRT replicates, and (d) each 5’ end method and standard RNA-Seq. Points are colored based on their normalized density (Online Methods). Pearson's r shown for each comparison. Sample size for each method: n = 1 library per replicate or method, except CAGE (d) is a combination of 3 libraries.
(a,b) Corroborative data for TSS peaks from all methods. Shown are the proportion (a) and number (b) of peaks (y axis) with support from each corroborative data source (color legend) for peaks initially defined as ‘true positive’, ‘false positive’ and ‘intergenic’ based on the UCSC annotation. (a) Peaks were assigned to only one category of support as in Fig. 4a. (b) Peaks were assigned to as many corroborative categories as evidence supported as in Fig. 4b.
Venn diagram showing TSS prediction with Standard RNA-Seq, DNase-Seq and H3K4me3 ChIP-Seq data. Numbers of peaks shown here in overlapping categories correspond to RNA-Seq peaks for all overlaps involving RNA-Seq peaks and DNase-Seq peaks in the overlap with only H3K4me3 ChIP-Seq peaks. For each subset of RNA-Seq peaks, we also show the % true positives (TPs) out of all the RNA-Seq peaks in that category. Areas not to scale.
Heatmap showing the Pearson correlation of expression levels based on ln(TPM+1) between each pair of brain-related samples. Correlation was calculated using all genes expressed in at least one sample. The associated hierarchical clustering is displayed above and to the left of the heatmap. Sample size for each method: n = 1 library per sample.
Supplementary Figures 1–13, Supplementary Notes 1–5 and Supplementary Tables 1–6, 8–12
List of differential TSS usage in brain-related samples.