Abstract
Precise regulation of transcription by RNA polymerase II (RNAPII) is critical for organismal growth and development. However, what determines whether an engaged RNAPII will synthesize a full-length transcript or terminate prematurely is poorly understood. Notably, RNAPII is far more susceptible to termination when transcribing non-coding RNAs than when synthesizing protein-coding mRNAs, but the mechanisms underlying this are unclear. To investigate the impact of transcribed sequence on elongation potential, we developed a method to screen the effects of thousands of INtegrated Sequences on Expression of RNA and Translation using high-throughput sequencing (INSERT-seq). We found that higher AT content in non-coding RNAs, rather than specific sequence motifs, drives RNAPII termination. Further, we demonstrate that 5′ splice sites autonomously stimulate processive transcription, even in the absence of polyadenylation signals. Our results reveal a potent role for the transcribed sequence in dictating gene output and demonstrate the power of INSERT-seq toward illuminating these contributions.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Raw and processed data files of all INSERT-seq experiments, PRO-seq, H3K4me3 ChIP–seq, and TT-seq are available at the Gene Expression Omnibus, accession no. GSE178230. H3K27ac ChIP–seq data are available through the 4DN data portal (https://data.4dnucleome.org/), ExperimentSet accession no. 4DNESQ33L4G7. H3K4me1 mESC ChIP–seq data were downloaded from the Gene Expression Omnibus, accession no. GSE56138. Reference genome mm10 (GRCm38) can be downloaded using RefSeq assembly accession number GCF_000001635.20. Supplementary Tables 3–7 provide all normalized and averaged data from INSERT-seq experiments, as well as which inserts are included in which plot. Uncropped image files and processed data shown in each plot are provided as source data. Source data are provided with this paper.
Code availability
All scripts used for analysis of INSERT-seq data can be found on Github: https://github.com/AdelmanLab/Vlaming2021_INSERT-seq_paper. URLs for all custom scripts used for PRO-seq, TT-seq and ChIP–seq analysis are provided in the Methods; these can be found at https://github.com/AdelmanLab/NIH_scripts/ and https://github.com/benjaminmartin02/binBedGraph.
References
Lykke-Andersen, S. et al. Integrator is a genome-wide attenuator of non-productive transcription. Mol. Cell 81, 514–529.e6 (2021).
Scruggs, B. S. et al. Bidirectional transcription arises from two distinct hubs of transcription factor binding and active chromatin. Mol. Cell 58, 1101–1112 (2015).
Tian, B., Hu, J., Zhang, H. & Lutz, C. S. A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 33, 201–212 (2005).
Shi, Y. & Manley, J. L. The end of the message: multiple protein-RNA interactions define the mRNA polyadenylation site. Genes Dev. 29, 889–897 (2015).
Ntini, E. et al. Polyadenylation site–induced decay of upstream transcripts enforces promoter directionality. Nat. Struct. Mol. Biol. 20, 923–928 (2013).
Almada, A. E., Wu, X., Kriz, A. J., Burge, C. B. & Sharp, P. A. Promoter directionality is controlled by U1 snRNP and polyadenylation signals. Nature 499, 360–363 (2013).
Core, L. J. et al. Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers. Nat. Genet. 46, 1311–1320 (2014).
Chiu, A. C. et al. Transcriptional pause sites delineate stable nucleosome-associated premature polyadenylation suppressed by U1 snRNP. Mol. Cell 69, 648–663 (2018).
Le Hir, H., Nott, A. & Moore, M. J. How introns influence and enhance eukaryotic gene expression. Trends Biochem. Sci. 28, 215–220 (2003).
Damgaard, C. K. et al. A 5′ splice site enhances the recruitment of basal transcription initiation factors in vivo. Mol. Cell 29, 271–278 (2008).
Bieberstein, N. I., Carrillo Oesterreich, F., Straube, K. & Neugebauer, K. M. First exon length controls active chromatin signatures and transcription. Cell Rep. 2, 62–68 (2012).
Fiszbein, A., Krick, K. S., Begg, B. E. & Burge, C. B. Exon-mediated activation of transcription starts. Cell 179, 1551–1565(2019).
Sousa-Luís, R. et al. POINT technology illuminates the processing of polymerase-associated intact nascent transcripts. Mol. Cell 81, 1935–19502021).
Caizzi, L. et al. Efficient RNA polymerase II pause release requires U2 snRNP function. Mol. Cell 81, 1920–1934.e9 (2021).
Kaida, D. et al. U1 snRNP protects pre-mRNAs from premature cleavage and polyadenylation. Nature 468, 664–668 (2010).
Berg, M. G. et al. U1 snRNP determines mRNA length and regulates isoform expression. Cell 150, 53–64 (2012).
Andersen, P. K., Lykke-Andersen, S. & Jensen, T. H. Promoter-proximal polyadenylation sites reduce transcription activity. Genes Dev. 26, 2169–2179 (2012).
Zhang, S. et al. Structure of a transcribing RNA polymerase II–U1 snRNP complex. Science 371, 305–309 (2021).
Kinney, J. B., Murugan, A., Callan, C. G. & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA 107, 9158–9163 (2010).
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Field, A. & Adelman, K. Evaluating enhancer function and transcription. Annu. Rev. Biochem. 89, 213–234 (2020).
Hnisz, D. et al. Super-enhancers in the control of cell identity and disease. Cell 155, 934–947 (2013).
Flynn, R. A. et al. 7SK–BAF axis controls pervasive transcription at enhancers. Nat. Struct. Mol. Biol. 23, 231–238 (2016).
Preker, P. et al. RNA exosome depletion reveals transcription upstream of active human promoters. Science 322, 1851–1854 (2008).
Seila, A. C. et al. Divergent transcription from active promoters. Science 322, 1849–1851 (2008).
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Kwak, H., Fuda, N. J., Core, L. J. & Lis, J. T. Precise maps of RNA polymerase reveal how promoters direct initiation and pausing. Science 339, 950–953 (2013).
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Krinner, S. et al. CpG domains downstream of TSSs promote high levels of gene expression. Nucleic Acids Res. 42, 3551–3564 (2014).
Noe Gonzalez, M., Blears, D. & Svejstrup, J. Q. Causes and consequences of RNA polymerase II stalling during transcript elongation. Nat. Rev. Mol. Cell Biol. 22, 3–21 (2021).
Zamft, B., Bintu, L., Ishibashi, T. & Bustamante, C. Nascent RNA structure modulates the transcriptional dynamics of RNA polymerases. Proc. Natl Acad. Sci. 109, 8948–8953 (2012).
Turowski, T. W. et al. Nascent transcript folding plays a major role in determining RNA polymerase elongation rates. Mol. Cell 79, 488–503(2020).
Roberts, J. W. Mechanisms of bacterial transcription termination. J. Mol. Biol. 431, 4030–4039 (2019).
Mishra, S. & Maraia, R. J. RNA polymerase III subunits C37/53 modulate rU:dA hybrid 3′ end dynamics during transcription termination. Nucleic Acids Res. 47, 310–327 (2019).
Fouqueau, T. et al. The cutting edge of archaeal transcription. Emerg. Top. Life Sci. 2, 517–533 (2018).
Davidson, L., Francis, L., Eaton, J. D. & West, S. Integrator-dependent and allosteric/intrinsic mechanisms ensure efficient termination of snRNA transcription. Cell Rep. 33, 108319 (2020).
White, E., Kamieniarz-Gdula, K., Dye, M. J. & Proudfoot, N. J. AT-rich sequence elements promote nascent transcript cleavage leading to RNA polymerase II termination. Nucleic Acids Res. 41, 1797–1806 (2013).
Bailey, T. L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36 (1994).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–89 (2010).
Levitt, N., Briggs, D., Gil, A. & Proudfoot, N. J. Definition of an efficient synthetic poly(A) site. Genes Dev. 3, 1019–25 (1989).
Yeo, G. & Burge, C. B. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol. 11, 377–394 (2004).
Mordstein, C. et al. Codon usage and splicing jointly influence mrna localization. Cell Syst. 10, 351–362.e8 (2020).
Elrod, N. D. et al. The integrator complex attenuates promoter-proximal transcription at protein-coding genes. Mol. Cell 76, 738–752 (2019).
Austenaa, L. M. I. et al. A first exon termination checkpoint preferentially suppresses extragenic transcription. Nat. Struct. Mol. Biol. 28, 337–346 (2021).
Estell, C., Davidson, L., Steketee, P. C., Monier, A. & West, S. ZC3H4 restricts non-coding transcription in human cells. eLife 10, e67305 (2021).
Rivera-Mulia, J. C. et al. Allele-specific control of replication timing and genome organization during development. Genome Res. 28, 800–811 (2018).
Williams, L. H. et al. Pausing of RNA polymerase II regulates mammalian developmental potential through control of signaling networks. Mol. Cell 58, 311–322 (2015).
Brinkman, E. K., Chen, T., Amendola, M. & van Steensel, B. Easy quantitative assessment of genome editing by sequence trace decomposition. Nucleic Acids Res. 42, e168–e168 (2014).
Reimer, K. A., Mimoso, C. A., Adelman, K. & Neugebauer, K. M. Co-transcriptional splicing regulates 3′ end cleavage during mammalian erythropoiesis. Mol. Cell 81, 998–1012.e7 (2021).
Henriques, T. et al. Stable pausing by RNA polymerase II provides an opportunity to target and integrate regulatory signals. Mol. Cell 52, 517–528 (2013).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10 (2011).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Korhonen, J. H., Palin, K., Taipale, J. & Ukkonen, E. Fast motif matching revisited: high-order PWMs, SNPs and indels. Bioinformatics 33, 514–521 (2016).
Georgiou, G. & van Heeringen, S. J. fluff: exploratory analysis and visualization of high-throughput sequencing data. PeerJ 4, e2209 (2016).
Buecker, C. et al. Reorganization of enhancer patterns in transition from naive to primed pluripotency. Cell Stem Cell 14, 838–853 (2014).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Schwalb, B. et al. TT-seq maps the human transient transcriptome. Science 352, 1225–1228 (2016).
Acknowledgements
We thank K. Sasaki for her help in optimizing the run-on protocol for screening purposes and E. Kaye for discussions on library design. We thank S. Buratowski for useful discussions on the project, and D. Shlyueva and T. H. Jensen for feedback on the manuscript. We are also grateful to the Flow Cytometry Facility at the HMS Department of Immunology for cell sorting help and advice, the HMS Nascent Transcriptomics Core for PRO-seq library construction, and the HMS Biopolymers Facility and The Bauer Core Facility at Harvard University for next-generation sequencing. This research was supported by the European Molecular Biology Organization (ALTF 531-2017 to H. V.), Human Frontier Science Program (LT000651/2018-L to H. V.), the National Institutes of Health (NIH R01 GM139960 to K. A.), startup funding from Harvard Medical School to K. A, the National Science Foundation Graduate Research Fellowship (DGE1745303 to C. A. M.) and the Canadian Institutes of Health Research (Banting fellowship to B. J. E. M.).
Author information
Authors and Affiliations
Contributions
H. V. and K. A. conceived the study and designed experiments. H. V. performed experiments and analyzed data. C. A. M. performed PRO-seq data analysis, helped generate intron-containing clonal cell lines, and optimized the run-on assay and knockdown conditions. B. J. E. M. and A. R. F. performed ChIP–seq and TT-seq experiments. K. A. supervised the study. H. V. and K. A. wrote the manuscript with input from all co-authors.
Corresponding authors
Ethics declarations
Competing interests
K. A. is a consultant for Syros Pharmaceuticals, is on the scientific advisory board of CAMP4 Therapeutics, and receives research funding from Novartis unrelated to this work. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Structural and Molecular Biology thanks Yongsheng Shi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Carolina Perdigoto, in collaboration with the Nature Structural & Molecular Biology team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Correlations between INSERT-seq experiments.
a, Spearman correlation coefficients between steady-state RNA and Sort-seq experiments, using all inserts for which data was obtained in each of the six experiments (n = 12,090). b, Sort-seq scores of inserts containing TSS-proximal and TSS-distal genomic regions of indicated RNA classes. Same groups as in Fig. 1e. Comparisons between proximal and distal regions by Kruskal-Wallis test, **** indicates P < 0.0001. c, Sort-seq scores of inserts containing TSS-proximal regions from typical enhancers (TE, n = 1,506) and super enhancers (SE22, n = 600), compared by Mann–Whitney test. d, Correlation between steady-state RNA levels at the Oct4 uaRNA locus (average of 4 replicates) and 4930461G14Rik lincRNA locus (average of 3 replicates). Plotted are all inserts used for Fig. 1, as well as synthetic controls sequences (Fig. 3), for which data was obtained at the lincRNA locus (n = 11,600).
Extended Data Fig. 2 EXOSC3 knockdown validation and correlation between nascent RNA and steady-state RNA results.
a, Immunoblot showing EXOSC3 protein level in control and siEXOSC3 conditions, harvested from the same experiment as the screen in Fig. 2a, b. b, RT-qPCR on steady-state RNA samples with which the screen was performed, showing levels of the EXOSC3 mRNA and the reporter transcript, just downstream of the library integration site, both internally normalized to TBP. Bars show mean, whiskers indicate standard deviation, n = 3 biologically independent experiments. c, Correlation between nascent RNA (average of 2 replicates) and steady-state RNA (average of 4 replicates) levels, showing all inserts used for Fig. 1, as well as synthetic controls sequences (Fig. 3), n = 11,132. d, Chromatin-associated RNA (Chr-RNA) results with library at uaRNA locus. mRNAs n = 3,832, lincRNAs n = 339, uaRNAs n = 1,730, eRNAs n = 2074, mRNA terminators n = 414. Neighbors were compared by Kruskal-Wallis test, **** indicates P < 0.0001, higher P values are indicated in the panel. e, Correlation between Chr-RNA (average of 2 replicates) and steady-state RNA (average of 4 replicates) levels, all inserts from panel c for which Chr-RNA data was obtained (n = 11,029).
Extended Data Fig. 3 GC content in genomic regions and its effect on expression.
a, Distribution of GC contents in inserts of the indicated classes included in the library. Open violins show TSS-proximal regions, patterned violins show TSS-distal regions. b,c, Nascent RNA abundance (b) and sort-seq scores (c) of control sequences grouped by GC content percentage. N = 39/281/330/292/117 for <41/41-50/51-60/61-70/>70%, respectively. Neighbors were compared by Kruskal-Wallis test, **** indicates P < 0.0001, higher P values are indicated in the panel. d, Relation between the number of CpG dinucleotides in synthetic control sequences and their steady-state RNA levels (n = 1,059). The red line is the best linear fit through the data. Pearson r = 0.47, P < 0.0001. e, Metagene representations of PRO-seq signal around TSSs of uaRNAs (left) or eRNAs (right), grouped by GC content of the transcribed sequence from +6 to +179 downstream of the TSS (the region included in our screening library). Data shown are from endogenous genomic locations of sequences included in the INSERT-seq screen. Read counts were summed into 25nt bins.
Extended Data Fig. 4 Co-transcriptionally spliced introns boost transcription and protein expression.
Nascent RNA levels (left) and Sort-seq scores (right) of inserts containing wild-type introns (unbarcoded) grouped by splicing efficiency measured using the nascent RNA screen data. <3% spliced n = 76, 3-30% spliced n = 107, >30% spliced n = 198, significance tested by Kruskal-Wallis test. **** indicates P < 0.0001, higher P values are indicated in the figure.
Extended Data Fig. 5 Effects of splice site mutants and 5′SS insertion in INSERT-seq and clonal lines.
a, Nascent RNA levels (left) and Sort-seq scores (right) of intron-containing inserts with wild-type (wt) or mutant (m) splice sites. As in Fig. 5a, only introns are shown of which the wild-type version was >30% spliced in nascent RNA and mutants were <3% spliced. 5′SS mutants n = 51, 3′SS mutants n = 23, WT n = 52, comparisons by Kruskal-Wallis test. The differences between 5′SS and 3′SS mutants was not significant in these analyses, but the pattern of the 3′SS mutants being more abundant on average was consistent with the steady-state RNA result (Fig. 5a). b, Steady-state RNA levels of intron-containing inserts with wild-type (+) and mutant (-) splice sites as in Fig. 5a, but showing only inserts that do not contain a PAS hexamer (any of the top-10 PASs in mouse3). 5′SS mutants n = 19, 3′SS mutants n = 10, WT n = 20, comparisons by Kruskal-Wallis test. c, Characterization of all clonal cell lines shown in Fig. 5b, where versions of the 14th intron of the Smc1 gene with wild-type (+) or mutant (−) splice sites were integrated at the Oct4 uaRNA reporter locus. Top shows RT-PCR, bottom shows PCR on genomic DNA. All clonal lines show genomic integration of the same size in the genomic DNA, but only lines where the intron is flanked by two wild-type splice sites show evidence of splicing. Note that lanes should not be quantitatively compared to each other, as amounts of template material were not controlled. d, Density plot of GC-corrected steady-state RNA levels of unspliced TSS-proximal/distal uaRNA/eRNA regions grouped by the presence and strength (MaxEnt score41) of a 5′SS motif (see Methods). None n = 2,632, medium (MaxEnt 5–10) n = 1,554, strong (MaxEnt10+) n = 106. All groups are significantly different from each other (P < 0.0001) by Kruskal-Wallis test. e, Density plot of GC-corrected steady-state RNA levels of unspliced TSS-proximal mRNA regions (left) and TSS-proximal/distal uaRNA/eRNA regions (right) grouped by the number of 5′SS motifs (MaxEnt score >5). mRNAs: none n = 1,392, 1 n = 1,604, >1 n = 691. uaRNA/eRNAs: none n = 2,632, 1 n = 1,232, >1 n = 428, comparisons by Kruskal-Wallis test. f, Relative nascent RNA levels (left) and sort-seq scores (right) of 10nt annotated 5′SSs with a MaxEnt score of >5, embedded into several background sequences. Only unspliced inserts (<3% spliced in nascent-RNA) were considered. Same groups as in Fig. 5d: scrambled (Scr, n = 24) and antisense (AS, n = 24) versions of 5′SSs were compared to sense (S) 5′SSs (n = 50) by Kruskal=Wallis test. In all panels, **** indicates P < 0.0001, higher P values are indicated in each plot.
Supplementary information
Supplementary Tables 1–9
Containis library composition, all INSERT-seq data, and plasmids and primers used in this study
Source data
Source Data Fig. 1
Statistical Source Data
Source Data Fig. 2
Statistical Source Data
Source Data Fig. 3
Statistical Source Data
Source Data Fig. 4
Statistical Source Data
Source Data Fig. 5
Statistical Source Data
Source Data Extended Data Fig. 1
Statistical Source Data
Source Data Extended Data Fig. 2
Statistical Source Data
Source Data Extended Data Fig. 2
Unprocessed Western Blots
Source Data Extended Data Fig. 3
Statistical Source Data
Source Data Extended Data Fig. 4
Statistical Source Data
Source Data Extended Data Fig. 5
Statistical Source Data
Source Data Extended Data Fig. 5
Unprocessed gel image
Rights and permissions
About this article
Cite this article
Vlaming, H., Mimoso, C.A., Field, A.R. et al. Screening thousands of transcribed coding and non-coding regions reveals sequence determinants of RNA polymerase II elongation potential. Nat Struct Mol Biol 29, 613–620 (2022). https://doi.org/10.1038/s41594-022-00785-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41594-022-00785-9
This article is cited by
-
Co-transcriptional gene regulation in eukaryotes and prokaryotes
Nature Reviews Molecular Cell Biology (2024)
-
PSIP1/LEDGF reduces R-loops at transcription sites to maintain genome integrity
Nature Communications (2024)
-
Selection on synonymous sites: the unwanted transcript hypothesis
Nature Reviews Genetics (2024)
-
Autonomous transposons tune their sequences to ensure somatic suppression
Nature (2024)
-
A CpG island-encoded mechanism protects genes from premature transcription termination
Nature Communications (2023)