Transcriptional cofactors (COFs) communicate regulatory cues from enhancers to promoters and are central effectors of transcription activation and gene expression1. Although some COFs have been shown to prefer certain promoter types2,3,4,5 over others (for example, see refs 6,7), the extent to which different COFs display intrinsic specificities for distinct promoters is unclear. Here we use a high-throughput promoter-activity assay in Drosophila melanogaster S2 cells to screen 23 COFs for their ability to activate 72,000 candidate core promoters (CPs). We observe differential activation of CPs, indicating distinct regulatory preferences or ‘compatibilities’8,9 between COFs and specific types of CPs. These functionally distinct CP types are differentially enriched for known sequence elements2,4, such as the TATA box, downstream promoter element (DPE) or TCT motif, and display distinct chromatin properties at endogenous loci. Notably, the CP types differ in their relative abundance of H3K4me3 and H3K4me1 marks (see also refs 10,11,12), suggesting that these histone modifications might distinguish trans-regulatory factors rather than promoter- versus enhancer-type cis-regulatory elements. We confirm the existence of distinct COF–CP compatibilities in two additional Drosophila cell lines and in human cells, for which we find COFs that prefer TATA-box or CpG-island promoters, respectively. Distinct compatibilities between COFs and promoters can explain how different enhancers specifically activate distinct sets of genes9, alternative promoters within the same genes, and distinct transcription start sites within the same promoter13. Thus, COF–promoter compatibilities may underlie distinct transcriptional programs in species as divergent as flies and humans.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
All raw sequencing and processed data generated in this study have been deposited in the NCBI Gene Expression Omnibus (GEO) under accession numbers GSE116197 (D. melanogaster data) and GSE126221 (human data). Previously published datasets reanalysed in this study are available in the GEO repository under the following accession numbers: GSE47691 (RNA-seq), GSE58955 (GRO-seq), GSE40739 (DHS-seq), GSE22119 (MNase-seq), GSE52029 (ChIP–seq for Tbp and Trf2), GSE97841 (ChIP-exo for TAF1 and M1BP), GSE39664 (ChIP–seq for DREF), GSE64464 (ChIP–seq for P300/CBP), GSE30820 (ChIP–seq for Fsh/Brd4), GSE37864 (ChIP–seq for Mof), GSE47263 (ChIP–seq for Chro), GSE41440 (ChIP–seq for Lpt, Pol II, H3K4me1 and H3K4me3), GSE81795 (ChIP–seq for Set1, Trr and Trx; RNA-seq upon Trx depletion), GSE81649 (PRO-seq upon P300/CBP inhibition), GSE43180 (RNA-seq upon Fsh/Brd4 depletion), GSE95025 (single-cell RNA-seq of D. melanogaster embryo). S2 cells CAGE and Chro ChIP–seq data are available from modENCODE (http://data.modencode.org/, sample ID: 5331 and 5068, respectively). The full sequences of plasmids used in this study are available at www.addgene.org. No restrictions on data availability apply.
All custom code used for data processing and computational analyses is available from the corresponding author upon request.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Zabidi, M. A. & Stark, A. Regulatory enhancer-core-promoter communication via transcription factors and cofactors. Trends Genet. 32, 801–814 (2016).
Ohler, U., Liao, G.-C., Niemann, H. & Rubin, G. M. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 3, R87 (2002).
Rach, E. A., Yuan, H.-Y., Majoros, W. H., Tomancak, P. & Ohler, U. Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome. Genome Biol. 10, R73 (2009).
Parry, T. J. et al. The TCT motif, a key component of an RNA polymerase II transcription system for the translational machinery. Genes Dev. 24, 2013–2018 (2010).
Hoskins, R. A. et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 21, 182–192 (2011).
Hsu, J.-Y. et al. TBP, Mot1, and NC2 establish a regulatory circuit that controls DPE-dependent versus TATA-dependent transcription. Genes Dev. 22, 2353–2358 (2008).
Stampfel, G. et al. Transcriptional regulators form diverse groups with context-dependent regulatory functions. Nature 528, 147–151 (2015).
van Arensbergen, J., van Steensel, B. & Bussemaker, H. J. In search of the determinants of enhancer–promoter interaction specificity. Trends Cell Biol. 24, 695–702 (2014).
Zabidi, M. A. et al. Enhancer–core-promoter specificity separates developmental and housekeeping gene regulation. Nature 518, 556–559 (2015).
Rach, E. A. et al. Transcription initiation patterns indicate divergent strategies for gene regulation at the chromatin level. PLoS Genet. 7, e1001274 (2011).
Pérez-Lluch, S. et al. Absence of canonical marks of active chromatin in developmentally regulated genes. Nat. Genet. 47, 1158–1167 (2015).
Boija, A. et al. CBP regulates recruitment and release of promoter-proximal RNA polymerase II. Mol. Cell 68, 491–503 (2017).
Haberle, V. et al. Two independent transcription initiation codes overlap on vertebrate core promoters. Nature 507, 381–385 (2014).
Arnold, C. D. et al. Genome-wide assessment of sequence-intrinsic enhancer responsiveness at single-base-pair resolution. Nat. Biotechnol. 35, 136–144 (2017).
Chatterjee, S. & Struhl, K. Connecting a promoter-bound protein to TBP bypasses the need for a transcriptional activation domain. Nature 374, 820–822 (1995).
Ptashne, M. & Gann, A. Transcriptional activation by recruitment. Nature 386, 569–577 (1997).
Kockmann, T. et al. The BET protein FSH functionally interacts with ASH1 to orchestrate global gene activity in Drosophila. Genome Biol. 14, R18 (2013).
Rickels, R. et al. An evolutionary conserved epigenetic mark of Polycomb response elements implemented by Trx/MLL/COMPASS. Mol. Cell 63, 318–328 (2016).
Herz, H.-M. et al. Enhancer-associated H3K4 monomethylation by Trithorax-related, the Drosophila homolog of mammalian Mll3/Mll4. Genes Dev. 26, 2604–2620 (2012).
Straub, T., Zabel, A., Gilfillan, G. D., Feller, C. & Becker, P. B. Different chromatin interfaces of the Drosophila dosage compensation complex revealed by high-shear ChIP–seq. Genome Res. 23, 473–485 (2013).
Ho, J. W. K. et al. Comparative analysis of metazoan chromatin organization. Nature 512, 449–452 (2014).
Hochheimer, A. & Tjian, R. Diversified transcription initiation complexes expand promoter selectivity and tissue-specific gene expression. Genes Dev. 17, 1309–1320 (2003).
Burke, T. W. & Kadonaga, J. T. Drosophila TFIID binds to a conserved downstream basal promoter element that is present in many TATA-box-deficient promoters. Genes Dev. 10, 711–724 (1996).
Wang, Y.-L. et al. TRF2, but not TBP, mediates the transcription of ribosomal protein genes. Genes Dev. 28, 1550–1555 (2014).
Gurudatta, B. V., Yang, J., Van Bortle, K., Donlin-Asp, P. G. & Corces, V. G. Dynamic changes in the genomic localization of DNA replication-related element binding factor during the cell cycle. Cell Cycle 12, 1605–1615 (2013).
Baumann, D. G. & Gilmour, D. S. A sequence-specific core promoter-binding transcription factor recruits TRF2 to coordinately transcribe ribosomal protein genes. Nucleic Acids Res. 45, 10481–10491 (2017).
Karaiskos, N. et al. The Drosophila embryo at single-cell transcriptome resolution. Science 358, 194–199 (2017).
Gilchrist, D. A. et al. Pausing of RNA polymerase II disrupts DNA-specified nucleosome organization to enable precise gene regulation. Cell 143, 540–551 (2010).
Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).
Heintzman, N. D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 39, 311–318 (2007).
Herschlag, D. & Johnson, F. B. Synergism in transcriptional activation: a kinetic view. Genes Dev. 7, 173–179 (1993).
Adelman, K. & Lis, J. T. Promoter-proximal pausing of RNA polymerase II: emerging roles in metazoans. Nat. Rev. Genet. 13, 720–731 (2012).
Michel, M. & Cramer, P. Transitions for regulating early transcription. Cell 153, 943–944 (2013).
Muerdter, F. et al. Resolving systematic errors in widely used enhancer activity assays in human cells. Nat. Methods 15, 141–149 (2018).
Arnold, C. D. et al. Quantitative genome-wide enhancer activity maps for five Drosophila species show functional enhancer conservation and turnover during cis-regulatory evolution. Nat. Genet. 46, 685–692 (2014).
Andersen, P. R., Tirian, L., Vunjak, M. & Brennecke, J. A heterochromatin-dependent transcription machinery drives piRNA expression. Nature 549, 54–59 (2017).
Brown, J. B. et al. Diversity and dynamics of the Drosophila transcriptome. Nature 512, 393–399 (2014).
Batut, P., Dobin, A., Plessy, C., Carninci, P. & Gingeras, T. R. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 23, 169–180 (2013).
The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Jayaprakash, A. D., Jabado, O., Brown, B. D. & Sachidanandam, R. Identification and remediation of biases in the activity of RNA ligases in small-RNA deep sequencing. Nucleic Acids Res. 39, e141–e141 (2011).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Philip, P. et al. CBP binding outside of promoters and enhancers in Drosophila melanogaster. Epigenetics Chromatin 8, 48 (2015).
Shlyueva, D. et al. Hormone-responsive enhancer-activity maps reveal predictive motifs, indirect repression, and targeting of closed chromatin. Mol. Cell 54, 180–192 (2014).
Fuda, N. J. et al. GAGA factor maintains nucleosome-free regions and has a role in RNA polymerase II recruitment to promoters. PLoS Genet. 11, e1005108 (2015).
Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009).
Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
FitzGerald, P. C., Sturgill, D., Shyakhtenko, A., Oliver, B. & Vinson, C. Comparative genomics of Drosophila and human core promoters. Genome Biol. 7, R53 (2006).
Falcon, S. & Gentleman, R. Using GOstats to test gene lists for GO term association. Bioinformatics 23, 257–258 (2007).
Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).
R Core Team. R: A Language and Environment for Statistical Computing http://www.R-project.org/ (R Foundation for Statistical Computing, Vienna, Austria, 2013).
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Barberis, A. et al. Contact with a component of the polymerase II holoenzyme suffices for gene activation. Cell 81, 359–368 (1995).
The authors thank C. Plaschka, L. Cochella, P. R. Andersen and Life Science Editors for comments on the manuscript; the IMP/IMBA Graphics Department for help with Fig. 4; J. Wysocka, T. Swigut and K. Dorighi (Stanford University), M. Seimiya and R. Paro (ETH Zürich), and P. R. Andersen and J. Brennecke (IMBA) for sharing MLL3, Trx and Trf2 cDNAs. Deep sequencing was performed at the Vienna Biocenter Core Facilities GmbH. V.H. is supported by the Human Frontier Science Program (grant no. LT000324/2016-L). Research in the Stark group is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 647320) and by the Austrian Science Fund (FWF, P29613-B28 and F4303-B09). Basic research at the IMP is supported by Boehringer Ingelheim GmbH and the Austrian Research Promotion Agency (FFG).
Extended data figures and tables
a, List of initial 13 D. melanogaster COFs used in this study (see Extended Data Fig. 7 for ten additional COFs). For each COF, relevant information about its function is shown (functional domain, enzymatic activity and protein complex) and the name of the respective mammalian homologue from Ensembl database. b, CP candidates from the D. melanogaster genome were selected sequentially (in order of the white arrow) based on TSSs from datasets that map endogenous transcription initiation (CAGE37 and RAMPAGE38), TSSs in reporter assays (STAP-seq14), or FlyBase (v.5.57) and Ensembl (v.78) gene annotations (for each new dataset, only TSSs that were more than 10 bp away from TSSs already present in the selection were added). As negative controls, random positions without any evidence of initiation were selected. A total of 72,000 TSSs were used as reference points to design CP oligos encompassing 66 bp upstream and 66 bp downstream of the TSS. c, Overview of COF-recruitment STAP-seq (COF-STAP-seq), a high-throughput activator bypass15,16,54-like assay that we created by combining a plasmid-based high-throughput promoter-activity assay, self-transcribing active core promoter-sequencing (STAP-seq)14 with the GAL4-DBD-mediated recruitment of individual COFs7. The D. melanogaster CP candidate library, pre-mixed with the D. pseudoobscura CP spike-in mix, was co-transfected with an expression plasmid for one of the GAL4-DBD–COF fusion proteins. If binding of a GAL4-DBD–COF to the 4xUAS array activates transcription from a candidate CP, this generates reporter RNAs with a short 5′ sequence tag, derived from the 3′ end of the corresponding CP. These reporter transcripts are captured with a 5′ RNA linker that includes a 10-nt-long UMI, allowing counting of individual reporter RNA molecules. In addition, the RNA linker contains a 4-nt sample barcode (BC), used for sample identification, enabling pooled processing of up to eight samples after linker ligation. This is followed by selective reverse transcription, PCR amplification, deep sequencing and mapping of the 5′ sequence tags to quantify productive initiation events at single-base-pair resolution for all candidate CPs in the library and spike-in CPs.
Extended Data Fig. 2 COF recruitment reproducibly activates transcription preferentially from annotated CP sequences.
a, Pairwise comparisons of normalized STAP-seq tag counts between three independent biological replicates per COF across all 72,000 tested CP candidates. The PCC is denoted for each comparison. b, Total unique STAP-seq tag counts for P65, GFP and the 13 COFs (left, raw counts; right, counts relative to spike-in). Bar heights, mean counts; error bars, s.d. n = 3 independent biological replicates for each COF. c, Distribution of normalized STAP-seq tag counts from all COFs at candidates grouped by different annotated genomic regions (FlyBase v.5.57). CP regions were defined as 100-bp regions from 50 bp upstream to 50 bp downstream of annotated gene TSSs, and ‘proximal promoter’ as regions up to 250 bp upstream of annotated gene TSSs. ‘Gene body’ includes both exons and introns, but excludes 5′ UTRs, which form a separate category. ‘Random negative regions’ represent candidates selected as negative controls (see Extended Data Fig. 1b) irrespective of their genomic location. n, number of independent CP candidates per box; boxes show median and interquartile range; dots are mean; whiskers indicate 5th and 95th percentiles. d, Genomic distribution of CP candidates (top; n = 72,000) and of unique STAP-seq tags; that is, transcripts initiated at CP candidates upon activation by any of the COFs (bottom; n = 41,069,770). Annotated gene CPs (red) are highly enriched for STAP-seq tags.
a, COF-STAP-seq signals (transcription initiation events) of each of the 13 COFs and the positive and negative controls (P65 and GFP, respectively) from CP candidates in the representative genomic locus (same as in Fig. 1b but showing all 13 COFs). Negative values denote transcription initiation on the antisense strand. b, Principal component analysis of STAP-seq tag count normalized to spike-ins for 30,936 CPs significantly activated above GFP by at least one COF (≥twofold enrichment over GFP and Student’s t-test FDR ≤ 0.06; see Methods) in three biological replicates per tested COF and controls. Scatter plot of projections onto the first two principal components (left) and the per cent of variance explained by each principal component (right) are shown. c, Hierarchical clustering of individual biological replicates per COF based on PCCs across 30,936 CPs activated by at least one COF. All biological replicates cluster closely together and reproduce the functional COF groups shown in Fig. 1c derived from merged replicates. Blue-to-red shading indicates the PCC for each comparison. d, Comparison of CP activation above GFP (induction) in STAP-seq (x axis) and luciferase (y axis) for 50 CPs tested with P65 and four different COFs. PCC indicated for each comparison.
a, Representative genomic locus showing differential COF-STAP-seq signals for recruitment of MED25, Lpt, Chro and Mof in three D. melanogaster cell lines. Each COF preferentially activates the same CPs in all three cell lines (S2, OSC and Kc167 cells), and these preferences differ between COFs. STAP-seq data is the merge of three independent biological replicates. b, Hierarchical clustering of P65 and six COFs tested in all three cell lines based on PCC of CP activation in each cell line. c, Activation of all 72,000 CP candidates by different COFs in the three cell lines. For each COF, the CPs are first sorted by activation in S2 cells and then the activation in OSC and Kc167 cells is displayed in the same order. PCCs (right) were calculated by comparing OSC or Kc167 with S2 cells, respectively. d, COF-STAP-seq activation of 50 CPs selected for luciferase assays in S2 cells (see Fig. 1d) by different COFs and P65 in the three cell lines (subset of c). Differential activation of CPs by each COF is consistent across all cell lines. e, Pairwise comparison of CP activation by different COFs above GFP (induction) in OSC versus S2 cells (top row) and Kc167 versus S2 cells (bottom row) for all 72,000 CP candidates.
Extended Data Fig. 5 COFs preferentially activate CPs of their endogenously bound and regulated target genes.
a–e, Binding of Trr18 (a), Lpt19 (b), Mof20 (c) and Trx18 (e) in S2 cells and Chro in D. melanogaster embryos21 (d) to 5,933 CPs active in COF-STAP-seq and endogenously in S2 cells (as in Fig. 1e but for additional COFs). Per COF, CPs are sorted by STAP-seq activation (left) and ChIP–seq coverage is shown in heat maps and box plots (−150 to +50-bp window around the TSS; n = 297 independent CPs per box; box shading, mean STAP-seq tag count; boxes show median and interquartile range; whiskers indicate 5th and 95th percentiles; one-sided Wilcoxon rank-sum test; all ChIP–seq data from previous publications; see Supplementary Table 1 for details and references). For all COFs, the most strongly activated CPs in COF-STAP-seq are significantly more strongly bound by the respective COF in their endogenous genomic context compared to CPs that are activated weakly (note that even though this also holds for Lpt, the trend for Lpt starts only after the most strongly activated CPs (first two bins), which are less strongly bound than expected). f, Expression fold change upon Trx depletion by RNAi for genes associated with top and bottom 25% CPs by activation with Trx (RNA-seq data from ref. 18; see also Supplementary Table 1). Only CPs associated with genes that are active in S2 cells and activated in COF-STAP-seq by at least one COF are included. g, STAP-seq tag count for CPs of genes downregulated upon Trx depletion by RNAi versus CPs of all other genes expressed in S2 cells and activated by at least one COF (RNA-seq data from ref. 18; n, number of independent CPs; boxes show median and interquartile range; whiskers indicate 5th and 95th percentiles; one-sided Wilcoxon rank-sum test).
a, Spike-in normalized COF-STAP-seq tag counts (left heat map) for 30,936 CP candidates (columns) clustered based on their preferential activation by different COFs (rows). These tag counts were transformed for each CP separately into Z-scores (right heat map) to highlight the differential activation by different COFs independently of the overall activity of the CP. We then used these Z-score-transformed values to cluster the CPs into five groups of respectively similar activation profiles across all COFs irrespective of absolute activation levels using k-means clustering (the CPs in both heat maps are organized identically according to these groups, see coloured bar on top). The line plot on the left shows the average spike-in normalized COF-STAP-seq tag count across all CPs of each group for each of the 13 COFs and the two controls. b, Per cent of variance in the data explained by clustering CPs into different number of clusters with k-means (k ranging from 1 to 10). Increasing the number of clusters beyond five is of little benefit in explaining the variance in the data. c, Gain of per cent variance explained by increasing the number of clusters in steps of one from three to six. d, Distribution of sum of squared distances to centroids of the clusters for number of clusters ranging from one to ten, using a fivefold cross-validation approach. The data was binned randomly into five equally sized bins, one bin was left aside as a test set and clustering was performed on the remaining four bins. Sum of squared distances to the nearest centroid for each data point in the test set was then calculated. The procedure was repeated for each number of clusters (k). Increasing the number of clusters beyond five does not lead to substantially more coherent or dense clusters. For each box, n = 30,936 independent CPs. e–g, Clustering of 30,936 CPs (columns) based on their preferential activation by different COFs (rows) as in a, but using data for only one replicate as indicated. k-means clustering (k = 5) for each individual replicate reproduces qualitatively the same groups obtained with the merged replicates (see a). h, Agreement between assignment of CPs to groups in individual replicates and in the pooled data (left). In each replicate, around 85% of CPs are assigned to the same group as in the assignment based on pooled replicates. Bar plot, number of replicates that reproduce group assignment for individual CPs is shown on the right. For around 94% of CPs, the group assignment is reproduced in at least two replicates. i, Pairwise distances in CP response to six COFs and two controls for CPs belonging to the same (intra-) or different (inter-) clusters (defined in S2 cells) in all three D. melanogaster cell lines. n = 115,508,123 and 362,994,457 independent CP pairs for intra and inter-cluster boxes, respectively. *P ≤ 0.01; one-sided Wilcoxon rank-sum test. j, Induction (activation above GFP) of CPs (five groups defined in S2 cells; see a) by P65 and six COFs in S2 (top), OSC (middle) and Kc167 (bottom) cells. Each of the six COFs preferentially activates the same CP groups in all three cell lines; that is, COF–CP preferences appear to be cell-type independent. n = 5,723, 11,538, 3,203, 5,038 and 5,434 CPs, for groups 1 to 5, respectively. In d, i, j, boxes show median and interquartile range; whiskers indicate 5th and 95th percentiles.
a, List of ten additionally tested D. melanogaster COFs. For each COF, relevant information about its function is shown (functional domain, enzymatic activity and protein complex) as well as the name of the respective mammalian homologue. b, Total COF-STAP-seq tag counts relative to spike-in for GFP (negative control) and the ten COFs. Bar heights, mean counts; error bars, s.d.; n = 3 independent biological replicates per COF. c, Per cent of variance in the data explained by clustering CPs into different numbers of clusters with k-means (k ranging from 1 to 10) using the original dataset containing 13 COFs, P65 and GFP (as in Extended Data Fig. 6b; blue) or the extended dataset with ten additional COFs (23 total; red). The curves are highly similar for both datasets; that is, the same number of clusters explains the same amount of variance in both the original and the extended dataset. d, As in Extended Data Fig. 6a but for the extended dataset of 23 COFs: spike-in normalized STAP-seq tag counts (left heat map) for 30,936 CPs (columns) clustered based on their preferential activation by 23 different COFs and two controls (rows). Tag counts were transformed into Z-scores (right heat map), which were used to cluster CPs into five clusters with k-means. For comparison, groups defined on the dataset containing 13 COFs and two controls (Extended Data Fig. 6a) are shown in the top row and groups defined with this extended dataset are shown below. e, Correlation between each of the six activating COFs in the extended dataset and the 13 COFs of the original dataset. *PCC ≥ 0.9.
Extended Data Fig. 8 CPs activated by distinct COFs discriminate between housekeeping and developmental gene regulation.
a, Expression variability between around 8,000 single cells of a stage 6 D. melanogaster embryo for genes associated with each of the five different CP groups (single-cell RNA-seq data from ref. 27). b, GO term enrichment analysis (GOStats R/Bioconductor package v.2.34.0) for genes associated with the five different CP groups. c, d, Activation of 72,000 CP candidates by a developmental (dev; from the gene zfh1) and a housekeeping (hk; from the gene ssp3) enhancer (enhancers and enhancer-less control obtained from refs 9,14). CPs are grouped into five groups as in Extended Data Fig. 6a. The enhancer-less control reflects the basal activity of the CPs. Group 3 CPs have the highest basal activity but are further activated by the hk enhancer. n = 5,723, 11,538, 3,203, 5,038 and 5,434 independent CPs, for groups 1 to 5, respectively; boxes show median and interquartile range; whiskers indicate 5th and 95th percentiles. e, f, Transcription-factor motif enrichment analysis in the sequence 500 bp upstream of the TSS (e) or within the nearest developmental or housekeeping enhancer (from ref. 9; f) for the five CP groups. n = 5,723, 11,538, 3,203, 5,038 and 5,434 independent CPs, for groups 1 to 5, respectively. NS, not significant (two-sided Fisher’s exact test; P-values corrected for multiple testing by Benjamini–Hochberg procedure; FDR > 0.01).
Extended Data Fig. 9 CPs activated preferentially by distinct COFs differ in their sequence and in endogenous chromatin features.
a, Occurrence of specific dinucleotides (see label in each heat map) relative to TSSs for CPs of the five groups defined in Extended Data Fig. 6a. Within each group, CPs are sorted decreasingly by the COF-STAP-seq tag count of the respective strongest COFs (denoted on the left). Darker shade reflects higher density of the respective dinucleotides at specific positions. b, c, Examples of genomic loci with CPs active in S2 cells that are differentially activated by COFs in STAP-seq. All supporting data tracks are from S2 cells and reanalysed from previous publications (see Supplementary Table 1 for details and references). b, CPs of KLHL18 and Spt3 (group 3), and GCC185 and DCAF12 (group 4), are preferentially activated by Mof and Chro, respectively, and have high levels of H3K4me3 downstream of their TSSs. By contrast, the CP of Ect3 (group 1) is preferentially activated by P300 and has high levels of H3K4me1 both upstream and downstream of the TSS but almost no H3K4me3, although Ect3 is expressed and the CP is endogenously active in S2 cells. c, CPs of CkIIalpha-i3 (group 4) and CG13896 (group 3) are preferentially activated by Chro and Mof, respectively, and both bear high levels of H3K4me3 and low levels of H3K4me1 downstream of the TSS. By contrast, the CP of CG13895 (group 1) is preferentially activated by P300 and is marked by higher levels of H3K4me1, but lower levels of H3K4me3, although the gene is expressed in S2 cells. d, Average H3K4me1 ChIP–seq coverage in the 500-bp window upstream (left) and 500-bp window downstream (right) of the TSS for five groups of CPs active in S2 cells (as in Fig. 3b). n = 646, 363, 1,842, 1,885 and 179 CPs, for groups 1 to 5, respectively. e, Heat maps showing endogenous expression (as measured by RNA-seq (left) and GRO-seq (right)) of genes associated with CPs active in S2 cells from the five CP groups (RNA-seq and GRO-seq data from refs 44,45; see Supplementary Table 1 for details and references). Within each group, CPs are sorted decreasingly by STAP-seq of the respective strongest COFs (denoted on the left). f, Gene expression for genes associated with five groups of CPs as in e but shown as box plots. n = 646, 363, 1,842, 1,885 and 179 CPs, for groups 1 to 5, respectively. In d, f, boxes show median and interquartile range; whiskers indicate 5th and 95th percentiles. g, Example of differentially activated alternative promoters. h, Example of differentially activated closely spaced TSSs (g, h, merge of three independent biological replicates).
a, Total unique STAP-seq tag counts relative to spike-in for P65, GFP and five human COFs from COF-STAP-seq in human HCT116 cells. Bar heights, mean counts; error bars, s.d.; n = 3 independent biological replicates for each COF. b, COF-STAP-seq signals (transcription initiation) activated by P65, and the five human COFs for the CPs of MMP1 (TATA-box promoter; left) and CIZ1 (CpG-island promoter; right; STAP-seq data: merge of three independent biological replicates). c, Hierarchical clustering of independent biological replicates for all tested human COFs based on PCCs across 12,000 human CP candidates. d, Occurrence of different dinucleotides (TA, AT, AA, CG and GC) around TSSs in CPs sorted by the ratio between COF-STAP-seq signals with MED15 and MLL3, for 9,607 CPs activated by either COF.
List of all previously published datasets reanalysed in this study, with respective references, GEO and SRA accessions and mapping statistics.
Sequences of Drosophila pseudoobscura enhancers and promoters used as drivers for cofactor expression and for expression of spike-in core promoters, with respective primers used to amplify them. Sequence of the 4xUAS array and the gBlock used for cloning pSTAP-seq_human-4xUAS.
Sequences of primers used to clone the human COFs and cDNA sequences of the human BRD4, EMSY, EP300, MED15 and MLL3 COFs.
Sequences of primers used in COF STAP-seq pipeline, including library cloning primers, nested PCR and sequencing-ready PCR primers, and 5’ RNA linkers.
Drosophila library. Table of all 72,000 Drosophila melanogaster CP candidates included in the STAP-seq library, with genomic coordinates, dataset supporting the choice and oligo sequence for each candidate.
Table of all 12,000 human CP candidates included in the STAP-seq library, with genomic coordinates, dataset supporting the choice and oligo sequence for each candidate.
Sequences of Mus musculus promoters used as spike-in core promoters, with respective genomic coordinates, full DNA sequence, primers used to amplify them, and concentrations of individual spike-in plasmids used for creating the spike-in mix co-transfected in STAP-seq. Sequence of the gBlock used for cloning pSTAP-seq_human_spike-in.
Sequences of core promoters and primers for 50 core promoter candidates selected for validation in luciferase assay.
Summary of total sequenced reads, mapped reads and unique STAP-seq tags (after collapsing by UMI) for 78 independent COF STAP-seq and 4 enhancer STAP-seq datasets in S2 cells, 24 COF STAP-seq datasets in OSC cells, 24 COF STAP-seq datasets in Kc167 cells and 21 COF STAP-seq datasets in human HCT116 cells. Counts mapping to referent CP candidate library (Drosophila melanogaster or human) and to spike-in CPs (Drosophila pseudoobscura or Mus musculus) are reported.
Unique STAP-seq tag counts mapping to each of the 9 Drosophila pseudoobscura (for fly samples) or Mus musculus (for human samples) spike-in CPs, along with the calculated normalization factors used to scale down each of the independent COF STAP-seq datasets within a single batch.
List of 30,936 CP candidates activated significantly above GFP by at least one cofactor (COF) with normalized tag counts per COF (averaged across the 3 biological replicates).
List of 12,000 CP candidates with normalized tag counts per COF (averaged across the 3 biological replicates).
Non-overlapping subset of activated fly CPs such that only a single oligo per promoter region is kept (the one with the highest overall activity), with normalized tagcounts per COF.