Identification of the human DPR core promoter element using machine learning

Vo ngoc, Long; Huang, Cassidy Yunjing; Cassidy, California Jack; Medrano, Claudia; Kadonaga, James T.

doi:10.1038/s41586-020-2689-7

Article
Published: 09 September 2020

Identification of the human DPR core promoter element using machine learning

Long Vo ngoc¹,
Cassidy Yunjing Huang¹,
California Jack Cassidy¹,
Claudia Medrano¹ &
…
James T. Kadonaga ORCID: orcid.org/0000-0002-2075-9458¹

Nature volume 585, pages 459–463 (2020)Cite this article

10k Accesses
31 Citations
103 Altmetric
Metrics details

Subjects

Abstract

The RNA polymerase II (Pol II) core promoter is the strategic site of convergence of the signals that lead to the initiation of DNA transcription^1,2,3,4,5, but the downstream core promoter in humans has been difficult to understand^1,2,3. Here we analyse the human Pol II core promoter and use machine learning to generate predictive models for the downstream core promoter region (DPR) and the TATA box. We developed a method termed HARPE (high-throughput analysis of randomized promoter elements) to create hundreds of thousands of DPR (or TATA box) variants, each with known transcriptional strength. We then analysed the HARPE data by support vector regression (SVR) to provide comprehensive models for the sequence motifs, and found that the SVR-based approach is more effective than a consensus-based method for predicting transcriptional activity. These results show that the DPR is a functionally important core promoter element that is widely used in human promoters. Notably, there appears to be a duality between the DPR and the TATA box, as many promoters contain one or the other element. More broadly, these findings show that functional DNA motifs can be identified by machine learning analysis of a comprehensive set of sequence variants.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: HARPE comprehensively assesses the transcriptional effect of many different DNA sequences in a specific region of the promoter.**

**Fig. 2: HARPE yields consistent data under different conditions.**

**Fig. 3: Machine learning analysis of the HARPE data yields an SVR model for the DPR.**

Sequence determinants of human gene regulatory elements

Article Open access 21 February 2022

Biswajyoti Sahu, Tuomo Hartonen, … Jussi Taipale

Deciphering eukaryotic gene-regulatory logic with 100 million random promoters

Article 02 December 2019

Carl G. de Boer, Eeshit Dhaval Vaishnav, … Aviv Regev

Prediction of protein–ligand binding affinity from sequencing data with interpretable machine learning

Article Open access 23 May 2022

H. Tomas Rube, Chaitanya Rastogi, … Harmen J. Bussemaker

Data availability

The HARPE data are available from Gene Expression Omnibus (GEO; accession number, GSE139635). We obtained 5′-GRO-seq files (GSE63872³³ and GSE90035¹²) and GRO-cap files (GSM1480321)³⁷ from the Gene Expression Omnibus website (https://www.ncbi.nlm.nih.gov/geo/). Source data are provided with this paper.

Code availability

All computational analyses were performed by using R version 3.6.1 and previously described packages, as noted in the Methods.

References

Sandelin, A. et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat. Rev. Genet. 8, 424–436 (2007).
Article CAS Google Scholar
Vo ngoc, L., Wang, Y.-L., Kassavetis, G. A. & Kadonaga, J. T. The punctilious RNA polymerase II core promoter. Genes Dev. 31, 1289–1301 (2017).
Article Google Scholar
Haberle, V. & Stark, A. Eukaryotic core promoters and the functional basis of transcription initiation. Nat. Rev. Mol. Cell Biol. 19, 621–637 (2018).
Article CAS Google Scholar
Meylan, P., Dreos, R., Ambrosini, G., Groux, R. & Bucher, P. EPD in 2020: enhanced data visualization and extension to ncRNA promoters. Nucleic Acids Res. 48 (D1), D65–D69 (2020).
CAS PubMed Google Scholar
Roeder, R. G. 50+ years of eukaryotic transcription: an expanding universe of factors and mechanisms. Nat. Struct. Mol. Biol. 26, 783–791 (2019).
Article CAS Google Scholar
Butler, J. E. & Kadonaga, J. T. Enhancer-promoter specificity mediated by DPE or TATA core promoter motifs. Genes Dev. 15, 2515–2519 (2001).
Article CAS Google Scholar
Juven-Gershon, T., Hsu, J. Y. & Kadonaga, J. T. Caudal, a key developmental regulator, is a DPE-specific transcriptional factor. Genes Dev. 22, 2823–2830 (2008).
Article CAS Google Scholar
Zabidi, M. A. et al. Enhancer-core-promoter specificity separates developmental and housekeeping gene regulation. Nature 518, 556–559 (2015).
Article ADS CAS Google Scholar
Parry, T. J. et al. The TCT motif, a key component of an RNA polymerase II transcription system for the translational machinery. Genes Dev. 24, 2013–2018 (2010).
Article CAS Google Scholar
Wang, Y. L. et al. TRF2, but not TBP, mediates the transcription of ribosomal protein genes. Genes Dev. 28, 1550–1555 (2014).
Article CAS Google Scholar
Duttke, S. H. C., Doolittle, R. F., Wang, Y.-L. & Kadonaga, J. T. TRF2 and the evolution of the bilateria. Genes Dev. 28, 2071–2076 (2014).
Article CAS Google Scholar
Vo Ngoc, L., Cassidy, C. J., Huang, C. Y., Duttke, S. H. & Kadonaga, J. T. The human initiator is a distinct and abundant element that is precisely positioned in focused core promoters. Genes Dev. 31, 6–11 (2017).
Article CAS Google Scholar
Burke, T. W. & Kadonaga, J. T. Drosophila TFIID binds to a conserved downstream basal promoter element that is present in many TATA-box-deficient promoters. Genes Dev. 10, 711–724 (1996).
Article CAS Google Scholar
Kutach, A. K. & Kadonaga, J. T. The downstream promoter element DPE appears to be as widely used as the TATA box in Drosophila core promoters. Mol. Cell. Biol. 20, 4754–4764 (2000).
Article CAS Google Scholar
Lim, C. Y. et al. The MTE, a new core promoter element for transcription by RNA polymerase II. Genes Dev. 18, 1606–1617 (2004).
Article CAS Google Scholar
Theisen, J. W. M., Lim, C. Y. & Kadonaga, J. T. Three key subregions contribute to the function of the downstream RNA polymerase II core promoter. Mol. Cell. Biol. 30, 3471–3479 (2010).
Article CAS Google Scholar
Burke, T. W. & Kadonaga, J. T. The downstream core promoter element, DPE, is conserved from Drosophila to humans and is recognized by TAFII60 of Drosophila. Genes Dev. 11, 3020–3031 (1997).
Article CAS Google Scholar
Louder, R. K. et al. Structure of promoter-bound TFIID and model of human pre-initiation complex assembly. Nature 531, 604–609 (2016).
Article ADS CAS Google Scholar
Patel, A. B. et al. Structure of human TFIID and mechanism of TBP loading onto promoter DNA. Science 362, eaau8872 (2018).
Article CAS Google Scholar
Patwardhan, R. P. et al. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat. Biotechnol. 27, 1173–1175 (2009).
Article CAS Google Scholar
Lubliner, S. et al. Core promoter sequence in yeast is a major determinant of expression level. Genome Res. 25, 1008–1017 (2015).
Article CAS Google Scholar
Arnold, C. D. et al. Genome-wide assessment of sequence-intrinsic enhancer responsiveness at single-base-pair resolution. Nat. Biotechnol. 35, 136–144 (2017).
Article CAS Google Scholar
van Arensbergen, J. et al. Genome-wide mapping of autonomous promoter activity in human cells. Nat. Biotechnol. 35, 145–153 (2017).
Article Google Scholar
Weingarten-Gabbay, S. et al. Systematic interrogation of human promoters. Genome Res. 29, 171–183 (2019).
Article CAS Google Scholar
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Article CAS Google Scholar
Juven-Gershon, T., Cheng, S. & Kadonaga, J. T. Rational design of a super core promoter that enhances gene expression. Nat. Methods 3, 917–922 (2006).
Article CAS Google Scholar
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
MATH Google Scholar
Vapnik, V. N. The Nature of Statistical Learning Theory (Springer, 1995).
Willy, P. J., Kobayashi, R. & Kadonaga, J. T. A basal transcription factor that activates or represses transcription. Science 290, 982–985 (2000).
Article ADS CAS Google Scholar
Hsu, J. Y. et al. TBP, Mot1, and NC2 establish a regulatory circuit that controls DPE-dependent versus TATA-dependent transcription. Genes Dev. 22, 2353–2358 (2008).
Article CAS Google Scholar
Chen, K. et al. A global change in RNA polymerase II pausing during the Drosophila midblastula transition. eLife 2, e00861 (2013).
Article Google Scholar
Kedmi, A. et al. Drosophila TRF2 is a preferential core promoter regulator. Genes Dev. 28, 2163–2174 (2014).
Article CAS Google Scholar
Duttke, S. H. C. et al. Human promoters are intrinsically directional. Mol. Cell 57, 674–684 (2015).
Article CAS Google Scholar
Dignam, J. D., Lebovitz, R. M. & Roeder, R. G. Accurate transcription initiation by RNA polymerase II in a soluble extract from isolated mammalian nuclei. Nucleic Acids Res. 11, 1475–1489 (1983).
Article CAS Google Scholar
Crooks, G. E., Hon, G., Chandonia, J. M. & Brenner, S. E. WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004).
Article CAS Google Scholar
Schneider, T. D. & Stephens, R. M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100 (1990).
Article CAS Google Scholar
Core, L. J. et al. Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers. Nat. Genet. 46, 1311–1320 (2014).
Article CAS Google Scholar

Download references

Acknowledgements

We thank E. P. Geiduschek, T. Juven-Gershon, G. Kassavetis, B. Delatte, J. Fei, G. Cruz-Becerra, and S. Chen for critical reading of the manuscript; J. van Arensbergen and B. van Steensel for the SuRE plasmid and protocols; B. Grant and C. Benner for advice; A. Rao for the HeLa cells; and the DNA sequencing facility at the Moores Cancer Center at UCSD (supported by NIH grant P30 CA023100 and NIH SIG grant S10 OD026929). L.V.n. received a UCSD Molecular Biology Cancer Fellowship. J.T.K. is the Amylin Chair in the Life Sciences. This work was supported by funding from NIH/NIGMS (R35 GM118060) to J.T.K.

Author information

Authors and Affiliations

Section of Molecular Biology, University of California, San Diego, La Jolla, CA, USA
Long Vo ngoc, Cassidy Yunjing Huang, California Jack Cassidy, Claudia Medrano & James T. Kadonaga

Authors

Long Vo ngoc
View author publications
You can also search for this author in PubMed Google Scholar
Cassidy Yunjing Huang
View author publications
You can also search for this author in PubMed Google Scholar
California Jack Cassidy
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Medrano
View author publications
You can also search for this author in PubMed Google Scholar
James T. Kadonaga
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.V.n., C.Y.H. and J.T.K. oversaw the overall design and execution of the project. The experiments were performed mostly by L.V.n. and C.Y.H. The analysis of the natural promoters was carried out by C.M. The computational analyses were performed by L.V.n., C.J.C. and C.Y.H. L.V.n. and J.T.K. were primarily responsible for writing the manuscript.

Corresponding author

Correspondence to James T. Kadonaga.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Design and initial characterization of the HARPE assay.

a, RNA polymerase II core promoter elements that were examined in this study. This diagram shows the positions of the TATA box, initiator (Inr), motif ten element (MTE), downstream core promoter element (DPE), and downstream core promoter region (DPR) relative to the A+1 nucleotide in the Inr consensus sequence. The Inr and MTE function together with a strict spacing requirement between the two motifs. The Inr and DPE similarly act together with a strict spacing requirement between the motifs. The Figure is drawn roughly to scale. The sequences that were randomized in the HARPE experiments are also indicated. b, c, Preparation of the HARPE library. b, HARPE constructs have two GC-boxes (Sp1 binding sites) upstream of the core promoter. The core promoters used in this study (SCP1m and IRF1) are TATA-less (mTATA = mutant TATA box), initiator (Inr)-containing promoters. An RNA polymerase III (Pol III) terminator prevents transcription by Pol III. The open reading frame of green fluorescent protein (ORF) and the polyadenylation signal (PAS) promote the synthesis of mature and stable transcripts. For the study of the DPR, the randomized region is from +17 to +35 relative to the +1 TSS. c, The fragments containing randomized elements are produced by annealing oligonucleotides that give protruding ends matching the KpnI and AatII sticky ends on the pre-digested plasmid. A high-complexity library of ~1M to 80M variants is typically obtained after bacterial transformation. If required, the level of complexity is decreased to ~100k to ~500k variants with a subset of the transformants. d, Nucleotide preferences can be observed in the most active DPR sequences. The nucleotide frequencies at each position of the DPR in the top 50% to the top 0.1% of the most transcribed sequences are indicated. All sequences (100%) are included as a reference. e, f, DPR motifs identified by HOMER. e, HOMER motifs found in the top 0.1% of HARPE DPR variants. f, Position-weight matrix for the top HOMER motif. P-values associated with hypergeometric tests (one tailed, no adjustment). All panels show a representative experiment (n = 2 biologically independent samples). g–i, HARPE is highly reproducible. g, Most variants are present and detectable in biological replicates. The intersection comprises variants detected in both biological replicates (exact sequence match). PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. h, Reproducibility of the DNA and RNA tag counts, and the resulting transcription strength value, for variants detected in both biological replicates. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. i, Reproducibility of the MTE, DPE, IRF1, and SCP1 (with TATA box) datasets, for variants detected in both biological replicates. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶.

Source data

Extended Data Fig. 2 Further characterization of the HARPE assay and modification of the HARPE assay to include the analysis of the upstream TATA box element.

a–d, Relative promoter strengths in HARPE experiments performed in the absence versus the presence of sarkosyl. In vitro transcription reactions were performed in the absence or presence of 0.2% (w/v) sarkosyl (added immediately after transcription initiation). a, HARPE datasets with reactions performed in the presence of sarkosyl are reproducible. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. b, Relative promoter strength does not appear to be affected by the addition of sarkosyl. Comparison of HARPE data from reactions carried out in the absence (Control) or the presence of sarkosyl. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. c, The top 0.1% most highly transcribed promoter variants show similar nucleotide preferences in the absence (Control) or the presence of sarkosyl (representative experiment, n = 2 biologically independent samples). d, The individual analysis of 16 independent promoter variants shows that the relative promoter strengths are approximately the same in the absence (Control) or the presence of sarkosyl. PCC, Pearson’s correlation coefficient with two-tailed P-value = 7.1 × 10⁻¹¹ (replicate 1) or 1.7 × 10⁻¹¹ (replicate 2). For gel source data, see Supplementary Fig. 1. e–g, HARPE yields consistent data under different conditions. The nucleotide frequencies of the top 0.1% most active sequences are shown. e, HARPE analysis (in vitro) of the DPR with three different promoter cassettes: SCP1 lacking a TATA box (SCP1m), the human IRF1 core promoter (IRF1), and SCP1 containing a TATA box (SCP1). f, HARPE of the DPR (+17 to +35), DPE (+23 to +34), and MTE (+18 to +29) motifs with the SCP1m promoter in vitro. g, HARPE of the DPR in the SCP1m promoter transcribed in vitro or in cells. All panels show a representative experiment, n = 2 biologically independent samples. h–j, HARPE data generated in cells are similar to the corresponding in vitro data. h, The nucleotide frequencies of the top 0.1% most active DPR sequences obtained in cells are consistent with their in vitro counterparts. These HARPE experiments were performed with the human IRF1 core promoter. i, The nucleotide frequencies of the top 0.1% most active MTE and DPE sequences obtained in cells are consistent with their in vitro counterparts. These experiments examined either the MTE region or the DPE region in cells or in vitro. j, The nucleotide frequencies of the top 0.1% most active DPR sequences obtained in cells are consistent with their in vitro counterparts. These HARPE experiments were performed with the TATA-box-containing SCP1 core promoter. All panels show a representative experiment (n = 2 biologically independent samples). k–p, HARPE can be used to analyse regions upstream of the TSS. k, Design of a HARPE experiment targeting the upstream TATA-box region. Sequencing of the DNA constructs provides a correspondence between each TATA-box variant and a downstream barcode. Analysis of the barcode sequence in each transcript thus identifies its associated TATA-box variant sequence. l, HARPE was performed with a randomized region from −32 to −21 (long TATA) relative to the +1 TSS. The reproducibility of two independent experiments is shown. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. m, HARPE was carried with a randomized region from −30 to −23 (short TATA) with an upstream TA dinucleotide at positions −32 and −31. The upstream TA sequence directs the formation of the TATA box in a single phase. The reproducibility of two independent experiments is also shown. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. n, The nucleotide frequencies and top 8-nt and 12-nt HOMER motifs for the top 0.1% most transcribed variants are shown for HARPE data with the long TATA (−32 to −21) randomized sequence. The upstream T of the 8-nt TATA box motif was found to be located at position −32, −31, or −30 (representative experiment, n = 2 biologically independent samples). o, The nucleotide frequencies and top 8-nt HOMER motif for the top 0.1% most transcribed variants are shown for HARPE data with the short TATA (−30 to −23) randomized sequence. In the short TATA analysis, the upstream T of the TATA box is fixed at position −32, and thus, a distinct TATA-box sequence can be seen in the HOMER analysis (representative experiment, n = 2 biologically independent samples). p, The nucleotide frequencies in natural human focused promoters¹² are similar to those in the long TATA dataset (n), particularly with the A and T nucleotides.

Source data

Extended Data Fig. 3 Initial characterization and optimization of the SVR models and the creation of a low complexity HARPE library for further SVR analysis of the DPR.

a, Selection of sequences for training of the SVR. Different numbers of training sequences were selected either randomly (blue line) or by using a combination of the most transcribed (Best) variants and Non-Best variants (that is, those variants that are not in the Best category) at a 1:1 ratio of Best:Non-Best (orange line). The resulting SVR models were used to predict the transcriptional activity of the Test Sequences in Fig. 3b, and the correlations between the predicted versus observed transcriptional activities are shown on the Y axis. In our studies, we used the SVR model (Selected variants) that was built on the training set that consists of the 100,000 most transcribed (Best) variants and randomly selected 100,000 Non-Best variants (representative experiment n = 2 biologically independent samples). The models in this figure were built by using default parameters for SVR training. b–d, Grid search cross validation for the SVR models. Grid search results with different values for the cost of misclassification (cost) and individual training example influence (gamma) for (b) SVRb, (c) SVRc, and (d) SVRtata. Shown are Spearman’s rank correlation coefficient (rho) between the prediction of each model and the observed transcription strength with two independent datasets (validation and test sets, which are separate halves of the test sequences described in Fig. 3b) that were not used in the training of the models. SVR models were trained as described in Methods. Undefined (UD) correlation is observed when the prediction of a model is constant regardless of the sequence. The hyperparameter values that were selected in this study are as follows: SVRb (c = 10 and gamma = 0.1); SVRc (c = 1, gamma = 0.02); and SVRtata (c = 100, gamma = 0.1). e, Concordance between the predicted and observed activities of DPR sequence variants, as shown with a logarithmic scale. Analysis of 7500 independent test sequences in the HARPE dataset that were not used in the training of SVRb. This figure presents the data shown in Fig. 3b with a log scale for the x- and y-axes. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. f–i, Design and use of a low complexity HARPE library that provides high-quality data on 8,431 unique DPR variants. f, Design of a low complexity library with multiple DNA sequence tags for each DPR variant. A restricted library was built with 8,431 unique DPR variants. Each variant was associated with about 15 downstream DNA sequence tags that enable multiple measurements of transcription strength for the same variant within the same experiment. g, To restrict the complexity of the library, the randomized region was shortened to 13 nucleotides, and each position contained one of only two different bases. h, The number of tags per variant. The median value is 13 (representative experiment, n = 2 biologically independent samples). i, The observed transcription strength for each of the DPR variants. There are multiple different sequence tags for each DPR variant. The plot shows the average (black) ± standard deviation (designated in grey) for each of the variants (representative experiment, n = 2 biologically independent samples).

Source data

Extended Data Fig. 4 Individual assessment of the transcription activity of 16 independent variants that are not present in the SVR training set.

a, The 16 variants, which include the original SCP1m sequence, represent a wide range of SVR scores. Nucleotides that differ from the SCP1m sequence are indicated in red type. b, The 16 promoter sequences were inserted into plasmids and subjected to in vitro transcription and primer extension analysis (n = 4 biologically independent samples). The plots show the predicted SVRb scores and the observed transcription strengths. Replicate 1 is shown in Fig. 3d. PCC, Pearson’s correlation coefficient with two-tailed P-values <1.7 × 10⁻⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. For gel source data, see Supplementary Fig. 1. c, The 16 promoters were subjected to transient transfection and primer extension analysis (n = 4 biologically independent samples). The plots show the predicted SVRb scores and the observed transcription strengths. PCC, Pearson’s correlation coefficient with two-tailed P-value <3.9 × 10⁻⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. For gel source data, see Supplementary Fig. 1.

Source data

Extended Data Fig. 5 Use of the SVR models to identify active sequence elements and performance assessment of the SVR models.

a–c, The relationship between SVR scores and transcription strength. Box-plot diagrams are shown for (a) SVRb, (b) SVRc, and (c) SVRtata with all of their corresponding HARPE sequence variants that are placed in bins of the indicated SVR score ranges. Sequence variants with SVRb score ≥ 2, SVRc score ≥ 2, and SVRtata score ≥ 1 are typically at least about 6 times more active than an inactive sequence (light blue shaded regions), and are thus designated as “active”. The thick horizontal lines are the medians, and the lower and upper hinges are the first and third quartiles, respectively. Each upper (or lower) whisker extends from the upper (or lower) hinge to the largest (or lowest) value no further than 1.5 * IQR from the hinge. Data beyond the end of the whiskers (outlying points) are omitted from the box plot. Sequence variants with transcription strength = 0 were removed to allow log-scale display of the diagrams. The horizontal dashed grey lines denote the transcription strengths of the median inactive sequences. d–h, Performance assessment of SVRb. All panels show a representative experiment (n = 2 biologically independent samples). d, Selection of HARPE variants used in performance assessment. The top 10% sequence variants were designated as active/positive for transcription, and an equal (randomly selected) number of the bottom 50% of sequence variants were designated as inactive/negative for transcription. These sequences were then used in the performance assessment. Intermediate variants that were between the top and bottom groups were not included. The transcription strengths of all selected sequences are shown. e, Receiver operating characteristic (ROC) curve. f, Precision-recall (PR) curve. g, Performance measures relative to the minimum SVRb score required for a positive prediction. Performance was computed by counting true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Accuracy [(TP+TN) / (TP+FP+TN+FN)] reflects how often SVRb predictions are correct. Precision [TP / (TP + FP)] is the proportion of positive predictions that are correct. Sensitivity or recall or true positive rate [TP / (TP + FN)] is the proportion of transcriptionally active variants that are correctly predicted as positives. h, False positive and false negative rates. The false positive rate [FP / (FP + TN)] is the probability for an inactive sequence to be incorrectly predicted as positive. The false negative rate [FN / (FN + TP)] = (1 − Sensitivity) is the probability for an active sequence to be incorrectly predicted as negative. Performance values are shown for selected minimum SVRb scores (1.5 and 2). All panels show a representative experiment (n = 2 biologically independent samples). i–m, Performance assessment of SVRc. i, Selection of HARPE variants used in performance assessment. The top 10% sequence variants were designated as active/positive for transcription, and an equal (randomly selected) number of the bottom 50% of sequence variants were designated as inactive/negative for transcription. These sequences were then used in the performance assessment. Intermediate variants that were between the top and bottom groups were not included. The transcription strengths of all selected sequences are shown. j, Receiver operating characteristic (ROC) curve. k, Precision-recall (PR) curve. l, Performance measures relative to the minimum SVRc score required for a positive prediction. Performance was computed by counting true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Accuracy [(TP+TN) / (TP+FP+TN+FN)] reflects how often SVRc predictions are correct. Precision [TP / (TP + FP)] is the proportion of positive predictions that are correct. Sensitivity [TP / (TP + FN)] is the proportion of transcriptionally active variants that are correctly predicted as positives. m, False positive and false negative rates. The false positive rate [FP / (FP + TN)] is the probability for an inactive sequence to be incorrectly predicted as positive. The false negative rate [FN / (FN + TP)] = (1 − Sensitivity) is the probability for an active sequence to be incorrectly predicted as negative. Performance values are shown for selected minimum SVRc scores (1.5 and 2). All panels show a representative experiment (n = 2 biologically independent samples). n–r, Performance assessment of SVRtata. n, Selection of HARPE variants used in performance assessment. The top 10% sequence variants were designated as active/positive for transcription, and an equal (randomly selected) number of the bottom 50% of sequence variants were designated as inactive/negative for transcription. These sequences were then used in the performance assessment. Intermediate variants that were between the top and bottom groups were not included. The transcription strengths of all selected sequences are shown. One outlier variant with an exceptionally high transcription level was omitted in the graph, but was included in the performance analysis. o, Receiver operating characteristic (ROC) curve. p, Precision-recall (PR) curve. q, Performance measures relative to the minimum SVRtata score required for a positive prediction. Performance was computed by counting true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Accuracy [(TP+TN) / (TP+FP+TN+FN)] reflects how often SVRtata predictions are correct. Precision [TP / (TP + FP)] is the proportion of positive predictions that are correct. Sensitivity [TP / (TP + FN)] is the proportion of transcriptionally active variants that are correctly predicted as positives. r, False positive and false negative rates. The false positive rate [FP / (FP + TN)] is the probability for an inactive sequence to be incorrectly predicted as positive. The false negative rate [FN / (FN + TP)] = (1 − Sensitivity) is the probability for an active sequence to be incorrectly predicted as negative. Performance values are shown for minimum SVRtata scores = 1.0. All panels show a representative experiment (n = 2 biologically independent samples).

Source data

Extended Data Fig. 6 Further analysis of the SVR models and their relation to consensus sequence-based approaches.

a–e, SVR models based on HARPE data with different promoter backgrounds are consistent. SVR models were tested with the 7500 DPR sequence variants used in Fig. 3b. a, SVRirf1 models trained with HARPE data for the DPR with the IRF1 promoter cassette (promoter background) are reproducible. b, SVRb based on HARPE data for the DPR with the SCP1m promoter cassette (promoter background) is similar to the SVRirf1 model trained with HARPE data for the DPR in the IRF1 background. c, SVRscp1 models trained with HARPE data for the DPR with the SCP1 (TATA-containing) promoter cassette (promoter background) are reproducible. d, SVRb for the DPR in the TATA-less SCP1m promoter cassette (promoter background) is similar to the SVRscp1 model for the DPR in the TATA-containing SCP1 promoter cassette. e, SVRb and SVRscp1 exhibit similar DNA sequence preferences. This figure shows the web logos for the top HOMER motifs identified with the top 0.1% DPR sequences (in 500,000 random sequences), as assessed with either SVRb or SVRscp1. f–h, SVR analysis incorporates information that is not encapsulated in a consensus of enriched sequences in the most active variants. f, Web logo for the top HOMER motif identified with the 0.1% most transcribed DPR sequences. This panel is adapted from Fig. 1c and shows the DPE-like RGWYGT consensus of enriched sequences from +28 to +33. In contrast, the SVR model is generated from strong, intermediate, and weak variants of the entire DPR region. g, HARPE variants with a perfect match to the RGWYGT consensus exhibit transcription strengths that range from highly active to inactive. h, SVRb accurately predicts the transcription strengths of different HARPE variants with a perfect match to the RGWYGT consensus. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. i, An SVR-based approach provides a more accurate prediction of DPR activity than a consensus sequence-based method. The plots show the correlation between the observed transcription strength (in vitro) and the predicted scores of the DPR, as assessed with either SVRb (upper; adapted from Fig. 3b) or a consensus sequence/position-weight matrix-based method (HOMER; lower). The HOMER consensus/position-weight matrix (Fig. 1c, Extended Data Fig. 1e, f) is based on the top 0.1% most transcribed DPR sequences. The DPR variants are the 7500 Test Sequences shown in Fig. 3. The coloured density scale is identical for both plots (representative experiment, n = 2 biologically independent samples). PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. j, k, SVRb scores are influenced by DNA sequence context (that is, flanking nucleotides), whereas PWM-based HOMER scores treat individual nucleotide positions independently. j, Box-plot diagrams of the changes in the HOMER motif scores (top) and the SVRb scores (bottom) due to an A-to-G substitution at each of the indicated positions. The values were generated with 200 different DPR sequences in randomly-selected natural human promoters. The thick horizontal lines are the medians, and the lower and upper hinges are the first and third quartiles, respectively. Each upper (or lower) whisker extends from the upper (or lower) hinge to the largest (or lowest) value no further than 1.5 * IQR from the hinge. Data beyond the end of the whiskers (outlying points) are omitted from the box plot. A representative experiment is shown (n = 2 biologically independent samples). k, The influence of sequence context is accurately captured by the SVR model. Shown are the changes in SVRb score and transcription strength for 4,081 DPR variants when A is mutated to G at positions +30 (left) or +32 (right). The transcription data of the sequence variants were from the Low Complexity Library (Fig. 3c). PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶.

Source data

Extended Data Fig. 7 Characterization of the properties of the SVR models and the generation of SVRtata for the TATA box and SVRc for the DPR with cell-based data.

a–c, SVR models capture the preferred distances between the TSS and the DPR. a, The most significantly enriched 8-nt HOMER motif found in the top 0.1% of HARPE DPR variants (top) and its associated position-weight matrix (bottom). P-value associated with hypergeometric tests (one tailed). This 8-nt DPE-like motif closely resembles the Drosophila DPE consensus sequence^2,14. Importantly, the DPE-like sequence is shorter than the DPR region and is therefore not at a fixed position. b, Positional preference analysis of the 8-nt motif in the top 0.1% HARPE DPR variants shows a preferred major position (74%) as well as a minor position (17%) that is 1 nt upstream of the major position. c, SVRb accurately predicts the transcription strength of sequence variants in all positions. This figure shows box-plot diagrams of the transcription strength for all variants within the HARPE dataset that contain the 8-nt motif at each position. The quality of the prediction at each position is indicated by Spearman’s rank correlation coefficient (rho) between the observed transcription strength and SVRb score, HOMER motif score with the 19-nt DPR motif (shown in Extended Data Fig. 1e, f), or HOMER motif score with the 8-nt DPR motif (shown in a). The thick horizontal lines are the medians, and the lower and upper hinges are the first and third quartiles, respectively. Each upper (or lower) whisker extends from the upper (or lower) hinge to the largest (or lowest) value no further than 1.5 * IQR from the hinge. Data beyond the end of the whiskers (outlying points) are omitted from the box plot. All panels show a representative experiment (n = 2 biologically independent samples). d–i, Machine learning analysis of the HARPE TATA-box data yields an SVRtata model for the TATA box. The HARPE data for the long TATA-box region (−32 to −21; Extended Data Figs. 1a, 2k–p, 8a, b) were subjected to SVR analysis. The resulting SVR models (derived from data generated in vitro or in cells) were termed SVRtata. d, The SVRtata model from HARPE data in cells is similar to that from HARPE data in vitro. The SVRtata (in vitro) and SVRtata (in cells) scores are compared by using 5000 independent test sequences that were not used in the training of the SVR. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. e, Comparison of SVRtata scores and the observed transcription strengths of 5000 independent test sequences. These results are based on in vitro data. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. f, Comparison of HOMER motif scores and the observed transcription strengths of the same 5000 test sequences used in e. The position-weight matrices of the top 12-nt (left) or 8-nt (right) HOMER motifs (Extended Data Fig. 2n) were used to determine HOMER motif scores. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. g, Cumulative frequency of SVRtata scores of natural human promoters in HeLa cells. Approximately 23% of 11,932 human promoters and 4% of 100,000 random sequences (61% average G/C content, as in human core promoters) have an SVRtata (in vitro) score of at least 1 (marked with a green line), which corresponds to an active TATA box (Extended Data Fig. 5c). h, Cumulative frequency of SVRtata scores of natural human promoters in MCF7 cells. Focused promoters identified in ref. ¹² were used. Approximately 18% of 7,678 MCF7 promoters and 4% of 100,000 random sequences (61% average G/C content, as in human core promoters) have an SVRtata (in vitro) score of at least 1 (marked with a green line), which corresponds to an active TATA box. i, Cumulative frequency of SVRtata scores of natural human promoters in GM12878 cells. Focused promoters were identified as described in ref. ¹² by using GRO-cap data in human GM12878 cells from ref. ³⁷. Approximately 15% of 30,643 GM12878 promoters and 4% of 100,000 random sequences (61% average G/C content, as in human core promoters) have an SVRtata (in vitro) score of at least 1 (marked with a green line), which corresponds to an active TATA box. All panels show a representative experiment (n = 2 biologically independent samples). j, k, Most positions within the DPR have a moderate impact upon the overall SVR score. The influence of each position in the DPR on the model prediction score is shown by the value of the Position Index. The Position Index at position X is the average of the maximal magnitude of variation in (j) the SVR score or (k) the HOMER motif score with A, C, G or T at position X with 200 different DPR sequences that were randomly selected from natural human promoters. As a reference, the Web Logo for the top HOMER motif identified with the 0.1% most transcribed DPR sequences is also shown. l, m, SVRc model of the DPR with HARPE data generated in cells. l, HARPE libraries were transfected in cells, and normalized RNA tags were obtained. The SVRc (SVR from cell-based data) scores derived from these data correlate with measured transcription strengths in cells (with data that are independent of the SVRc training data) (representative experiment, n = 2 biologically independent samples). PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶. m, The SVRc models obtained from cells are reproducible. PCC, Pearson’s correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶; rho, Spearman’s rank correlation coefficient with two-tailed P-value <2.2 × 10⁻¹⁶.

Source data

Extended Data Fig. 8 Analysis of the HARPE TATA data as well as the DPR in natural human promoters.

a, b, The nucleotide preferences of the top 0.1% most active TATA-box sequences in cells are similar to those of their in vitro counterparts. a, Long randomized TATA-box region (-32 to -21 relative to the +1 TSS). b, Short randomized TATA-box region (-30 to -23 relative to the +1 TSS). All panels show a representative experiment (n = 2 biologically independent samples). c, Distinct nucleotide preferences can be seen at the DPR in focused human promoters, which were identified as described in ref. ¹² by using 5′GRO-seq data in HeLa cells³³. d, The top ~2.5% (11,932) most active DPR sequences in cells, as assessed by HARPE, have nucleotide preferences that are similar to those seen in natural human core promoters in HeLa cells (representative experiment, n = 2 biologically independent samples). e–g, Relationship between natural human promoter sequences and HARPE data in vitro. e, The top ~2.5% (11,932) most active DPR sequences in vitro, as assessed by HARPE, have nucleotide preferences that are similar to those seen in natural human core promoters in HeLa cells. f, Cumulative frequency of SVRb DPR scores of natural human promoters. Approximately 26% of 11,932 human promoters (HeLa cells), 12% of 100,000 random sequences (61% average G/C content, as in human core promoters), and 0.4% of 10,000 inactive sequences (randomly selected from the 50% least active sequences in the HARPE assay; not used in the training of the SVR) have an SVRb score of at least 2 (marked with a green line), which corresponds to an active DPR (Extended Data Fig. 5a). g, Cumulative frequency of SVRc and SVRb DPR scores of natural human promoters in MCF7 and GM12878 cells. Approximately 34% of 7,678 MCF7 promoters, 34% of 30,643 GM12878 promoters, 17% of 100,000 random sequences (61% average G/C content, as in human core promoters), and 2.6% of 10,000 inactive sequences (randomly selected from the 50% least active sequences in the HARPE assay; not used in the training of the SVR) have an SVRc score of at least 2 (marked with a green line), which corresponds to an active DPR (Extended Data Fig. 5b). Approximately 26% of 7,678 MCF7 promoters, 25% of 30,643 GM12878 promoters, 12% of 100,000 random sequences (61% average G/C content, as in human core promoters), and 0.4% of 10,000 inactive sequences (randomly selected from the 50% least active sequences in the HARPE assay; not used in the training of the SVR) have an SVRb score of at least 2 (marked with a green line), which corresponds to an active DPR (Extended Data Fig. 5a). All panels show a representative experiment (n = 2 biologically independent samples). h, i, Analysis of the DPR in natural human promoters. h, Sequences of natural human promoters that contain DPR motifs with an SVRb score >6 and an SVRc score >2.5. The mutant DPR sequence has an SVRb score = 0.3 and an SVRc score = 0.3. i, Mutational analysis reveals DPR activity in different human promoters with SVRb DPR scores >6. In each of the mutant promoters, the wild-type DPR was substituted with a DNA sequence that has an SVRb DPR score of 0.3 (data are depicted as the mean with error bars denoting standard deviation, n = 3 or 4 biologically independent samples, as indicated by the points representing independent samples on the graph). The sequences of the tested promoters are shown in f. Promoter activity was measured by in vitro transcription followed by primer extension analysis of the TSSs. All P-values <0.01 (Student’s t-test, two-tailed, paired). For gel source data, see Supplementary Fig. 1.

Source data

Extended Data Fig. 9 Analysis of the DPR and its relationship to the Inr and TATA box in active human promoters in different human cell lines.

a–e Analysis of the DPR and its relationship to the Inr and TATA box in active human promoters in HeLa cells. a, Distribution of focused human promoters derived from HeLa cells in increasing SVRc DPR score bins. Bins 9 and 10 have less than 100 promoters. b, The frequencies of occurrence of the Inr and Inr-like sequences in different bins of promoters with increasing SVRc DPR scores. The Inr-like sequence is as defined previously¹². c, The frequencies of occurrence of the TATA box and TATA-like sequences decrease as the SVRc DPR score increases. d, Distribution of focused human promoters in increasing SVRb DPR score bins. Promoters with SVRb scores between 4.24 and 17 were combined together in bin 11. e, The frequencies of occurrence of Inr-like sequences, TATA-like sequences, and TATA-box motifs (as assessed with SVRtata ≥ 1; Extended Data Fig. 5c) in different bins of promoters with increasing SVRb DPR scores. The Inr-like and TATA-like sequences are as defined previously¹². In b and c, bins with less than 100 promoters are indicated with open circles and are connected by dashed lines. In e, bin 11 is shown in black circles connected by dashed black lines. All panels show a representative experiment (n = 2 biologically independent samples). f, g, Analysis of the DPR and its relationship to the Inr and TATA box in active human promoters in MCF7 and GM12878 cells. f, Distribution of focused human promoters in increasing SVRc DPR score bins. For each cell line, bin 10 has less than 100 promoters. MCF7 focused promoters are described in ref. ¹². GM12878 focused promoters were identified as described in ref. ¹² by using GRO-cap data in human GM12878 cells from ref. ³⁷. g, The frequencies of occurrence of Inr-like sequences, TATA-like sequences, and TATA-box motifs (as assessed with SVRtata ≥ 1; Extended Data Fig. 5c) in different bins of promoters with increasing SVRc DPR scores. The Inr-like and TATA-like sequences are as defined previously¹². Bins with less than 100 promoters are indicated with open circles and are connected by dashed lines. All panels show a representative experiment (n = 2 biologically independent samples).

Source data

Extended Data Fig. 10 Distribution of SVR DPR scores for human promoters in relation to their SVRtata scores.

Human promoters were divided into four groups according to their SVRtata score. For each TATA box category, the distribution of SVR DPR scores is shown for each of five classes of promoters (no DPR, weak DPR, intermediate DPR, good DPR, and strong DPR). a, Human focused promoters obtained from HeLa cells^12,33 analysed with SVRtata and SVRc. b, Human focused promoters obtained from HeLa cells analysed with SVRtata and SVRb. c, Human focused promoters obtained from MCF7 cells¹² analysed with SVRtata and SVRc. d, Human focused promoters obtained from GM12878 cells³⁷ analysed with SVRtata and SVRc. Focused promoters were identified as described in ref. ¹² by using GRO-cap data in human GM12878 cells from ref. ³⁷. All panels show a representative experiment (n = 2 biologically independent samples).

Source data

Supplementary information

Supplementary Information

This file contains Supplementary Discussions 1 to 7, Supplementary References, Supplementary Tables 1 and 2, and Supplementary Fig. 1.

Reporting Summary

Source data

Source Data Fig. 1

Source Data Fig. 3

Source Data Fig. 4

Source Data Extended Data Fig. 1

Source Data Extended Data Fig. 2

Source Data Extended Data Fig. 3

Source Data Extended Data Fig. 4

Source Data Extended Data Fig. 5

Source Data Extended Data Fig. 6

Source Data Extended Data Fig. 7

Source Data Extended Data Fig. 8

Source Data Extended Data Fig. 9

Source Data Extended Data Fig. 10

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vo ngoc, L., Huang, C.Y., Cassidy, C.J. et al. Identification of the human DPR core promoter element using machine learning. Nature 585, 459–463 (2020). https://doi.org/10.1038/s41586-020-2689-7

Download citation

Received: 27 November 2019
Accepted: 16 June 2020
Published: 09 September 2020
Issue Date: 17 September 2020
DOI: https://doi.org/10.1038/s41586-020-2689-7

This article is cited by

Quantitative analysis of transcription start site selection reveals control by DNA sequence, RNA polymerase II activity and NTP levels
- Yunye Zhu
- Irina O. Vvedenskaya
- Craig D. Kaplan
Nature Structural & Molecular Biology (2024)
A 3-Gene Random Forest Model to Diagnose Non-obstructive Azoospermia Based on Transcription Factor-Related Henes
- Ranran Zhou
- Jingjing Liang
- Cundong Liu
Reproductive Sciences (2023)
A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers
- Li Yao
- Jin Liang
- Haiyuan Yu
Nature Biotechnology (2022)
Plant synthetic epigenomic engineering for crop improvement
- Liwen Yang
- Pingxian Zhang
- Li Pu
Science China Life Sciences (2022)
The TFIID pivot of preinitiation complex
- Jingdong Xue
- Wanli Yang
- Bing Li
Science China Life Sciences (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data figures and tables

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links