Supervised enhancer prediction with epigenetic pattern recognition and targeted validation

Sethi, Anurag; Gu, Mengting; Gumusgoz, Emrah; Chan, Landon; Yan, Koon-Kiu; Rozowsky, Joel; Barozzi, Iros; Afzal, Veena; Akiyama, Jennifer A.; Plajzer-Frick, Ingrid; Yan, Chengfei; Novak, Catherine S.; Kato, Momoe; Garvin, Tyler H.; Pham, Quan; Harrington, Anne; Mannion, Brandon J.; Lee, Elizabeth A.; Fukuda-Yuzawa, Yoko; Visel, Axel; Dickel, Diane E.; Yip, Kevin Y.; Sutton, Richard; Pennacchio, Len A.; Gerstein, Mark

doi:10.1038/s41592-020-0907-8

Article
Published: 29 July 2020

Supervised enhancer prediction with epigenetic pattern recognition and targeted validation

Nature Methods volume 17, pages 807–814 (2020)Cite this article

10k Accesses
48 Citations
25 Altmetric
Metrics details

Subjects

Abstract

Enhancers are important non-coding elements, but they have traditionally been hard to characterize experimentally. The development of massively parallel assays allows the characterization of large numbers of enhancers for the first time. Here, we developed a framework using Drosophila STARR-seq to create shape-matching filters based on meta-profiles of epigenetic features. We integrated these features with supervised machine-learning algorithms to predict enhancers. We further demonstrated that our model could be transferred to predict enhancers in mammals. We comprehensively validated the predictions using a combination of in vivo and in vitro approaches, involving transgenic assays in mice and transduction-based reporter assays in human cell lines (153 enhancers in total). The results confirmed that our model can accurately predict enhancers in different species without re-parameterization. Finally, we examined the transcription factor binding patterns at predicted enhancers versus promoters. We demonstrated that these patterns enable the construction of a secondary model that effectively distinguishes enhancers and promoters.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Flowchart of the matched-filter model.**

**Fig. 2: Performance of matched filters and integrated models for predicting STARR-seq peaks, compared with that of peak-based models.**

**Fig. 3: Performance of matched filters and integrated models for predicting promoters and enhancers.**

**Fig. 4: Performance of matched filters and integrated model for predicting active enhancers in mice.**

**Fig. 5: Differences in TF-binding patterns at enhancers and promoters.**

DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers

Article 12 May 2022

A machine learning technique for identifying DNA enhancer regions utilizing CIS-regulatory element patterns

Article Open access 07 September 2022

Towards a comprehensive catalogue of validated and target-linked human enhancers

Article 27 January 2020

Data availability

A detailed description of the datasets used in each part of the study is in the corresponding section of Supplementary Methods. Specifically, the Drosophila epigenetics datasets used in this study were generated by the modENCODE consortium, available online (http://data.modencode.org). The mouse epigenetics datasets were generated by the ENCODE and Roadmap Epigenomics consortium, available online (https://www.encodeproject.org). We downloaded the Drosophila STARR-seq data²⁸ and the mouse FIREWACh data³² from previous studies. Results from transgenic-mouse enhancer assays were generated by the Pennacchio lab at LBNL. Experimental results are summarized in Supplementary Tables 4–9, with the mouse images and additional details available on the VISTA Enhancer Browser (https://enhancer.lbl.gov). The human-cell-line enhancer reporter assay results were generated by the Sutton lab at Yale University. Experiment results are summarized in Supplementary Table 10. More detailed results for each cell line are available in Supplementary Data 1.

Code availability

We have implemented our methods in Python. The source code and the output annotations referenced in the paper are available at the website http://matchedfilter.gersteinlab.org. A dockerized image is also provided at this site.

References

Banerji, J., Rusconi, S. & Schaffner, W. Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell 27, 299–308 (1981).
CAS PubMed Google Scholar
Levo, M. et al. Unraveling determinants of transcription factor binding outside the core binding site. Genome Res. 25, 1018–1029 (2015).
CAS PubMed PubMed Central Google Scholar
Slattery, M. et al. Absence of a simple code: how transcription factors read the genome. Trends Biochem. Sci. 39, 381–399 (2014).
CAS PubMed PubMed Central Google Scholar
Corradin, O. & Scacheri, P. C. Enhancer variants: evaluating functions in common disease. Genome Med. 6, 85 (2014).
PubMed PubMed Central Google Scholar
Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 95, 535–552 (2014).
CAS PubMed PubMed Central Google Scholar
Wray, G. A. The evolutionary significance of cis-regulatory mutations. Nat. Rev. Genet.8, 206–216 (2007).
CAS PubMed Google Scholar
Erwin, G. D. et al. Integrating diverse datasets improves developmental enhancer prediction. PLoS Comput. Biol. 10, e1003677 (2014).
PubMed PubMed Central Google Scholar
Pennacchio, L. A., Bickmore, W., Dean, A., Nobrega, M. A. & Bejerano, G. Enhancers: five essential questions. Nat. Rev. Genet 14, 288–295 (2013).
CAS PubMed PubMed Central Google Scholar
Pennacchio, L. A. et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444, 499–502 (2006).
CAS PubMed Google Scholar
Visel, A. et al. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat. Genet 40, 158–160 (2008).
CAS PubMed PubMed Central Google Scholar
Nord, A. S. et al. Rapid and pervasive changes in genome-wide enhancer usage during mammalian development. Cell 155, 1521–1531 (2013).
CAS PubMed PubMed Central Google Scholar
Visel, A. et al. ChIP–seq accurately predicts tissue-specific activity of enhancers. Nature 457, 854–858 (2009).
CAS PubMed PubMed Central Google Scholar
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
CAS PubMed PubMed Central Google Scholar
Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M. A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol. 10, e1003711 (2014).
PubMed PubMed Central Google Scholar
Narlikar, L. et al. Genome-wide discovery of human heart enhancers. Genome Res. 20, 381–392 (2010).
CAS PubMed PubMed Central Google Scholar
Yip, K. Y. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 13, R48 (2012).
CAS PubMed PubMed Central Google Scholar
Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).
CAS PubMed Google Scholar
Maston, G. A., Landt, S. G., Snyder, M. & Green, M. R. Characterization of enhancer function from genome-wide analyses. Annu. Rev. Genomics Hum. Genet. 13, 29–57 (2012).
CAS PubMed Google Scholar
Shlyueva, D., Stampfel, G. & Stark, A. Transcriptional enhancers: from properties to genome-wide predictions. Nat. Rev. Genet 15, 272–286 (2014).
CAS PubMed Google Scholar
Yanez-Cuna, J. O. et al. Dissection of thousands of cell type-specific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome Res. 24, 1147–1156 (2014).
CAS PubMed PubMed Central Google Scholar
Creyghton, M. P. et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc. Natl Acad. Sci. USA 107, 21931–21936 (2010).
CAS PubMed PubMed Central Google Scholar
Heintzman, N. D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 39, 311–318 (2007).
CAS PubMed Google Scholar
Gasperini, M., Tome, J. M. & Shendure, J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat. Rev. Genet 21, 292–310 (2020).
CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. Functional assessment of human enhancer activities using whole-genome STARR-sequencing. Genome Biol. 18, 219 (2017).
PubMed PubMed Central Google Scholar
Boyle, A. P. et al. Comparative analysis of regulatory information and circuits across distant species. Nature 512, 453–456 (2014).
CAS PubMed PubMed Central Google Scholar
Cheng, C. & Gerstein, M. Modeling the relative relationship of transcription factor binding and histone modifications to gene expression levels in mouse embryonic stem cells. Nucleic Acids Res. 40, 553–568 (2012).
CAS PubMed Google Scholar
Cheng, Y. et al. Principles of regulatory information conservation between mouse and human. Nature 515, 371 (2014).
CAS PubMed PubMed Central Google Scholar
Dong, X. et al. Modeling gene expression using chromatin features in various cellular contexts. Genome Biol. 13, R53 (2012).
CAS PubMed PubMed Central Google Scholar
Gerstein, M. B. et al. Comparative analysis of the transcriptome across distant species. Nature 512, 445–448 (2014).
CAS PubMed PubMed Central Google Scholar
Gjoneska, E. et al. Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease. Nature 518, 365–369 (2015).
CAS PubMed PubMed Central Google Scholar
Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014).
CAS PubMed PubMed Central Google Scholar
Zabidi, M. A. et al. Enhancer-core-promoter specificity separates developmental and housekeeping gene regulation. Nature 518, 556–559 (2015).
CAS PubMed Google Scholar
Cotney, J. et al. Chromatin state signatures associated with tissue-specific gene expression and enhancer activity in the embryonic limb. Genome Res. 22, 1069–1080 (2012).
CAS PubMed PubMed Central Google Scholar
Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 (2011).
CAS PubMed PubMed Central Google Scholar
Burges, C. J. C. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121–167 (1998).
Google Scholar
Roadmap Epigenomics, C. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Google Scholar
Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Google Scholar
Koch, C. M. et al. The landscape of histone modifications across 1% of the human genome in five human cell lines. Genome Res. 17, 691–707 (2007).
CAS PubMed PubMed Central Google Scholar
Rajagopal, N. et al. RFECS: a random-forest based algorithm for enhancer identification from chromatin state. PLoS Comput. Biol. 9, e1002968 (2013).
CAS PubMed PubMed Central Google Scholar
Murtha, M. et al. FIREWACh: high-throughput functional detection of transcriptional regulatory modules in mammalian cells. Nat. Methods 11, 559–565 (2014).
CAS PubMed PubMed Central Google Scholar
Bailey, S. D. et al. ZNF143 provides sequence specificity to secure chromatin interactions at gene promoters. Nat. Commun. 2, 6186 (2015).
PubMed Google Scholar
Muerdter, F. et al. Resolving systematic errors in widely used enhancer activity assays in human cells. Nat. Methods 15, 141–149 (2018).
CAS PubMed Google Scholar
Kumar, V. B. V. K., Mahalanobis, A. & Juday, R. D. Correlation Pattern Recognition (Cambridge University Press, 2005).
Mod, E. C. et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 1787–1797 (2010).
Google Scholar
Blanchard, G., Bousquet, O. & Massaer, P. Statistical performance of support vector machines. Ann. Stat. 36, 489–531 (2008).
Google Scholar
Hoerl, A. E. & Kennard, R. W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970).
Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Google Scholar
Stuart, R. & Norvig, P. Artificial Intelligence: A Modern Approach 2nd edn (Pearson, 2003).
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Diao, Y. et al. A tiling-deletion-based genetic screen for cis-regulatory element identification in mammalian cells. Nat. Methods 14, 629–635 (2017).
CAS PubMed PubMed Central Google Scholar
Villar, D. et al. Enhancer evolution across 20 mammalian species. Cell 160, 554–566 (2015).
CAS PubMed PubMed Central Google Scholar
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
CAS PubMed PubMed Central Google Scholar
Kothary, R. et al. Inducible expression of an hsp68-lacZ hybrid gene in transgenic mice. Development 105, 707–714 (1989).
CAS PubMed Google Scholar
Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012).
CAS PubMed PubMed Central Google Scholar
Firpi, H. A., Ucar, D. & Tan, K. Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics 26, 1579–1586 (2010).
CAS PubMed PubMed Central Google Scholar
Lu, Y., Qu, W., Shan, G. & Zhang, C. DELTA: a distal enhancer locating tool based on adaboost algorithm and shape features of chromatin modifications. PLoS One 10, e0130622 (2015).
PubMed PubMed Central Google Scholar
He, Y. et al. Improved regulatory element prediction based on tissue-specific local epigenomic signatures. Proc. Natl Acad. Sci. USA 114, E1633–E1640 (2017).
CAS PubMed PubMed Central Google Scholar
Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473–476 (2012).
CAS PubMed PubMed Central Google Scholar
Arner, E. et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347, 1010–1014 (2015).
CAS PubMed PubMed Central Google Scholar
Kleftogiannis, D., Kalnis, P. & Bajic, V. B. DEEP: a general computational framework for predicting enhancers. Nucleic Acids Res. 43, e6 (2015).
PubMed Google Scholar

Download references

Acknowledgements

M. Gerstein was supported by NIH grant HG009446-01. A.V. and L.A.P. were supported by NHLBI grant R24HL123879 and NHGRI grants R01HG003988, U54HG006997 and UM1HG009421, where research was conducted at the E.O. Lawrence Berkeley National Laboratory and performed under Department of Energy Contract DE-AC02-05CH11231, University of California. We thank A. Paccanaro, D. Galeano, M. Torres and Y. Wu for their insightful scientific discussions. We thank all ENCODE consortium members for their feedback on this work.

Author information

These authors contributed equally: Anurag Sethi, Mengting Gu.

Authors and Affiliations

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
Anurag Sethi, Koon-Kiu Yan, Joel Rozowsky, Chengfei Yan & Mark Gerstein
Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
Mengting Gu & Mark Gerstein
Department of Computer Science, Yale University, New Haven, CT, USA
Mengting Gu & Mark Gerstein
Department of Internal Medicine, Section of Infectious Diseases, Yale University School of Medicine, New Haven, CT, USA
Emrah Gumusgoz & Richard Sutton
School of Medicine, The Chinese University of Hong Kong, Hong Kong, China
Landon Chan
Functional Genomics Department, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Iros Barozzi, Veena Afzal, Jennifer A. Akiyama, Ingrid Plajzer-Frick, Catherine S. Novak, Momoe Kato, Tyler H. Garvin, Quan Pham, Anne Harrington, Brandon J. Mannion, Elizabeth A. Lee, Yoko Fukuda-Yuzawa, Axel Visel, Diane E. Dickel & Len A. Pennacchio
Department of Computer Science, The Chinese University of Hong Kong, Hong Kong, China
Kevin Y. Yip

Authors

Anurag Sethi
View author publications
You can also search for this author in PubMed Google Scholar
Mengting Gu
View author publications
You can also search for this author in PubMed Google Scholar
Emrah Gumusgoz
View author publications
You can also search for this author in PubMed Google Scholar
Landon Chan
View author publications
You can also search for this author in PubMed Google Scholar
Koon-Kiu Yan
View author publications
You can also search for this author in PubMed Google Scholar
Joel Rozowsky
View author publications
You can also search for this author in PubMed Google Scholar
Iros Barozzi
View author publications
You can also search for this author in PubMed Google Scholar
Veena Afzal
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer A. Akiyama
View author publications
You can also search for this author in PubMed Google Scholar
Ingrid Plajzer-Frick
View author publications
You can also search for this author in PubMed Google Scholar
Chengfei Yan
View author publications
You can also search for this author in PubMed Google Scholar
Catherine S. Novak
View author publications
You can also search for this author in PubMed Google Scholar
Momoe Kato
View author publications
You can also search for this author in PubMed Google Scholar
Tyler H. Garvin
View author publications
You can also search for this author in PubMed Google Scholar
Quan Pham
View author publications
You can also search for this author in PubMed Google Scholar
Anne Harrington
View author publications
You can also search for this author in PubMed Google Scholar
Brandon J. Mannion
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth A. Lee
View author publications
You can also search for this author in PubMed Google Scholar
Yoko Fukuda-Yuzawa
View author publications
You can also search for this author in PubMed Google Scholar
Axel Visel
View author publications
You can also search for this author in PubMed Google Scholar
Diane E. Dickel
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Y. Yip
View author publications
You can also search for this author in PubMed Google Scholar
Richard Sutton
View author publications
You can also search for this author in PubMed Google Scholar
Len A. Pennacchio
View author publications
You can also search for this author in PubMed Google Scholar
Mark Gerstein
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.S. and M. Gu conceptualized and developed the matched-filter model under the supervision of M. Gerstein. L.C., K.-K.Y., J.R. and K.Y.Y. performed many initial explorations and analysis of ChIP–seq data. C.Y. did model-performance comparisons. E.G. and R.S. performed the transduction reporter assay in human cell lines. I.B., V.A., J.A.A, I.P.-F., C.S.N., M.K., T.H.G., Q.P., A.H., B.J.M., E.A.L., Y.F.-Y., A.V., D.E.D. and L.A.P. performed enhancer assays in transgenic mice. A.S., M. Gu and M. Gerstein designed the model, coordinated the experimental validation and wrote the manuscript with input from coauthors.

Corresponding author

Correspondence to Mark Gerstein.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Properties of double peak metaprofile.

a) The frequency of distance between the two maxima in a double peak flanking active STARR-seq peaks is plotted. b) The symmetricity of the double peak pattern is plotted. The ratio of the distance between the two peaks to the ratio between one of the maxima and the minima is plotted. While there is large amount of variability in the distance between the two peaks (mostly between 300-1100 bp), the trough in the double peak tends to occur in the center of the two peaks. (c) The metaprofile around active STARR-seq peaks is plotted for different epigenetic marks. Histone marks that are enriched near STARR-seq peaks display the characteristic double peak pattern shown in c) due to the depletion of histone proteins at active regulatory regions. In addition, DHS displays a single peak at the center of these regulatory regions as shown in c). d) On the other hand, no such double peak pattern is observed on depleted histone marks at STARR-seq peaks. e) The matched filter score is calculated using the convolution of the flipped metaprofile and the epigenetic signal using a sliding window of variable length. The significant peaks in the final matched filter score are used to identify active regulatory regions.

Extended Data Fig. 2 Histogram of matched filter scores.

a) The probability density of matched filter scores for different epigenetic marks for STARR-seq peaks (positives) and random regions of the genome (negatives) with H3K27ac signal. In most cases, the matched filter scores for positives and negatives are Gaussian curves. The amount of overlap between these two curves determines the accuracy of the matched filter for predicting STARR-seq peaks using the matched filters for the corresponding epigenetic feature. b) The histogram of matched filter scores for small set of epigenetic features on promoters is compared to random regions of the genome. c) The histogram of matched filter scores for small set of epigenetic features on enhancers is compared to random regions of the genome. The features chosen in b, c were chosen to display distinct features of epigenetic marks around promoters and enhancers.

Extended Data Fig. 3 Comparison of different statistical models for predicting all STARR-seq peaks using a 30-feature model.

The performance of the different statistical models to integrate the information from 30 epigenetic features is shown. a) The numbers within the parentheses refer to the AUROC and AUPR for predicting the STARR-seq peaks (single core promoter) with histone peaks while the numbers outside the parentheses refer to the AUROC and AUPR for predicting STARR-seq peaks identified after combining multiple core promoters. b) The individual ROC and PR curves for each statistical model. c) The contribution of the matched filter score for each epigenetic feature to the different integrated models.

Extended Data Fig. 4 Comparison of different statistical models for predicting all STARR-seq peaks using a 6-feature model.

The performance of the different statistical models to integrate the information from six epigenetic features is shown. a) The numbers within the parentheses refer to the AUROC and AUPR for predicting the STARR-seq peaks (single core promoter) with histone peaks while the numbers outside the parentheses refer to the AUROC and AUPR for predicting STARR-seq peaks identified after combining multiple core promoters. b) The individual ROC and PR curves for each statistical model. c) The contribution of the matched filter score for each epigenetic feature to the different integrated models. The mean value is displayed in the bar plot while the error bars show the standard deviation of feature weights measured by ten-fold cross validation. d) We evaluated the accuracy of the models using different amounts of training data. The AUPR of the model increases with increasing amount of training data until it starts to saturate around 70% of the data. The mean value is displayed in the bar plot while the error bars show the standard deviation of feature weights measured by ten-fold cross validation.

Extended Data Fig. 5 Comparison of different statistical models for predicting enhancers and promoters using six features.

The performance of the different statistical models to integrate the information from six epigenetic features for promoter and enhancer prediction is shown. a) The numbers within the parentheses refer to the AUROC and AUPR for predicting the promoters with histone peaks while the numbers outside the parentheses refer to the AUROC and AUPR for predicting enhancers. The promoters and enhancers from multiple STARR-seq experiments with different core promoters are merged in this analysis. b) The individual ROC and PR curves for each statistical model is shown. The contribution of the matched filter score for each epigenetic feature to the different integrated models for promoter prediction (c) and enhancer prediction (d) are shown. The mean value is displayed in the bar plot while the error bars show the standard deviation of feature weights measured by ten-fold cross validation.

Extended Data Fig. 6 Comparison of different statistical models for predicting enhancers and promoters using 30 features.

The performance of the different statistical models to integrate the information from thirty epigenetic features for promoter and enhancer prediction is shown. a) The numbers within the parentheses refer to the AUROC and AUPR for predicting the promoters with histone peaks while the numbers outside the parentheses refer to the AUROC and AUPR for predicting enhancers. The promoters and enhancers from multiple STARR-seq experiments with different core promoters are merged in this analysis. b) The individual ROC and PR curves for each statistical model is shown. The contribution of the matched filter score for each epigenetic feature to the different integrated models for promoter prediction (c) and enhancer prediction (d) are shown.

Extended Data Fig. 7 Location of H1-hESC predictions.

a) The probability density of the distance of the predicted promoter and enhancer from the closest TSS is shown. b) The location of the enhancers and promoters on genomic elements are shown. Promoters are defined as TSS +/- 2kb. All TSS, UTR, exons, introns, and intergenic elements are calculated based on GENCODE 19 definitions. A regulatory region is considered to overlap with the elements if more than 50% of the matched filter region overlaps with the corresponding element in b. c) The distribution of gene expression of gene closest to the enhancer/promoters are plotted and compared to the gene expression of all genes in H1-hESC. A two-sided Wilcoxon test shows that P-value for differences in gene expression of genes close to enhancers and promoters are significantly higher than expression of all genes in H1-hESC (< 10⁻¹⁰⁰ each). The center line in each category represents the median expression level for all genes close to corresponding category while the lower and upper boundaries of the box indicate the 25th and 75th percentile of the expression levels for genes within that category.

Extended Data Fig. 8 Testing predicted enhancers using transgenic mouse enhancer assay.

Representative embryo images are shown for transgenic mice at the e11.5 stage. Blue staining indicates enhancers displaying reproducible activity in expected tissues (forebrain, midbrain, hindbrain, heart, neural tube, or limb). The unique identifiers under each image (accession number starting with ‘mm’) correspond to the element numbers in Supplementary Tables 4-9. Details of each experiment can be found in the VISTA enhancer browser (https://enhancer.lbl.gov) under the corresponding accession number.

Extended Data Fig. 9 Conservation of epigenetic features.

The performance of the fly-based matched filters and the integrated model for predicting active promoters and enhancers in mouse embryonic stem cells identified using FIREWACh. a Similar to Fig. 3, the numbers within parentheses refer to the AUROC and AUPR for predicting promoters while the numbers outside parentheses refer the performance of the models for predicting enhancers. b) The weights of the different features in the integrated models for promoter and enhancer prediction are shown. c) The individual ROC and PR curves for each matched filter and d) the integrated model are shown. The performance of these features and the integrated model for predicting the active promoters and enhancers identified using FIREWACh are shown.

Extended Data Fig. 10 Cross-comparison of integrated models for enhancer prediction.

Cross test results of the integrated model on mouse and fly. a) Models were trained in a cell line- and tissue-specific fashion. Row names show the context where the model is trained. Column names show the cell line or tissue where the model is tested. b) Similar to a), assuming identical distribution of matched filter scores for active enhancer regions in each tissue in mouse, we combined the normalized matched filter scores to get a larger training set for the model.

Extended Data Fig. 11 Enhancer validation experiments in human cell lines.

a) Schematic of the enhancer validation experiment flow. At top is the third-generation HIV-based self-inactivating vector (deletion in 3’ LTR indicated by red triangle), with PCR-amplified test DNA (blue, cloned in both orientations) inserted just 5’ of a basal Oct4 promoter (P) driving IRES-eGFP (green). Vector supernatant was prepared by plasmid co-transfection of 293T cells. Cells of interest were transduced and then analyzed by flow cytometry a few days later. Shown below is the expected post-transduction structure of the SIN HIV vector, with a duplication of the 3’ LTR deletion rendering both LTRs non-functional. b) Fold changes of gene expression of eGFP was compared between negative elements (n=20 biologically independent samples) and putative enhancers (n=20 biologically independent samples) chosen at random. Each sample in the plot is the average log fold change of the replicates for each element. c–e) Predicted enhancers increase gene expressions in A549, HOS, and TZM-bl cell lines. The enhancers were predicted in H1-hESCs. The activities of these enhancers (N=20 in each plot) were compared to control regions (N=20 in each plot) in three other cell lines: c) HOS, d) A549, and e) TZM-bl. The p-value were calculated by the two-sided t-test. The center value represented by the green line in the box plot shows the median log FC of each group. The 25th and 75th percentiles of the log fold changes in gene expressions for each group are represented by the upper and lower lines of the box, with whiskers connecting to the maximum and the minimum value.

Extended Data Fig. 12 Activity of putative enhancers tested in H1-hESCs.

Each element was tested in triplicate (biologically independent experiments) by SIN HIV vector transduction of H1-hESCs. The bar plot shows the average of the activity measured in three replicates for each element by FACS analysis gating on eGFP+ cells, with error bars showing the standard deviations. F, forward orientation; R, reverse orientation.

Extended Data Fig. 13 Performance comparison of the Matched-filter model in four mouse tissues.

We compared the performance of the matched filter model to the other state-of-the-art predictive models on four mouse tissues where data is available. a) Comparing the performance of the matched filter model and ChromHMM with ROC curves using experimental results from transgenic mouse enhancer assays. The ROC curves for matched filter are plotted in blue solid lines, and the ROC curves for ChromHMM are plotted in green dashed lines. ROC curves are shown for all four tissues in embryonic mice at the e11.5 stage. b) Comparing the performance of the matched filter model with the reported performance of other published methods, including REPTILE, RFECS, DELTA, and CSI-ANN. Bar plots show the areas under the ROC curve (AUROC) of each methods in different tissues of embryonic mice at the e11.5 stage.

Extended Data Fig. 14 Evaluating the Matched-filter prediction using FANTOM5 experimental data.

We assessed the percentage overlap of the matched filter prediction with the FANTOM5 enhancers/promoters, and compared the percentages with other state-of-the-art methods. A) Comparison of the matched filter enhancer prediction in human cell lines with the integrated ChromHMM and Segway annotations using the FANTOM5 enhancer set. Bar plots on the left show the percentage of predicted enhancers overlapping with FANTOM enhancers; bar plots on the right show the percentage of FANTOM5 enhancers overlapping with predicted enhancers. B) Comparison of the overlap of matched filter promoter predictions with the FANTOM5 promoter set to that of the integrated ChromHMM and Segway annotations. The bar plots show the percentage of predicted promoters overlapping with FANTOM5 promoters, with dark blue denoting the matched filter model and light blue denoting the integrated ChromHMM and Segway annotations. C) Comparison of the overlap of K562 enhancers predicted by matched filter and other published methods with the FANTOM5 enhancer set. The bar plots show the percentage of predicted enhancers overlapping with FANTOM5 K562 enhancers for the matched filter model, CSI-ANN, DEEP, and RFECS.

Extended Data Fig. 15 Comparison of the transcription factor binding pattern of matched filter and RFECS in H1-hESCs.

Potentially positive enhancers were considered as regions with either DNase-I hypersensitive sites (DHS), or bound by transcription factors (TFs) such as NANOG, OCT4, SOX2, or p300. TSS were defined as within 2.5kb of any known GENCODE TSS. Predictions that fell out of the above categories were classified as unknown.

Supplementary information

Supplementary Information

Supplementary Methods, Tables 1–10 and Figs. 1–12

Reporting Summary

Supplementary Data 1

The raw data for human validation experiments in all four cell lines

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sethi, A., Gu, M., Gumusgoz, E. et al. Supervised enhancer prediction with epigenetic pattern recognition and targeted validation. Nat Methods 17, 807–814 (2020). https://doi.org/10.1038/s41592-020-0907-8

Download citation

Received: 25 September 2017
Accepted: 18 June 2020
Published: 29 July 2020
Issue Date: August 2020
DOI: https://doi.org/10.1038/s41592-020-0907-8

This article is cited by

Integrative analysis of transcriptomic and epigenomic data reveals distinct patterns for developmental and housekeeping gene regulation
- Irina Abnizova
- Carine Stapel
- Martin Hemberg
BMC Biology (2024)
Pig-eRNAdb: a comprehensive enhancer and eRNA dataset of pigs
- Yifei Wang
- Weiwei Jin
- Xiaolong Yuan
Scientific Data (2024)
Synthetic reversed sequences reveal default genomic states
- Brendan R. Camellato
- Ran Brosh
- Jef D. Boeke
Nature (2024)
An overview of artificial intelligence in the field of genomics
- Khizra Maqsood
- Hani Hagras
- Nicolae Radu Zabet
Discover Artificial Intelligence (2024)
Complex regulatory networks influence pluripotent cell state transitions in human iPSCs
- Timothy D. Arthur
- Jennifer P. Nguyen
- Kelly A. Frazer
Nature Communications (2024)