Supervised enhancer prediction with epigenetic pattern recognition and targeted validation

Abstract

Enhancers are important non-coding elements, but they have traditionally been hard to characterize experimentally. The development of massively parallel assays allows the characterization of large numbers of enhancers for the first time. Here, we developed a framework using Drosophila STARR-seq to create shape-matching filters based on meta-profiles of epigenetic features. We integrated these features with supervised machine-learning algorithms to predict enhancers. We further demonstrated that our model could be transferred to predict enhancers in mammals. We comprehensively validated the predictions using a combination of in vivo and in vitro approaches, involving transgenic assays in mice and transduction-based reporter assays in human cell lines (153 enhancers in total). The results confirmed that our model can accurately predict enhancers in different species without re-parameterization. Finally, we examined the transcription factor binding patterns at predicted enhancers versus promoters. We demonstrated that these patterns enable the construction of a secondary model that effectively distinguishes enhancers and promoters.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Flowchart of the matched-filter model.
Fig. 2: Performance of matched filters and integrated models for predicting STARR-seq peaks, compared with that of peak-based models.
Fig. 3: Performance of matched filters and integrated models for predicting promoters and enhancers.
Fig. 4: Performance of matched filters and integrated model for predicting active enhancers in mice.
Fig. 5: Differences in TF-binding patterns at enhancers and promoters.

Data availability

A detailed description of the datasets used in each part of the study is in the corresponding section of Supplementary Methods. Specifically, the Drosophila epigenetics datasets used in this study were generated by the modENCODE consortium, available online (http://data.modencode.org). The mouse epigenetics datasets were generated by the ENCODE and Roadmap Epigenomics consortium, available online (https://www.encodeproject.org). We downloaded the Drosophila STARR-seq data28 and the mouse FIREWACh data32 from previous studies. Results from transgenic-mouse enhancer assays were generated by the Pennacchio lab at LBNL. Experimental results are summarized in Supplementary Tables 49, with the mouse images and additional details available on the VISTA Enhancer Browser (https://enhancer.lbl.gov). The human-cell-line enhancer reporter assay results were generated by the Sutton lab at Yale University. Experiment results are summarized in Supplementary Table 10. More detailed results for each cell line are available in Supplementary Data 1.

Code availability

We have implemented our methods in Python. The source code and the output annotations referenced in the paper are available at the website http://matchedfilter.gersteinlab.org. A dockerized image is also provided at this site.

References

  1. 1.

    Banerji, J., Rusconi, S. & Schaffner, W. Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell 27, 299–308 (1981).

    CAS  PubMed  Google Scholar 

  2. 2.

    Levo, M. et al. Unraveling determinants of transcription factor binding outside the core binding site. Genome Res. 25, 1018–1029 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Slattery, M. et al. Absence of a simple code: how transcription factors read the genome. Trends Biochem. Sci. 39, 381–399 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Corradin, O. & Scacheri, P. C. Enhancer variants: evaluating functions in common disease. Genome Med. 6, 85 (2014).

    PubMed  PubMed Central  Google Scholar 

  5. 5.

    Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 95, 535–552 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Wray, G. A. The evolutionary significance of cis-regulatory mutations. Nat. Rev. Genet.8, 206–216 (2007).

    CAS  PubMed  Google Scholar 

  7. 7.

    Erwin, G. D. et al. Integrating diverse datasets improves developmental enhancer prediction. PLoS Comput. Biol. 10, e1003677 (2014).

    PubMed  PubMed Central  Google Scholar 

  8. 8.

    Pennacchio, L. A., Bickmore, W., Dean, A., Nobrega, M. A. & Bejerano, G. Enhancers: five essential questions. Nat. Rev. Genet 14, 288–295 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Pennacchio, L. A. et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444, 499–502 (2006).

    CAS  PubMed  Google Scholar 

  10. 10.

    Visel, A. et al. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat. Genet 40, 158–160 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Nord, A. S. et al. Rapid and pervasive changes in genome-wide enhancer usage during mammalian development. Cell 155, 1521–1531 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Visel, A. et al. ChIP–seq accurately predicts tissue-specific activity of enhancers. Nature 457, 854–858 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M. A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol. 10, e1003711 (2014).

    PubMed  PubMed Central  Google Scholar 

  15. 15.

    Narlikar, L. et al. Genome-wide discovery of human heart enhancers. Genome Res. 20, 381–392 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Yip, K. Y. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 13, R48 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).

    CAS  PubMed  Google Scholar 

  18. 18.

    Maston, G. A., Landt, S. G., Snyder, M. & Green, M. R. Characterization of enhancer function from genome-wide analyses. Annu. Rev. Genomics Hum. Genet. 13, 29–57 (2012).

    CAS  PubMed  Google Scholar 

  19. 19.

    Shlyueva, D., Stampfel, G. & Stark, A. Transcriptional enhancers: from properties to genome-wide predictions. Nat. Rev. Genet 15, 272–286 (2014).

    CAS  PubMed  Google Scholar 

  20. 20.

    Yanez-Cuna, J. O. et al. Dissection of thousands of cell type-specific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome Res. 24, 1147–1156 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Creyghton, M. P. et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc. Natl Acad. Sci. USA 107, 21931–21936 (2010).

    CAS  PubMed  Google Scholar 

  22. 22.

    Heintzman, N. D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 39, 311–318 (2007).

    CAS  PubMed  Google Scholar 

  23. 23.

    Gasperini, M., Tome, J. M. & Shendure, J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat. Rev. Genet 21, 292–310 (2020).

    CAS  PubMed  Google Scholar 

  24. 24.

    Liu, Y. et al. Functional assessment of human enhancer activities using whole-genome STARR-sequencing. Genome Biol. 18, 219 (2017).

    PubMed  PubMed Central  Google Scholar 

  25. 25.

    Boyle, A. P. et al. Comparative analysis of regulatory information and circuits across distant species. Nature 512, 453–456 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Cheng, C. & Gerstein, M. Modeling the relative relationship of transcription factor binding and histone modifications to gene expression levels in mouse embryonic stem cells. Nucleic Acids Res. 40, 553–568 (2012).

    CAS  PubMed  Google Scholar 

  27. 27.

    Cheng, Y. et al. Principles of regulatory information conservation between mouse and human. Nature 515, 371 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Dong, X. et al. Modeling gene expression using chromatin features in various cellular contexts. Genome Biol. 13, R53 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Gerstein, M. B. et al. Comparative analysis of the transcriptome across distant species. Nature 512, 445–448 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Gjoneska, E. et al. Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease. Nature 518, 365–369 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 31.

    Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Zabidi, M. A. et al. Enhancer-core-promoter specificity separates developmental and housekeeping gene regulation. Nature 518, 556–559 (2015).

    CAS  PubMed  Google Scholar 

  33. 33.

    Cotney, J. et al. Chromatin state signatures associated with tissue-specific gene expression and enhancer activity in the embryonic limb. Genome Res. 22, 1069–1080 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Burges, C. J. C. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121–167 (1998).

    Google Scholar 

  36. 36.

    Roadmap Epigenomics, C. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    Google Scholar 

  37. 37.

    Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Google Scholar 

  38. 38.

    Koch, C. M. et al. The landscape of histone modifications across 1% of the human genome in five human cell lines. Genome Res. 17, 691–707 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Rajagopal, N. et al. RFECS: a random-forest based algorithm for enhancer identification from chromatin state. PLoS Comput. Biol. 9, e1002968 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Murtha, M. et al. FIREWACh: high-throughput functional detection of transcriptional regulatory modules in mammalian cells. Nat. Methods 11, 559–565 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. 41.

    Bailey, S. D. et al. ZNF143 provides sequence specificity to secure chromatin interactions at gene promoters. Nat. Commun. 2, 6186 (2015).

    PubMed  Google Scholar 

  42. 42.

    Muerdter, F. et al. Resolving systematic errors in widely used enhancer activity assays in human cells. Nat. Methods 15, 141–149 (2018).

    CAS  PubMed  Google Scholar 

  43. 43.

    Kumar, V. B. V. K., Mahalanobis, A. & Juday, R. D. Correlation Pattern Recognition (Cambridge University Press, 2005).

  44. 44.

    Mod, E. C. et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 1787–1797 (2010).

    Google Scholar 

  45. 45.

    Blanchard, G., Bousquet, O. & Massaer, P. Statistical performance of support vector machines. Ann. Stat. 36, 489–531 (2008).

    Google Scholar 

  46. 46.

    Hoerl, A. E. & Kennard, R. W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970).

    Google Scholar 

  47. 47.

    Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Google Scholar 

  48. 48.

    Stuart, R. & Norvig, P. Artificial Intelligence: A Modern Approach 2nd edn (Pearson, 2003).

  49. 49.

    Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  50. 50.

    Diao, Y. et al. A tiling-deletion-based genetic screen for cis-regulatory element identification in mammalian cells. Nat. Methods 14, 629–635 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. 51.

    Villar, D. et al. Enhancer evolution across 20 mammalian species. Cell 160, 554–566 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. 53.

    Kothary, R. et al. Inducible expression of an hsp68-lacZ hybrid gene in transgenic mice. Development 105, 707–714 (1989).

    CAS  PubMed  Google Scholar 

  54. 54.

    Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. 55.

    Firpi, H. A., Ucar, D. & Tan, K. Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics 26, 1579–1586 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. 56.

    Lu, Y., Qu, W., Shan, G. & Zhang, C. DELTA: a distal enhancer locating tool based on adaboost algorithm and shape features of chromatin modifications. PLoS One 10, e0130622 (2015).

    PubMed  PubMed Central  Google Scholar 

  57. 57.

    He, Y. et al. Improved regulatory element prediction based on tissue-specific local epigenomic signatures. Proc. Natl Acad. Sci. USA 114, E1633–E1640 (2017).

    CAS  PubMed  Google Scholar 

  58. 58.

    Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473–476 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. 59.

    Arner, E. et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347, 1010–1014 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. 60.

    Kleftogiannis, D., Kalnis, P. & Bajic, V. B. DEEP: a general computational framework for predicting enhancers. Nucleic Acids Res. 43, e6 (2015).

    PubMed  Google Scholar 

Download references

Acknowledgements

M. Gerstein was supported by NIH grant HG009446-01. A.V. and L.A.P. were supported by NHLBI grant R24HL123879 and NHGRI grants R01HG003988, U54HG006997 and UM1HG009421, where research was conducted at the E.O. Lawrence Berkeley National Laboratory and performed under Department of Energy Contract DE-AC02-05CH11231, University of California. We thank A. Paccanaro, D. Galeano, M. Torres and Y. Wu for their insightful scientific discussions. We thank all ENCODE consortium members for their feedback on this work.

Author information

Affiliations

Authors

Contributions

A.S. and M. Gu conceptualized and developed the matched-filter model under the supervision of M. Gerstein. L.C., K.-K.Y., J.R. and K.Y.Y. performed many initial explorations and analysis of ChIP–seq data. C.Y. did model-performance comparisons. E.G. and R.S. performed the transduction reporter assay in human cell lines. I.B., V.A., J.A.A, I.P.-F., C.S.N., M.K., T.H.G., Q.P., A.H., B.J.M., E.A.L., Y.F.-Y., A.V., D.E.D. and L.A.P. performed enhancer assays in transgenic mice. A.S., M. Gu and M. Gerstein designed the model, coordinated the experimental validation and wrote the manuscript with input from coauthors.

Corresponding author

Correspondence to Mark Gerstein.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Properties of double peak metaprofile.

a) The frequency of distance between the two maxima in a double peak flanking active STARR-seq peaks is plotted. b) The symmetricity of the double peak pattern is plotted. The ratio of the distance between the two peaks to the ratio between one of the maxima and the minima is plotted. While there is large amount of variability in the distance between the two peaks (mostly between 300-1100 bp), the trough in the double peak tends to occur in the center of the two peaks. (c) The metaprofile around active STARR-seq peaks is plotted for different epigenetic marks. Histone marks that are enriched near STARR-seq peaks display the characteristic double peak pattern shown in c) due to the depletion of histone proteins at active regulatory regions. In addition, DHS displays a single peak at the center of these regulatory regions as shown in c). d) On the other hand, no such double peak pattern is observed on depleted histone marks at STARR-seq peaks. e) The matched filter score is calculated using the convolution of the flipped metaprofile and the epigenetic signal using a sliding window of variable length. The significant peaks in the final matched filter score are used to identify active regulatory regions.

Extended Data Fig. 2 Histogram of matched filter scores.

a) The probability density of matched filter scores for different epigenetic marks for STARR-seq peaks (positives) and random regions of the genome (negatives) with H3K27ac signal. In most cases, the matched filter scores for positives and negatives are Gaussian curves. The amount of overlap between these two curves determines the accuracy of the matched filter for predicting STARR-seq peaks using the matched filters for the corresponding epigenetic feature. b) The histogram of matched filter scores for small set of epigenetic features on promoters is compared to random regions of the genome. c) The histogram of matched filter scores for small set of epigenetic features on enhancers is compared to random regions of the genome. The features chosen in b, c were chosen to display distinct features of epigenetic marks around promoters and enhancers.

Extended Data Fig. 3 Comparison of different statistical models for predicting all STARR-seq peaks using a 30-feature model.

The performance of the different statistical models to integrate the information from 30 epigenetic features is shown. a) The numbers within the parentheses refer to the AUROC and AUPR for predicting the STARR-seq peaks (single core promoter) with histone peaks while the numbers outside the parentheses refer to the AUROC and AUPR for predicting STARR-seq peaks identified after combining multiple core promoters. b) The individual ROC and PR curves for each statistical model. c) The contribution of the matched filter score for each epigenetic feature to the different integrated models.

Extended Data Fig. 4 Comparison of different statistical models for predicting all STARR-seq peaks using a 6-feature model.

The performance of the different statistical models to integrate the information from six epigenetic features is shown. a) The numbers within the parentheses refer to the AUROC and AUPR for predicting the STARR-seq peaks (single core promoter) with histone peaks while the numbers outside the parentheses refer to the AUROC and AUPR for predicting STARR-seq peaks identified after combining multiple core promoters. b) The individual ROC and PR curves for each statistical model. c) The contribution of the matched filter score for each epigenetic feature to the different integrated models. The mean value is displayed in the bar plot while the error bars show the standard deviation of feature weights measured by ten-fold cross validation. d) We evaluated the accuracy of the models using different amounts of training data. The AUPR of the model increases with increasing amount of training data until it starts to saturate around 70% of the data. The mean value is displayed in the bar plot while the error bars show the standard deviation of feature weights measured by ten-fold cross validation.

Extended Data Fig. 5 Comparison of different statistical models for predicting enhancers and promoters using six features.

The performance of the different statistical models to integrate the information from six epigenetic features for promoter and enhancer prediction is shown. a) The numbers within the parentheses refer to the AUROC and AUPR for predicting the promoters with histone peaks while the numbers outside the parentheses refer to the AUROC and AUPR for predicting enhancers. The promoters and enhancers from multiple STARR-seq experiments with different core promoters are merged in this analysis. b) The individual ROC and PR curves for each statistical model is shown. The contribution of the matched filter score for each epigenetic feature to the different integrated models for promoter prediction (c) and enhancer prediction (d) are shown. The mean value is displayed in the bar plot while the error bars show the standard deviation of feature weights measured by ten-fold cross validation.

Extended Data Fig. 6 Comparison of different statistical models for predicting enhancers and promoters using 30 features.

The performance of the different statistical models to integrate the information from thirty epigenetic features for promoter and enhancer prediction is shown. a) The numbers within the parentheses refer to the AUROC and AUPR for predicting the promoters with histone peaks while the numbers outside the parentheses refer to the AUROC and AUPR for predicting enhancers. The promoters and enhancers from multiple STARR-seq experiments with different core promoters are merged in this analysis. b) The individual ROC and PR curves for each statistical model is shown. The contribution of the matched filter score for each epigenetic feature to the different integrated models for promoter prediction (c) and enhancer prediction (d) are shown.

Extended Data Fig. 7 Location of H1-hESC predictions.

a) The probability density of the distance of the predicted promoter and enhancer from the closest TSS is shown. b) The location of the enhancers and promoters on genomic elements are shown. Promoters are defined as TSS +/- 2kb. All TSS, UTR, exons, introns, and intergenic elements are calculated based on GENCODE 19 definitions. A regulatory region is considered to overlap with the elements if more than 50% of the matched filter region overlaps with the corresponding element in b. c) The distribution of gene expression of gene closest to the enhancer/promoters are plotted and compared to the gene expression of all genes in H1-hESC. A two-sided Wilcoxon test shows that P-value for differences in gene expression of genes close to enhancers and promoters are significantly higher than expression of all genes in H1-hESC (< 10−100 each). The center line in each category represents the median expression level for all genes close to corresponding category while the lower and upper boundaries of the box indicate the 25th and 75th percentile of the expression levels for genes within that category.

Extended Data Fig. 8 Testing predicted enhancers using transgenic mouse enhancer assay.

Representative embryo images are shown for transgenic mice at the e11.5 stage. Blue staining indicates enhancers displaying reproducible activity in expected tissues (forebrain, midbrain, hindbrain, heart, neural tube, or limb). The unique identifiers under each image (accession number starting with ‘mm’) correspond to the element numbers in Supplementary Tables 4-9. Details of each experiment can be found in the VISTA enhancer browser (https://enhancer.lbl.gov) under the corresponding accession number.

Extended Data Fig. 9 Conservation of epigenetic features.

The performance of the fly-based matched filters and the integrated model for predicting active promoters and enhancers in mouse embryonic stem cells identified using FIREWACh. a Similar to Fig. 3, the numbers within parentheses refer to the AUROC and AUPR for predicting promoters while the numbers outside parentheses refer the performance of the models for predicting enhancers. b) The weights of the different features in the integrated models for promoter and enhancer prediction are shown. c) The individual ROC and PR curves for each matched filter and d) the integrated model are shown. The performance of these features and the integrated model for predicting the active promoters and enhancers identified using FIREWACh are shown.

Extended Data Fig. 10 Cross-comparison of integrated models for enhancer prediction.

Cross test results of the integrated model on mouse and fly. a) Models were trained in a cell line- and tissue-specific fashion. Row names show the context where the model is trained. Column names show the cell line or tissue where the model is tested. b) Similar to a), assuming identical distribution of matched filter scores for active enhancer regions in each tissue in mouse, we combined the normalized matched filter scores to get a larger training set for the model.

Extended Data Fig. 11 Enhancer validation experiments in human cell lines.

a) Schematic of the enhancer validation experiment flow. At top is the third-generation HIV-based self-inactivating vector (deletion in 3’ LTR indicated by red triangle), with PCR-amplified test DNA (blue, cloned in both orientations) inserted just 5’ of a basal Oct4 promoter (P) driving IRES-eGFP (green). Vector supernatant was prepared by plasmid co-transfection of 293T cells. Cells of interest were transduced and then analyzed by flow cytometry a few days later. Shown below is the expected post-transduction structure of the SIN HIV vector, with a duplication of the 3’ LTR deletion rendering both LTRs non-functional. b) Fold changes of gene expression of eGFP was compared between negative elements (n=20 biologically independent samples) and putative enhancers (n=20 biologically independent samples) chosen at random. Each sample in the plot is the average log fold change of the replicates for each element. ce) Predicted enhancers increase gene expressions in A549, HOS, and TZM-bl cell lines. The enhancers were predicted in H1-hESCs. The activities of these enhancers (N=20 in each plot) were compared to control regions (N=20 in each plot) in three other cell lines: c) HOS, d) A549, and e) TZM-bl. The p-value were calculated by the two-sided t-test. The center value represented by the green line in the box plot shows the median log FC of each group. The 25th and 75th percentiles of the log fold changes in gene expressions for each group are represented by the upper and lower lines of the box, with whiskers connecting to the maximum and the minimum value.

Extended Data Fig. 12 Activity of putative enhancers tested in H1-hESCs.

Each element was tested in triplicate (biologically independent experiments) by SIN HIV vector transduction of H1-hESCs. The bar plot shows the average of the activity measured in three replicates for each element by FACS analysis gating on eGFP+ cells, with error bars showing the standard deviations. F, forward orientation; R, reverse orientation.

Extended Data Fig. 13 Performance comparison of the Matched-filter model in four mouse tissues.

We compared the performance of the matched filter model to the other state-of-the-art predictive models on four mouse tissues where data is available. a) Comparing the performance of the matched filter model and ChromHMM with ROC curves using experimental results from transgenic mouse enhancer assays. The ROC curves for matched filter are plotted in blue solid lines, and the ROC curves for ChromHMM are plotted in green dashed lines. ROC curves are shown for all four tissues in embryonic mice at the e11.5 stage. b) Comparing the performance of the matched filter model with the reported performance of other published methods, including REPTILE, RFECS, DELTA, and CSI-ANN. Bar plots show the areas under the ROC curve (AUROC) of each methods in different tissues of embryonic mice at the e11.5 stage.

Extended Data Fig. 14 Evaluating the Matched-filter prediction using FANTOM5 experimental data.

We assessed the percentage overlap of the matched filter prediction with the FANTOM5 enhancers/promoters, and compared the percentages with other state-of-the-art methods. A) Comparison of the matched filter enhancer prediction in human cell lines with the integrated ChromHMM and Segway annotations using the FANTOM5 enhancer set. Bar plots on the left show the percentage of predicted enhancers overlapping with FANTOM enhancers; bar plots on the right show the percentage of FANTOM5 enhancers overlapping with predicted enhancers. B) Comparison of the overlap of matched filter promoter predictions with the FANTOM5 promoter set to that of the integrated ChromHMM and Segway annotations. The bar plots show the percentage of predicted promoters overlapping with FANTOM5 promoters, with dark blue denoting the matched filter model and light blue denoting the integrated ChromHMM and Segway annotations. C) Comparison of the overlap of K562 enhancers predicted by matched filter and other published methods with the FANTOM5 enhancer set. The bar plots show the percentage of predicted enhancers overlapping with FANTOM5 K562 enhancers for the matched filter model, CSI-ANN, DEEP, and RFECS.

Extended Data Fig. 15 Comparison of the transcription factor binding pattern of matched filter and RFECS in H1-hESCs.

Potentially positive enhancers were considered as regions with either DNase-I hypersensitive sites (DHS), or bound by transcription factors (TFs) such as NANOG, OCT4, SOX2, or p300. TSS were defined as within 2.5kb of any known GENCODE TSS. Predictions that fell out of the above categories were classified as unknown.

Supplementary information

Supplementary Information

Supplementary Methods, Tables 1–10 and Figs. 1–12

Reporting Summary

Supplementary Data 1

The raw data for human validation experiments in all four cell lines

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sethi, A., Gu, M., Gumusgoz, E. et al. Supervised enhancer prediction with epigenetic pattern recognition and targeted validation. Nat Methods 17, 807–814 (2020). https://doi.org/10.1038/s41592-020-0907-8

Download citation