Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Deep neural networks identify sequence context features predictive of transcription factor binding

Abstract

Transcription factors bind DNA by recognizing specific sequence motifs, which are typically 6–12 bp long. A motif can occur many thousands of times in the human genome, but only a subset of those sites are actually bound. Here we present a machine-learning framework leveraging existing convolutional neural network architectures and model interpretation techniques to identify and interpret sequence context features most important for predicting whether a particular motif instance will be bound. We apply our framework to predict binding at motifs for 38 transcription factors in a lymphoblastoid cell line, score the importance of context sequences at base-pair resolution and characterize context features most predictive of binding. We find that the choice of training data heavily influences classification accuracy and the relative importance of features such as open chromatin. Overall, our framework enables novel insights into features predictive of transcription factor binding and is likely to inform future deep learning applications to interpret non-coding genetic variants.

A preprint version of the article is available at bioRxiv.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: AgentBind overview.
Fig. 2: Interpreting context-specific determinants of TF binding.
Fig. 3: Identifying key context sequence features for TF binding in GM12878.
Fig. 4: Cell-type-specific enrichment of 5-mers influential for STAT3 binding.

Data availability

Variant annotation scores for each TF analysed can be found at https://github.com/Pandaman-Ryan/AgentBind. Peak files for ENCODE ChIP-seq datasets can be found at http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgTfbsUniform. Peak files for STAT3 in CD4+ T cells were obtained from the Gene Expression Omnibus (GEO accession GSM2545819).

Code availability

Code used for training models and performing analyses are available in our Github repository https://github.com/Pandaman-Ryan/AgentBind (ref. 45)

References

  1. 1.

    Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).

    Article  Google Scholar 

  2. 2.

    Zaret, K. S. & Mango, S. E. Pioneer transcription factors, chromatin dynamics, and cell fate control. Curr. Opin. Genet. Dev. 37, 76–81 (2016).

    Article  Google Scholar 

  3. 3.

    Segal, E. et al. A genomic code for nucleosome positioning. Nature 442, 772–778 (2006).

    Article  Google Scholar 

  4. 4.

    Morgunova, E. & Taipale, J. Structural perspective of cooperative transcription factor binding. Curr. Opin. Struct. Biol. 47, 1–8 (2017).

    Article  Google Scholar 

  5. 5.

    Wang, J. et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798–1812 (2012).

    Article  Google Scholar 

  6. 6.

    Zhou, T. et al. Quantitative modeling of transcription factor binding specificities using DNA shape. Proc. Natl Acad. Sci. USA 112, 4654–4659 (2015).

    Article  Google Scholar 

  7. 7.

    Guo, Y., Tian, K., Zeng, H., Guo, X. & Gifford, D. K. A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction. Genome Res. 28, 891–900 (2018).

    Article  Google Scholar 

  8. 8.

    Westholm, J. O., Xu, F., Ronne, H. & Komorowski, J. Genome-scale study of the importance of binding site context for transcription factor binding and gene regulation. BMC Bioinform. 9, 484 (2008).

    Article  Google Scholar 

  9. 9.

    Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  Google Scholar 

  10. 10.

    Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

    Article  Google Scholar 

  11. 11.

    Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    Article  Google Scholar 

  12. 12.

    Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).

    Article  Google Scholar 

  13. 13.

    Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016).

    Article  Google Scholar 

  14. 14.

    Quang, D. & Xie, X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods 166, 40–47 (2019).

    Article  Google Scholar 

  15. 15.

    Zeng, H., Hashimoto, T., Kang, D. D. & Gifford, D. K. GERV: a statistical method for generative evaluation of regulatory variants for transcription factor binding. Bioinformatics 32, 490–496 (2016).

    Article  Google Scholar 

  16. 16.

    Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Meth. 12, 931–934 (2015).

    Article  Google Scholar 

  17. 17.

    Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision 618–626 (2017).

  18. 18.

    Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  Google Scholar 

  19. 19.

    Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D260–D266 (2018).

    Article  Google Scholar 

  20. 20.

    Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    Article  Google Scholar 

  21. 21.

    Ma, H., Bandos, A. I., Rockette, H. E. & Gur, D. On use of partial area under the ROC curve for evaluation of diagnostic performance. Stat. Med. 32, 3449–3458 (2013).

    MathSciNet  Article  Google Scholar 

  22. 22.

    Amariuta, T. et al. IMPACT: genomic annotation of cell-state-specific regulatory elements inferred from the epigenome of bound transcription factors. Am. J. Hum. Genet. 104, 879–895 (2019).

    Article  Google Scholar 

  23. 23.

    Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proc. 34th International Conference on Machine Learning 70, 3145–3153 (JMLR, 2017).

  24. 24.

    Lanchantin, J., Singh, R., Wang, B. & Qi, Y. Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. In Pacific Symposium on Biocomputing 2017 254–265 (World Scientific, 2017).

  25. 25.

    Gotea, V. et al. Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res. 20, 565–577 (2010).

    Article  Google Scholar 

  26. 26.

    Roder, K., Wolf, S. S., Larkin, K. J. & Schweizer, M. Interaction between the two ubiquitously expressed transcription factors NF-Y and Sp1. Gene 234, 61–69 (1999).

    Article  Google Scholar 

  27. 27.

    Dolfini, D., Zambelli, F., Pedrazzoli, M., Mantovani, R. & Pavesi, G. A high definition look at the NF-Y regulome reveals genome-wide associations with selected transcription factors. Nucleic Acids Res. 44, 4684–4702 (2016).

    Article  Google Scholar 

  28. 28.

    Van Dam, H. & Castellazzi, M. Distinct roles of Jun: Fos and Jun: ATF dimers in oncogenesis. Oncogene 20, 2453–2464 (2001).

    Article  Google Scholar 

  29. 29.

    Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).

    Article  Google Scholar 

  30. 30.

    Li, P. et al. BATF–JUN is critical for IRF4-mediated transcription in T cells. Nature 490, 543–546 (2012).

    Article  Google Scholar 

  31. 31.

    Benner, C. et al. Decoding a signature-based model of transcription cofactor recruitment dictated by cardinal cis-regulatory elements in proximal promoter regions. PLoS Genet. 9, e1003906 (2013).

    Article  Google Scholar 

  32. 32.

    Mevel, R., Draper, J. E., Lie, A. L. M., Kouskoff, V. & Lacaud, G. RUNX transcription factors: orchestrators of development. Development 146, dev148296 (2019).

  33. 33.

    Kroger, A. IRFs as competing pioneers in T-cell differentiation. Cell. Mol. Immunol. 14, 649–651 (2017).

    Article  Google Scholar 

  34. 34.

    Tewhey, R. et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016).

    Article  Google Scholar 

  35. 35.

    Liu, G., Zeng, H. & Gifford, D. K. Visualizing complex feature interactions and feature sharing in genomic deep neural networks. BMC Bioinform. 20, 401 (2019).

    Article  Google Scholar 

  36. 36.

    Zeng, H., Edwards, M. D., Guo, Y. & Gifford, D. K. Accurate eQTL prioritization with an ensemble-based framework. Hum. Mutat. 38, 1259–1265 (2017).

    Article  Google Scholar 

  37. 37.

    Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).

    Article  Google Scholar 

  38. 38.

    Abadi, M. et al. TensorFlow: a system for large-scale machine learning. In Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘16) 265–283 (USENIX, 2016).

  39. 39.

    Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).

    Article  Google Scholar 

  40. 40.

    Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article  Google Scholar 

  41. 41.

    Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 46, D252–D259 (2018).

    Article  Google Scholar 

  42. 42.

    Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–-443 (2020).

    Article  Google Scholar 

  43. 43.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  Google Scholar 

  44. 44.

    Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

    Article  Google Scholar 

  45. 45.

    Zheng, A. Pandaman-Ryan/AgentBind v.0.1 (accessed 19 November 2020); https://doi.org/10.5281/zenodo.4281456

Download references

Acknowledgements

This study was supported in part by NIH/NHGRI 1R21HG010070-01 (M.G.), the Microsoft Genomics for Research programme and an Amazon Web Services research award. We thank NVIDIA for donating a Tesla K40 GPU to support this project. We additionally thank C. Benner, C. Glass and A. Goren for helpful comments.

Author information

Affiliations

Authors

Contributions

A.Z. designed and performed analyses and helped write the manuscript. M.L., H.Z. and C.W. helped perform analyses. H.S. helped design the study. M.G. conceived the study, supervised analyses and helped write the manuscript.

Corresponding author

Correspondence to Melissa Gymrek.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks David Gifford, Peter Koo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Model performance related to GC content and open chromatin.

(a,b) GC content differences correlated with model performance. The x-axis shows the absolute value of the difference in mean GC content for positive vs. negative sequences with the motif for each TF. The y-axis shows auROCs. Each dot represents one TF. Results in a-b are for baseline models with motifs blocked (a) or unblocked (b). (c) Comparison of training data size and change in model performance. The x-axis (log10 scale) shows the number of training samples. Orange points show the difference in auROC (y-axis) for baseline vs. GC-controlled models. Green points show GC-controlled vs. DNase-I-controlled models. Each dot represents one TF. (d) Model performance for each TF. The y-axis gives the auROC obtained for different models for each TF. Grey = baseline; orange = GC-controlled; green = DNase-I-controlled. TFs are ranked by the change in auROC between the DNase-I and GC-controlled models. (e) Comparison of cross-TF model performance. Heatmaps show the auROC using a GC-controlled model trained on one TF (rows) and tested on another TF (columns). Red squares denote the model with highest auROC for each TF. (f) Distribution of the difference in auROC between top models and TF-specific models. For TFs where the TF-specific model was best, we computed the difference between the TF-specific model and the next best model (red). For all other TFs, we compared performance of the best model to the TF-specific model (blue). (g-h) are the same as in e-f but based on DNase-I-controlled models.

Extended Data Fig. 2 Aggregate Grad-CAM score profiles for each TF.

For each TF, we computed the average absolute value of the Grad-CAM score per position using either models with the core motif unblocked (left) or blocked (right). Values shown are Z-normalized across rows. (a) shows aggregate scores for sequences labelled as positive (bound) and is reproduced from Fig. 2d. (b) shows aggregate scores for sequences labelled as negative (unbound).

Extended Data Fig. 3 Comparing key context sequence features identified in pre-trained vs. fine-tuned models.

The heatmap shows the enrichment of each 5-mer in regions with the highest Grad-CAM scores for each TF using baseline models before (a) and after (b) fine-tuning. Rows and columns are ordered the same as in Fig. 3. Colours denote odds ratios and the sizes of the boxes denote statistical significance as in Fig. 3. Panel (b) is reproduced from Fig. 3a for comparison.

Extended Data Fig. 4 Context sequence features specific to proximal vs. distal sites.

(a) Enrichment of 5-mers in high-scoring Grad-CAM regions for proximal (left) and distal (right) binding sites. Proximal and distal TF binding sites are defined as described in Methods. Rows and columns are ordered the same as in Fig. 3. (b,c) are the same as in (a) but show data for GC-controlled (b) and DNase-I-controlled (c) models. For (a–c), colours denote odds ratios and the sizes of the boxes denote statistical significance as in Fig. 3. (d) Comparison of top scoring 5-mers in proximal vs. distal SP1 sites. Bars show the odds ratio of enrichment of each sequence in top 5-mers for all (gray), proximal (red) and distal (blue) SP1 sites. The top 20 5-mers ranked by the best odds ratio across all three SP1 models (all, proximal, and distal sites) are shown. Error bars show 95% confidence intervals on odds ratios. (e,f) are the same as in (d) but show data for GC-controlled (e) and DNase-I-controlled (f) models.

Extended Data Fig. 5 Singleton rate of context SNPs vs. core motif regions.

(a) Singleton rate of context SNPs. The plot shows the percent of SNPs in each category that are singletons. Black = all context sites, orange = context sites with top 5% Grad-CAM scores, red = context sites with top 0.5% Grad-CAM scores. Error bars show +/− 1 s.e. (b) is the same as (a), but additionally shows singleton rates for SNPs in core motif regions (blue). The number of SNPs in each category for each TF is annotated above each plot.

Supplementary information

Supplementary Information

Supplementary discussion.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–11.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zheng, A., Lamkin, M., Zhao, H. et al. Deep neural networks identify sequence context features predictive of transcription factor binding. Nat Mach Intell 3, 172–180 (2021). https://doi.org/10.1038/s42256-020-00282-y

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing