Analysis

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

Received:
Accepted:
Published online:

Abstract

Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a 'mutation map' that indicates how variations affect binding within a specific sequence.

  • Subscribe to Nature Biotechnology for full access:

    $2.5E+2

    Subscribe

Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.

References

  1. 1.

    DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000).

  2. 2.

    et al. Origins of specificity in protein-DNA recognition. Annu. Rev. Biochem. 79, 233–269 (2010).

  3. 3.

    , , , & RNAcontext: a new method for learning the sequence and structure binding preferences of RNA-binding proteins. PLoS Comput. Biol. 6, e1000832 (2010).

  4. 4.

    et al. Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol. 29, 659–664 (2011).

  5. 5.

    & Protein-DNA binding: complexities and multi-protein codes. Nucleic Acids Res. 42, 2099–2111 (2014).

  6. 6.

    , & in Advances in Neural Information Processing Systems (eds. Pereira, F., Burges, C.J.C., Bottou, L. & Weinberger, K.Q.) 1097–1105 (Curran Associates, 2012).

  7. 7.

    , & Speech recognition with deep recurrent neural networks. ICASSP 6645–6649 (2013).

  8. 8.

    et al. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat. Genet. 36, 1331–1339 (2004).

  9. 9.

    et al. Rapid and systematic analysis of the RNA recognition specificities of RNA-binding proteins. Nat. Biotechnol. 27, 667–670 (2009).

  10. 10.

    , & Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotechnol. 26, 1351–1359 (2008).

  11. 11.

    et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861–873 (2010).

  12. 12.

    , , & Highly expressed loci are vulnerable to misleading ChIP localization of multiple unrelated proteins. Proc. Natl. Acad. Sci. USA 110, 18602–18607 (2013).

  13. 13.

    et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989).

  14. 14.

    , , & in Advances in Neural Information Processing Systems (Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F. & Weinberger, K.Q.) 1647–1655 (Curran Associates, 2011).

  15. 15.

    , , , & Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).

  16. 16.

    & Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).

  17. 17.

    et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).

  18. 18.

    , , & Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29, 480–483 (2011).

  19. 19.

    , & Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22, e141–e149 (2006).

  20. 20.

    , & RankMotif.: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors. Bioinformatics 23, i72–i79 (2007).

  21. 21.

    et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429–1435 (2006).

  22. 22.

    et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature 499, 172–177 (2013).

  23. 23.

    et al. Shape-specific recognition in the structure of the Vts1p SAM domain with RNA. Nat. Struct. Mol. Biol. 13, 160–167 (2006).

  24. 24.

    , & RRM-RNA recognition: NMR or crystallography...and new findings. Curr. Opin. Struct. Biol. 23, 100–108 (2013).

  25. 25.

    & The role of RNA sequence and structure in RNA–protein interactions. J. Mol. Biol. 409, 574–587 (2011).

  26. 26.

    et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22, 1813–1831 (2012).

  27. 27.

    et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798–1812 (2012).

  28. 28.

    & MEME-ChIP: Motif analysis of large DNA datasets. Bioinformatics 27, 1696–1697 (2011).

  29. 29.

    et al. Application of experimentally verified transcription factor binding sites models for computational analysis of ChIP-Seq data. BMC Genomics 15, 80 (2014).

  30. 30.

    et al. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 42, D142–D147 (2014).

  31. 31.

    et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006).

  32. 32.

    et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).

  33. 33.

    & Transcriptional regulation and its misregulation in disease. Cell 152, 1237–1251 (2013).

  34. 34.

    et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1–9 (2014).

  35. 35.

    et al. Functional analysis of LDLR promoter and 5′ UTR mutations in subjects with clinical diagnosis of familial hypercholesterolemia. Hum. Mutat. 32, 868–872 (2011).

  36. 36.

    et al. The 8q24 cancer risk variant rs6983267 shows long-range interaction with MYC in colorectal cancer. Nat. Genet. 41, 882–884 (2009).

  37. 37.

    et al. A regulatory SNP causes a human genetic disease by creating a new transcriptional promoter. Science 312, 1215–1217 (2006).

  38. 38.

    et al. GATA-4 regulates Bcl-2 expression in ovarian granulosa cell tumors. Endocrinology 149, 5635–5642 (2008).

  39. 39.

    et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).

  40. 40.

    et al. Evolutionarily dynamic alternative splicing of GPR56 regulates regional cerebral cortical patterning. Science 343, 764–768 (2014).

  41. 41.

    et al. The transcription factor GABP selectively binds and activates the mutant TERT promoter in cancer. Science 348, 1036–1039 (2015).

  42. 42.

    et al. TERT promoter mutations in familial and sporadic melanoma. Science 339, 959–961 (2013).

  43. 43.

    et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 (2013).

  44. 44.

    et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

  45. 45.

    , , , & Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40, 1413–1415 (2008).

  46. 46.

    et al. MBNL proteins repress ES-cell-specific alternative splicing and reprogramming. Nature 498, 241–245 (2013).

  47. 47.

    et al. RBFOX1 regulates both splicing and transcriptional networks in human neuronal development. Hum. Mol. Genet. 21, 4171–4186 (2012).

  48. 48.

    et al. An RNA map predicting Nova-dependent splicing regulation. Nature 444, 580–586 (2006).

  49. 49.

    et al. The RNA-binding protein TIA-1 is a novel mammalian splicing regulator acting through intron sequences adjacent to a 5′ splice site. Mol. Cell. Biol. 20, 6287–6299 (2000).

  50. 50.

    et al. Genome-wide analysis of PTB-RNA interactions reveals a strategy used by the general splicing repressor to modulate exon inclusion or skipping. Mol. Cell 36, 996–1006 (2009).

  51. 51.

    et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009).

Download references

Acknowledgements

We are grateful to K.B. Cook, Q.D. Morris and T.R. Hughes for helpful discussions. This work was supported by a grant from the Canadian Institutes of Health Research (OGP-106690) to B.J.F., a John C. Polanyi Fellowship Grant to B.J.F., and funding from the Canadian Institutes for Advanced Research to B.J.F. and M.T.W. B.A. was supported by a joint Autism Research Training and NeuroDevNet Fellowship. A.D. was supported by a Fellowship from the Natural Science and Engineering Research Council of Canada.

Author information

Author notes

    • Babak Alipanahi
    •  & Andrew Delong

    These authors contributed equally to this work.

Affiliations

  1. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada.

    • Babak Alipanahi
    • , Andrew Delong
    •  & Brendan J Frey
  2. Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada.

    • Babak Alipanahi
    •  & Brendan J Frey
  3. Canadian Institute for Advanced Research, Programs on Genetic Networks and Neural Computation, Toronto, Ontario, Canada.

    • Matthew T Weirauch
    •  & Brendan J Frey
  4. Center for Autoimmune Genomics and Etiology, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA.

    • Matthew T Weirauch
  5. Divisions of Biomedical Informatics and Developmental Biology, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA.

    • Matthew T Weirauch

Authors

  1. Search for Babak Alipanahi in:

  2. Search for Andrew Delong in:

  3. Search for Matthew T Weirauch in:

  4. Search for Brendan J Frey in:

Contributions

B.A., A.D. and B.J.F. conceived the method. A.D. implemented DeepBind and the online database of models. B.A. designed the experiments with input from A.D., M.T.W., and B.J.F., and also implemented DeepFind. B.A., A.D. and B.J.F. wrote the manuscript with valuable input from M.T.W.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Brendan J Frey.

Integrated supplementary information

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–10

  2. 2.

    Supplementary Notes

Excel files

  1. 1.

    Supplementary Table 1

    Performance of in vitro trained models on DREAM5 in vitro and in vivo test data

  2. 2.

    Supplementary Table 2

    In vitro performance metrics for models trained on RNAcompete RBP data

  3. 3.

    Supplementary Table 3

    In vivo performance metrics for models trained on RNAcompete RBP data

  4. 4.

    Supplementary Table 4

    The list of all ENCODE ChIP-seq data sets analyzed

  5. 5.

    Supplementary Table 5

    Performance of models trained on ENCODE ChIP-seq data on held out data

  6. 6.

    Supplementary Table 6

    The list of all HT-SELEX data sets analyzed

  7. 7.

    Supplementary Table 7

    Performance of models trained on HT-SELEX data on held out data

  8. 8.

    Supplementary Table 8

    Performance of models trained on HT-SELEX data on ENCODE ChIP-seq data

  9. 9.

    Supplementary Table 9

    P-values for differential binding scores of RBPs regulating alternatively-spliced exons

  10. 10.

    Supplementary Table 10

    All calibration parameters for DeepBind models and the SGD learning algorithm. Each parameter is either fixed for all calibration trials, or is independently sampled for each trial from the given search space.

Zip files

  1. 1.

    Supplementary Software

    This code download is distributed as part of the Nature Biotechnology supplementary software release for DeepBind. Users of DeepBind are encouraged to instead use the latest source code and binaries for scoring sequences at http://tools.genes.toronto.edu/deepbind/Your access to and use of the downloadable code (the “Code”) contained in this Supplementary Software is subject to a non-exclusive, revocable, non-transferable, and limited right to use the Code for the exclusive purpose of undertaking academic, governmental, or not-for-profit research. Use of the Code or any part thereof for commercial or clinical purposes is strictly prohibited in the absence of a Commercial License Agreement from Deep Genomics. (info@deepgenomics.com)