Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

Journal name:
Nature Biotechnology
Volume:
33,
Pages:
831–838
Year published:
DOI:
doi:10.1038/nbt.3300
Received
Accepted
Published online

Abstract

Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a 'mutation map' that indicates how variations affect binding within a specific sequence.

At a glance

Figures

  1. DeepBind's input data, training procedure and applications.
    Figure 1: DeepBind's input data, training procedure and applications.

    1. The sequence specificities of DNA- and RNA-binding proteins can now be measured by several types of high-throughput assay, including PBM, SELEX, and ChIP- and CLIP-seq techniques. 2. DeepBind captures these binding specificities from raw sequence data by jointly discovering new sequence motifs along with rules for combining them into a predictive binding score. Graphics processing units (GPUs) are used to automatically train high-quality models, with expert tuning allowed but not required. 3. The resulting DeepBind models can then be used to identify binding sites in test sequences and to score the effects of novel mutations.

  2. Details of inner workings of DeepBind and its training procedure.
    Figure 2: Details of inner workings of DeepBind and its training procedure.

    (a) Five independent sequences being processed in parallel by a single DeepBind model. The convolve, rectify, pool and neural network stages predict a separate score for each sequence using the current model parameters (Supplementary Notes, sec. 1). During the training phase, the backprop and update stages simultaneously update all motifs, thresholds and network weights of the model to improve prediction accuracy. (b) The calibration, training and testing procedure used throughout (Supplementary Notes, sec. 2).

  3. Quantitative performance on various types of held-out experimental test data.
    Figure 3: Quantitative performance on various types of held-out experimental test data.

    (a) Revised PBM evaluation scores for the DREAM5 in vitro transcription factor challenge. The DREAM5 PBM score is based on Pearson correlations and AUCs across 66 transcription factors (c.f. ref. 17, Table 2; Supplementary Notes, sec. 5.2). (b) DREAM5 in vivo transcription factor challenge ChIP AUC, using the in vitro models (c.f. ref. 17, Fig. 3; Supplementary Notes, sec. 5.3); only DeepBind ranks highly for both in vitro and in vivo. (c) RBP in vitro performance using RNAcompete data22 (Wilcoxon two-sided signed-rank test, n = 244; Supplementary Notes, sec. 4.2); all box-plot whiskers show 95th/5th percentile. (d) RBP in vivo performance using PBM-trained models (c.f. ref. 22, Fig. 1c; Supplementary Notes, sec. 4.3). (e) AUCs of ChIP-seq models on ChIP-seq data, and of HT-SELEX models on HT-SELEX data (Wilcoxon one-sided signed-rank test, n = 506; Supplementary Notes, sec. 6.1 and 7.2). (f) Performance of HT-SELEX models when used to score ChIP-seq data (Wilcoxon one-sided signed-rank test, n = 35; Supplementary Notes, sec. 7.3).

  4. Analysis of potentially disease-causing genomic variants.
    Figure 4: Analysis of potentially disease-causing genomic variants.

    DeepBind mutation maps (Supplementary Notes, sec. 10.1) were used to understand disease-causing SNVs associated with transcription factor binding. (a) A disrupted SP1 binding site in the LDL-R promoter that leads to familial hypercholesterolemia. (b) A cancer risk variant in a MYC enhancer weakens a TCF7L2 binding site. (c) A gained GATA1 binding site that disrupts the original globin cluster promoters. (d) A lost GATA4 binding site in the BCL-2 promoter, potentially playing a role in ovarian granulosa cell tumors. (e) Loss of two potential RFX3 binding sites leads to abnormal cortical development. (f,g) HGMD SNVs disrupt several transcription factor binding sites in the promoters of HBB and F7, potentially leading to β-thalassemia and hemophilia, respectively. (h) Gained GABP-α binding sites in the TERT promoter, which are linked to several types of aggressive cancer. WT, wild type.

  5. DeepBind models are used to describe the regulation mechanism for different RBPs.
    Figure 5: DeepBind models are used to describe the regulation mechanism for different RBPs.

    All P values are computed between predicted scores of upregulated and/or downregulated exons and scores of control exons (Mann-Whitney U test; n = c + u for upregulated vs. control exons, and n = c + d for downregulated vs. control exons). * 1 × 10−8 < P ≤ 1 × 10−4; ** 1 × 10−16 < P ≤ 1× 10−8; *** 1 × 10−32 < P ≤ 1 × 10−16; + P ≤ 1 × 10−32. The number of up-, down- and control exons are denoted by u, d and c, respectively. All box-plot whiskers show 95th and 5th percentile. u5SS, 3SS, 5SS and d3SS: intronic regions close to upstream exon's 5′ splice site, target exon's 3′ and 5′ splice sites, and downstream exon's 3′ splice site, respectively.

  6. Comparison of motifs learned by DeepBind with known motifs.
    Figure 6: Comparison of motifs learned by DeepBind with known motifs.

    Example motif detectors learned by DeepBind models, along with known motifs from CISBP-RNA22 (for RBPs) and JASPAR30 (for transcription factors). A protein's motifs can collectively suggest putative RNA- and DNA-binding properties, as outlined51, such as variable-width gaps (HNRNPA1, Tp53), position interdependence (CTCF, NR4A2), and secondary motifs (PTBP1, Sox10, Pou2f2). Motifs learned from in vivo data (e.g., ChIP) can suggest potential co-factors (PRDM1/EBF1) as in Teytelman et al.12. Details and references for 'known motifs' are in Supplementary Notes, sec. 10.2.

  7. An extended version of Figure 2a, depicting multi-model training and reverse-complement mode
    Supplementary Fig. 1: An extended version of Figure 2a, depicting multi-model training and reverse-complement mode

    To use the GPU’s full computational power, we train several independent models in parallel on the same data, each with different calibration parameters. The calibration parameters with validation performance are used to train the final model. Shown is an example with batch_size=5, motif_len=6, num_motifs=4, num_models=3. Sequences are padded with ‘N’s so that the motif scan operation can find detections at both extremities. Yellow cells represent the reverse complement of the input located above; both strands are fed to the model, and the strand with the maximum score is used for the output prediction (the max strand stage). The output dimension of the pool stage, depicted as num_motifs (*), depends on whether “max” or “max and avg” pooling was used.

  8. Performance of in vitro trained TF models on in vivo data (DREAM5 ChIP-seq)
    Supplementary Fig. 2: Performance of in vitro trained TF models on in vivo data (DREAM5 ChIP-seq)

    (a) All DREAM5 ChIP-seq AUCs used to compute mean performance shown in Figure 3b. The models were trained on the DREAM5 PBM training data only, and evaluated against three different backgrounds22. (b) Cross-validation performance of methods trained directly on ChIP-seq data (sequence length 100), evaluated against a dinucleotide shuffled background (Supplementary Table 1).

  9. Performance on in vitro RBP data using several evaluation metrics
    Supplementary Fig. 3: Performance on in vitro RBP data using several evaluation metrics

    Box plots showing distribution of RNAcompete in vitro RBP performance over 244 different microarray experiments using 6 evaluation metrics (columns) and two types of correlation (rows). Models were trained on RNAcompete PBM probes labeled “Set A”, and tested on “Set B” probes.

  10. Performance of in vitro trained RBP models on in vivo data
    Supplementary Fig. 4: Performance of in vitro trained RBP models on in vivo data

    Performance of all RBP models for which RNAcompete in vivo data was available (c.f. Ray et al.19, Fig. 1C). Figure 3d shows only the subset of RBPs for which the in vivo test sequences has average length <1000. All AUCs are calculated with 100 bootstrap samples, and the standard deviation is shown as vertical lines. “Base counts” show the best performance achievable from ranking test sequences by the proportion of a single nucleotide or by sequence length; for example, ranking the QKI test sequences by 1/(fraction of Gs) gives AUC of 0.95. There are 9 RBPs for which at least one method can perform better than base counts on this test data. RNAcompete PFMs beat base counts for PUM2, SRSF1, FMR1, and Vts1p. DeepBind beats base counts for 8 RBPs (no significant improvement for FMR1). See Supplementary Table 3 (“In vivo AUCs”) for raw data for this plot.

  11. ROC curves for the AUCs shown in Figure 3d
    Supplementary Fig. 5: ROC curves for the AUCs shown in Figure 3d

    ROC curves for the AUCs shown in Figure 3d, where the RNAcompete-trained (in vitro) RBP models were applied to in vivo (CLIP, RIP) sequences. Importantly, several DeepBind models have higher recall at low false positive rates.

  12. Detailed explanation of how ChIP-seq peaks were divided into training and testing data for each experiment.
    Supplementary Fig. 6: Detailed explanation of how ChIP-seq peaks were divided into training and testing data for each experiment.

    The ChIP-seq performance from Figure 3e are reproduced at left with extra annotations for clarity. At right is the breakdown of ChIPseq peaks used to train a model on each ChIP experiment. We train each method on peaks labeled A (“top 500 odd”), then test each method on peaks labeled B (“top 500 even”). DeepBind* is a special case where we show that including the lower -ranked peaks labeled C (“all remaining peaks”) in the training set can significantly improve the accuracy when scoring the top-ranked peaks labeled B.

  13. Evaluation of FoXA2 models learned from ChIP-seq data on EMSA-measured affinities
    Supplementary Fig. 7: Evaluation of FoXA2 models learned from ChIP-seq data on EMSA-measured affinities

    FoxA2 ChIP model predictions validated by EMSA-measured affinities of FoxA2 binding to 64 probe sequences32. The column marked “DeepBind” is an extra model that we trained on the same ENCODE ChIP data as “DeepBind”, but where we used motif_len=16 instead of the usual motif_len=24. The shorter motif length was tried due to the post-hoc observation that our FoxA2 model learns patterns of length 10, and we heuristically found that motif_len of ~1.5x the true motif length often works well. The fact that DeepBind performed best suggests that there is still room for refinement in the DeepBind training procedure we use.

  14. ROC curves for the AUCs shown in Figure 3f
    Supplementary Fig. 8: ROC curves for the AUCs shown in Figure 3f

    ROC curves for the AUCs shown in Figure 3f, where the HT-SELEX-trained (in vitro) TF models were applied to in vivo (ChIP) sequences. For the semi-automatic method of Jolma et al. we show the curve for whichever PWM performed best on the test data; summing the scores of their choices of PFMs resulted in worse performance overall, so it is not shown.

  15. Schematic diagram of the DeepFind model
    Supplementary Fig. 9: Schematic diagram of the DeepFind model

    Schematic diagram of the DeepFind model, using 2n TF scores (n wild type, n mutant) as features to a deep neural network.

  16. DeepFind score distributions for the observed and simulated SNVs.
    Supplementary Fig. 10: DeepFind score distributions for the observed and simulated SNVs.

    DeepFind score distributions for the observed and simulated SNVs.

References

  1. Stormo, G. DNA binding sites: representation and discovery. Bioinformatics 16, 1623 (2000).
  2. Rohs, R. et al. Origins of specificity in protein-DNA recognition. Annu. Rev. Biochem. 79, 233269 (2010).
  3. Kazan, H., Ray, D., Chan, E.T., Hughes, T.R. & Morris, Q. RNAcontext: a new method for learning the sequence and structure binding preferences of RNA-binding proteins. PLoS Comput. Biol. 6, e1000832 (2010).
  4. Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol. 29, 659664 (2011).
  5. Siggers, T. & Gordân, R. Protein-DNA binding: complexities and multi-protein codes. Nucleic Acids Res. 42, 20992111 (2014).
  6. Krizhevsky, A., Sutskever, I. & Hinton, G.E. in Advances in Neural Information Processing Systems (eds. Pereira, F., Burges, C.J.C., Bottou, L. & Weinberger, K.Q.) 10971105 (Curran Associates, 2012).
  7. Graves, A., Mohamed, A. & Hinton, G. Speech recognition with deep recurrent neural networks. ICASSP 66456649 (2013).
  8. Mukherjee, S. et al. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat. Genet. 36, 13311339 (2004).
  9. Ray, D. et al. Rapid and systematic analysis of the RNA recognition specificities of RNA-binding proteins. Nat. Biotechnol. 27, 667670 (2009).
  10. Kharchenko, P., Tolstorukov, M. & Park, P. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotechnol. 26, 13511359 (2008).
  11. Jolma, A. et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861873 (2010).
  12. Teytelman, L., Thurtle, D.M., Rine, J. & van Oudenaarden, A. Highly expressed loci are vulnerable to misleading ChIP localization of multiple unrelated proteins. Proc. Natl. Acad. Sci. USA 110, 1860218607 (2013).
  13. LeCun, Y. et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541551 (1989).
  14. Cotter, A., Shamir, O., Srebro, N. & Sridharan, K. in Advances in Neural Information Processing Systems (Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F. & Weinberger, K.Q.) 16471655 (Curran Associates, 2011).
  15. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 19291958 (2014).
  16. Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281305 (2012).
  17. Weirauch, M.T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126134 (2013).
  18. Zhao, Y., Stormo, G.D., Feature, N. & Eisenstein, M. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29, 480483 (2011).
  19. Foat, B.C., Morozov, A.V. & Bussemaker, H.J. Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22, e141e149 (2006).
  20. Chen, X., Hughes, T.R. & Morris, Q. RankMotif.: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors. Bioinformatics 23, i72i79 (2007).
  21. Berger, M.F. et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 14291435 (2006).
  22. Ray, D. et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature 499, 172177 (2013).
  23. Oberstrass, F.C. et al. Shape-specific recognition in the structure of the Vts1p SAM domain with RNA. Nat. Struct. Mol. Biol. 13, 160167 (2006).
  24. Daubner, G.M., Cléry, A. & Allain, F.H.-T. RRM-RNA recognition: NMR or crystallography...and new findings. Curr. Opin. Struct. Biol. 23, 100108 (2013).
  25. Gupta, A. & Gribskov, M. The role of RNA sequence and structure in RNA–protein interactions. J. Mol. Biol. 409, 574587 (2011).
  26. Landt, S. et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22, 18131831 (2012).
  27. Wang, J. et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 17981812 (2012).
  28. Machanick, P. & Bailey, T.L. MEME-ChIP: Motif analysis of large DNA datasets. Bioinformatics 27, 16961697 (2011).
  29. Levitsky, V.G. et al. Application of experimentally verified transcription factor binding sites models for computational analysis of ChIP-Seq data. BMC Genomics 15, 80 (2014).
  30. Mathelier, A. et al. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 42, D142D147 (2014).
  31. Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108D110 (2006).
  32. Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327339 (2013).
  33. Lee, T.I. & Young, R.A. Transcriptional regulation and its misregulation in disease. Cell 152, 12371251 (2013).
  34. Stenson, P. et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 19 (2014).
  35. De Castro-Orós, I. et al. Functional analysis of LDLR promoter and 5′ UTR mutations in subjects with clinical diagnosis of familial hypercholesterolemia. Hum. Mutat. 32, 868872 (2011).
  36. Pomerantz, M.M. et al. The 8q24 cancer risk variant rs6983267 shows long-range interaction with MYC in colorectal cancer. Nat. Genet. 41, 882884 (2009).
  37. De Gobbi, M. et al. A regulatory SNP causes a human genetic disease by creating a new transcriptional promoter. Science 312, 12151217 (2006).
  38. Kyrönlahti, A. et al. GATA-4 regulates Bcl-2 expression in ovarian granulosa cell tumors. Endocrinology 149, 56355642 (2008).
  39. Forbes, S.A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945D950 (2011).
  40. Bae, B.-I. et al. Evolutionarily dynamic alternative splicing of GPR56 regulates regional cerebral cortical patterning. Science 343, 764768 (2014).
  41. Bell, R.J.A. et al. The transcription factor GABP selectively binds and activates the mutant TERT promoter in cancer. Science 348, 10361039 (2015).
  42. Horn, S. et al. TERT promoter mutations in familial and sporadic melanoma. Science 339, 959961 (2013).
  43. Huang, F. et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957959 (2013).
  44. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310315 (2014).
  45. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40, 14131415 (2008).
  46. Han, H. et al. MBNL proteins repress ES-cell-specific alternative splicing and reprogramming. Nature 498, 241245 (2013).
  47. Fogel, B.L. et al. RBFOX1 regulates both splicing and transcriptional networks in human neuronal development. Hum. Mol. Genet. 21, 41714186 (2012).
  48. Ule, J. et al. An RNA map predicting Nova-dependent splicing regulation. Nature 444, 580586 (2006).
  49. Del Gatto-Konczak, F. et al. The RNA-binding protein TIA-1 is a novel mammalian splicing regulator acting through intron sequences adjacent to a 5′ splice site. Mol. Cell. Biol. 20, 62876299 (2000).
  50. Xue, Y. et al. Genome-wide analysis of PTB-RNA interactions reveals a strategy used by the general splicing repressor to modulate exon inclusion or skipping. Mol. Cell 36, 9961006 (2009).
  51. Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 17201723 (2009).

Download references

Author information

  1. These authors contributed equally to this work.

    • Babak Alipanahi &
    • Andrew Delong

Affiliations

  1. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada.

    • Babak Alipanahi,
    • Andrew Delong &
    • Brendan J Frey
  2. Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada.

    • Babak Alipanahi &
    • Brendan J Frey
  3. Canadian Institute for Advanced Research, Programs on Genetic Networks and Neural Computation, Toronto, Ontario, Canada.

    • Matthew T Weirauch &
    • Brendan J Frey
  4. Center for Autoimmune Genomics and Etiology, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA.

    • Matthew T Weirauch
  5. Divisions of Biomedical Informatics and Developmental Biology, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA.

    • Matthew T Weirauch

Contributions

B.A., A.D. and B.J.F. conceived the method. A.D. implemented DeepBind and the online database of models. B.A. designed the experiments with input from A.D., M.T.W., and B.J.F., and also implemented DeepFind. B.A., A.D. and B.J.F. wrote the manuscript with valuable input from M.T.W.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: An extended version of Figure 2a, depicting multi-model training and reverse-complement mode (90 KB)

    To use the GPU’s full computational power, we train several independent models in parallel on the same data, each with different calibration parameters. The calibration parameters with validation performance are used to train the final model. Shown is an example with batch_size=5, motif_len=6, num_motifs=4, num_models=3. Sequences are padded with ‘N’s so that the motif scan operation can find detections at both extremities. Yellow cells represent the reverse complement of the input located above; both strands are fed to the model, and the strand with the maximum score is used for the output prediction (the max strand stage). The output dimension of the pool stage, depicted as num_motifs (*), depends on whether “max” or “max and avg” pooling was used.

  2. Supplementary Figure 2: Performance of in vitro trained TF models on in vivo data (DREAM5 ChIP-seq) (129 KB)

    (a) All DREAM5 ChIP-seq AUCs used to compute mean performance shown in Figure 3b. The models were trained on the DREAM5 PBM training data only, and evaluated against three different backgrounds22. (b) Cross-validation performance of methods trained directly on ChIP-seq data (sequence length 100), evaluated against a dinucleotide shuffled background (Supplementary Table 1).

  3. Supplementary Figure 3: Performance on in vitro RBP data using several evaluation metrics (65 KB)

    Box plots showing distribution of RNAcompete in vitro RBP performance over 244 different microarray experiments using 6 evaluation metrics (columns) and two types of correlation (rows). Models were trained on RNAcompete PBM probes labeled “Set A”, and tested on “Set B” probes.

  4. Supplementary Figure 4: Performance of in vitro trained RBP models on in vivo data (90 KB)

    Performance of all RBP models for which RNAcompete in vivo data was available (c.f. Ray et al.19, Fig. 1C). Figure 3d shows only the subset of RBPs for which the in vivo test sequences has average length <1000. All AUCs are calculated with 100 bootstrap samples, and the standard deviation is shown as vertical lines. “Base counts” show the best performance achievable from ranking test sequences by the proportion of a single nucleotide or by sequence length; for example, ranking the QKI test sequences by 1/(fraction of Gs) gives AUC of 0.95. There are 9 RBPs for which at least one method can perform better than base counts on this test data. RNAcompete PFMs beat base counts for PUM2, SRSF1, FMR1, and Vts1p. DeepBind beats base counts for 8 RBPs (no significant improvement for FMR1). See Supplementary Table 3 (“In vivo AUCs”) for raw data for this plot.

  5. Supplementary Figure 5: ROC curves for the AUCs shown in Figure 3d (199 KB)

    ROC curves for the AUCs shown in Figure 3d, where the RNAcompete-trained (in vitro) RBP models were applied to in vivo (CLIP, RIP) sequences. Importantly, several DeepBind models have higher recall at low false positive rates.

  6. Supplementary Figure 6: Detailed explanation of how ChIP-seq peaks were divided into training and testing data for each experiment. (66 KB)

    The ChIP-seq performance from Figure 3e are reproduced at left with extra annotations for clarity. At right is the breakdown of ChIPseq peaks used to train a model on each ChIP experiment. We train each method on peaks labeled A (“top 500 odd”), then test each method on peaks labeled B (“top 500 even”). DeepBind* is a special case where we show that including the lower -ranked peaks labeled C (“all remaining peaks”) in the training set can significantly improve the accuracy when scoring the top-ranked peaks labeled B.

  7. Supplementary Figure 7: Evaluation of FoXA2 models learned from ChIP-seq data on EMSA-measured affinities (68 KB)

    FoxA2 ChIP model predictions validated by EMSA-measured affinities of FoxA2 binding to 64 probe sequences32. The column marked “DeepBind” is an extra model that we trained on the same ENCODE ChIP data as “DeepBind”, but where we used motif_len=16 instead of the usual motif_len=24. The shorter motif length was tried due to the post-hoc observation that our FoxA2 model learns patterns of length 10, and we heuristically found that motif_len of ~1.5x the true motif length often works well. The fact that DeepBind performed best suggests that there is still room for refinement in the DeepBind training procedure we use.

  8. Supplementary Figure 8: ROC curves for the AUCs shown in Figure 3f (649 KB)

    ROC curves for the AUCs shown in Figure 3f, where the HT-SELEX-trained (in vitro) TF models were applied to in vivo (ChIP) sequences. For the semi-automatic method of Jolma et al. we show the curve for whichever PWM performed best on the test data; summing the scores of their choices of PFMs resulted in worse performance overall, so it is not shown.

  9. Supplementary Figure 9: Schematic diagram of the DeepFind model (81 KB)

    Schematic diagram of the DeepFind model, using 2n TF scores (n wild type, n mutant) as features to a deep neural network.

  10. Supplementary Figure 10: DeepFind score distributions for the observed and simulated SNVs. (72 KB)

    DeepFind score distributions for the observed and simulated SNVs.

PDF files

  1. Supplementary Text and Figures (1,241 KB)

    Supplementary Figures 1–10

  2. Supplementary Notes (1,631 KB)

Excel files

  1. Supplementary Table 1 (39 KB)

    Performance of in vitro trained models on DREAM5 in vitro and in vivo test data

  2. Supplementary Table 2 (97 KB)

    In vitro performance metrics for models trained on RNAcompete RBP data

  3. Supplementary Table 3 (18 KB)

    In vivo performance metrics for models trained on RNAcompete RBP data

  4. Supplementary Table 4 (35 KB)

    The list of all ENCODE ChIP-seq data sets analyzed

  5. Supplementary Table 5 (50 KB)

    Performance of models trained on ENCODE ChIP-seq data on held out data

  6. Supplementary Table 6 (67 KB)

    The list of all HT-SELEX data sets analyzed

  7. Supplementary Table 7 (34 KB)

    Performance of models trained on HT-SELEX data on held out data

  8. Supplementary Table 8 (26 KB)

    Performance of models trained on HT-SELEX data on ENCODE ChIP-seq data

  9. Supplementary Table 9 (13 KB)

    P-values for differential binding scores of RBPs regulating alternatively-spliced exons

  10. Supplementary Table 10 (36 KB)

    All calibration parameters for DeepBind models and the SGD learning algorithm. Each parameter is either fixed for all calibration trials, or is independently sampled for each trial from the given search space.

Zip files

  1. Supplementary Software (134 MB)

    This code download is distributed as part of the Nature Biotechnology supplementary software release for DeepBind. Users of DeepBind are encouraged to instead use the latest source code and binaries for scoring sequences at http://tools.genes.toronto.edu/deepbind/

    Your access to and use of the downloadable code (the “Code”) contained in this Supplementary Software is subject to a non-exclusive, revocable, non-transferable, and limited right to use the Code for the exclusive purpose of undertaking academic, governmental, or not-for-profit research. Use of the Code or any part thereof for commercial or clinical purposes is strictly prohibited in the absence of a Commercial License Agreement from Deep Genomics. (info@deepgenomics.com)

Additional data