Evaluation of methods for modeling transcription factor sequence specificity

Journal name:
Nature Biotechnology
Year published:
Published online


Genomic analyses often involve scanning for potential transcription factor (TF) binding sites using models of the sequence specificity of DNA binding proteins. Many approaches have been developed to model and learn a protein's DNA-binding specificity, but these methods have not been systematically compared. Here we applied 26 such approaches to in vitro protein binding microarray data for 66 mouse TFs belonging to various families. For nine TFs, we also scored the resulting motif models on in vivo data, and found that the best in vitro–derived motifs performed similarly to motifs derived from the in vivo data. Our results indicate that simple models based on mononucleotide position weight matrices trained by the best methods perform similarly to more complex models for most TFs examined, but fall short in specific cases (<10% of the TFs examined here). In addition, the best-performing motifs typically have relatively low information content, consistent with widespread degeneracy in eukaryotic TF sequence preferences.

At a glance


  1. Evaluation criteria used in this study.
    Figure 1: Evaluation criteria used in this study.

    For each TF, we scored an algorithm's probe intensity predictions using two evaluation criteria, which are illustrated here for TF_16 (Prdm11), using the predictions of BEEML-PBM on the raw array intensity data. (a) Pearson correlation between predicted and actual probe intensities across all ~40,000 probes. (b) AUROC of the set of positive probes. Positive probes (black lines) were defined as all probes on the test array with intensities >4 s.d. above the mean probe intensity for the given array.

  2. Comparison of algorithm performance by TF.
    Figure 2: Comparison of algorithm performance by TF.

    (a) Final score of each algorithm for each TF. TF name, ID and family are depicted across the columns, and sequence specificity model type and name are depicted across the rows. Algorithms are sorted in decreasing order of final performance across all TFs. TFs are sorted in decreasing order of mean final score across all algorithms. Numbers in parentheses indicate the number of zinc fingers in the protein. (b) Summary statistics for each TF across all algorithms: mean final score, maximum final score achieved by any k-mer, dinucleotide or PWM-based algorithm, Pearson correlation of 8-mer Z-scores between replicate arrays, and the number of 8-mers with E-scores > 0.45 on the training array (normalized by the maximum such value across all TFs). (c) Difference between the best score achieved by any k-mer–based algorithm and the best score achieved by any PWM-based algorithm for each TF.

  3. Comparison of algorithm performance on in vivo data.
    Figure 3: Comparison of algorithm performance on in vivo data.

    For each algorithm, we trained a model (PWM, 2 PWMs, k-mer or dinucleotide) using PBM data, and gauged its ability to discriminate real from random ChIP peaks using the AUROC (Online Methods). Data for the first five TFs were taken from mouse ChIP-seq data. The final four are from yeast ChIP-exo data. The color scale is indicated at the bottom. Team_E was not run on the ChIP-exo data, because it requires initialization parameters specific to the individual TF. FeatureREDUCE was run using models of length 8, instead of length 10, owing to the superior performance of this length model on in vivo data (T.R.R. and H.J.B., unpublished data).

  4. Characteristics of Klf9 motifs produced by the eight PWM-based algorithms evaluated in this study.
    Figure 4: Characteristics of Klf9 motifs produced by the eight PWM-based algorithms evaluated in this study.

    The algorithms are ranked top to bottom in order of the overall score of their PWM for this TF in our evaluation scheme. Two popular visualization methods of the PWMs produced by each algorithm are depicted. On the left are traditional sequence logos39, 40, which display the information content of each nucleotide at each position; the total information content (I.C.) of the PWM is given to the left of this logo. On the right are frequency logos, in which the height of each nucleotide corresponds to its frequency of occurrence at the given position40.

Accession codes

Referenced accessions


  1. Stormo, G.D., Schneider, T.D., Gold, L. & Ehrenfeucht, A. Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 10, 29973011 (1982).
  2. Berg, O.G. & von Hippel, P.H. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 193, 723743 (1987).
  3. Stormo, G.D. Consensus patterns in DNA. Methods Enzymol. 183, 211221 (1990).
  4. Siddharthan, R. Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix. PLoS ONE 5, e9722 (2010).
  5. Zhao, X., Huang, H. & Speed, T.P. Finding short DNA motifs using permuted Markov models. J. Comput. Biol. 12, 894906 (2005).
  6. Sharon, E., Lubliner, S. & Segal, E. A feature-based approach to modeling protein-DNA interactions. PLOS Comput. Biol. 4, e1000154 (2008).
  7. Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 17201723 (2009).
  8. Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol. 29, 659664 (2011).
  9. Maerkl, S.J. & Quake, S.R. A systems approach to measuring the binding energy landscapes of transcription factors. Science 315, 233237 (2007).
  10. Agius, P., Arvey, A., Chang, W., Noble, W.S. & Leslie, C. High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions. PLoS Comput. Biol. 6, e1000916 (2010).
  11. Annala, M., Laurila, K., Lähdesmäki, H. & Nykter, M. A linear model for transcription factor binding affinity prediction in protein binding microarrays. PLoS ONE 6, e20059 (2011).
  12. Zhao, Y., Granas, D. & Stormo, G.D. Inferring binding energies from selected binding sites. PLOS Comput. Biol. 5, e1000590 (2009).
  13. Slattery, M. et al. Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell 147, 12701282 (2011).
  14. Jolma, A. et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861873 (2010).
  15. Zykovich, A., Korf, I. & Segal, D.J. Bind-n-Seq: high-throughput analysis of in vitro protein-DNA interactions using massively parallel sequencing. Nucleic Acids Res. 37, e151 (2009).
  16. Fordyce, P.M. et al. De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis. Nat. Biotechnol. 28, 970975 (2010).
  17. Warren, C.L. et al. Defining the sequence-recognition profile of DNA-binding molecules. Proc. Natl. Acad. Sci. USA 103, 867872 (2006).
  18. Meng, X., Brodsky, M.H. & Wolfe, S.A. A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors. Nat. Biotechnol. 23, 988994 (2005).
  19. Berger, M.F. et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 14291435 (2006).
  20. Stormo, G.D. & Zhao, Y. Determining the specificity of protein-DNA interactions. Nat. Rev. Genet. 11, 751760 (2010).
  21. Prill, R.J. et al. Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS ONE 5, e9202 (2010).
  22. Stolovitzky, G., Monroe, D. & Califano, A. Dialogue on reverse-engineering assessment and methods: the DREAM of high-throughput pathway inference. Ann. NY Acad. Sci. 1115, 122 (2007).
  23. Stolovitzky, G., Prill, R.J. & Califano, A. Lessons from the DREAM2 Challenges. Ann. NY Acad. Sci. 1158, 159195 (2009).
  24. Zhao, Y. & Stormo, G.D. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29, 480483 (2011).
  25. Zhao, Y., Ruan, S., Pandey, M. & Stormo, G.D. Improved models for transcription factor binding site identification using non-independent interactions. Genetics 191, 781790 (2012).
  26. Foat, B.C., Morozov, A.V. & Bussemaker, H.J. Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22, e141e149 (2006).
  27. Chen, X., Hughes, T.R. & Morris, Q. RankMotif.: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors. Bioinformatics 23, i72i79 (2007).
  28. Berger, M.F. et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell 133, 12661276 (2008).
  29. Rhee, H.S. & Pugh, B.F. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147, 14081419 (2011).
  30. Wei, G.H. et al. Genome-wide analysis of ETS-family DNA-binding in vitro and in vivo. EMBO J. 29, 21472160 (2010).
  31. de Boer, C.G. & Hughes, T.R. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 40, D169D179 (2012).
  32. Kulakovskiy, I.V., Boeva, V.A., Favorov, A.V. & Makeev, V.J. Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics 26, 26222623 (2010).
  33. Machanick, P. & Bailey, T.L. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27, 16961697 (2011).
  34. Zhu, C. et al. High-resolution DNA-binding specificity analysis of yeast transcription factors. Genome Res. 19, 556566 (2009).
  35. John, S., Marais, R., Child, R., Light, Y. & Leonard, W.J. Importance of low affinity Elf-1 sites in the regulation of lymphoid-specific inducible gene expression. J. Exp. Med. 183, 743750 (1996).
  36. Tanay, A. Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 16, 962972 (2006).
  37. Jaeger, S.A. et al. Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites. Genomics 95, 185195 (2010).
  38. Segal, E., Raveh-Sadka, T., Schroeder, M., Unnerstall, U. & Gaul, U. Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature 451, 535540 (2008).
  39. Schneider, T.D. & Stephens, R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 60976100 (1990).
  40. Crooks, G.E., Hon, G., Chandonia, J.M. & Brenner, S.E. WebLogo: a sequence logo generator. Genome Res. 14, 11881190 (2004).
  41. Keilwagen, J. et al. De-novo discovery of differentially abundant transcription factor binding sites including their positional preference. PLOS Comput. Biol. 7, e1001070 (2011).
  42. Bailey, T.L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 2836 (1994).
  43. Schutz, F. & Delorenzi, M. MAMOT: hidden Markov modeling tool. Bioinformatics 24, 13991400 (2008).
  44. Kinney, J.B., Tkacik, G. & Callan, C.G. Jr. Precise physical models of protein-DNA interaction from high-throughput data. Proc. Natl. Acad. Sci. USA 104, 501506 (2007).
  45. Kinney, J.B., Murugan, A., Callan, C.G. Jr. & Cox, E.C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl. Acad. Sci. USA 107, 91589163 (2010).
  46. Linhart, C., Halperin, Y. & Shamir, R. Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. Genome Res. 18, 11801189 (2008).
  47. Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc., B 58, 267288 (1996).
  48. Chen, C.Y. et al. Discovering gapped binding sites of yeast transcription factors. Proc. Natl. Acad. Sci. USA 105, 25272532 (2008).
  49. Philippakis, A.A., Qureshi, A.M., Berger, M.F. & Bulyk, M.L. Design of compact, universal DNA microarrays for protein binding microarray experiments. J. Comput. Biol. 15, 655665 (2008).
  50. Lam, K.N., van Bakel, H., Cote, A.G., van der Ven, A. & Hughes, T.R. Sequence specificity is obtained from the majority of modular C2H2 zinc-finger arrays. Nucleic Acids Res. 39, 46804690 (2011).
  51. Finn, R.D. et al. The Pfam protein families database. Nucleic Acids Res. 38, D211D222 (2010).
  52. Eddy, S.R. A new generation of homology search tools based on probabilistic inference. Genome Inform. 23, 205211 (2009).
  53. Chen, L., Wu, G. & Ji, H. hmChIP: a database and web server for exploring publicly available human and mouse ChIP-seq and ChIP-chip data. Bioinformatics 27, 14471448 (2011).
  54. Parkinson, H. et al. ArrayExpress update–an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 39, D1002D1004 (2011).
  55. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res. 39, D1005D1010 (2011).
  56. Dreszer, T.R. et al. The UCSC Genome Browser database: extensions and updates 2011. Nucleic Acids Res. 40, D918D923 (2012).

Download references

Author information


  1. Banting and Best Department of Medical Research and Donnelly Centre, University of Toronto, Toronto, Ontario, Canada.

    • Matthew T Weirauch,
    • Atina Cote,
    • Shaheynoor Talukder,
    • Quaid D Morris &
    • Timothy R Hughes
  2. Center for Autoimmune Genomics and Etiology (CAGE) and Divisions of Rheumatology and Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA.

    • Matthew T Weirauch
  3. IBM Computational Biology Center, Yorktown Heights, New York, New York, USA.

    • Raquel Norel &
    • Gustavo Stolovitzky
  4. Department of Signal Processing, Tampere University of Technology, Tampere, Finland.

    • Matti Annala
  5. Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

    • Yue Zhao
  6. Department of Biological Sciences, Columbia University, and Center for Computational Biology and Bioinformatics, Columbia University Medical Center, New York, New York, USA.

    • Todd R Riley &
    • Harmen J Bussemaker
  7. EMBL-EBI European Bioinformatics Institute, Cambridge, UK.

    • Julio Saez-Rodriguez &
    • Thomas Cokelaer
  8. Department of Medicine, Division of Genetics, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts, USA.

    • Anastasia Vedenko &
    • Martha L Bulyk
  9. Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada.

    • Quaid D Morris &
    • Timothy R Hughes
  10. Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts, USA.

    • Martha L Bulyk
  11. Harvard-MIT Division of Health Sciences and Technology, Harvard Medical School, Boston, Massachusetts, USA.

    • Martha L Bulyk
  12. Computational Biology Program, Sloan-Kettering Institute, Memorial Sloan-Kettering Cancer Center, New York, New York, USA.

    • Phaedra Agius,
    • Aaron Arvey &
    • Christina Leslie
  13. Swiss Institute of Bioinformatics, Lausanne, Switzerland.

    • Philipp Bucher,
    • Vidhya Jagannathan &
    • Christoph D Schmid
  14. EPFL (École Polytechnique Fédérale de Lausanne) SV ISREC (The Swiss Institute for Experimental Cancer Research) GR-BUCHER, Lausanne, Switzerland.

    • Philipp Bucher
  15. Department of Physics, Princeton University, Princeton, New Jersey, USA.

    • Curtis G Callan Jr &
    • Anand Murugan
  16. Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA.

    • Curtis G Callan Jr
  17. Genome Institute of Singapore, Singapore.

    • Cheng Wei Chang &
    • Wing-Kin Sung
  18. Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, Taiwan.

    • Chien-Yu Chen,
    • Yong-Syuan Chen &
    • Yu-Wei Chu
  19. Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan.

    • Yu-Wei Chu
  20. Institute of Computer Science, Martin Luther University, Halle-Wittenberg, Germany.

    • Jan Grau,
    • Ivo Grosse &
    • Stefan Posch
  21. Institute for Genetics, University of Bern, Bern, Switzerland.

    • Vidhya Jagannathan
  22. Leibniz Institute of Plant Genetics and Crop Plant Research, Gatersleben, Germany.

    • Jens Keilwagen
  23. Max Planck Institute for Molecular Genetics, Berlin, Germany.

    • Szymon M Kiełbasa,
    • Alena Myšičková &
    • Martin Vingron
  24. Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.

    • Justin B Kinney
  25. MicroDiscovery GmbH, Berlin, Germany.

    • Holger Klein
  26. Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warsaw, Poland.

    • Miron B Kursa &
    • Witold R Rudnicki
  27. Department of Information and Computer Science, Aalto University School of Science and Technology, Aalto, Finland.

    • Harri Lähdesmäki
  28. Turku Centre for Biotechnology, Turku University, Turku, Finland.

    • Harri Lähdesmäki
  29. Department of Signal Processing, Tampere University of Technology, Tampere, Finland.

    • Kirsti Laurila &
    • Matti Nykter
  30. Department of Computer Science, University of Texas at San Antonio, San Antonio, Texas, USA.

    • Chengwei Lei &
    • Jianhua Ruan
  31. Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel.

    • Chaim Linhart,
    • Yaron Orenstein &
    • Ron Shamir
  32. Department of Genome Sciences, University of Washington, Seattle, Washington, USA.

    • William Stafford Noble
  33. Swiss Tropical and Public Health Institute (Swiss TPH), Basel, Switzerland.

    • Christoph D Schmid
  34. University of Basel, Basel, Switzerland.

    • Christoph D Schmid
  35. School of Computing, National University of Singapore, Singapore.

    • Wing-Kin Sung &
    • Zhizhuo Zhang


  1. DREAM5 Consortium

    • Phaedra Agius,
    • Aaron Arvey,
    • Philipp Bucher,
    • Curtis G Callan Jr,
    • Cheng Wei Chang,
    • Chien-Yu Chen,
    • Yong-Syuan Chen,
    • Yu-Wei Chu,
    • Jan Grau,
    • Ivo Grosse,
    • Vidhya Jagannathan,
    • Jens Keilwagen,
    • Szymon M Kiełbasa,
    • Justin B Kinney,
    • Holger Klein,
    • Miron B Kursa,
    • Harri Lähdesmäki,
    • Kirsti Laurila,
    • Chengwei Lei,
    • Christina Leslie,
    • Chaim Linhart,
    • Anand Murugan,
    • Alena Myšičková,
    • William Stafford Noble,
    • Matti Nykter,
    • Yaron Orenstein,
    • Stefan Posch,
    • Jianhua Ruan,
    • Witold R Rudnicki,
    • Christoph D Schmid,
    • Ron Shamir,
    • Wing-Kin Sung,
    • Martin Vingron &
    • Zhizhuo Zhang


M.T.W. and T.R.H. wrote the manuscript. T.R.H., M.T.W., M.L.B. and A.V. conceived of the study. M.T.W. did the majority of the computational analyses. M.A., Y.Z. and T.R.R. did additional computational analyses. A.C. and S.T. performed the PBM experiments. T.R.H., M.T.W., G.S. and R.N. designed and carried out the DREAM5 TF challenge. The DREAM5 Consortium and M.A. participated in the DREAM5 TF challenge. R.N., J.S.-R., T.C. and M.T.W. designed and created the prediction server. M.L.B., G.S., Q.D.M. and H.J.B. provided critical feedback on the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (9 MB)

    Supplementary Notes 1–9, Supplementary Tables 1–8 and Supplementary Figures 1–4

Excel files

  1. Supplementary Table 1 (35 KB)

    Information on transcription factors and associated experiments

  2. Supplementary Table 3 (93 KB)

    Full evaluations for all algorithms, by TF

  3. Supplementary Table 6 (49 KB)

    Improvement of secondary over primary motifs, for each TF

  4. Supplementary Table 7 (27 KB)

    Full Comparison to ChIP-seq and ChIP-exo data

  5. Supplementary Table 8 (46 KB)

    Information on plasmids used for PBMs in this study

Zip files

  1. Supplementary Code 1 (34 KB)

    Final set of PWMs for each transcription factor

  2. Supplementary Code 2 (4 MB)

    Algorithm source code

Additional data