Similarity regression predicts evolution of transcription factor sequence specificity


Transcription factor (TF) binding specificities (motifs) are essential for the analysis of gene regulation. Accurate prediction of TF motifs is critical, because it is infeasible to assay all TFs in all sequenced eukaryotic genomes. There is ongoing controversy regarding the degree of motif diversification among related species that is, in part, because of uncertainty in motif prediction methods. Here we describe similarity regression, a significantly improved method for predicting motifs, which we use to update and expand the Cis-BP database. Similarity regression inherently quantifies TF motif evolution, and shows that previous claims of near-complete conservation of motifs between human and Drosophila are inflated, with nearly half of the motifs in each species absent from the other, largely due to extensive divergence in C2H2 zinc finger proteins. We conclude that diversification in DNA-binding motifs is pervasive, and present a new tool and updated resource to study TF diversity and gene regulation across eukaryotes.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Overview of the similarity regression method.
Fig. 2: Similarity regression classification of TFs for highly similar or dissimilar sequence specificities.
Fig. 3: PBM data from the plant C. sativa and model fungi A. nidulans and N. crassa for TFs with conserved and dissimilar motifs.
Fig. 4: Conservation of TF motifs within major eukaryotic kingdoms.
Fig. 5: Motif divergence of TF families in metazoans and plants.
Fig. 6: TF motif conservation between human and Drosophila melanogaster.

Data availability

New PBM data and motifs are deposited in GEO (accession number GSE121420) and the Cis-BP database (v.2.0;

Code availability

The Similarity Regression code, and examples, are available on GitHub (


  1. 1.

    Stormo, G. D. DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000).

    CAS  Article  Google Scholar 

  2. 2.

    Mathelier, A. et al. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 44, D110–D115 (2016).

    CAS  Article  Google Scholar 

  3. 3.

    Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014).

    CAS  Article  Google Scholar 

  4. 4.

    Pelossof, R. et al. Affinity regression predicts the recognition code of nucleic acid-binding proteins. Nat. Biotechnol. 33, 1242–1249 (2015).

    CAS  Article  Google Scholar 

  5. 5.

    Christensen, R. G. et al. Recognition models to predict DNA-binding specificities of homeodomain proteins. Bioinformatics 28, i84–i89 (2012).

    CAS  Article  Google Scholar 

  6. 6.

    Persikov, A. V. et al. A systematic survey of the Cys2His2 zinc finger DNA-binding landscape. Nucleic Acids Res. 43, 1965–1984 (2015).

    CAS  Article  Google Scholar 

  7. 7.

    Najafabadi, H. S. et al. C2H2 zinc finger proteins greatly expand the human regulatory lexicon. Nat. Biotechnol. 33, 555–562 (2015).

    CAS  Article  Google Scholar 

  8. 8.

    Nitta, K. R. et al. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. eLife 4, e04837 (2015).

  9. 9.

    Liu, H., Chang, L. H., Sun, Y., Lu, X. & Stubbs, L. Deep vertebrate roots for mammalian zinc finger transcription factor subfamilies. Genome Biol. Evol. 6, 510–525 (2014).

    Article  Google Scholar 

  10. 10.

    Nadimpalli, S., Persikov, A. V. & Singh, M. Pervasive variation of transcription factor orthologs contributes to regulatory network evolution. PLoS Genet. 11, e1005011 (2015).

    Article  Google Scholar 

  11. 11.

    Lynch, V. J. & Wagner, G. P. Resurrecting the role of transcription factor change in developmental evolution. Evolution 62, 2131–2154 (2008).

    CAS  Article  Google Scholar 

  12. 12.

    Baker, C. R., Tuch, B. B. & Johnson, A. D. Extensive DNA-binding specificity divergence of a conserved transcription regulator. Proc. Natl Acad. Sci. USA 108, 7493–7498 (2011).

    CAS  Article  Google Scholar 

  13. 13.

    Sayou, C. et al. A promiscuous intermediate underlies the evolution of LEAFY DNA binding specificity. Science 343, 645–648 (2014).

    CAS  Article  Google Scholar 

  14. 14.

    Morgunova, E. et al. Structural insights into the DNA-binding specificity of E2F family transcription factors. Nat. Commun. 6, 10050 (2015).

    CAS  Article  Google Scholar 

  15. 15.

    McKeown, A. N. et al. Evolution of DNA specificity in a transcription factor family produced a new gene regulatory module. Cell 159, 58–68 (2014).

    CAS  Article  Google Scholar 

  16. 16.

    Najafabadi, H. S. et al. Non-base-contacting residues enable kaleidoscopic evolution of metazoan C2H2 zinc finger DNA binding. Genome Biol. 18, 167 (2017).

    Article  Google Scholar 

  17. 17.

    Berger, M. F. et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429–1435 (2006).

    CAS  Article  Google Scholar 

  18. 18.

    Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).

    CAS  Article  Google Scholar 

  19. 19.

    Love, J. J. et al. Structural basis for DNA bending by the architectural transcription factor LEF-1. Nature 376, 791–795 (1995).

    CAS  Article  Google Scholar 

  20. 20.

    Marmorstein, R., Carey, M., Ptashne, M. & Harrison, S. C. DNA recognition by GAL4: structure of a protein–DNA complex. Nature 356, 408–414 (1992).

    CAS  Article  Google Scholar 

  21. 21.

    King, D. A., Zhang, L., Guarente, L. & Marmorstein, R. Structure of a HAP1–DNA complex reveals dramatically asymmetric DNA binding by a homodimeric protein. Nat. Struct. Biol. 6, 64–71 (1999).

    CAS  Article  Google Scholar 

  22. 22.

    Persikov, A. V. & Singh, M. De novo prediction of DNA-binding specificities for Cys2His2 zinc finger proteins. Nucleic Acids Res. 42, 97–108 (2014).

    CAS  Article  Google Scholar 

  23. 23.

    Gupta, A. et al. An improved predictive recognition model for Cys2-His2 zinc finger proteins. Nucleic Acids Res. 42, 4800–4812 (2014).

    CAS  Article  Google Scholar 

  24. 24.

    de Mendoza, A. et al. Transcription factor evolution in eukaryotes and the assembly of the regulatory toolkit in multicellular lineages. Proc. Natl Acad. Sci. USA 110, E4858–E4866 (2013).

    Article  Google Scholar 

  25. 25.

    Narasimhan, K. et al. Mapping and analysis of Caenorhabditis elegans transcription factor sequence specificities. eLife 4, e06967 (2015).

  26. 26.

    Robinson-Rechavi, M., Maina, C. V., Gissendanner, C. R., Laudet, V. & Sluder, A. Explosive lineage-specific expansion of the orphan nuclear receptor HNF4 in nematodes. J. Mol. Evol. 60, 577–586 (2005).

    CAS  Article  Google Scholar 

  27. 27.

    Stracke, R., Werber, M. & Weisshaar, B. The R2R3-MYB gene family in Arabidopsis thaliana. Curr. Opin. Plant Biol. 4, 447–456 (2001).

    CAS  Article  Google Scholar 

  28. 28.

    Grove, C. A. et al. A multiparameter network reveals extensive divergence between C. elegans bHLH transcription factors. Cell 138, 314–327 (2009).

    CAS  Article  Google Scholar 

  29. 29.

    Reinke, A. W., Baek, J., Ashenberg, O. & Keating, A. E. Networks of bZIP protein–protein interactions diversified over a billion years of evolution. Science 340, 730–734 (2013).

    CAS  Article  Google Scholar 

  30. 30.

    Jolma, A. et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861–873 (2010).

    CAS  Article  Google Scholar 

  31. 31.

    Noyes, M. B. et al. A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system. Nucleic Acids Res. 36, 2547–2560 (2008).

    CAS  Article  Google Scholar 

  32. 32.

    Zhu, L. J. et al. FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system. Nucleic Acids Res. 39, D111–D117 (2011).

    CAS  Article  Google Scholar 

  33. 33.

    MacPherson, S., Larochelle, M. & Turcotte, B. A fungal family of transcriptional regulators: the zinc cluster proteins. Microbiol. Mol. Biol. Rev. 70, 583–604 (2006).

    CAS  Article  Google Scholar 

  34. 34.

    Lambert, S. A. et al. The human transcription factors. Cell 175, 598–599 (2018).

    CAS  Article  Google Scholar 

  35. 35.

    Ecco, G., Imbeault, M. & Trono, D. KRAB zinc finger proteins. Development 144, 2719–2729 (2017).

    CAS  Article  Google Scholar 

  36. 36.

    Schmitges, F. W. et al. Multiparameter functional diversity of human C2H2 zinc finger proteins. Genome Res. 26, 1742–1752 (2016).

    CAS  Article  Google Scholar 

  37. 37.

    Noyes, M. B. et al. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell 133, 1277–1289 (2008).

    CAS  Article  Google Scholar 

  38. 38.

    Wilkinson, S. P. aphid: an R package for analysis with profile hidden Markov models. Bioinformatics (2019).

    Article  Google Scholar 

  39. 39.

    Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).

    CAS  Article  Google Scholar 

  40. 40.

    R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2013);

  41. 41.

    Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).

    Article  Google Scholar 

  42. 42.

    Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008).

  43. 43.

    Gorodkin, J. Comparing two K-category assignments by a K-category correlation coefficient. Comput. Biol. Chem. 28, 367–374 (2004).

    CAS  Article  Google Scholar 

  44. 44.

    Sagendorf, J. M., Berman, H. M. & Rohs, R. DNAproDB: an interactive tool for structural analysis of DNA–protein complexes. Nucleic Acids Res. 45, W89–W97 (2017).

    CAS  Article  Google Scholar 

  45. 45.

    Berman, H. M. et al. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).

    CAS  Article  Google Scholar 

  46. 46.

    HMMER: biosequence analysis using profile hidden Markov models (Howard Hughes Medical Institute, 2015);

  47. 47.

    Finn, R. D. et al. The Pfam protein families database. Nucleic Acids Res. 38, D211–D222 (2010).

    CAS  Article  Google Scholar 

  48. 48.

    Lambert, S. A., Albu, M., Hughes, T. R. & Najafabadi, H. S. Motif comparison based on similarity of binding affinity profiles. Bioinformatics 32, 3504–3506 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. 49.

    Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017).

    Article  Google Scholar 

  50. 50.

    O’Malley, R. C. et al. Cistrome and epicistrome features shape the regulatory DNA landscape. Cell 165, 1280–1292 (2016).

    Article  Google Scholar 

  51. 51.

    Barazandeh, M., Lambert, S. A., Albu, M. & Hughes, T. R. Comparison of ChIP-seq data and a reference motif set for human KRAB C2H2 zinc finger proteins. G3 (Bethesda) 8, 219–229 (2018).

    CAS  Article  Google Scholar 

  52. 52.

    Hume, M. A., Barrera, L. A., Gisselbrecht, S. S. & Bulyk, M. L. UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein–DNA interactions. Nucleic Acids Res. 43, D117–D122 (2015).

    CAS  Article  Google Scholar 

  53. 53.

    Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006).

    CAS  Article  Google Scholar 

  54. 54.

    Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D1284 (2018).

    Article  Google Scholar 

  55. 55.

    Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-seq analysis. Nucleic Acids Res. 46, D252–D259 (2018).

    CAS  Article  Google Scholar 

  56. 56.

    Sigrist, C. J. et al. PROSITE: a documented database using patterns and profiles as motif descriptors. Brief. Bioinform. 3, 265–274 (2002).

    CAS  Article  Google Scholar 

  57. 57.

    Kumar, S., Stecher, G., Suleski, M. & Hedges, S. B. Timetree: a resource for timelines, timetrees, and divergence times. Mol. Biol. Evol. 34, 1812–1819 (2017).

    CAS  Article  Google Scholar 

  58. 58.

    Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    CAS  Article  Google Scholar 

  59. 59.

    Lam, K. N., van Bakel, H., Cote, A. G., van der Ven, A. & Hughes, T. R. Sequence specificity is obtained from the majority of modular C2H2 zinc-finger arrays. Nucleic Acids Res. 39, 4680–4690 (2011).

    CAS  Article  Google Scholar 

  60. 60.

    Zhao, Y. & Stormo, G. D. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29, 480–483 (2011).

    CAS  Article  Google Scholar 

Download references


We thank Xiaoting Chen and Mario Pujato for computational support. S.A.L. was funded by a Natural Sciences and Engineering Research Council of Canada Doctoral Fellowship. T.R.H. holds the Billes Chair of Medical Research at the University of Toronto. This work was supported by a Canadian Institutes of Health Research grant (FDN-148403) and a Natural Sciences and Engineering Research Council of Canada grant (RPGIN-2016-05643) to T.R.H., National Institutes of Health (NIH) grants R01 AR073228, R01 NS099068 and R01 GM055479, Lupus Research Alliance ‘Novel Approaches’, CCRF Endowed Scholar and CCHMC CpG Award 53553 to M.T.W. and a Canadian Institutes of Health Research Operating grant (MOP-125894) to Q.D.M. and T.R.H.

Author information




S.A.L., M.T.W. and T.R.H. conceived the study and oversaw it to completion. S.A.L. analyzed the data, made the figures and performed all computational analyses except for experiments for which A.S. reimplemented the affinity regression pipeline and applied it to new data. Q.D.M. guided the computational and statistical analyses. M.A., S.A.L. and M.T.W. maintained and updated the Cis-BP database. G.C. and M.X.C. produced the clones for Aspergillus PBM experiments. A.W.H.Y. produced the remainder of the clones and performed all PBM experiments. S.A.L. and T.R.H. wrote the manuscript with feedback and approval from all authors.

Corresponding author

Correspondence to Timothy R. Hughes.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Additional similarity regression model building and selection details.

Four similarity regression (SR) models are made for each TF family, and compared to alignment percent identity to identify the best similarity regression model. The best model is selected after cross-validation, and threshold selection by Matthews correlation coefficient. This figure uses homeodomains as an example which have 465 PBM constructs yielding 13,832 highly similar, 22,714 ambiguous and 71,793 dissimilar pairs.

Supplementary Figure 2 Application of similarity regression to TFs with an array of DBDs.

(a) DBDs are first aligned to find the best (maximizing amino acid identity) ungapped and internal alignment. Examples of permissible and non-permissable alignment configurations are shown. (b) Alignments are then scored by calculating positional protein similarity features in each finger of a DBD array (for example C2H2 ZFs), and combined into a single representation by averaging the features by the length of the longest DBD array.

Supplementary Figure 3 Comparison of similarity regression weights to known DNA-contacting residues.

(a) Homeodomain, or (b) C2H2 ZF similarity regression (SR) weights are compared to DNAproDB contact frequencies for DNA backbone, major and minor groove contacts, using partial Pearson correlations. TF amino acid sequence diversity (for the similarity regression model training sequences) is displayed, for reference (above). Figures comparing contact frequencies with similarity regression weights are provided for all similarity regression models in Supplementary Data 1. (c) Partial correlations for 25 TF families with similarity regression models and structural information in DNAproDB (Sagendorf, J.M., et al., Nucleic Acids Res. 45, W89-W97, 2017) are displayed and coloured according to the statistical significance, as -log10(p-value).

Supplementary Figure 4 Comparison of similarity regression to percent identity at predicting TF pairs with dissimilar specificities.

(a) Scatter plot comparing the fraction of all dissimilar TF pairs captured by the 95% NPV threshold (specificity) for 17 TF families that have dissimilar TFs. (b) Scatter plot showing Matthews correlation coefficient, which summarizes multi-class (for highly similar, ambiguous, and dissimilar TF sequence specificity) classification accuracy for 29 TF families. In both panels, points are sized according to the number of PBM experiments used for training, and coloured according to the AA features used in each model.

Supplementary Figure 5 Comparison of similarity regression scores with experimentally determined similarity in DNA sequence specificity, for new PBM data.

Predicted TF similarity (similarity regression (SR) score) and actual DNA-binding similarity (PBM E-score overlap) are plotted for 275 PBM experiments, vs the most similar (by similarity regression score) TF in the training set. Results are displayed for each TF family with more than three TFs. Linear fit is shown, with corresponding R2 value. Points are coloured by their actual TF similarity based on family-specific E-score overlap thresholds.

Supplementary Figure 6 Comparison of predicted Z-score profiles for similarity regression, affinity regression and percent identity.

(a) Individual points show the Pearson correlation coefficient of predicted vs. actual Z-score profiles for 315 TFs (those among the 340 that have Similarity Regression models), for the reconstruction methods (similarity regression (SR), affinity regression (AR) and percent identity) tested. Reconstruction methods are grouped by whether they are a mixture of one NN, or multiple (Z-score reconstructions) TF profiles, as indicated by grey bars above. Points are coloured by TF family (see legend). (b-d) Individual results for the three most abundant TF families in the test set are plotted separately: (b) C2H2 ZFs (n = 34), (c) Homeodomain (n = 17), and (d) zinc cluster (n = 107). Boxplots are defined with center line, median; box limits, upper and lower quartiles; whiskers, smallest or largest data point within 1.5× interquartile range.

Supplementary Figure 7 Comparison of similarity regression predicted motifs with C2H2 and homeodomain-recognition code predictions.

(a) Individual points show the motif similarity of predicted vs. actual PFMs, for the Homeodomain motif prediction methods tested (recognition codes, similarity regression (SR), and percent identity). Boxplots summarizing the predictions for 17 TFs are coloured by the motif prediction method and defined with center line, median; box limits, upper and lower quartiles; whiskers, smallest or largest data point within 1.5× interquartile range. (b) Motif similarity of predicted vs. actual PFMs for C2H2 ZF prediction methods are displayed the same as in a for 34 TFs.

Supplementary Figure 8 Increase in percentage of TFs with a predicted motif in CIS-BP (Similarity Regression compared to percent identity).

(a) The percentage of TFs with a ‘Direct’ (that is experimentally determined) (black bars), or predicted (grey bars) motif are plotted for the 50 largest TF families in CIS-BP. Increase in percentage due to similarity regression (SR) models is shown by red bars. Total number of TFs encompassed is shown at right. (b) Motif coverage in well-studied eukaryotes, plotted as in panel a. Relationships between the species are represented by divergence time (million years ago) obtained from the TimeTree database (Kumar, S., et al., Mol Biol Evol. 34, 1812-1819, 2017). The major clades of fungi, metazoans, and plants are coloured in red, blue, and green respectively.

Supplementary Figure 9 Motif divergence of TF families in fungi.

Classifications of motif similarity are shown as in Fig. 5. The outer ring of each nested pie chart represents Saccharomyces cerevisiae TFs similarities with respect to the species it’s being compared to (displayed along the phylogeny). The inner ring represents the compared species similarities with respect to S. cerevisiae. Branch length is the divergence time between species (millions of years).

Supplementary Figure 10 Motif similarity between corresponding Drosophila and human TFs (highest scoring BLASTP hits with Drosophila as query).

Motif similarity was calculated between 322 fly (1128 PWMs) and 251 human TFs (with 2177 PWMs) with experimentally determined motifs, using MoSBAT energy scores (Lambert, S.A., et al., Bioinformatics. 32, 3504–3506, 2016). (a) The maximum motif similarity for all pairs of human and fly TFs (that is considering that there are often multiple motifs per TF) is displayed as a boxplot, according to the Similarity Regression-predicted TF similarity for each NN pair. (b) Similar plot as panel a, but only HT-SELEX data (306 fly PWMs and 410 human PWMs) is used in the analysis. Boxplots are defined with: center line, median; box limits, upper and lower quartiles; whiskers, smallest or largest data point within 1.5× interquartile range; points, outliers.

Supplementary information

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lambert, S.A., Yang, A.W.H., Sasse, A. et al. Similarity regression predicts evolution of transcription factor sequence specificity. Nat Genet 51, 981–989 (2019).

Download citation

Further reading


Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing