Transcription factor (TF) binding specificities (motifs) are essential for the analysis of gene regulation. Accurate prediction of TF motifs is critical, because it is infeasible to assay all TFs in all sequenced eukaryotic genomes. There is ongoing controversy regarding the degree of motif diversification among related species that is, in part, because of uncertainty in motif prediction methods. Here we describe similarity regression, a significantly improved method for predicting motifs, which we use to update and expand the Cis-BP database. Similarity regression inherently quantifies TF motif evolution, and shows that previous claims of near-complete conservation of motifs between human and Drosophila are inflated, with nearly half of the motifs in each species absent from the other, largely due to extensive divergence in C2H2 zinc finger proteins. We conclude that diversification in DNA-binding motifs is pervasive, and present a new tool and updated resource to study TF diversity and gene regulation across eukaryotes.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $18.75 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The Similarity Regression code, and examples, are available on GitHub (https://github.com/smlmbrt/SimilarityRegression).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Stormo, G. D. DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000).
Mathelier, A. et al. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 44, D110–D115 (2016).
Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014).
Pelossof, R. et al. Affinity regression predicts the recognition code of nucleic acid-binding proteins. Nat. Biotechnol. 33, 1242–1249 (2015).
Christensen, R. G. et al. Recognition models to predict DNA-binding specificities of homeodomain proteins. Bioinformatics 28, i84–i89 (2012).
Persikov, A. V. et al. A systematic survey of the Cys2His2 zinc finger DNA-binding landscape. Nucleic Acids Res. 43, 1965–1984 (2015).
Najafabadi, H. S. et al. C2H2 zinc finger proteins greatly expand the human regulatory lexicon. Nat. Biotechnol. 33, 555–562 (2015).
Nitta, K. R. et al. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. eLife 4, e04837 (2015).
Liu, H., Chang, L. H., Sun, Y., Lu, X. & Stubbs, L. Deep vertebrate roots for mammalian zinc finger transcription factor subfamilies. Genome Biol. Evol. 6, 510–525 (2014).
Nadimpalli, S., Persikov, A. V. & Singh, M. Pervasive variation of transcription factor orthologs contributes to regulatory network evolution. PLoS Genet. 11, e1005011 (2015).
Lynch, V. J. & Wagner, G. P. Resurrecting the role of transcription factor change in developmental evolution. Evolution 62, 2131–2154 (2008).
Baker, C. R., Tuch, B. B. & Johnson, A. D. Extensive DNA-binding specificity divergence of a conserved transcription regulator. Proc. Natl Acad. Sci. USA 108, 7493–7498 (2011).
Sayou, C. et al. A promiscuous intermediate underlies the evolution of LEAFY DNA binding specificity. Science 343, 645–648 (2014).
Morgunova, E. et al. Structural insights into the DNA-binding specificity of E2F family transcription factors. Nat. Commun. 6, 10050 (2015).
McKeown, A. N. et al. Evolution of DNA specificity in a transcription factor family produced a new gene regulatory module. Cell 159, 58–68 (2014).
Najafabadi, H. S. et al. Non-base-contacting residues enable kaleidoscopic evolution of metazoan C2H2 zinc finger DNA binding. Genome Biol. 18, 167 (2017).
Berger, M. F. et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429–1435 (2006).
Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).
Love, J. J. et al. Structural basis for DNA bending by the architectural transcription factor LEF-1. Nature 376, 791–795 (1995).
Marmorstein, R., Carey, M., Ptashne, M. & Harrison, S. C. DNA recognition by GAL4: structure of a protein–DNA complex. Nature 356, 408–414 (1992).
King, D. A., Zhang, L., Guarente, L. & Marmorstein, R. Structure of a HAP1–DNA complex reveals dramatically asymmetric DNA binding by a homodimeric protein. Nat. Struct. Biol. 6, 64–71 (1999).
Persikov, A. V. & Singh, M. De novo prediction of DNA-binding specificities for Cys2His2 zinc finger proteins. Nucleic Acids Res. 42, 97–108 (2014).
Gupta, A. et al. An improved predictive recognition model for Cys2-His2 zinc finger proteins. Nucleic Acids Res. 42, 4800–4812 (2014).
de Mendoza, A. et al. Transcription factor evolution in eukaryotes and the assembly of the regulatory toolkit in multicellular lineages. Proc. Natl Acad. Sci. USA 110, E4858–E4866 (2013).
Narasimhan, K. et al. Mapping and analysis of Caenorhabditis elegans transcription factor sequence specificities. eLife 4, e06967 (2015).
Robinson-Rechavi, M., Maina, C. V., Gissendanner, C. R., Laudet, V. & Sluder, A. Explosive lineage-specific expansion of the orphan nuclear receptor HNF4 in nematodes. J. Mol. Evol. 60, 577–586 (2005).
Stracke, R., Werber, M. & Weisshaar, B. The R2R3-MYB gene family in Arabidopsis thaliana. Curr. Opin. Plant Biol. 4, 447–456 (2001).
Grove, C. A. et al. A multiparameter network reveals extensive divergence between C. elegans bHLH transcription factors. Cell 138, 314–327 (2009).
Reinke, A. W., Baek, J., Ashenberg, O. & Keating, A. E. Networks of bZIP protein–protein interactions diversified over a billion years of evolution. Science 340, 730–734 (2013).
Jolma, A. et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861–873 (2010).
Noyes, M. B. et al. A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system. Nucleic Acids Res. 36, 2547–2560 (2008).
Zhu, L. J. et al. FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system. Nucleic Acids Res. 39, D111–D117 (2011).
MacPherson, S., Larochelle, M. & Turcotte, B. A fungal family of transcriptional regulators: the zinc cluster proteins. Microbiol. Mol. Biol. Rev. 70, 583–604 (2006).
Lambert, S. A. et al. The human transcription factors. Cell 175, 598–599 (2018).
Ecco, G., Imbeault, M. & Trono, D. KRAB zinc finger proteins. Development 144, 2719–2729 (2017).
Schmitges, F. W. et al. Multiparameter functional diversity of human C2H2 zinc finger proteins. Genome Res. 26, 1742–1752 (2016).
Noyes, M. B. et al. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell 133, 1277–1289 (2008).
Wilkinson, S. P. aphid: an R package for analysis with profile hidden Markov models. Bioinformatics https://doi.org/10.1093/bioinformatics/btz159 (2019).
Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2013);http://www.R-project.org/
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008).
Gorodkin, J. Comparing two K-category assignments by a K-category correlation coefficient. Comput. Biol. Chem. 28, 367–374 (2004).
Sagendorf, J. M., Berman, H. M. & Rohs, R. DNAproDB: an interactive tool for structural analysis of DNA–protein complexes. Nucleic Acids Res. 45, W89–W97 (2017).
Berman, H. M. et al. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).
HMMER: biosequence analysis using profile hidden Markov models (Howard Hughes Medical Institute, 2015); http://hmmer.org/
Finn, R. D. et al. The Pfam protein families database. Nucleic Acids Res. 38, D211–D222 (2010).
Lambert, S. A., Albu, M., Hughes, T. R. & Najafabadi, H. S. Motif comparison based on similarity of binding affinity profiles. Bioinformatics 32, 3504–3506 (2016).
Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017).
O’Malley, R. C. et al. Cistrome and epicistrome features shape the regulatory DNA landscape. Cell 165, 1280–1292 (2016).
Barazandeh, M., Lambert, S. A., Albu, M. & Hughes, T. R. Comparison of ChIP-seq data and a reference motif set for human KRAB C2H2 zinc finger proteins. G3 (Bethesda) 8, 219–229 (2018).
Hume, M. A., Barrera, L. A., Gisselbrecht, S. S. & Bulyk, M. L. UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein–DNA interactions. Nucleic Acids Res. 43, D117–D122 (2015).
Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006).
Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D1284 (2018).
Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-seq analysis. Nucleic Acids Res. 46, D252–D259 (2018).
Sigrist, C. J. et al. PROSITE: a documented database using patterns and profiles as motif descriptors. Brief. Bioinform. 3, 265–274 (2002).
Kumar, S., Stecher, G., Suleski, M. & Hedges, S. B. Timetree: a resource for timelines, timetrees, and divergence times. Mol. Biol. Evol. 34, 1812–1819 (2017).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Lam, K. N., van Bakel, H., Cote, A. G., van der Ven, A. & Hughes, T. R. Sequence specificity is obtained from the majority of modular C2H2 zinc-finger arrays. Nucleic Acids Res. 39, 4680–4690 (2011).
Zhao, Y. & Stormo, G. D. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29, 480–483 (2011).
We thank Xiaoting Chen and Mario Pujato for computational support. S.A.L. was funded by a Natural Sciences and Engineering Research Council of Canada Doctoral Fellowship. T.R.H. holds the Billes Chair of Medical Research at the University of Toronto. This work was supported by a Canadian Institutes of Health Research grant (FDN-148403) and a Natural Sciences and Engineering Research Council of Canada grant (RPGIN-2016-05643) to T.R.H., National Institutes of Health (NIH) grants R01 AR073228, R01 NS099068 and R01 GM055479, Lupus Research Alliance ‘Novel Approaches’, CCRF Endowed Scholar and CCHMC CpG Award 53553 to M.T.W. and a Canadian Institutes of Health Research Operating grant (MOP-125894) to Q.D.M. and T.R.H.
Integrated supplementary information
Four similarity regression (SR) models are made for each TF family, and compared to alignment percent identity to identify the best similarity regression model. The best model is selected after cross-validation, and threshold selection by Matthews correlation coefficient. This figure uses homeodomains as an example which have 465 PBM constructs yielding 13,832 highly similar, 22,714 ambiguous and 71,793 dissimilar pairs.
(a) DBDs are first aligned to find the best (maximizing amino acid identity) ungapped and internal alignment. Examples of permissible and non-permissable alignment configurations are shown. (b) Alignments are then scored by calculating positional protein similarity features in each finger of a DBD array (for example C2H2 ZFs), and combined into a single representation by averaging the features by the length of the longest DBD array.
Supplementary Figure 3 Comparison of similarity regression weights to known DNA-contacting residues.
(a) Homeodomain, or (b) C2H2 ZF similarity regression (SR) weights are compared to DNAproDB contact frequencies for DNA backbone, major and minor groove contacts, using partial Pearson correlations. TF amino acid sequence diversity (for the similarity regression model training sequences) is displayed, for reference (above). Figures comparing contact frequencies with similarity regression weights are provided for all similarity regression models in Supplementary Data 1. (c) Partial correlations for 25 TF families with similarity regression models and structural information in DNAproDB (Sagendorf, J.M., et al., Nucleic Acids Res. 45, W89-W97, 2017) are displayed and coloured according to the statistical significance, as -log10(p-value).
Supplementary Figure 4 Comparison of similarity regression to percent identity at predicting TF pairs with dissimilar specificities.
(a) Scatter plot comparing the fraction of all dissimilar TF pairs captured by the 95% NPV threshold (specificity) for 17 TF families that have dissimilar TFs. (b) Scatter plot showing Matthews correlation coefficient, which summarizes multi-class (for highly similar, ambiguous, and dissimilar TF sequence specificity) classification accuracy for 29 TF families. In both panels, points are sized according to the number of PBM experiments used for training, and coloured according to the AA features used in each model.
Supplementary Figure 5 Comparison of similarity regression scores with experimentally determined similarity in DNA sequence specificity, for new PBM data.
Predicted TF similarity (similarity regression (SR) score) and actual DNA-binding similarity (PBM E-score overlap) are plotted for 275 PBM experiments, vs the most similar (by similarity regression score) TF in the training set. Results are displayed for each TF family with more than three TFs. Linear fit is shown, with corresponding R2 value. Points are coloured by their actual TF similarity based on family-specific E-score overlap thresholds.
Supplementary Figure 6 Comparison of predicted Z-score profiles for similarity regression, affinity regression and percent identity.
(a) Individual points show the Pearson correlation coefficient of predicted vs. actual Z-score profiles for 315 TFs (those among the 340 that have Similarity Regression models), for the reconstruction methods (similarity regression (SR), affinity regression (AR) and percent identity) tested. Reconstruction methods are grouped by whether they are a mixture of one NN, or multiple (Z-score reconstructions) TF profiles, as indicated by grey bars above. Points are coloured by TF family (see legend). (b-d) Individual results for the three most abundant TF families in the test set are plotted separately: (b) C2H2 ZFs (n = 34), (c) Homeodomain (n = 17), and (d) zinc cluster (n = 107). Boxplots are defined with center line, median; box limits, upper and lower quartiles; whiskers, smallest or largest data point within 1.5× interquartile range.
Supplementary Figure 7 Comparison of similarity regression predicted motifs with C2H2 and homeodomain-recognition code predictions.
(a) Individual points show the motif similarity of predicted vs. actual PFMs, for the Homeodomain motif prediction methods tested (recognition codes, similarity regression (SR), and percent identity). Boxplots summarizing the predictions for 17 TFs are coloured by the motif prediction method and defined with center line, median; box limits, upper and lower quartiles; whiskers, smallest or largest data point within 1.5× interquartile range. (b) Motif similarity of predicted vs. actual PFMs for C2H2 ZF prediction methods are displayed the same as in a for 34 TFs.
Supplementary Figure 8 Increase in percentage of TFs with a predicted motif in CIS-BP (Similarity Regression compared to percent identity).
(a) The percentage of TFs with a ‘Direct’ (that is experimentally determined) (black bars), or predicted (grey bars) motif are plotted for the 50 largest TF families in CIS-BP. Increase in percentage due to similarity regression (SR) models is shown by red bars. Total number of TFs encompassed is shown at right. (b) Motif coverage in well-studied eukaryotes, plotted as in panel a. Relationships between the species are represented by divergence time (million years ago) obtained from the TimeTree database (Kumar, S., et al., Mol Biol Evol. 34, 1812-1819, 2017). The major clades of fungi, metazoans, and plants are coloured in red, blue, and green respectively.
Classifications of motif similarity are shown as in Fig. 5. The outer ring of each nested pie chart represents Saccharomyces cerevisiae TFs similarities with respect to the species it’s being compared to (displayed along the phylogeny). The inner ring represents the compared species similarities with respect to S. cerevisiae. Branch length is the divergence time between species (millions of years).
Supplementary Figure 10 Motif similarity between corresponding Drosophila and human TFs (highest scoring BLASTP hits with Drosophila as query).
Motif similarity was calculated between 322 fly (1128 PWMs) and 251 human TFs (with 2177 PWMs) with experimentally determined motifs, using MoSBAT energy scores (Lambert, S.A., et al., Bioinformatics. 32, 3504–3506, 2016). (a) The maximum motif similarity for all pairs of human and fly TFs (that is considering that there are often multiple motifs per TF) is displayed as a boxplot, according to the Similarity Regression-predicted TF similarity for each NN pair. (b) Similar plot as panel a, but only HT-SELEX data (306 fly PWMs and 410 human PWMs) is used in the analysis. Boxplots are defined with: center line, median; box limits, upper and lower quartiles; whiskers, smallest or largest data point within 1.5× interquartile range; points, outliers.