Accurate modeling of the sequence specificities of TFs is of central importance for understanding the function and evolution of genomes. Ideally, sequence specificity models should predict the relative affinity (or dissociation constant) for different individual sequences and/or the probability of occupancy at any position in the genome. The major paradigm in modeling TF sequence specificity is the position weight matrix (PWM) model1,2,3. PWMs represent the DNA sequence preference of a TF as an N by B matrix, where N is the length of the site bound by the TF, and B is the number of possible nucleotide bases (that is, A, C, G or T). Each position provides a score for each nucleotide, representing the relative preference for the given base. PWM models provide an intuitive representation of the sequence preferences of a TF, including the exact position where it would bind the DNA, and involve relatively few parameters. However, recent studies suggest that shortcomings of PWMs, including their inability to model variable-width gaps, capture dependencies between the residues in the binding site or account for the fact that TFs can have more than one DNA-binding interface, can make them inaccurate4,5,6,7,8,9. Alternative models have been developed that extend the PWM model by considering the contribution of combinations of nucleotides, for example, dinucleotides or combinations of multiple motifs4,6,7,10. Another alternative, k-mer–based approaches7,11, assign a score to every possible sequence of length k, and hence make no assumptions about position dependence, variable gap lengths or multiple binding motifs. To our knowledge, the relative efficacies of these approaches have not been systematically compared.

A major difficulty in studying TF-DNA binding specificity and, therefore, in evaluating models for representing this specificity has been scarcity of data. The process of training and testing models benefits from a large number of unbiased data points. In the case of TF-DNA binding models, the required data are the relative preferences of a TF for a large number of individual sequences. Ideally, such data should be obtained in an in vitro setting, as many confounding factors can influence the binding of a transcription factor in vivo (e.g., chromatin state, TF concentration or interactions with cofactors). Methods for measuring in vitro binding specificity include (HT)-SELEX/SELEX-seq12,13,14,15, HiTS-FLIP8, mechanically induced trapping of molecular interactions (MITOMI)9,16, cognate site identifier17, bacterial one-hybrid18 and protein binding microarrays (PBMs)19.

PBMs have enjoyed widespread use owing to the ease, accessibility and relatively high information content of the assay. Raw PBM data consist of a score (that is, fluorescence signal intensity) representing the relative preference of a given TF to the sequence of each probe contained on the array. PBM data represent specificity (that is, how strongly a given TF binds to a given sequence, relative to all other sequences), as opposed to binding affinity (that is, how strongly a TF binds to a single sequence); specificity is the more important measure, because in vivo, the TF must be able to distinguish its functional sites from all accessible sequences in the genome20. A typical universal PBM is designed using a de Bruijn sequence, such that all possible 10-mers and 32 copies of every nonpalindromic 8-mer are contained within 40,000 60-base probe sequences (each containing either 35 or 36 unique bases) on each array, offering an unbiased survey of TF sequence specificities19. Constructing arrays with different de Bruijn sequences, each capturing the sequence specificities of the same TF to entirely different sets of sequences, provides a means to test the relative performance of various algorithms for modeling and predicting TF sequence specificities, because models can be trained on one array and tested on the other7,19. Here we present an evaluation of 26 different algorithms for modeling the DNA sequence specificity of a diversity of TFs, using two PBM array designs for each TF.


The DREAM5 challenge

The Dialogue for Reverse Engineering Assessments and Methods (DREAM) is a series of annual reverse-engineering systems biology challenges21,22,23. The DREAM5 TF-DNA Motif Recognition Challenge formed the basis for the analyses presented here. The challenge used PBM data to test the ability of different algorithms to represent the sequence preferences of TFs (here, 'algorithm' refers to the combination of data preprocessing, TF sequence specificity model, training and scoring). Briefly, we generated PBM data measuring the DNA sequence preferences of 86 mouse TFs, taken from 15 diverse TF families (Supplementary Table 1). All TFs were assayed in duplicate on two arrays with independent de Bruijn sequences (denoted 'ME' and 'HK' after the initials of the designers). In the DREAM5 challenge, the sequences of both arrays were made known, but only a subset of the PBM data was provided to participants, and the teams submitted predictions on the array data held back. For 20 randomly chosen TFs, array intensity data were provided from both array types, in order for the participants to calibrate and test their algorithms. For 33 TFs, intensity data were provided only from the ME type of array; data for the remaining 33 TFs were provided only for the HK type of array. Given the output of probe intensities of one PBM array type, the challenge consisted of predicting the probe intensities of the second array type for each of the 66 TFs.

The probe intensity predictions from each participant were then evaluated using five criteria (that is, scores) that assess the ability of an algorithm to either predict probe sequence intensities or assign high ranks to preferred 8-mer sequences. These criteria and a combined score that summarizes the performance of each algorithm are described in Supplementary Note 1. Briefly, the k-mer–based method of Team_D11 outperformed all other algorithms, with algorithms ranked two through five performing similarly to each other (Supplementary Table 2). Of note, the top five teams represent a wide range of sequence specificity models (Table 1), suggesting that the algorithm, its implementation and its scoring system might be of greater importance than the type of model used.

Table 1 Summary of evaluated algorithms

The DREAM5 outcome, along with feedback from participants and others, led us to revisit and investigate several aspects of the results. First, we wanted to revisit the evaluation criteria. Second, we wanted to account for the possibility that microarray data preprocessing might have an effect on the final performance of a model or algorithm, as it clearly did for Team_D11. Third, we wanted to incorporate published algorithms that were not represented in the challenge, including three biophysical energy–based algorithms, BEEML-PBM24,25, FeatureREDUCE (T.R.R. and H.J.B., unpublished data) and MatrixREDUCE26, as well as two statistical algorithms, RankMotif++27 and Seed-and-Wobble19. We also wanted to examine the impact of dinucleotide-based PWM models and 'secondary motifs', which can model proteins with multiple modes of binding DNA7. Here, we include 15 published and unpublished algorithms, in addition to 11 algorithms submitted as part of the original challenge (Table 1). Fourth, we wished to examine whether the results we obtained for in vitro data were supported by in vivo analyses and alternative in vitro assays.

Revised evaluation criteria

We considered two general issues in revisiting the evaluation criteria. The first is that, ideally, a representation of DNA sequence preference (e.g., a PWM) should output a number that reflects relative preference to a given sequence20. Most of the algorithms we considered aim to do this. In such cases it is reasonable to score using Pearson correlation. We note, however, that other models are intended to discriminate bound from unbound sets of sequences, or to represent the best binding sequences. In addition, microarray data can be subject to noise and saturation effects. In such cases it is appropriate to ask whether highly bound sequences can be discriminated from unbound sequences, which can be measured by the area under the receiver operating characteristic (AUROC).

The second issue is whether scoring should be based on predicting the 35-mer probe intensities or predicting their transformed 8-mer values (we refer to full probe sequences as 35-mers, because each 60-base probe sequence contains 35 unique bases). The original DREAM5 challenge included both. There are arguments for and against both7,19,24 and our comparisons to independent data did not support either as being superior overall (Supplementary Note 2). In addition, the 8-mer values can be derived by different means; one previous study7 directly predicted values for the test 8-mers with PWMs, whereas another24 first scored the test 35-mer scores and then converted these to 8-mer scores. We found that the latter approach24 results in dramatically improved correlations to the measured test 8-mer Z-scores (Supplementary Note 2), suggesting that previous conclusions regarding secondary motifs, which were derived using the former approach7, should be revisited. Using the latter procedure24, the correlations obtained for 8-mers and for 35-mers on the same array scale with each other almost perfectly, whether the 35-mers are scored with PWMs or with 8-mers (Supplementary Note 2). The only substantial difference we have observed between scoring 35-mers or 8-mers is that secondary motifs appear to confer a slight advantage when scoring 8-mers, but not 35-mers.

In the evaluations below, we use two criteria that are based on prediction of 35-mer intensities (which was the original DREAM5 challenge), but acknowledge that the data may be noisy and semiquantitative: (i) Pearson correlation between predicted and actual probe intensities (in the linear domain), and (ii) the AUROC of the set of positive probes, where positive probes are defined as those with actual intensities >4 s.d. above the mean probe intensity for the given experiment (average of 350 probes per experiment, out of 40,000 probes total) (Fig. 1). We calculate a normalized score in which the top-performing algorithm for the given evaluation criterion receives a 1, and all other algorithms receive scores proportional to the top algorithm. The final score for an algorithm is the average of its two normalized scores. We also report the Pearson correlation between measured and predicted 8-mer scores, and the AUROC of positive 8-mers, where positive 8-mers are defined as those with associated E-scores > 0.45 in the actual experiment28, although these are not used to gauge the efficacy of algorithms or models.

Figure 1: Evaluation criteria used in this study.
figure 1

For each TF, we scored an algorithm's probe intensity predictions using two evaluation criteria, which are illustrated here for TF_16 (Prdm11), using the predictions of BEEML-PBM on the raw array intensity data. (a) Pearson correlation between predicted and actual probe intensities across all 40,000 probes. (b) AUROC of the set of positive probes. Positive probes (black lines) were defined as all probes on the test array with intensities >4 s.d. above the mean probe intensity for the given array.

Results of new evaluations

In the revised evaluations, we used the 35-mer scores from the DREAM5 challenge directly for eight of the algorithms. For the top three algorithms in the initial DREAM5 challenge that take <24 CPU hours to run per experiment (originally ranked 1, 3 and 4), as well as the algorithms BEEML-PBM24, FeatureREDUCE (T.R.R. and H.J.B., unpublished data), RankMotif++27, Seed-and-Wobble19 and five simple algorithms we implemented to provide a baseline (PWM_align, PWM_align_E, 8mer_max, 8mer_sum and 8mer_pos), we constructed a training data set from the combination of preprocessing steps that resulted in the best final score for the given algorithm (Supplementary Note 3). Algorithms that were not subjected to the preprocessing analysis may work better if given the same benefit. We scored all 26 algorithms across the panel of 66 mouse TFs used in the challenge (see Supplementary Table 3 for all evaluation scores of each algorithm on each TF).

The results of our revised evaluation scheme produced rankings similar to those of the DREAM5 challenge, with the algorithm of Team_D again performing best among the original challenge participants (Table 2). Final performance was robust to the choice of evaluation criteria (Supplementary Fig. 1). Overall, the highest-scoring algorithm is FeatureREDUCE, which combines a dinucleotide model in a biophysical framework with a background k-mer model explicitly intended to capture PBM-specific biases. In general, k-mer– and dinucleotide-based algorithms scored highest, although some PWM-based algorithms produced competitive results. Overall, it is notable that the specific algorithm is still more important than the type of sequence specificity model used by the algorithm. For example, BEEML-PBM, a published PWM-based algorithm, receives a better final score than four k-mer–based algorithms. Furthermore, algorithms based on the same sequence-specificity model type (e.g., PWM, dinucleotides or k-mers) do not necessarily produce similar probe intensity predictions (Supplementary Fig. 2).

Table 2 Final evaluation results

Algorithm performance varied substantially across the 66 TFs (Fig. 2a). The quality of the underlying experimental data, as opposed to inherent differences between TF families, appears to be the major factor in the overall ease of predicting probe intensities for a given TF (Fig. 2b and Supplementary Note 4). For example, TFs that were harder for most algorithms to model tended to have lower correlations between the 8-mer Z-scores of their training and test arrays, and fewer 8-mer E-scores > 0.45 on their training arrays.

Figure 2: Comparison of algorithm performance by TF.
figure 2

(a) Final score of each algorithm for each TF. TF name, ID and family are depicted across the columns, and sequence specificity model type and name are depicted across the rows. Algorithms are sorted in decreasing order of final performance across all TFs. TFs are sorted in decreasing order of mean final score across all algorithms. Numbers in parentheses indicate the number of zinc fingers in the protein. (b) Summary statistics for each TF across all algorithms: mean final score, maximum final score achieved by any k-mer, dinucleotide or PWM-based algorithm, Pearson correlation of 8-mer Z-scores between replicate arrays, and the number of 8-mers with E-scores > 0.45 on the training array (normalized by the maximum such value across all TFs). (c) Difference between the best score achieved by any k-mer–based algorithm and the best score achieved by any PWM-based algorithm for each TF.

To further examine the relative performance of the k-mer, dinucleotide and PWM models, we compared the final scores produced by the single algorithm from each model category that performed best for each TF. On average, the best k-mer–based algorithm outperformed the best dinucleotide or PWM algorithm, but this is mainly because of large differences in a handful of specific TFs (Fig. 2b,c). Algorithms based on dinucleotides did substantially worse on these harder-to-model TFs, suggesting that they might be overfitting to array-specific noise. The best PWM-based algorithm performs as well as the best k-mer–based algorithm for the majority of TFs (Fig. 2c), with a median difference of only 0.014. PWM algorithms, in fact, did slightly better than k-mer–based algorithms for 18 TFs (Fig. 2c). However, of the five cases in which the final score for the best of one model type beats the best of the other type by >0.10, all but one favor k-mer algorithms (Fig. 2c). The majority of TFs for which the k-mer model performed better contained C2H2 zinc-finger arrays, which, depending on which C2H2 fingers are engaged, may have different binding modes; there is previous evidence for such phenomena both in vivo and in vitro7,29. However, some of these C2H2 zinc fingers present a challenge for all sequence-specificity models, perhaps owing to the small number of sequences they preferentially bind (Fig. 2 and Supplementary Note 4).

Despite the fact that more complicated algorithms produce higher scores, the results of these analyses suggest that the PWM model can accurately capture the sequence preferences for most TFs. Nevertheless, we observed a wide range in PWM-based algorithm performance across the 66 TFs (Fig. 2a). The fact that the two highest-scoring PWM-based algorithms, Team_E and BEEML-PBM (Table 2), both model PBM-specific effects suggests that their high scores might not be solely due to superior PWMs. We carried out a series of analyses aimed at isolating the predictive ability of the PWMs produced by all of the PWM-based algorithms. Those produced by BEEML-PBM were the most accurate of all of the algorithms; the high performance of Team_E is due to its extensive modeling of PBM background effects and not to the quality of its PWMs (Supplementary Note 5). We also found this to be the case for predicting in vivo TF binding (see below).

Analysis of dinucleotide matrices and secondary motifs

Numerous studies have called into question the accuracy of the assumption inherent to the PWM model that bases are independent and many propose the use of dinucleotide dependencies to model TF binding. To quantify the relative accuracies of the dinucleotide and PWM models, we compared the performance of two of the top algorithms, FeatureREDUCE and BEEML-PBM, both of which can be run using either type of model. Both did better overall when using the dinucleotide model (Table 2), although the difference was not dramatic, and certain TFs benefitted more than others (Supplementary Table 4; median improvement of 0.019 and 0.006, respectively). In general, an overall improvement is not surprising because the dinucleotide model has more parameters. We note that the degree of improvement between FeatureREDUCE and BEEML-PBM is poorly correlated and is negatively correlated with how well each performs using only a mononucleotide PWM (Supplementary Fig. 3), suggesting that much of the improvement may be due to suboptimal mononucleotide PWMs. Of the six cases in which a dinucleotide model results in an improvement of >5% in the final score for both FeatureREDUCE and BEEML-PBM, five are among the TFs for which it appears to be difficult to train a PWM (Fig. 2). These observations suggest that there are relatively few cases in which there are bona fide dinucleotide interactions that have a major impact on model performance.

Secondary motifs would represent alternative binding modes for a TF that are also not possible to capture with a single PWM7. The previously claimed widespread prevalence of secondary motifs7 was recently contested by the finding that a single BEEML-PBM PWM is more predictive than two PWMs derived by Seed-and-Wobble24, on the same data set used to support the original claim7. To more directly examine the importance of secondary motifs, we identified secondary motifs in both the PBM data of this study and that of the previous study7. We discovered secondary motifs by using the residuals of the primary motif probe signal intensity predictions for both BEEML-PBM and FeatureREDUCE, used regression on the training data to assign weights to the two motifs and evaluated their impact on the overall performance of each algorithm (Online Methods). The performance of both BEEML-PBM and FeatureREDUCE was, in fact, slightly weakened using this scheme (Table 2).

Because the decreased performance might be due to probe-level noise drowning out the comparatively weaker secondary motif signal, we evaluated the performance of the secondary motifs using 8-mer scores, using the newer 8-mer scoring procedure24 (Online Methods). Under this scoring scheme, secondary motifs provided a slight increase in overall performance (2–8% improvement in average correlation) (Supplementary Table 5). However, examination of secondary motif performance for each TF revealed that secondary motifs substantially increase performance only in specific cases (Supplementary Table 6). Moreover, as in the case of dinucleotide-based models, the degree of improvement in FeatureREDUCE and BEEML-PBM is poorly correlated, and again correlates negatively with how well each algorithm scores using only a mononucleotide PWM (Supplementary Fig. 3). Manual inspection of these examples revealed that improvement can typically be attributed to either the identification of a minor variation on the primary motif, a 'second chance' after producing an inaccurate motif on the first attempt, or by the identification of the second half-site for a TF that can bind DNA as a homodimer (Supplementary Note 6). We did identify several instances of what appear to be alternative binding modes, including three examples capturing the classic TAATA and ATGCWWW sequences of Pou+Homeodomain TFs, and extensions of primary motifs (e.g., extending the consensus sequence of Nr5a2 from AAGGTCA to TCAAGGTCA), indicating that our methodology can detect bona fide cases of secondary motifs (Supplementary Note 6). Nonetheless, it appears as if the major benefit of secondary motifs is to make up for shortcomings in the initial motif-finding process.

In vitro–derived PWMs accurately reflect in vivo binding

We next asked whether conclusions reached using in vitro data also apply to TF binding in vivo. The sequence specificity of a TF is only one of several factors that determine where it binds in vivo (others include cofactors and DNA accessibility); nonetheless, motifs consistent with those obtained in vitro can often be derived directly from in vivo data7,30,31, indicating that the intrinsic sequence specificity of TFs is a major factor in controlling their DNA binding in vivo. We obtained publicly available ChIP-seq data for five of the mouse TFs whose DNA sequence preferences were measured using PBMs in this study, and ChIP-exo data from four yeast TFs whose preferences have been measured using PBMs in other studies. We then trained models on the PBM data using each algorithm and gauged their ability to accurately distinguish sequences bound by ChIP-seq and ChIP-exo peaks from control sequences. We also trained PWMs on the same in vivo data by running ChIPMunk32 and MEME-Chip33, methods that have been specifically tailored for motif discovery from ChIP-seq data, in a cross-validation setting. We evaluated each algorithm with AUROCs, which here measure the ability of a given algorithm to assign higher scores to positive (bound) sequences relative to control (random) sequences (Online Methods).

All PWM-based algorithms could discriminate ChIP-seq and ChIP-exo peaks from control sequences to some degree, as evidenced by the fact that the average AUROC scores of all algorithms exceeded the random expectation of 0.5 (Fig. 3). Conversely, the algorithms that performed best in our in vitro evaluations (FeatureREDUCE and Team_D, which both incorporate k-mer sequence specificity models) perform poorly (Team_D) or substantially worse (FeatureREDUCE) in nearly all cases analyzed (as does the simple 8mer_sum algorithm; Fig. 3). Likewise, the dinucleotide versions of BEEML-PBM and FeatureREDUCE do not improve upon their PWM-based counterparts. The performance of the k-mer and dinucleotide-based in vitro-trained models on in vivo data could be due to a combination of modeling probe-specific effects such as GC content and complications arising from biases in genomic nucleotide content relative to PBM probe sequences. Indeed, the 8mer_sum_high algorithm, which incorporates only 8-mers with Z-scores >3 (a cutoff that likely excludes PBM-specific background noise), performs substantially better than the 8mer_sum algorithm, which incorporates scores across the entire range of k-mer values (Fig. 3).

Figure 3: Comparison of algorithm performance on in vivo data.
figure 3

For each algorithm, we trained a model (PWM, 2 PWMs, k-mer or dinucleotide) using PBM data, and gauged its ability to discriminate real from random ChIP peaks using the AUROC (Online Methods). Data for the first five TFs were taken from mouse ChIP-seq data. The final four are from yeast ChIP-exo data. The color scale is indicated at the bottom. Team_E was not run on the ChIP-exo data, because it requires initialization parameters specific to the individual TF. FeatureREDUCE was run using models of length 8, instead of length 10, owing to the superior performance of this length model on in vivo data (T.R.R. and H.J.B., unpublished data).

Overall, PWMs produced by the FeatureREDUCE_PWM algorithm perform best on in vivo data (Fig. 3). Notably, FeatureREDUCE_PWM performs similarly to ChIPMunk, and out-performs the MEME-Chip algorithm, despite the fact that the latter algorithms trained their PWMs on the ChIP-seq data, and should thus incorporate features unique to in vivo data, such as nucleotide bias. All of our conclusions were robust to a variety of positive and negative sequence settings (Supplementary Table 7). Thus, at least for the nine TFs we examined here, in vitro–derived PWMs are in general better than in vitro–derived k-mer and dinucleotide models, and similar to in vivo–derived PWMs, in terms of predicting bound versus unbound ChIP-seq and ChIP-exo sequences.

Accurate prediction of data from alternative in vitro assays

Finally, we examined how well PBM-derived motifs, with or without dinucleotides, secondary motifs or k-mers, could predict data for 24 TFs that have been assayed using the MITOMI9,16 or HiTS-FLIP technologies8, all of which also have PBM data available from other studies7,31,34. We trained the best-performing FeatureREDUCE algorithm on the PBM data in each of its possible settings: PWM only, 2 PWMs (secondary motifs), dinucleotides and dinucleotides+k-mers. We then compared the ability of each model to predict the values produced by the other technology.

The inclusion of features beyond mononucleotide PWMs had limited impact on the majority of these 24 TFs (Supplementary Note 7). We note, however, that we detected specific examples where more complicated models provided an increase in performance across platforms (Supplementary Note 7). For example, k-mers and secondary motifs both improve cross-platform performance for Cbf1. This finding confirms that PBMs are capable of detecting cases where more complicated binding modes exist, and that these models are capable of improving predictive performance on other data sources. Taken together, these results are consistent with our findings that PWMs work well for most TFs, although certain TFs require more complicated models.


We have come to several major conclusions on the basis of this study, which have broad implications for the representation of sequence specificity of DNA-binding proteins. We note that the exact conclusions reached depend on both the TFs used for evaluation and the evaluation criteria, a fact that likely accounts for the ongoing controversy in this area. However, our general conclusions are robust to changes in the evaluation procedure. In addition, our conclusion that well-implemented PWMs can perform as effectively as more complicated models in most cases is supported by cross-technology analysis of in vitro data and by analysis of in vivo data.

Our first major conclusion is that, when testing on PBM data, k-mer–based models score best overall. Other approaches do nearly as well, however, and details of implementation, such as parameter estimation techniques, can be as important to the performance of an algorithm as the underlying model. Indeed, the algorithms that produce the most predictive PWMs, FeatureREDUCE_PWM and BEEML-PBM, which both train PWMs in an energy-based framework (Supplementary Note 8), perform similarly to more complicated models for the majority of TFs, supporting the contention that imperfections in motif derivation (and scoring) underlie most of the apparent superiority of k-mer scoring that we previously reported7,24. PWMs consistently fared poorly in 10% of the TFs, relative to k-mer–based sequence specificity models; however, many of these cases are characterized by having few high-scoring 8-mers (Fig. 2b and Supplementary Note 4). Thus, the scarcity of the data itself may limit the ability of algorithms to train a PWM. Modification of the algorithms may help improve these cases.

The fact that incorporation of dinucleotide interactions improves the performance of both BEEML-PBM and FeatureREDUCE, but for different sets of TFs, suggests that the need for these extensions to mononucleotide PWM is driven more by the algorithm than by a property of the TF. Dinucleotide interactions clearly do exist25 and were highlighted in previous analyses using MITOMI9, HiTS-FLIP8 and PBM19 data. However, these studies did not specifically ask how much of the overall variation in the data (e.g., using Pearson correlation) is accounted for by mononucleotide versus dinucleotide PWMs. We also note that more complex models can be more prone to learning platform-specific noise. At present it is not clear what the best approach is for different platforms; resolving the source and relative contribution of complexities in DNA-binding data would benefit from analysis of the same TFs on multiple high-resolution platforms.

One striking outcome of our study is that the appearance and information content of a motif has little bearing on its accuracy. The motifs produced by BEEML-PBM and FeatureREDUCE_PWM—two of the highest-scoring PWM algorithms—are, in general, those with the lowest information content (Box 1, Fig. 4 and Supplementary Fig. 4). Conversely, PWMs produced by Seed-and-Wobble and PWM_align appear to be the strongest (that is, they are wider and have larger letters in the traditional 'information content' sequence logos), but they score substantially lower than those of BEEML-PBM and FeatureREDUCE_PWM, on both PBM and ChIP-seq data. We conclude from this analysis that information content has little to do with the accuracy and utility of a motif, underscoring the fact that degeneracy is common among eukaryotic TF sequence specificities, and that most TFs will bind to many variations of their 'consensus sequence', albeit at lower affinity. Indeed, previous studies have demonstrated the importance of low affinity binding sites in vivo35,36,37,38. PWMs that allow for a greater amount of degeneracy (and hence have lower information content) are able to better capture the full range of lower affinity sites.

Figure 4: Characteristics of Klf9 motifs produced by the eight PWM-based algorithms evaluated in this study.
figure 4

The algorithms are ranked top to bottom in order of the overall score of their PWM for this TF in our evaluation scheme. Two popular visualization methods of the PWMs produced by each algorithm are depicted. On the left are traditional sequence logos39,40, which display the information content of each nucleotide at each position; the total information content (I.C.) of the PWM is given to the left of this logo. On the right are frequency logos, in which the height of each nucleotide corresponds to its frequency of occurrence at the given position40.

The finding that different algorithms excel (and fail) for different TFs suggests that an algorithm incorporating all of their advantages will likely outperform any individual one. To aid in the continued improvement of algorithms for the modeling of TF binding specificities, we have created a web server that allows users to upload their own probe intensity predictions, and compare them to those of the algorithms evaluated here ( We anticipate that the availability of this resource will help encourage future improvements to algorithms for the modeling and prediction of TF binding specificities.


Protein binding microarray experiments.

Details of the design and use of PBMs have been described elsewhere19,28,49,50. Here, we used two different universal PBM array designs, designated 'ME' and 'HK', after the initials of their designers. Information about individual plasmids is available in Supplementary Table 8. We identified the DNA binding domain (DBD) of each TF by searching for Pfam domains51 using the HMMER tool52. DBD sequences along with 50 amino acid residues on either side of the DBD in the native protein were cloned as SacI–BamHI fragments into pTH5325, a modified T7-driven glutathione S-transferase (GST) expression vector. Briefly, we used 150 ng of plasmid DNA in a 15 μl in vitro transcription and/or translation reaction using a PURExpress In Vitro Protein Synthesis Kit (New England BioLabs) supplemented with RNase inhibitor and 50 μM zinc acetate. After a 2-h incubation at 37 °C, 12.5 ml of the mix was added to 137.5 ml of protein-binding solution for a final mix of PBS/2% skim milk/0.2 mg per ml BSA/50 μM zinc acetate/0.1% Tween-20. This mixture was added to an array previously blocked with PBS/2% skim milk and washed once with PBS/0.1% Tween-20 and once with PBS/0.01% Triton-X 100. After a 1-h incubation at room temperature, the array was washed once with PBS/0.5% Tween-20/50 mM zinc acetate and once with PBS/0.01% Triton-X 100/50 mM zinc acetate. Cy5-labeled anti-GST antibody was added, diluted in PBS/2% skim milk/50 mM zinc acetate. After a 1-h incubation at room temperature, the array was washed three times with PBS/0.05% Tween-20/50 mM zinc acetate and once with PBS/50 mM zinc acetate. The array was then imaged using an Agilent microarray scanner at 2 μm resolution. Images were scanned at two power settings: 100% photomultiplier tube (PMT) voltage (high), and 10% PMT (low). The two resulting grid images were then manually examined, and the scan with the fewest number of saturated spots was used. Image spot intensities were quantified using ImaGene software (BioDiscovery).

Prediction of array intensities.

We evaluated a panel of 26 algorithms, based on their ability to accurately predict array intensities (Table 1). Parameters used for the published and novel algorithms and full descriptions of the algorithms submitted as part of the DREAM5 challenge can be found in Supplementary Note 9.

Evaluation criteria.

We evaluated the probe intensity predictions produced by each algorithm for each TF using two evaluation criteria (see Fig. 1 for illustrations, and below for descriptions). Before performing our evaluations, we removed all spots manually flagged as bad or suspect from the set of test probe intensities used in the evaluations. Each of the 66 experiments was scored individually using each criterion. The final score for both criteria was calculated as the average across all 66 experiments. To assign a final score to each algorithm, the score distributions of both of the criteria were first converted to relative scores, such that the best-performing algorithm for the given criterion received a score of 1, and the scores of all other algorithms were relative to this best score (e.g., 0.90 as good as the top score, 0.80 as good). The final score for each algorithm was then calculated as the average of its two relative score, and can hence be interpreted as how well the algorithm performed relative to the best algorithm, on average. A similar calculation was done to achieve the final scores of the individual TFs depicted in Figure 2. In this case, the calculations were carried out as described above, but individually for each of the 66 experiments (that is, skipping the step of averaging across all 66 experiments).

Pearson correlation of probe intensities.

We measured the correlation between the predicted probe intensities p and the actual intensities a using the (centered) Pearson correlation, r:

where N is the total number of probe sequences on the array, indicates the mean probe intensity across all predicted probe intensities, and ā indicates the mean across all actual probe intensities. We chose not to use the Spearman correlation because its rank transformation results in a loss of resolution in the high probe intensity range, placing greater emphasis on the (majority of) unbound, low intensity probes.

AUROC of probe intensity predictions.

As a second measure of an algorithm's accuracy, we quantified the ability of the given algorithm to assign high ranks to bright probes. We defined bright probes as those whose intensities were 4 standard deviations above the mean in the actual experiment27. This results in an average of 350 bright probes per experiment, with an enforced minimum of 50, and a maximum of 1,300. For each algorithm's predictions for each TF, we ranked the 40,000 probes based on their predicted intensities and calculated the AUROC of the actual bright probes. We subtracted 0.50 from the final AUROC score, so that a value of 0 corresponds to random expectation.

Identification and evaluation of secondary motifs.

We identified primary and secondary PWMs for each TF in this paper and a set of previously published TFs7 using two of the top algorithms (FeatureREDUCE and BEEML-PBM), and used a combination of both PWMs to predict probe intensities using the following procedure:

  1. 1

    1. Run the algorithm to train a single PFM, PFM1, on the training array data.

  2. 2

    2. Use PFM1 to predict the probe intensities of the training array (intensities1).

  3. 3

    3. Regress the values of intensities1 against the actual training array intensities.

  4. 4

    4. Calculate the residuals by subtracting the regressed intensities from the actual training array intensities. Set any resulting negative values to 0.

  5. 5

    5. Run the algorithm to train a single PFM, PFM2, on the residuals.

  6. 6

    6. Use PFM2 to predict the probe sequences of the training array (intensities2).

  7. 7

    7. Regress the two sets of probe scores (intensities1 and intensities2) against the training probe intensities to learn the weights of the two PFMs.

  8. 8

    8. Use PFM1 to predict the probe intensities of the test array.

  9. 9

    9. Use PFM2 to predict the probe intensities of the test array.

  10. 10

    10.Combine the two sets of predicted probe intensities using the regression coefficients learned on the training array in step 7.

We found that the resulting secondary motif probe intensity predictions decreased performance for both algorithms in our evaluation scheme (Table 2). We therefore tried an alternative scheme24 where we converted the training intensities and probe intensity predictions of PFM1 and PFM2 to 8-mers (using the median probe intensity), and then learned the weights of the two PWMs by performing regression on these 8-mer values. The resulting weights were then used to combine the predicted 8-mer scores of PWM1 and PWM2 on the test data. Using this strategy, we observed a minor increase in overall performance for both algorithms on both data sets (Supplementary Table 6).

Comparison of algorithm performance on in vivo data.

We gauged the ability of each algorithm to predict in vivo TF binding by comparing the ability of their models to accurately predict ChIP-seq and ChIP-exo binding data. We searched for publicly available ChIP-seq data measuring the in vivo binding of any of the 66 mouse TFs evaluated here using a variety of sources, including the hmCHIP database53, ArrayExpress54 and the NCBI Gene Expression Omnibus55. Some data was unusable because scores were not assigned to individual peak calls. In total, we obtained data for five TFs: Esrrb (GEO accession GSM288355), Zfx (GEO accession GSM288352), Tbx20 (GEO accession GSM734426), Tbx5 (GEO accession GSM558908) and Gata4 (GEO accession GSM558904). We also obtained four yeast ChIP-exo experiments from the literature29.

For each in vivo data set, we defined a set of positive (bound) sequences and negative (control) sequences. Positive sequences were defined for ChIP-seq data as the 500 highest-confidence peaks, using only the middle 100 bases of each peak (similar results were obtained when using the middle 50 bases; Supplementary Table 7). Full-length sequence reads were used for ChIP-exo data. Random sequences were defined in one of three ways: (i) 500 randomly chosen genomic regions of the same length as the positive sequences, excluding all repeat sequences using RepeatMasker; (ii) 500 sequences of length 100 (or 50) randomly chosen from promoter sequences, where promoters were defined as the 5,000 base upstream regions upstream of the transcription start site of Ref-seq genes, excluding all sequences flagged by RepeatMasker (obtained from the UCSC Genome Browser56); (iii) 500 randomly shuffled positive sequences, where dinucleotide frequencies were maintained.

We assessed the PWMs produced by each algorithm by scoring the positive and negative sequences, and calculating the AUROC of the sequence scores using the positive and negative probe labels. Positive and negative ChIP sequences were scored using the energy scoring framework of BEEML-PBM (setting mu to 0, and ignoring strand-specific biases). The final score for each algorithm on each TF was calculated as the mean AUROC across the three negative peaks sets. We also scored the probe sequences using the k-mer–based algorithms of Team_D, 8mer_sum, and FeatureREDUCE, and the dinucleotide algorithms of BEEML-PBM_dinuc and FeatureREDUCE_dinuc. We examined the performance of BEEML-PBM and FeatureREDUCE secondary motifs on the in vivo data using the PWMs and PWM weights learned from the in vitro data, as described above. To compare the in vitro generated motifs to in vivo-derived ones, we also used PWMs derived by ChIPMunk32 and MEME-Chip33 when run on the same in vivo data in a cross-validation setting. For these analyses, half of the positive probes were randomly chosen for training, and the other half were used for testing. This procedure was applied ten times, and the final numbers reported are the average evaluation scores across all ten iterations.

Data availability.

PBM data, GEO: GSE42864. The data are also available on the project website ( PWMs and algorithm source code are found in Supplementary Data 1 and 2.