Predicting conformational ensembles and genome-wide transcription factor binding sites from DNA sequences

DNA shape is emerging as an important determinant of transcription factor binding beyond just the DNA sequence. The only tool for large scale DNA shape estimates, DNAshape was derived from Monte-Carlo simulations and predicts four broad and static DNA shape features, Propeller twist, Helical twist, Minor groove width and Roll. The contributions of other shape features e.g. Shift, Slide and Opening cannot be evaluated using DNAshape. Here, we report a novel method DynaSeq, which predicts molecular dynamics-derived ensembles of a more exhaustive set of DNA shape features. We compared the DNAshape and DynaSeq predictions for the common features and applied both to predict the genome-wide binding sites of 1312 TFs available from protein interaction quantification (PIQ) data. The results indicate a good agreement between the two methods for the common shape features and point to advantages in using DynaSeq. Predictive models employing ensembles from individual conformational parameters revealed that base-pair opening - known to be important in strand separation - was the best predictor of transcription factor-binding sites (TFBS) followed by features employed by DNAshape. Of note, TFBS could be predicted not only from the features at the target motif sites, but also from those as far as 200 nucleotides away from the motif.


Training models: SVR, MLR and Elastic Nets:
Three different machine--learning models are used in this work selected subject to their suitability in different contexts, as follows.
1. A support vector regression (SVR) model with radial basis function (RBF) kernel was used to develop the DynaSeq core predictor, with DNA sequences from the MD trajectory data as inputs and the corresponding ensemble bin populations as an output. SVR with a radial basis function allows for non--additive contributions of various features hence this model was selected to account for sequence neighbors impacting the conformational ensembles in a non--additive manner. 2. For the predictions of DREAM5 sequence specificity values over training and test data, a Multiple Linear Regression (MLR) model was used considering the fact that the entire ensemble for all base positions within the input sequences is quite large and we do not wish to increase its complexity by introducing non--additive terms. This model was not cross-validated during training as the test data provides for the blind benchmark and was not utilized during the training of models. 3. For predicting TF binding sites compared to control, we opted for a regularized logistic regression. Lasso and ridge regression are well-known techniques to avoid overfitting in generalized linear models. Elastic net models provide a balance between the two regularization models (28) and have been successfully utilized in addressing important biological problems (e.g. (29)). Since, the objective of TF binding site classification is to avoid overfitting and estimating performance levels to allow comparison of various models, an elastic net regularized model with alpha=0.1 was utilized for classifying TF binding sites (30). During training, glmnet attempts to find the optimum model by trying different values of lambda and such models produce different prediction performances. We have selected a model with highest AUC score in each case.

Supplementary method SM3:
Benchmarking DynaSeq with known structures in PDB All high resolution (≤2.5Å) free DNA structures (without protein or RNA) available in the PDB were compiled and those with structural or sequence anomalies (e.g. quadruplex, triplex and modified base) were removed, leaving 115 entries (32). For each of these entries from the PDB, additional 1000 random sequences of the same length were generated and conformational ensembles were predicted for each one of them. The ability of DynaSeq to predict the sequence specificity of a given structure is determined by evaluating the agreement between observed conformational parameters on the one hand and those predicted for native and random genomic sequence on the other (native sequence refers to the one observed in the PDB file for the given structure under consideration). If the agreement on predicted conformations is better for the native sequence compared with random sequences, favorable sequence-specificity is established. The agreement between a single pair of structures is computed by first taking the Pearson's correlation coefficient between predicted and observed parameters one at a time. An average of 12 correlation coefficients obtained in this way is taken as the final measure of agreement between two structures. In this way an average correlation coefficients for each comparison (one native and 100 random) are obtained. The superiority of native predicted/observed agreement over random sequences is measured by taking the Z--score of these averaged Pearson's correlations.
! is the average Pearson correlation difference between observed (o) and predicted (p) structures of DNA sequence when the sequence being used for prediction is the same as observed in the PDB data and !" ! are the set of averaged correlation coefficients for a random sequences (r), which have the same length as (o). Denominator represents the standard deviation and <X> measures the average of values contained in X. The Z--score is converted to a p-value using a normal distribution function.
(2) To reflect the true nature of DynaSeq predictions, comparisons were made between the averages of predicted individual conformational parameters with those observed in the crystal structure. In addition, a complete atomic structure was modeled by using the conformational parameter values corresponding to the conformational bins, most populated in the predicted ensemble. Using the most populated bins instead of averages from each bin makes it more likely that the final three--dimensional structure represents a real conformation allowing structural alignment in a traditional manner and computing RMSD, which may be easier to understand. Ensemble populations for all these sequences were time--averaged and 12 conformational parameters are predicted at each position and the agreements between predicted and observed structures for native and random sequences are computed by Pearson's correlation for each parameter independently and averaging all these values. Z--score is used to quantify the degree of increase in the Pearson's correlation for the native sequence compared to the random set. By definition a higher negative Z--score indicates a favorable prediction for the native sequence compared to a random one.

Supplementary method SM4:
Treatment of DREAM5 data sets: DREAM5 data consist of pairs of experiments reporting binding activities of 86 mouse TFs for 32896 De Bruijn DNA sequences (8--mers each) (33). Each pair consists of a training set and a test set on binding affinities measured from protein binding arrays. We developed multiple linear regression (MLR) models over these data in two ways. First, the TFs were annotated to be binding or non-binding to each DNA sequence at a threshold of Z--score=4 (which is consistent with the class labels used in the original competition). MLR models were then trained to predict the binding class label from the DynaSeq--predicted and DNAshape--predicted features for the entire sequence. Leaving 2--bases on either terminal, the predictions are made for the remaining six positions and all predicted values are concatenated to form the inputs for the MLR model. Models are trained over TF data labeled as "HK design" and tested on one labeled as "ME design" in each case. MLR returns numerical values, which using various cutoffs were utilized to plot ROC curves and evaluate AUC under each feature set choice. Input feature sets used in this work consist of (1) four DNAshape features (2) DynaSeq--predicted values of the same four features that appear in DNAshape (named as DynaSeq.4D in this paper) and (3) DynaSeq.60D which uses the entire set of 60 features predicted by DynaSeq. Note that these numbers (4 or 60) refer to a single base position in one of the De Bruijn Sequences and the actual number of features input to MLR model is 6 times this number. In addition to AUC values, MLRs were also separately trained to predict raw scores available from DREAM5 data. This allows us to give a more quantitative estimate of affinity prediction performance, which is measured by a Pearson's correlation between predicted and observed values (in the test data).

Supplementary Result SR1:
Benchmarking with DREAM5 TF--binding specificity data: As a final benchmark to compare DNAshape and DynaSeq predictions, we used the data sets from DREAM5 competition. This data consists of 87 TFs with paired set of independently determined binding sites for each of them. Models are trained on one set and tested on the other within a pair using MLR. We computed the AUC and Pearson correlation between predictions by DNAshape features. For DynaSeq, we trained our models in two ways. First, we used predictions of the same four features as used in DNAshape for training the classifier (DynaSeq.4D) and compared if the two feature sets differ. Our prediction results for revealed that both DNAshape and DynaSeq performed similarly (Pearson correlation between performances in all 87 TFs is close to 0.9) when similar features are used for training these models (Supplementary Figure  SF2(a)). Even though the prediction results from DNAshape features and DynaSeq.4D features are similar, the AUC of the consensus shows significant improvement over DNAshape--based results with a mean AUC rising from 75.68% to 77.60% (p--value from t--test=0.0015, Supplementary Figure  SF2 (a)) and consensus is taken over the two models, the AUC of the consensus shows significant imporvement over DNAshape--based results with a mean AUC rising from 75.68% to 77.60% (p--value from t--test=0.0015). The scatterplot shows that the gain is more obvious for TFs poorly modeled by DNAshape. (c) and (d) compare the predictions from full set of DynaSeq features with the best of prediction scores obtained from DNAshape or corresponding 4D features from DynaSeq. These plots clearly establish that DynaSeq feature set outperformed DNAshape features by raising the average AUC to 93% (a large gain of 18% AUC). Supplementary Table ST1: Base pair and base--step parameters, representing DNA structure, considered in this work (Minor groove width not shown).