Local DNA shape is a general principle of transcription factor binding specificity in Arabidopsis thaliana

Understanding gene expression will require understanding where regulatory factors bind genomic DNA. The frequently used sequence-based motifs of protein-DNA binding are not predictive, since a genome contains many more binding sites than are actually bound and transcription factors of the same family share similar DNA-binding motifs. Traditionally, these motifs only depict sequence but neglect DNA shape. Since shape may contribute non-linearly and combinational to binding, machine learning approaches ought to be able to better predict transcription factor binding. Here we show that a random forest machine learning approach, which incorporates the 3D-shape of DNA, enhances binding prediction for all 216 tested Arabidopsis thaliana transcription factors and improves the resolution of differential binding by transcription factor family members which share the same binding motif. We observed that DNA shape features were individually weighted for each transcription factor, even if they shared the same binding sequence.

The comparison is based on the area under the precision recall curve (AUPRC) using ampDAP-seq peaks as ground truth. Only families with available data of at least 10 members were investigated to ensure a robust analysis. For the families C2C2dof, MYB related and G2like a significantly (p < 0.05, Wilcoxon-Mann-Whitney-Test) worse performance of the random forest model regarding precisionrecall relation was observed. For the transcription factor families NAC, AP2/EREBP and bZIP the performance was significantly (p < 0.05, Wilcoxon-Mann-Whitney-Test) better. The visualisation as well as the statistical tests were performed using the Dabest package 3 . Blue dots represent the area under the precision recall curve (AUPRC) using the regressor model which was trained on the DNA shape using DAP-seq data as ground truth, whereas red dots represent the AUPRC using solely the sequence motif derived from the ampDAP-seq peaks. For Members of the AP2EREBP, NAC and bZIP family, the binding prediction was consistently substantially improved. The improvement regarding the MYBrelated and C2C2dof family members was comparatively low.

Supplementary Figure 7: Correlation between dataset size and prediction performance.
The performance of each random forest model was plotted against the number of genomic sequences containing the binding motif. Overall, performance and dataset size correlate slightly negative with a R 2 of 0.154. The linear regression, as well as the two-sided p-value, was calculated using the SciPy 4 Python package, which applies the Wald test.

Supplementary Figure 8: Comparison of ChIP-seq and ampDAP-seq data performance.
For five transcription factors, which had data available for both experimental procedures 2,5-8 , the performance of binding site inference was compared. For ChIP-seq the ChIP-seq data were used as ground truth for training and for ampDAP-seq the ampDAP-seq data were used as ground truth. The sequence motif was searched throughout the Arabidopsis thaliana genome to extract all putative binding sites. For each TF the random forest models trained on ampDAP-seq data had higher AUPRCs than models trained on ChIP-seq data.
Supplementary Figure 9: Binding site prediction performance for differing sequence windows.
For each transcription factor the binding sequence window was varied up to 32 additional bases upstream and downstream from the core motif. The average AUPRC regarding the sequence search for the core motif is outperformed by calculating the DNA shape using only bases belonging to the core motif. Widening the sequence window with additional bases upstream and downstream from the core motif improves binding site predictiosn to an average AUPRC of approximately 0.6. The spread of prediction performance varies widely and is dependent on each transcription factor.

Supplementary Figure 10: Comparison of prediction performance between different machine learning approaches.
For each of the 216 transcription factors, three models using different machine learning approaches (gradient boosting, random forest and a baseline neural network) were trained. Each dot represents the performance of one of the 648 respective models. The data points are sorted according to the performance of the random forest approach.

Supplementary Figure 11: Frequency of the respective top 5 most important shape features for all TFs.
For each of the 216 TFs transcription factor a random forest model was trained. The respective feature importance was extracted and all features but the top 5 most important were discarded for each TF.
Most of the important shape features are located within the core motif.
Supplementary Figure 12: Intersections of "top 600 motif" occurrences and verified bind sites for AT3G16280 and AT5G51990.
Using the published motif derived from the top 600 peaks to scan for motif occurrences results in 103,779 hits. Although, the overlap between the locations is not as strong compared to using the core motif derived from all binding events instead of the top 600. Further, a noticeable amount of verified binding sites for AT5G51990 is only found in the extracted sequences for this transcription factor when the "top 600 motif" is applied. For experimental validation of the regressor predictions, DNA sequence with a high and low regressor prediction containing the same sequence motif not present in the genome of A. thaliana were generated. One sequence with a high prediction was labeled with biotin to detect DNA binding of HY5 by performing an EMSA. To compare binding affinities three sequences with high and low regressor predictions were used as competitors for the labeled sequence containing the same sequence motif. A shifted band is visible for all samples with low binding affinity prediction, one sample with high affinity prediction and the control without competitor sequence. Less visible shifted bands are observed in two samples with high predicted affinity and the control with the same competitor sequence. This EMSA experiment was performed two times. A cropped version of this gel image is visible in figure 3 and an uncropped version is available in the source data file. For experimental validation of the regressor predictions, DNA sequences not present in the genome of A. thaliana were generated. All sequences contain the extracted sequence motif for ANAC050. One sequence with a high regressor prediction was labeled with biotin to detect DNA binding of ANAC050 by performing an EMSA. To compare binding affinities three sequences with high and low regressor predictions were used as competitors for the labeled sequence. A shifted band is visible for two out of three samples with low binding affinity prediction and the control without competitor sequence. Less visible shifted bands are observed in all samples with high predicted affinity. This EMSA experiment was performed once. A cropped version of this gel image is visible in figure 3 and an uncropped version is available in the source data file.