The DNA binding landscape of the maize AUXIN RESPONSE FACTOR family

AUXIN RESPONSE FACTORS (ARFs) are plant-specific transcription factors (TFs) that couple perception of the hormone auxin to gene expression programs essential to all land plants. As with many large TF families, a key question is whether individual members determine developmental specificity by binding distinct target genes. We use DAP-seq to generate genome-wide in vitro TF:DNA interaction maps for fourteen maize ARFs from the evolutionarily conserved A and B clades. Comparative analysis reveal a high degree of binding site overlap for ARFs of the same clade, but largely distinct clade A and B binding. Many sites are however co-occupied by ARFs from both clades, suggesting transcriptional coordination for many genes. Among these, we investigate known QTLs and use machine learning to predict the impact of cis-regulatory variation. Overall, large-scale comparative analysis of ARF binding suggests that auxin response specificity may be determined by factors other than individual ARF binding site selection.

7-P14 on the QTL regulating TB1: it is not clear why the observations reported are significant. First, given the number of binding sites found in the genome, a statistical analysis is needed. Second, domestication has drastically changed plant architecture between teosinte and modern maize and the authors indicate strong conservation of the sites. It is then not clear that such sites would have played a role in domestication and why finding such sites is of interest. 8-From P14 on the use of the machine learning approach: It is not really clear what kind of information the authors really get with this approach notably on the sequences bound by the ARFs. They would need to explain that better so that one can understand why they suddenly use this approach rather than the logo-based analyses they have used before. Also finding ARF binding sites in a herbivore resistance QTL is an interesting correlation but the authors do not provide any evidence that these binding sites are of any functional significance. They authors would need to prov ide some functional evidence coming for example from existing mutations at these sites or maybe using auxin treatments to show that the sites they find confer different auxin response capacities to the DIMBOAgenerating genes in Mo17 and B73.
More minor comments: 9-P3 first paragraph: ARFs are not just activators; "ARFs to activate downstream target genes" needs to be rephrased. 10-P3 "C ARFs show neutral transcriptional activity": I am not sure I know what a neutral transcriptional activity is; this should be rephrased. 11-P14 "we observed that in total ~5-25% of ARF peaks overlapped with regions of open chromatin from at least one of these four tissue types (Figure S1F), with a subset of peaks falling in regions of open chromatin that were unique to each tissue type ( Figure 5A, Figure S7A)." C ould the authors describe further whether finding 5-25% of peaks in regions of open chromatin is meaningful? As such these numbers are not very informative.
Reviewer #2 (Remarks to the Author): The ARF family of transcription factors are central mediators of nearly aspect of plant growth. For decades, members of this family have been characterized as either transcriptional activators or transcriptional repressors, depending on the amino acid composition o f the middle region of the protein. Much progress has been made recently on the molecular basis of target gene binding by ARFs, which has led to many new hypotheses on potential mechanisms for ARF target gene specificity. This study fills many of the gaps in our understanding and clarifies what is known about ARF gene targets.
Overall, this manuscript is well-written and clearly laid out and I particularly appreciate the colorcoding used throughout the figures. I have no major criticisms of this work, wh ich appears to be well done and thoughtfully analyzed. I have a few minor comments that may improve the manuscript.
Minor comments: -As an Arabidopsis researcher, I felt that it might be helpful to also have the Arabidopsis ARFs in Figure S1, particularly because roles for these in development have been better characterized than the maize ARF family. Do the authors feel there is a benefit to this, or would this lead to more confusion (i.e., no clear conservation of roles of closely related members)?
-Some of the GEO accession data was missing from the Online Methods section. For example, in the Read Mapping section stating "A blacklist of peak regions appearing in all samples and the HALO -GST negative control was generated (see GEO accession).", no GEO accession number is listed.
-There were a few typos in the Online Methods section, mostly limited to capitalization and space issues. These should be corrected.
Reviewer #3 (Remarks to the Author): The manuscript by Mary Galli and collaborators reports on the maize DAP-seq exploration of 14 ARF transcription factors (TF), 7 of clade A (activators) and 7 of clade B (repressors). Due to the amount of data and the complexity of the analyses required, this is an important piece of work.
In a nutshell, they obtained their sequence data after cloning and expressing full -length maize ORFs of those TFs fused to Glutathione S-transferase (GST) in order to affinity-capture them bound to genomic DNA fragments. They also tested a negative control protein. A s this was done for a collection of cloned TFs one-by-one, it can be assumed that all captured DNA fragments must have been bound by either monomers or homo-oligomers of the same TF.
Regarding the experiment, I have a couple of questions from the point o f view of a computational biologist: 1) Is there a way to show that the GST fusion is not changing significantly DNA binding, or dimerization, at least for one ARF of each clade? Are there any C hIPseq datasets to compare?
2) It would be helpful to see a distribution of length of affinity-captured genomic fragments, so that we know how hard is to discover DNA motifs within them.
Regarding the data analysis, I have more comments/requests: 3) On Figure 1 the authors report a large number of DAP -seq peaks (124K). In C hIPseq analysis it is well known that tweaking parameters substantially affects the number of reported peaks, from a few hundreds to hundreds of thousands. Since this number is used all along the paper the authors might want to optimize the related parameters by comparison to known ARF sites, so that precision and recall can be estimated. For instance, they could check whether different parameters change the proportions in barplot Fig1B. Figure 2D suggests this issue is important. 4) On page 10 and Figure 3 the authors report on the co-occurrence of TGTC nearby motifs in any orientation. While their analyses make sense, it might be a good idea to confirm those exercises using an algorithm such as dyad-analysis, which is integrated in tool peak-motifs (https://www.ncbi.nlm.nih.gov/pubmed/18802440 , https://www.ncbi.nlm.nih.gov/pubmed/27557775), which rigorously calculates enrichment of this kind of split motifs provided that spacer length is conserved. Figure 3B proofs that DNA binding affinity increases with the number of TGTC motifs in a peak. It rather shows that DAP is more efficient (captures more genomic fragments) if several TGTC sites are found.

5) I don't agree that
6) The exact phases described in Figure 3D are hard to see. It would be easier if bars of 10bp are added on top of the heatmap, at least for clade A. For both clades it would be helpful to have conceptual models/plots of how dimers bind to those sites, if the A. thalina protein structures allow it. 7) Regarding Figure 4B, isn't 10 kb too wide a distance interval, particularly since gene bodies and proximal regions are enriched? What happens if 1kb is used instead? This is related to Fig4C . In eiter case, the barplot would improve if notches were added to define, for instance, 95% confidence intervals around the medians. 8) I have several comments about the machine learning section: 8.1) It's not clear the size of the training and validation sequences sets, nor whether training and validation sets were made by separate. In addition, the code should be published in a source code repo for the sake of transparency and reproducibility. "The python code to implement this is available upon request" is not good enough because the code could change in the future with no logged changes.
8.2) The authors say that "ARF16 binding peaks in the second DIC E element differed slightly relative to those in the first DIC E element, likely due to substantial sequence differences ( Figure 6D)." Instead of this the authors should report objectively on the exact sequences differences at that locus. 9) On page 4 the authors say that "Maize contains 32 expr essed ARFs". In the current context of pangenomes and pantranscriptomes (http://www.plantcell.org/content/early/2014/0131/tpc.113.119982) the authors should say to which cultivar(s) this figure applies. is it B73?
We would like to thank the reviewers for their comments and suggestions that we feel overall significantly improve and strengthen our manuscript. We addressed all of the concerns raised and adjusted the manuscript text and Figures accordingly. All changes have been highlighted in the revised text.

Reviewers' comments:
Reviewer #1 (Remarks to the Author): This manuscript by Galli and colleagues describes the generation and the analysis of the binding landscape of a large number of Auxin Response Factors (ARFs) in Maize. Auxin is a crucial morphogenetic regulator in plants and ARFs are central to transcriptional regulation in response to auxin. How ARFs contribute to the specificity of auxin responses is an important and open question. The authors use an in vitro high-throughput approach called DAP-Seq to tackle this question and the datasets presented in this manuscript are of broad interest to understand auxin signaling and auxin function during development. The authors also analyzed chromatin opening using ATAC-Seq and analyze the relation between binding and chromatin opening. The analysis presented by the authors is well executed and they consolidate significantly the idea that ARF activators and repressors prefer different motif orientations and spacing (an idea already found in O' Malley et al. 2016). They also show that ARF binding sites are present in important maize QTL. While there are some limited concerns with the analysis of ARF binding (but see below), the significance of binding sites present in regions where QTL are localized (domestication QTL and an herbivore-resistance QTL) is not demonstrated. The paper would benefit from data showing that indeed identifying ARF binding sites in regions where QTL are mapped can help understanding the molecular basis of QTLs.
Major concerns: 1-The authors should make clear from the very start (i.e. abstract, introduction and title of first result section) that the DAP-Seq is an in vitro approach. This is clearly stated quite late in the manuscript, which is quite confusing.
This was an unintentional oversight. We have modified the text of the abstract, intro and first results as follows: ABSTRACT INTRODUCTION 2-P5: it is confusing to introduce already some of the results using ATAC-Seq without any presentation of the approach. The authors should keep that for the dedicated result paragraph and revise/reorganize the figures accordingly.
We have moved Fig 3-P5: "We also observed strong peaks in putative regulatory regions of other known direct targets of Arabidopsis ARFs including homologs of TMO6 and LFY14,15, suggesting conserved transcriptional regulation across the 150 million years of evolution that separate Arabidopsis and maize ( Figure S2B,C)" It is quite farfetched to use only two ARF targets to support a conclusion on evolutionary conservation. The authors should extend their analysis to other known targets (DRNL, AHP6, etc) and also use the published ARF2/5 DAP-Seq for Arabidopsis (O'Malley et al. 2016) to check whether this is effectively the case. This comparison with the Arabidopsis DAP-Seq datasets will also strengthen the conclusions on the biological relevance of Maize datasets.
We appreciate the reviewer pointing out this potentially misleading statement. Our observation about the conservation of two conserved target genes was not intended to imply that there was a high degree of conservation among all Arabidopsis and maize target genes. We have modified the text to clarify our statement that some ARF target genes are likely conserved across evolution. We have modified the text as follows: 4-Beginning of P9: "Only seven conserved class-specific amino acid differences are present within this region and only two are found in the B3 DBD itself ( Figure  S3J)." A single amino-acid difference can entirely change the behavior of the protein. It would be interesting to see where these amino-acids are in the structure and whether their localization could suggest a limited impact on the DNA binding.
Alternatively the text should be revised. We have modified the text to account for the reviewer's comment. We chose to modify the text instead of including an analysis of where these residues are on the protein structure, because we feel that such analysis can be more thoroughly addressed in future studies with experimental validation. The new text reads: "Seven conserved class-specific amino acid differences are present within this region, including two that are found in the B3 DBD itself" 5-P10 "In contrast, randomly selected 100bp genomic regions contained a much higher percentage of instances with zero or only one TGTC (50% and ~34% respectively), and a much lower percentage of regions with two or more TGTCs (~16%; Figure 3A and Figure S4F), indicating that ARFs bind more frequently to sites containing more than one TGTC motif. " A statistical analysis would be required to properly support this conclusion.
We have corrected this unintentional omission. The text now reads: " 6-P11 first half of the page: the trend that the authors describe in the text (increase of peak intensity with two TGTC and extra increase with three or more TGTC) is not really seen on Fig 3b where the peak intensity seems to go up more around 7-8 TGTC. Also the authors indicate p-values for significance in the text but nothing is said about the statistical treatment of these data. The authors need to clarify this as it is not clear that their results support their conclusions.
We have clarified the text to address the reviewer's concern. The text now reads: " 7-P14 on the QTL regulating TB1: it is not clear why the observations reported are significant. First, given the number of binding sites found in the genome, a statistical analysis is needed. Second, domestication has drastically changed plant architecture between teosinte and modern maize and the authors indicate strong conservation of the sites. It is then not clear that such sites would have played a role in domestication and why finding such sites is of interest.
To address the first concern, we now provide statistical support showing that given an effective genome size of ~2Gb (nuclear chromosomes), the chances that one of our ARF peaks lies in a region of ear open chromatin is unlikely to occur by random chance. We have modified the main text and methods as follows:

Main text:
Regarding the reviewer's second comment that we observe no difference at the DNA sequence level in the ARF binding site between maize and teosinte: Despite this high degree of sequence homology, we feel that ARF binding events located in this region are of interest for at least two reasons. First, regardless of whether these binding events occur in teosinte, our finding that ARF peaks are present in an ear-specific region of open chromatin in modern maize is interesting in and of itself for those studying tissue-specific TF binding dynamics and maize inflorescence architecture. Second, a high degree of DNA sequence conservation among maize and teosinte does not eliminate the possibility that there could be DNA methylation differences and/or differences in tissue-specific open chromatin within the teosinte sequence that could affect ARF binding. Because such factors can influence TF binding, we feel that the presence of the ARF binding site will be of interest to people investigating this particular QTL.
We have modified the text to hopefully better clarify these possibilities: " 8-From P14 on the use of the machine learning approach: It is not really clear what kind of information the authors really get with this approach notably on the sequences bound by the ARFs. They would need to explain that better so that one can understand why they suddenly use this approach rather than the logobased analyses they have used before. Also finding ARF binding sites in a herbivore resistance QTL is an interesting correlation but the authors do not provide any evidence that these binding sites are of any functional significance. They authors would need to provide some functional evidence coming for example from existing mutations at these sites or maybe using auxin treatments to show that the sites they find confer different auxin response capacities to the DIMBOA -generating genes in Mo17 and B73.
learning approach was first intended to serve as an unbiased method to assess ARF binding trends. Indeed the independent models generated by this method supported our findings regarding the differences and similarities among clade A and clade B ARFs. We then chose to demonstrate how this machine learning approach could be used predict TF binding across maize inbred lines. Certainly we could have run 14 DAP-seq assays with each of our ARFs using Mo17 DNA, however we intended to show that a machine learning approach could be used as a scalable, cost-saving measure to instead predict how binding would be affected in these lines. The empirical binding to the Mo17 DICE element that we observed with ARF16 supported the machine learning findings and demonstrated that our predictions for the other thirteen ARF datasets were reliable. We envision that such an approach could ultimately be used to predict large scale TF binding events across many maize inbred lines and serve as a basis to investigate cis-regulatory The second concern raised by the reviewer regards providing evidence for the function of the additional ARF binding sites in Mo17. Unfortunately, we are not aware of any existing material containing mutations at these sites that could be used to support their functionality. Targeted mutagenesis of these binding sites in Mo17 would be an excellent follow up experiment and we are certainly interested in testing this, although it is a long-term goal given the time needed for maize transformation. We also like the proposed idea to carry out auxin treatments on B73 and Mo17, however we feel that thoroughly substantiating a role for auxin in the BX biosynthesis pathway would require extensive experiments and functional data that would distract from the central focus of this manuscript. We do note however that while no direct connection has yet been made between BX biosynthesis and auxin signaling, there is some evidence suggesting that benzoxazinoids could have a role in auxin signaling. We have included a reference to the paper We have modified the text as follows: " More minor comments: 9-P3 first paragraph: ARFs are not just activators; "ARFs to activate downstream target genes" needs to be rephrased. We intended here to highlight how the canonical auxin signaling pathway is believed to operate. We have modified the text to account for the fact that not all ARFs fit into this pathway. " 10-P3 "C ARFs show neutral transcriptional activity": I am not sure I know what a neutral transcriptional activity is; this should be rephrased.
The text has been changed to: " 11-P14 "we observed that in total ~5-25% of ARF peaks overlapped with regions of open chromatin from at least one of these four tissue types ( Figure S1F), with a subset of peaks falling in regions of open chromatin that were unique to each tissue type ( Figure 5A, Figure S7A)." Could the authors describe further whether finding 5-25% of peaks in regions of open chromatin is meaningful? As such these numbers are not very informative. As stated in the text ("In agreement with previous DAP-seq finding, we observed that in total ~5-25% The ARF family of transcription factors are central mediators of nearly aspect of plant growth. For decades, members of this family have been characterized as either transcriptional activators or transcriptional repressors, depending on the amino acid composition of the middle region of the protein. Much progress has been made recently on the molecular basis of target gene binding by ARFs, which has led to many new hypotheses on potential mechanisms for ARF target gene specificity. This study fills many of the gaps in our understanding and clarifies what is known about ARF gene targets.
Overall, this manuscript is well-written and clearly laid out and I particularly appreciate the color-coding used throughout the figures. I have no major criticisms of this work, which appears to be well done and thoughtfully analyzed. I have a few minor comments that may improve the manuscript.
Minor comments: -As an Arabidopsis researcher, I felt that it might be helpful to also have the Arabidopsis ARFs in Figure S1, particularly because roles for these in development have been better characterized than the maize ARF family. Do the authors feel there is a benefit to this, or would this lead to more confusion (i.e., no clear conservation of roles of closely related members)? We feel this is an excellent addition and have modified the phylogeny in Figure  S1.
-Some of the GEO accession data was missing from the Online Methods section. For example, in the Read Mapping section stating "A blacklist of peak regions appearing in all samples and the HALO-GST negative control was generated (see GEO accession).", no GEO accession number is listed. We appreciate the reviewer catching this error. We were unable to submit this data to GEO due to their submission criteria. We instead had attached this as a supplemental table, but forgot to update the Online methods section. We have now modified the methods to read: -There were a few typos in the Online Methods section, mostly limited to capitalization and space issues. These should be corrected. We proofread the methods section and corrected many of errors. Please see track changes in text document.
Reviewer #3 (Remarks to the Author): The manuscript by Mary Galli and collaborators reports on the maize DAP-seq exploration of 14 ARF transcription factors (TF), 7 of clade A (activators) and 7 of clade B (repressors). Due to the amount of data and the complexity of the analyses required, this is an important piece of work.
In a nutshell, they obtained their sequence data after cloning and expressing fulllength maize ORFs of those TFs fused to Glutathione S-transferase (GST) in order to affinity-capture them bound to genomic DNA fragments. They also tested a negative control protein. As this was done for a collection of cloned TFs one-by-one, it can be assumed that all captured DNA fragments must have been bound by either monomers or homo-oligomers of the same TF.
Regarding the experiment, I have a couple of questions from the point of view of a computational biologist: 1) Is there a way to show that the GST fusion is not changing significantly DNA binding, or dimerization, at least for one ARF of each clade? Are there any ChIPseq datasets to compare? Unfortunately, there are no maize ChIPseq dataseqs that we can compare to. We have however performed a DAP-seq experiment using the DNA-binding domain (DBD) of ARF4 (cladeA) and ARF39 (cladeB) fused to a HALO-tag and expressed in an in vitro rabbit reticulocyte-based transcription/translation system. (Full length HALO-ARF constructs were also tested but failed to produce any peaks). While these HALO-ARF-DBD experiments returned fewer peaks (below our cutoff for success datasets) than our E.coli purified GST-ARF DAP-seq experiments (possibly due to a lower amount of protein), over 93% of the HALOtagged ARF peaks were present in the GST-tagged ARF datasets. Motif enrichment produced identical top motifs for HALO-DBD fusions, no statistical difference was seen between the average number of TGTCs within peaks (Fisher's exact test), and spacing patterns were similar. We therefore believe that the GST tag likely does not influence ARF binding activity.
2) It would be helpful to see a distribution of length of affinity-captured genomic fragments, so that we know how hard is to discover DNA motifs within them.
As the first step in our library construction we have sheared our genomic DNA to 200bp fragments using a covaris S2 followed by an Ampure bead size selection step. This procedure produces very well defined genomic fragments of the expected size. When viewed by agarose gel separation, a typical final library which includes illumina sequencing adapters produces fragments of 350-400bp. Given that the illumina sequencing adapter adds ~170 bp total, we note that most input genomic fragments are ~200bp in length.
Regarding the data analysis, I have more comments/requests: 3) On Figure 1 the authors report a large number of DAP-seq peaks (124K). In ChIPseq analysis it is well known that tweaking parameters substantially affects the number of reported peaks, from a few hundreds to hundreds of thousands. Since this number is used all along the paper the authors might want to optimize the related parameters by comparison to known ARF sites, so that precision and recall can be estimated. For instance, they could check whether different parameters change the proportions in barplot Fig1B. Figure 2D suggests this issue is important. Unfortunately, there are no known maize ARF sites with which to calibrate our cutoffs. We initially tested our data however with three different FDR thresholds (FDR<0.01, 1e-5, and 1e-7) and chose a universal cutoff of FDR< 1e-5 which maximized the number of datasets giving greater than 2% reads in peaks (RiP) while restricting peak numbers such that comparative analysis between ARF datasets was reasonable. At each cutoff level, similar numbers of gene feature enrichments (Fig.1B) were observed, although clade A ARFs run with default parameters produced slightly decreased enrichment in promoter, TTS and exonic peaks. Figure 3 the authors report on the co-occurrence of TGTC nearby motifs in any orientation. While their analyses make sense, it might be a good idea to confirm those exercises using an algorithm such as dyad-analysis, which is integrated in tool peak-motifs (https://www.ncbi.nlm.nih.gov/pubmed/18802440 , https://www.ncbi.nlm.nih.gov/pubmed/27557775), which rigorously calculates enrichment of this kind of split motifs provided that spacer length is conserved. We very much appreciate the recommendation to use this analysis tool, which we were not previously aware of.

4) On page 10 and
By performing the suggested de novo dyad motif discovery analysis on each of our ARF datasets we were able to independently verify enrichment for many of the spacing configurations that we observed in our initial directed analysis, confirming that certain clade A and clade B ARFs prefer different spacing/orientation configurations. The most frequently returned spacing pattern identified by this program was the TGTC:GACA configuration with a spacing of 11 and 12 nucleotides and was found in both clade A and clade B ARFs. This configuration corresponds to the ER7,8 configuration that was captured in the crystal structure of Boer et al., 2014.
We have added a sentence in the main text to highlight the fact that de novo motif analysis uncovered many of the same motifs as our directed analysis. The text now reads: " (Defrance et al., 2008, Castro-Mondragon et al., 2016 was also performed ." Methods Figure 3B proofs that DNA binding affinity increases with the number of TGTC motifs in a peak. It rather shows that DAP is more efficient (captures more genomic fragments) if several TGTC sites are found. We understand the reviewer's concern regarding this section. We have reworked the text already in response to reviewer #1's concern and have added an additional comment to the main text to acknowledge reviewer #2's concern. " 6) The exact phases described in Figure 3D are hard to see. It would be easier if bars of 10bp are added on top of the heatmap, at least for clade A. For both clades it would be helpful to have conceptual models/plots of how dimers bind to those sites, if the A. thalina protein structures allow it. We have added 10bp bars to the heatmap figure to highlight the 10bp phasing.

5) I don't agree that
We would prefer to not include conceptual models because we feel it would be too speculative based on our data and the fact that a crystal structure is only available for the TGTC GACA (also referred to as the everted repeat orientation). Further experiments with mutations in key residues would be needed to support any speculative model. Figure 4B, isn't 10 kb too wide a distance interval, particularly since gene bodies and proximal regions are enriched? What happens if 1kb is used instead? This is related to Fig4C. In eiter case, the barplot would improve if notches were added to define, for instance, 95% confidence intervals around the medians. We have selected a distance of 10kb to assess the number of peaks in auxin induced genes given that this is the distance at which the majority of ARF peaks, (particularly in the case of the cladeA ARFs). We performed similar calculations using 1kb, 2kb, and 5kb distances and always observe a significance difference in the number of peaks relative to random genes, however the differences are not as striking at these shorter distances. Hand curation of several previously studied genes such as the Aux/IAAs Bif1 and Bif4 (Galli et al., PNAS 2015) as well as many new examples added to the supplementary figures (FigS2C-K), demonstrates that peaks are often observed at distances greater than 5kb and we feel taking this into consideration is important.