Quantitative integration of epigenomic variation and transcription factor binding using MAmotif toolkit identifies an important role of IRF2 as transcription activator at gene promoters

Eukaryotic gene transcription is controlled by a large cohort of chromatin-associated proteins including transcription factors (TFs) and epigenetic regulators 1,2 . ChIP-seq experiments are now widely used to characterize the genome-wide binding of these proteins, and comparing ChIP-seq data from different cell types can provide valuable insight into understanding how cell type-speci ﬁ c transcriptional programs are established 3 . Speci ﬁ cally, epigenetic regulators often show dynamic chromatin binding during development and disease progressions 2 . However, most of them are broadly expressed across tissues, and their chromatin binding is thought to be mainly modulated by crosstalk with TFs, which could be considered as their cell type-speci ﬁ c co-factors 4 . Thus, identifying TFs that preferentially bind at the genomic regions differentially bound by a chromatin-associated protein between different cell types has become an important step toward deciphering the molecular mechanism modulating its chromatin binding 4 . Moreover, applying this analysis to histone modi ﬁ cations marking active regulatory elements such as H3K4me1-3 and H3K27ac is frequently used for discovering cell type-speci ﬁ c regulators 5 . A traditional way of this analysis is to ﬁ rst detect the cell type-speci ﬁ c ChIP-seq peaks of the protein of interest, which are typically de ﬁ ned as those

MAmotif is used to quantitatively compare two ChIP-seq samples of the same protein but from different cell types (or conditions) and identify co-factors associated with the cell type-biased binding of this protein using the binding information of candidate TFs obtained from motif analysis (or other resources such as ChIP-chip/seq data). In MAmotif, we switch to use the log2-ratio of ChIP-seq intensities between two samples at each peak, i.e. the M value calculated by MAnorm model [5], to quantitatively represent the cell type specificity of this peak. This is mainly due to two reasons. First, when comparing ChIP-seq samples from different cell types, it was frequently found that the cell type specificity of peaks cannot be precisely characterized by simply classifying them into cell type-specific and non-specific ones by whether or not they overlap with peaks identified from other cell types [5][6][7], especially when the ChIP-seq samples are from two closely related cell types or conditions [2]. On this account, it has been suggested to better describe the binding changes in a quantitative way [5,7]. Second, it was also found that genomic regions with greater chromatin state changes across cell types usually are more likely to be directly co-occupied by key cell type-specific regulators [8]. As an example, we applied MAnorm to compare the ChIP-seq data of H3K27ac, a histone mark of active gene promoters and distal enhancers, between H1 hESCs and K562 cells. After ranking all the H3K27ac peaks by their log2-ratios of ChIP-seq intensities between two cell types, it can be clearly viewed that peaks with higher hESC-biased H3K27ac levels are more likely to contain ChIP-seq peaks of hESC-specific regulators POU5F1, SOX2 and NANOG, while peaks with higher K562-biased H3K27ac levels are more likely to be co-occupied by K562-specific regulators TAL1 and GATA1 ( Supplementary Fig. S1a). To test whether we can reproduce this observation with motif analysis, we scanned all the H3K27ac peaks of two cell types with the vertebrate TF binding motifs downloaded from JASPAR database [1] using our Motif-Scan toolkit (a detailed description of this toolkit can be found in the following sections), which has been successfully tested in our previous studies [2,5,[9][10][11]. Consistently, we found H3K27ac peaks with greater ChIP-seq intensity changes showed a higher enrichment of corresponding cell type-specific regulators' binding motifs ( Supplementary Fig. S1e), which in turn could also support the effectiveness of Motif-Scan toolkit in predicting TF binding sites. On the other hand, for each motif, we systematically classified the H3K27ac peaks identified from each cell type into motif-present and motif-absent ones, based on whether or not they were detected to contain this motif in their sequences by Motif-Scan. Interestingly, it could be easily found that peaks having the binding motifs of POU5F1, SOX2, TAL1 and GATA1 exhibited obviously greater changes of H3K27ac levels than the corresponding motif-absent peaks ( Supplementary Fig. S1f). Meanwhile, peaks with the binding motifs of SPZ1 and HNF4G, two TFs that haven't been reported to have clear functional selectivity between two cell types, failed to show significantly different levels of H3K27ac changes compared to the other peaks ( Supplementary Fig. S1f). Taken together, these analyses not only validated the power of quantitative comparison on ChIP-seq data, but also shed light on how to utilize the quantitative measure of ChIP-seq intensity changes.
Based on these observations, we combine MAnorm and Motif-scan as two basic functional modules of MAmotif, and incorporate a new integration module to examine the association between the quantitative changes of ChIP-seq intensities calculated by MAnorm and the occurrences of each TF binding motif detected by Motif-Scan. The significance of this association could be used to infer whether the corresponding TF can be a candidate cell type-specific co-factor accounting for the cell type-biased binding detected between the two ChIP-seq samples under comparison. Here we briefly describe its workflow ( Figure 1). First, it takes the coordinates of the peaks and aligned reads of two ChIP-seq samples as main input, together with the TF binding motifs chosen for motif analysis. Then, the input data are automatically processed by MAnorm and Motif-Scan modules, to get the log2-ratio of normalized ChIP-seq read densities as well as the occurrences of all input motifs at each peak region.
Subsequently, they are sent to the integration module, and two statistical tests are applied to assess the association between the ChIP-seq intensity changes and the occurrence of each input motif (right panel of Figure 1a), by checking whether the peaks having this motif (named as motif-present peaks) show significantly higher ChIP-seq intensity changes between two ChIP-seq samples than peaks with no motif occurrence (named as motif-absent peaks). If a motif passes both tests, which means the motif-present peaks of this motif are significantly more likely to have cell type-biased binding than the motif-absent peaks, this motif (i.e. the corresponding TF) will be reported as a candidate cell type-specific co-factor.

Introduction of the workflow and implementation of MAmotif toolkit
MAmotif takes four bed files describing the coordinates of all predefined peaks and aligned reads of two ChIP-seq samples as the main input. Besides, it also needs the gene annotation file of the corresponding genome assembly and fasta files of genome sequences, as well as a text file containing the PWMs of all candidate motifs as input for motif analysis. Of note, users can use the processed motif data for the motifs in JASPAR database (including the PWM of each motif and the motif score cutoffs corresponding to difference significance levels), which can be found on the webpage of Motif-Scan, to directly run a MAmotif analysis.
MAmotif starts from processing the input peak and read information of two ChIP-seq samples using MAnorm module to calculate the log2-ratio of normalized ChIP-seq intensities (represented by the ChIP-seq read densities) at each peak region (extended/truncated to the same length from peak summit, and here this length is set to be 2kb for H3K27ac and H3K4me3 as suggested for histone modifications with sharp peaks). Meanwhile, the DNA sequence of each peak region is extracted and scanned by the Motif-Scan module with input motifs, to detect the occurrence of each candidate motif in the sequence with motif score higher than the cutoff specified by the user. Here the length of peak region for motif scanning is set to be 1 kb as suggested. Next, the log2-ratios of ChIP-seq intensities calculated by MAnorm and the occurrences of each candidate motif detected by Motif-Scan are combined in the integration module. For each motif, all the peaks identified from one cell type, e.g. cell type A, are systematically classified into motif-present and motif-absent ones, based on whether or not it is found to contain at least one occurrence of this motif. Here users can choose to exclude the motifs being present in a very large fraction of peaks (e.g. >50%), which usually are motifs with low information content and may introduce a lot of false positives. Then, two statistical tests, including the Students' t-test and Mann-Whitney-Wilcoxon (MWW) rank-sum test, are applied to test whether the motif-present peaks generally have higher log2-ratios of ChIP-seq intensities (the ratios are calculated as cell type A/B) than motif-absent ones. Finally, the P-values of the two tests are corrected for multiple testing using the Benjamini-Hochberg approach, and the less significant one is used to rank the candidate motifs. Of note, MAmotif provides an option to separate promoter and non-promoter peaks based on input gene annotations before doing the statistical tests, since it was often found that the cell type-biased binding of many chromatin-associated proteins, especially the histone modifications, may be associated with different co-factors at gene promoter and distal regions. Besides, MAmotif can also be used to compared DNase-seq and ATAC-seq data and identify candidate TFs associated with the cell type-biased open chromatin sites.
Currently MAmotif is written in Python. The stand-alone version of all its three main functional modules are also provided. Their source codes and user manuals can be found at https://github.com/shao-lab/ and http://bioinfo.sibs.ac.cn/shaolab/opendata.php.

Workflow of the Motif-Scan toolkit embedded in MAmotif
Motif-Scan is a computational toolkit designed to scan input genomic regions with known DNA motifs, and check whether any of the motifs are significantly over-or under-represented in input genomic regions compared to the control regions randomly selected from the genome ( Supplementary Fig. S1b), or compared to another set of genomic regions provided by the user as controls (this option can be used to identify motifs differentially enriched in two groups of genomic regions, e.g. the unique peaks identified from two ChIP/DNase/ATAC-seq samples).
Motif-Scan takes a set of genomic regions (or two sets of genomic regions to perform differential enrichment analysis), the DNA motifs of interest, as well as the fasta file of each individual chromosome's DNA sequence and the gene annotation file of the corresponding genome assembly as input. If the input genomic regions are ChIP/DNase/ATAC-seq peaks, we suggest to truncate/extend them to the same length (from peak center by default, or from peak summit if available) before doing motif scanning, in order to make the number of a motif's occurrences comparable across different peaks. Here the length for motif scanning can be customized by the users (otherwise they can choose to scan the entire input regions without any truncation/extension), which is suggested to be 1 kb for DNase/ATAC-seq peaks and ChIP-seq peaks of TFs and histone modifications with sharp peaks. If the user did not provide any control regions, Motif-Scan will select a set of random control regions from the genome for each input region (typically the number is set to be 5), with controlling the distance to the nearest gene's TSS to be the same as the input region (unless it is far from known genes' TSSs, e.g. with a distance to the nearest gene's TSS >10kb). In this way, the random control regions can be assumed to have a similar sequence background to the input regions, as gene promoters often have high GC content and are rich of TF binding motifs.
After extracting the genome sequences of all the input and control regions, for each motif Motif-Scan uses a sliding window with the same length of the motif to scan the sequences at a step size of 1 base pair (bp). At each step, a motif score is calculated to represent the similarity between the sequence fragment in the window and the motif ( Supplementary Fig. S1c), which equals to 1 if the sequence fragment is right the best match of the motif, given the genome background nucleotide frequencies [5,10]. Then, the motif score is compared with the genome background motif score distribution of this motif, which is generated from 5*10 6 sequence fragments randomly sampled from the genome, and the sequence fragment will be labeled as a significant match of the motif if its motif score is higher than the cutoff specified by user. Here we suggest to use the motif score cutoff corresponding to a certain significance level, e.g. P-value 0.0001 based on the genome background motif score distribution. It means that among 10000 random sequence fragments, it's expected to have at most one sequence fragment with motif score higher than this cutoff. Besides providing the users with a script to model the background motif score distribution in a specific genome with their own motifs, we also compiled the motif score cutoffs corresponding to difference significance levels for the TF-binding motifs obtained from JASPAR database [1,12] in several frequently used genomes (such as human and mouse).
In this way, the users can use these motif annotation files to directly perform motif scanning with all JASPAR motifs.
When motif scanning is done, several statistics will be calculated for each motif to represent its enrichment/depletion in input genomic regions as compared to the control regions: 1) fold enrichment calculated as the ratio between the fraction of input regions having this motif and that of control regions, with the value higher/lower than 1 indicating the motif is over/under-represented (enriched/depleted) in input regions, respectively; 2) a P-value calculated based on hyper-geometric distribution to represent whether the enrichment/depletion is of statistical significance.
Since Motif-Scan is mainly developed for performing motif analysis on peaks identified from ChIP/DNase/ATAC-seq data, it will additionally output several figures to visualize the enrichment and also the distribution of motif occurrences in the input regions ( Supplementary Fig. S1d).
They can help users to assess the association between a motif's occurrence and the ChIP/DNase/ATAC-seq signal intensities at peak regions. This analysis can be further used to infer the data quality if the association is well established. For example, the occurrence of a TF's binding motif typically is positively correlated with the ChIP-Seq intensities at its ChIP-Seq peaks.
More specifically, peaks with stronger ChIP-seq intensities are more likely to contain its binding motifs (as shown in the left panels of Supplementary Fig. S1d), and within each peak, its binding motif is also more likely to be observed at the peak summit compared to flanking regions (as shown in the right panels of Supplementary Fig. S1d).

MAmotif can also utilize TF binding information from available ChIP-chip/seq data
Inspired by the analysis with IRF2 ChIP-seq peaks, we additionally provide a function in MAmotif toolkit to utilize TF binding information obtained from other resources, such as peaks of ChIP-chip/seq samples, to test the association between each candidate TF's binding and the ChIP-seq signal changes detected from the samples under comparison. To validate the effectiveness of this approach, we repeated the MAmotif analysis with all the human TF ChIP-seq peaks downloaded from Cistrome Data Browser [3], instead of the occurrences of JASPAR motifs detected by Motif-Scan. Consistently, we found only the IRF2 ChIP-seq peaks of adult proEs exhibited significant association with adult-biased H3K4me3 promoter peaks (P-value=3e-5). This finding not only validates the reproducibility of our prediction based on motif analysis, but also suggests this additional function of MAmotif can provide a valuable approach to reuse the huge amount of public ChIP-seq data.

Traditional overlap-based approach used in this study to detect co-factors associated the differential H3K4me3 peaks between adult and fetal proEs
In this study, the traditional overlap-based approach was also used to compare the H3K4me3 ChIP-seq data of adult and fetal proEs and identify co-factors associated with adult-specific H3K4me3 promoter peaks. First, adult-specific H3K4me3 peaks were defined as the H3K4me3 peaks of adult proEs that do not overlap with any H3K4me3 peak detected in fetal proEs, and the other H3K4me3 peaks of adult proEs were named as non-specific peaks. Then, Motif-Scan was applied to scan the adult-specific and non-specific H3K4me3 promoter peaks with the same parameters as those used in MAmotif analysis, and a P-value was calculated for each motif based on hypergeometric distribution to represent its differential enrichment between adult-specific H3K4me3 promoter peaks and the non-specific peaks, which was used to finally select the top candidate motifs.

Analysis of IRF2's binding at the differential H3K4me3 peaks defined by MAnorm
To more clearly illustrate the association between IRF2's binding and the H3K4me3 promoter peaks with stage-biased H3K4me3 levels, here we further defined stage-biased H3K4me3 peaks based on the quantitative comparison of H3K4me3 ChIP-seq data between adult and fetal proEs using MAnorm. Here, adult and fetal-biased H3K4me3 peaks were defined as the H3K4me3 peaks identified from adult and fetal proEs with fold change of H3K4me3 ChIP-seq intensities higher than 1.5 between adult and fetal proEs and P-value lower than 0.05, and unbiased H3K4me3 peaks were defined as the H3K4me3 peaks with fold change of H3K4me3 ChIP-seq intensities lower than 1.5. It can be seen that a significantly higher fraction of adult-biased H3K4me3 promoter peaks were detected to contain IRF2 motifs by Motif-Scan compared to fetal-biased and unbiased H3K4me3 promoter peaks ( Supplementary Fig. S2g), while the GATA2 motif was not found to be significantly more enriched in adult-biased H3K4me3 promoter peaks compared to fetal-biased ones (Supplementary Fig. S2h). These findings can directly support the predictions by MAmotif as shown in Figure 1b. Then, we named the adult-biased H3K4me3 promoter peaks that are detected to contain IRF2 motifs as IRF2-motif-present adult-biased H3K4me3 promoter peaks, and the others as IRF2-motif-absent adult-biased H3K4me3 promoter peaks. It could be found that the IRF2-motif-present adult-biased H3K4me3 promoter peaks are much more likely to directly co-localize with the IRF2 ChIP-seq peaks of adult proEs than IRF2-motif-absent adult-biased H3K4me3 promoter peaks ( Supplementary Fig.   S2i), suggesting IRF2's binding at adult-biased H3K4me3 promoter peaks is largely mediated by it's known binding motif.
Next, we defined the adult and fetal-biased H3K4me3-associated genes as genes with adult and fetal-biased H3K4me3 peaks at promoters, respectively, and the unbiased H3K4me3-associated genes as genes that only have unbiased H3K4me3 peaks at their promoters. Consistently, we found a much higher fraction of adult-biased H3K4me3-associated genes are directly bound by IRF2 at promoters in adult proEs than that of fetal-biased H3K4me3-associated genes (Figure 1d). Besides, the adult-high genes are also found to be more likely to have IRF2 ChIP-seq peaks at their promoters in adult proEs compared to those fetal-high genes. These findings clearly indicate that IRF2 preferentially binds at the promoters of genes with adult biased promoter H3K4me3 and expression levels, as predicted by MAmotif.

Association between IRF2's binding and other histone modifications at gene promoters
We additionally checked the association between IRF2 and adult-biased H3K4me1 and H3K27ac peaks at gene promoters. First, we applied MAnorm to re-analyze the ChIP-seq data of H3K4me1 and H3K27ac in adult and fetal proEs, and defined the adult/fetal-biased H3K4me1 and H3K27ac peaks as those with fold change of ChIP-seq intensities higher than 1.5 and P-value lower than 0.05. Interestingly, the fraction of adult-biased H3K27ac and H3K4me1 promoter peaks with the presence of IRF2 motif was not found to be significantly higher than that of other H3K27ac and H3K4me1 peaks ( Supplementary Fig. S3a-b). Consistently, the fraction of adult-biased H3K27ac/H3K4me1-associated genes co-occupied by IRF2 ChIP-Seq peaks at promoters were also not found to be significantly higher than that of other H3K27ac/H3K4me1-associated genes ( Supplementary Fig. S3c-d). Besides, we also tried mapping the IRF2 peaks located at annotated gene promoters to the second nearest genes of them, and found these genes showed a much weaker enrichment in the adult-high genes ( Supplementary Fig. S3e). Especially, these genes failed to show any significant enrichment in the IRF2-activated genes ( Supplementary Fig. S3f), suggesting IRF2's promoter binding mainly regulates the activity of immediate downstream genes.

Discussion of the performance of different approaches in identifying differential peaks between ChIP-seq samples and detecting co-factors associated with differential binding
In the comparison of H3K4me3 ChIP-seq data between adult and fetal proEs, 97% of the H3K4me3-associated genes are shared between two stages, covering more than 90% of the differentially expressed genes (Supplementary Fig. S2a). However, by using MAnorm to perform a quantitative comparison of ChIP-seq data, we still found many genes have stage-biased H3K4me3 levels at their promoters, and the changes of their H3K4me3 levels obviously are correlated with their expression changes ( Supplementary Fig. S2b-d). Thus, it's reasonable to find that traditional overlap-based analysis did not achieve a plausible performance in identifying co-factors associated with the differential H3K4me3 promoter peaks, as most of the real differential peaks were actually missed by this approach.
But, this finding does not mean the traditional overlap-based approach is useless. When applied to compare more distinct cell types, it may have an improved performance. Here we again use the comparison of H3K27ac ChIP-seq data between H1 hESCs and K562 cells as an example. The majority of the H3K27ac peaks of two cell types were labeled as cell type-specific ones based on peak overlap [5], leading to a large number of cell type-specific H3K27ac-associated genes (genes having H3K27ac peaks at their promoters). These genes cover a substantial part of the genes differentially expressed between two cell types ( Supplementary Fig. S4a). After using MAnorm to compare the two ChIP-seq samples, it can be easily seen that both the promoter H3K27ac level and the expression level of the cell type-specific H3K27ac-associated genes change dramatically between two cell types ( Supplementary Fig. S4c). Then, we applied both traditional overlap-based approach and MAmotif to identify the co-factors associated with H1-specific/biased H3K27ac peaks.
Interestingly, the top 5 JASPAR motifs detected by traditional overlap-based approach are exactly the same as those predicted by MAmotif (some of them may have different ranks, Supplementary Fig. S4d), which cover several known pluripotency related TFs including Pou5f1, TCF3 and Sox family motifs (its family member Sox2 is widely known as a core pluripotency TF).
This finding indicates that when most of the cell type-specific peaks (or associated genes) defined by overlapping analysis are true differential peaks, they can still be used as the basis for the following analysis (but as suggested in our previous studies [2,5], we recommend to use a quantitative/statistical comparison of the corresponding ChIP-seq samples to validate the definition of differential and non-differential peaks from overlapping analysis and/or further filter out those unreliable ones, in order to achieve a better performance).
In some recent studies, DESeq and DESeq2 [13,14], which were originally developed for differential expression analysis of RNA-seq data, were also used to perform differential binding analysis with ChIP-seq data of histone modifications. Here we also tried using DEseq2 and MEME [15], a very famous motif analysis suite, to re-perform the comparison of H3K4me3 ChIP-seq data between adult and fetal proEs. First, we found DEseq2 failed to detect any H3K4me3 peak with significant ChIP-seq intensity change using P<0.01 as cutoff ( Supplementary Fig. S5b), while MAnorm can still detect more than two thousand differential H3K4me3 peaks with the same cutoff, and several hundred of them are located at gene promoters, covering a considerable fraction of genes differentially expressed between adult and fetal proEs (Supplementary Fig. S5b). Then, we switched to use a less stringent cutoff P-value 0.05 as cutoff, and obtained 86/119 adult/fetal-biased H3K4me3 promoter peaks from DEseq2, respectively. We applied the AME tool in MEME suite to analyze the relative enrichment of the 196 JASPAR vertebrate motifs used in our study between the adult and fetal-biased H3K4me3 promoter peaks detected by DEseq2, and none of these motifs were found to be significantly more enriched in adult-biased H3K4me3 promoter peaks compared to the fetal-biased ones (P<0.01 after correction for multiple testing). On the other hand, AME found IRF family motifs are significantly over-represented in the adult-biased H3K4me3 promoter peaks detected by MAnorm compared to the corresponding fetal-biased peaks (P=2E-7 for IRF2 motif), indicating AME has the ability to detect motifs truly associated with adult-biased H3K4me3 peaks and our finding can be reproduced by other motif analysis tools.
Of note, the developers of DEseq and DEseq2 have claimed they were originally designed to compare RNA-seq data with biological replicates, so it's not too surprising to find they do not have a good performance on the ChIP-seq data used in this study. Besides, it has been mentioned that it's better to use computational tools specifically designed for ChIP-seq data to perform differential binding analysis [16]. For example, the signal-to-noise (S/N) ratio can vary dramatically across ChIP-seq samples, and it could be an important issue for the comparison of ChIP-seq data. Here, we still use the comparison of H3K27ac ChIP-seq data between H1 hESCs and K562 cells as an example. We applied both DEseq2 and MAnorm to compare the H3K27ac ChIP-seq data, and found DEseq identified significantly more H1-biased H3K27ac peaks than K562-biased ones (2669 vs 357, Supplementary Fig. S5e) using fold change higher than 2 and P-value lower than 0.01 as cutoffs, while the numbers of the H1-biased and K562-biased H3K27ac peaks detected by MAnorm are still quite comparable with each other (15572 vs 13562, Supplementary Fig. S5e), covering a much higher proportion of the genes differentially expressed between H1 and K562 cells (38.6% and 54.1% of H1 and K562-high genes, respectively, compared to 0.7% and 0.1% by DEseq2, Supplementary Fig. S5e). Interestingly, we previously found that DEseq2 detected similar numbers of adult and fetal-biased H3K4me3 peaks between adult and fetal proEs. Then, we used the fraction of aligned reads covered by the top 15,000 peaks as a simple estimation of the S/N ratio of each ChIP-seq sample, and found the H3K27ac ChIP-seq samples of H1 and K562 cells do have quite different S/N ratios, while the H3K4me3 samples of adult and fetal proEs showed similar S/N ratios with each other ( Supplementary Fig. S5f), indicating the unbalanced numbers of the differential peaks detected by DEseq are very likely to be due to the variation of S/N ratio between these samples.