ATAC-seq footprinting unravels kinetics of transcription factor binding during zygotic genome activation

While footprinting analysis of ATAC-seq data can theoretically enable investigation of transcription factor (TF) binding, the lack of a computational tool able to conduct different levels of footprinting analysis has so-far hindered the widespread application of this method. Here we present TOBIAS, a comprehensive, accurate, and fast footprinting framework enabling genome-wide investigation of TF binding dynamics for hundreds of TFs simultaneously. We validate TOBIAS using paired ATAC-seq and ChIP-seq data, and find that TOBIAS outperforms existing methods for bias correction and footprinting. As a proof-of-concept, we illustrate how TOBIAS can unveil complex TF dynamics during zygotic genome activation in both humans and mice, and propose how zygotic Dux activates cascades of TFs, binds to repeat elements and induces expression of novel genetic elements.


Supplementary Figure 5: Dux binding is visible as footprints and correlate with ChIP-signal
(a) Correction of the Dux footprint using different bias correction methods. The aggregate footprints for N=12095 Dux binding sites (within ATAC-seq peaks) are shown between Control and DuxOE conditions. The top three panels depict the uncorrected, expected and corrected signals as calculated by TOBIAS. The bottom panels depict the same sites corrected by either HINT-ATAC or SeqOutBias methods. (b) A view of the footprinting scores in the promoter of Tdpoz1. Genomic tracks show corrected ATAC-seq cutsites at 1bp resolution (blue), footprint scores as calculated by TOBIAS (red), and pileup of reads from Dux ChIP-seq from Hendrickson et al. 2017 (green). Potential Dux binding sites are highlighted in blue. (c-d) Footprinting correlates with ChIP-signal at multiple genomic loci. Genomic tracks are the same as described for (b). 9

The TOBIAS framework tools
In developing TOBIAS, we found that there were six main areas of DGF that had not been comprehensively addressed in the context of ATAC-seq footprinting analysis: 1. All-in-one framework including bias correction, footprinting, quantification of protein binding and visualization 2. Investigation of TF binding on a global level (which TFs are more bound globally) as well as the locusspecific level (which TF binds to which genomic locations including statistics on differential binding) 3. Consideration of the redundancy and similarity of known TF binding motifs in the context of footprinting 4. A scoring model for TF-DNA binding taking into account the potential lack of a canonical footprint effect 5. Comparison and quantification of TF binding activity within complex experimental settings (multiple conditions or time series) 6. Automated workflows for recurring analysis tasks Modules enabling these individual analysis steps are included in the TOBIAS package, which is publicly available at Github (https://github.com/loosolab/TOBIAS) as well as on PyPI and Bioconda. Besides the examples given in the repository README, we provide a Wiki (https://github.com/loosolab/TOBIAS/wiki) which introduces some of the individual software modules. We used the pre-defined workflows in Snakemake and NextFlow to run the full analysis (see section 2). The single modules, and how they were applied in this investigation, are explained in more detail below.

Bias correction (TOBIAS ATACorrect module)
Each Tn5-cut site is defined as the 5' end of a read shifted by +5 at the plus strand and -4 at the minus strand to center the transposase event. Using the mapped reads from closed chromatin, ATACorrect builds a dinucleotide weight matrix 2 representing the preference of Tn5 insertion. In contrast to the classical position weight matrix (PWM), the dinucleotide weight matrix (DWM) captures the inter-base relationships which arise due to the palindromic nature of the bias. A background model is similarly built by shifting all reads +100bp as described by Koohy et al. 3 .
Reads within open chromatin peaks are then corrected by estimating the expected number of cuts per base pair and subtracting this from the observed cut sites as follows (modified from 4 ): 10 Where: • is the corrected number of cuts • is the observed number of cuts • is the expected number of cuts • is the calculated bias level • i is the respective position in the genome.
To limit the influence of low-bias positions in the calculation of , a lower limit is set for by calculating the fit of cutsites vs. bias to a rectified linear unit function (ReLu) in moving 100bp-windows and setting every below the linear fit to 0. This calculation is performed for all base pairs within open chromatin, setting all other positions in closed chromatin to 0. Lastly, each is rescaled to fit the original sum of cuts for each window.
The default ranges for the widths are $ = [10,30] and $ " = [20,50]. For each i, multiple footprint scores F are calculated for different combinations of $ and $ " . For each i, the combination of $ and $ " with the highest score is taken as .
This footprint score is a combination of depletion and accessibility as a measure of protein binding. The term ̅ " will be negative and will therefore raise the score if there is a high depletion of cuts in the footprint. If there is no depletion, the score will simplify to the mean of cuts in the background regions, representing accessibility. It is therefore not essential to visibly see a canonical footprint shape for the footprint score to be high. The footprint score can be interpreted as higher scores giving more evidence that a protein was bound at a given position.

Estimation of transcription factor states and pairwise comparison between conditions (TOBIAS BINDetect module)
To match the calculated footprint scores to potential binding sites, TOBIAS BINDetect integrates genomic sequence, footprint scores from several conditions, as well as motifs, to identify up-and down regulated TFs based on footprint scores.
In the first step of the algorithm, the MOODS library (https://github.com/jhkorhonen/MOODS) 5 is used to detect TF binding sites (within peaks) with a p-value threshold of 1e-4. Background base pair probabilities are estimated from the input peak set. Subsequently, each binding site is matched to footprint scores for each condition. Simultaneously, a background distribution of values is built by randomly subsetting peak regions at ~200bp intervals, and the scores from each condition are normalized to each other using quantile normalization. These values are used to calculate a distribution of background log2FCs for each pairwise comparison of conditions.
Overlaps between the TFBS identified in the first step are quantified by creating a distance matrix of TFs. The distance d between a TF pair (TF1;TF2) is calculated as: Where: • D 789 and D 78. are the number of base pairs covered by all TF1 and TF2 sites respectively • E 789;78. is the amount of base pairs overlapping between TF1 and TF2 sites.
In the second step of the algorithm, every TF binding site found (for each motif given as input) is split into bound and unbound sites based on a score threshold per condition. The threshold is set at the level of significance of a normal-distribution fit to the background distribution of scores (user-defined p-value). As well as the per-condition split, each site is assigned a log2FC (fold change) per comparison, which represents whether the binding site has larger/smaller footprint scores in comparison. The global distribution of log2FC's per TF is compared to the background distributions to calculate a differential binding score, which is calculated as: • FOS footprint score The Footprint Occupancy Score (FOS) was calculated as described in 19 and pictured in Supplementary Figure 3g. To enable the calculation of FOS for TOBIAS corrected data, all negative signals were set to 0 before calculation.
Due to high computational times for some tools, the validation was limited to binding sites on human chromosome 1. With increasing footprint score thresholds, each TFBS was labeled as "False positive" and "True positive" on the basis of the respective ChIP-seq labels (as explained above). The area under the ROC curve (auROC) was used to evaluate the predictive power of each method.
Higher scores in 2C Higher scores in mESC BINDetect volcano plot Higher scores in 4C Higher scores in 8C BINDetect volcano plot Higher scores in 4C Higher scores in ICM Higher scores in 4C Higher scores in mESC BINDetect volcano plot Higher scores in 8C Higher scores in ICM