Global reference mapping and dynamics of human transcription factor footprints

Combinatorial binding of transcription factors to regulatory DNA underpins gene regulation in all organisms. Genetic variation in regulatory regions has been connected with diseases and diverse phenotypic traits1, yet it remains challenging to distinguish variants that impact regulatory function2. Genomic DNase I footprinting enables quantitative, nucleotide-resolution delineation of sites of transcription factor occupancy within native chromatin3–5. However, to date only a small fraction of such sites have been precisely resolved on the human genome sequence5. To enable comprehensive mapping of transcription factor footprints, we produced high-density DNase I cleavage maps from 243 human cell and tissue types and states and integrated these data to delineate at nucleotide resolution ~4.5 million compact genomic elements encoding transcription factor occupancy. We map the fine-scale structure of ~1.6 million DHS and show that the overwhelming majority is populated by well-spaced sites of single transcription factor:DNA interaction. Cell context-dependent cis-regulation is chiefly executed by wholesale actuation of accessibility at regulatory DNA versus by differential transcription factor occupancy within accessible elements. We show further that the well-described enrichment of disease- and phenotypic trait-associated genetic variants in regulatory regions1,6 is almost entirely attributable to variants localizing within footprints, and that functional variants impacting transcription factor occupancy are nearly evenly partitioned between loss- and gain-of-function alleles. Unexpectedly, we find that the global density of human genetic variation is markedly increased within transcription factor footprints, revealing an unappreciated driver of cis-regulatory evolution. Our results provide a new framework for both global and nucleotide-precision analyses of gene regulatory mechanisms and functional genetic variation.


Introduction 1
Genome-encoded recognition sites for sequence-specific DNA binding proteins are the 2 atomic units of eukaryotic gene regulation. Binding of regulatory factors to their cognate 3 elements in vivo shields them from nuclease attack, giving rise to protected single nucleotide-4 resolution DNA 'footprints'. The advent of DNA footprinting using the non-specific nuclease 5 DNase I 7 marked a major turning point in analysis of gene regulation, and facilitated the 6 identification of the first mammalian sequence-specific DNA binding proteins 8 . Genomic DNase 7 I footprinting 3 enables genome-wide detection of DNA footprints chiefly within regulatory DNA 8 regions, but also over other genomic elements where DNase I cleavage is sufficiently dense. 9 DNase I footprints define sites of direct regulatory factor occupancy on DNA and can be 10 used to discriminate sites of direct vs. indirect occupancy within orthogonal data from chromatin 11 immunoprecipitation and sequencing (ChIP-seq) experiments. Cognate transcription factors 12 (TFs) can be reliably assigned to DNase I footprints based on matching to consensus 13 sequences, enabling TF-focused analysis of gene regulation and regulatory networks 9 , and the 14 evolution of regulatory factor binding patterns within regulatory DNA 10 . DNase I is a small 15 enzyme, roughly the size of a typical transcription factor that recognizes the minor groove of 16 DNA and hydrolyzes single-stranded cleavages that, in principle, reflect both the topology and 17 the kinetics of DNA-protein interaction. Previous efforts to exploit this feature 4 were complicated 18 by slight sequence-driven variation in cleavage preferences; however, these have now been 19 exhaustively determined 11 , setting the stage for fully resolved tracing of DNA-protein interactions 20 within regulatory DNA. 21 Currently we lack a comprehensive, nucleotide-resolution annotation of small DNA 22 elements encoding regulatory factor recognition sites that are selectively occupied in different 23 cell types. Such a reference is essential both for analysis of cell-selective occupancy patterns, 24 and for systematic integration with genetic variation, particularly that associated with diseases 25 and phenotypic traits. Here we combine sampling of >67 billion DNase I cleavages from >240 26 human cell types and states to index, with unprecedented accuracy and resolution, human 27 genomic footprints and thereby the sequence elements that encode transcription factor 28 recognition sites. We leverage this index to comprehensively assign footprints to transcription 29 factor archetypes, define patterns of cell-selective occupancy, and analyze the distribution and 30 impact of human genetic variation on regulatory factor occupancy and the genetics of common 31 diseases and traits. 32

Comprehensive mapping of human TF footprints 33
To create comprehensive maps of TF occupancy, we selected and deeply-sequenced 34 high-quality DNase I libraries from 243 cell and tissues types derived from diverse primary cells 35 and tissues (n=151), primary cells in culture (n=22), immortalized cell lines (n=10) and cancer 36 cell lines and primary samples (n=60) (Extended Data  To identify DNase I footprints genome-wide, we developed a novel computational  41  approach incorporating both chromatin architecture and exhaustively determined empirical  42  DNase I sequence preferences to determine expected per-nucleotide cleavage rates across the  43  genome, and to derive, for each biosample, a statistical model for testing whether observed  44 cleavage rates at individual nucleotides deviated significantly from expectation (Extended Data 45 Fig. 1, Extend Data Fig. 2

A unified index of human genomic footprints 61
Comparative footprinting across many cell types has the potential to illuminate both the 62 structure and function of regulatory DNA, yet a systematic approach for joint analysis of 63 genomic footprinting data has been lacking. Given the scale and diversity of the cell types and 64 tissues surveyed, we sought to develop a framework that could integrate hundreds of available 65 genomic footprinting datasets to increase the precision and resolution of footprint detection and, 66 furthermore, serve as a scaffold to build a common reference index of TF-contacted DNA 67 genome-wide. 68 To accomplish this, we implemented an empirical Bayes framework that estimates the 69 posterior probability that a given nucleotide is footprinted by incorporating a prior on the 70 presence of a footprint (determined by footprints independently identified within individual 71 datasets) and a likelihood model of cleavage rates for both occupied and unoccupied sites ( Fig.  72 1a and Methods). Fig. 1b  all samples precisely demarcates the core recognition sequences of diverse TFs (Fig. 1b,  78 bottom). 79 To systematically create a reference set of TF-occupied DNA elements genome-wide, 80 we applied the Bayesian approach to all DHSs detected within one or more of the 243 81 biosamples, and applied the same consensus approach used to establish the consensus DHS 82 index 12 to collate overlapping footprinted regions across individual biosamples into distinct high-83 resolution consensus footprints (Methods). Collectively, we delineated ~4.46 million consensus 84 footprints within ~1.6 million individual DHS (Fig. 1c). 82.6% of consensus footprints localize 85 directly within the core of a DHS peak (avg. width 203bp) with virtually all residing within 250bp 86 of a DHS peak summit (Figure 1d). As expected, consensus-defined footprints were markedly 87 more reproducible than footprints detected independently within a given biosample (avg. 88 Jaccard distance 0.43 vs 0.29, respectively) (Extended Data Fig. 4d- binding domains such as the paired-box containing TF PAX6 24 (Fig. 2b) and other TFs with 125 extant co-crystal structures (not shown). Critically, these topological features are evident at the 126 level of individual TF footprints on the genome ( Fig. 2a-b and Extended Data Fig. 6), indicating 127 that the extended profile of corrected per-nucleotide DNase I cleavage across entire regulatory 128 regions should, in principle, provide a snapshot of the primary structure of active regulatory 129 DNA. 130

Distinguishing independent vs. cooperative modes of TF occupancy 131
Transcription factors compete cooperatively with nucleosomes for access to regulatory 132 DNA 25,26 . Although fundamental to eukaryotic gene regulation, it is currently unknown whether 133 nucleosome-enforced TF cooperativity derives primarily from local protein-protein interactions or 134 results from the synergistic effect of independent TF:DNA binding events 26 . We reasoned that 135 the number, relative spacing, and morphology of TF binding events within individual regulatory 136 elements could be used to gain insight into the mechanistic basis of TF cooperativity. We 137 observed that the average footprint width for diverse TFs tightly corresponded to the total width 138 of its recognition sequence (Spearman's ρ=0.82, p-value=0.001) indicating that DNase I 139 cleavage precisely delineates the boundaries between occupied and unoccupied DNA at 140 nucleotide resolution (Fig. 3a). 141 Since the width of genomic footprints tightly tracks the physical structure of individual 142 TFs bound to DNA ( Fig. 2a-b), and direct TF:TF interactions are dependent on close 143 proximity 19 , as such interactions should be reflected in larger footprints that harbor multiple TF 144 recognition sites. Conversely, independent TF:DNA interaction events should be reflected by 145 compact and widely-spaced footprints harboring single TF recognition sites. As such, the 146 prevalence of cooperativity mediated by direct TF:TF interactions vs. synergy of independent 147 binding events should be reflected in relative proportion of wide multi-motif footprints vs. well-148 spaced single footprints. Larger footprints are overwhelmingly associated with two (or more) 149 recognition sequences (Fig. 3b), yet such footprints represent only 8% of the global footprint 150 landscape. By contrast, 92% of footprints harbor a single TF recognition site (Fig. 3c). 151 Transcription factors can distort DNA upon engagement; as such, the spacing of 152 transcription factors can be critical for establishing an active regulatory structure. To quantify 153 global footprint spacing patterns, we first binned each DHS by its average accessibility across 154 all samples (because footprint discovery depends on total DNase I cleavage; Extended Data 155 Fig. 1b), and for each bin we computed the mean number of footprints present per element and 156 their relative edge-to-edge spacing. The density of footprints within the most deeply sampled 157 DHSs genome-wide plateaued at an average of 5.5 per 200 bp (average width of a DHS peak) 158 (Fig. 3d), which is in remarkable agreement with a theoretical prediction of the number of 159 human TFs required to destabilize a canonical nucleosome 26 and to encode specificity 27 . Within 160 DHSs, footprints exhibit an average edge-to-edge spacing of ~21bp (middle 50%, 12-35bp) 161 ( Fig. 3e). Taken together, these results are compatible with the observed lack of evolutionary 162 constraint on the spacing and orientation of TF motifs 28 and provide strong evidence that 163 regulatory DNA marked by DHSs is chiefly instantiated by independent but synergistic TF 164 binding modes (Fig. 3f). 165

Cell-selective TF occupancy patterns 166
Analysis of footprint occupancy across all biosamples revealed strong enrichment for the 167 recognition sequences of key regulatory TFs in their cognate lineages (Extended Fig. 7a and  168 Methods; for example: HNF4A in fetal intestine and liver; GATA factors in erythroid and 169 placental/trophoblast cells and tissues; NEUROG1 in brain; myogenic regulatory factors (e.g, 170 MYF6, MYOD, etc.) in muscle and lung; MEIS1 in developing eye, brain, and muscle tissues; 171 and PAX6 in fetal eye). In total, we identified 609 motif models matching footprinted sequences 172 (Methods); these models encompassed 64 distinct archetypal transcription factor recognition 173 codes (Extended Data Table 2), representing virtually all major DNA-binding domain families. 174 For degenerate motifs where the same sequence is recognized by many distinct TFs, we 175 observed highly cell-selective occupancy patterns that could be decomposed into coherent 176 groups corresponding with cell type and function (Extended Data Fig. 7b-d). 177

Most DHSs encode a single regulatory topology 178
Given that a significant fraction of DHSs are shared across two or more cell/tissue 179 types 12,29 , we next asked whether differential TF occupancy within the same regulatory DNA 180 region (vs. differential actuation of entire DHSs) could be a major driver of cell-selective 181 regulation. Nucleotide-resolution DNase I cleavage provides a topological fingerprint of each 182 DHS, reflecting its unique combination and ordering of occupying TFs. Although detectable on 183 manual inspection 4 , systematic analysis of differential TF occupancy has previously not been 184 possible due to dominance of intrinsic cleavage propensities when many data sets are 185 combined. To enable unbiased detection of differential footprint occupancy, we developed a 186 statistical framework to test for differences in relative DNase I cleavage rates at individual 187 nucleotides across many samples, analogous with methods developed for the identification of 188 differentially expressed genes (Methods). To estimate the proportion of differentially regulated 189 footprints within DHSs of a given cell/tissue type, we compared footprint occupancy within DHSs 190 broadly accessible in both nervous-system derived samples (n=31) with non-nervous-system 191 derived samples (n=212). We selected 67,368 DHSs that were highly accessible in at least 10 192 nervous and non-nervous derived samples, and for each DHS, performed a per-nucleotide 193 differential test (Fig. 4a,b and Extended Data Fig. 8). This analysis identified only a small 194 proportion of DHSs (1,720 DHSs; 2.5%) containing a differentially footprinted element (Fig. 4c).

195
Most of these DHSs harbored a single differentially regulated footprint, while a small fraction 196 contained 2-4 differentially occupied elements (Fig. 4c). Nonetheless, differentially occupied 197 footprints were significantly enriched recognition sites for known nervous system regulators 198 such as REST, NFIB, ZIC1, and EBF1 (Fig. 4c, bottom right and Extended Data Fig. 9) and 199 tissue-selective occupancy events paralleled expression of nearby genes (in the case of REST 200 occupancy) (Extended Data Fig. 10 217 5a, Extended Data Fig. 11a and Methods). Within DHSs, CAVs were markedly enriched in 218 core consensus footprints, even after controlling for the increased detection power within this 219 compartment ( Fig. 5b and Extended Data Fig. 12). 220 In protein-coding regions, most functional genetic variation is expected to be deleterious, 221 with rare gain-of-function alleles 30 . Protein-DNA recognition interfaces are likewise presumed to 222 be susceptible to disruption at critical nucleotides, predisposing to loss-of-function alleles. 223 Strikingly, we found CAVs to be nearly evenly partitioned between loss-(disruption of binding) 224 and gain-of-function (increased or de novo binding) alleles (Fig. 5c-d and Extended Data Fig.  225  10c). Homozygosity for either the reference or alternative allele paralleled results from 226 heterozygotes and further revealed that structural changes due to TF occupancy were precisely 227 confined to the DNA sequence recognition interface (Fig. 5c, bottom). In many cases, SNVs 228 detected in both heterozygous and homozygous configurations showed strong agreement 229 between allelic ratios and relative footprint strength ( Fig. 5e; Spearman's ρ=0.9, p-value < 10 -5 ). 230 Variants residing within core recognition motifs in footprints were markedly enriched for 231 imbalance vs. non-footprinted motifs; were localized to high-information content positions within 232 the recognition interface (Fig. 5c, bottom and Extended Data Fig. 13); and paralleled the 233 predicted energetic effect of the variant on the TF binding site ( Fig. 5f and Extended Data Fig.  234  14), thus providing a direct quantitative readout of functional variation impacting TF occupancy. 235

DNA elements encoding footprints are hypermutable 236
We next explored the global distribution of human genetic variation relative to consensus 237 footprints. Transcription factor binding sites appear to be gradually remodeled over evolutionary 238 time via sequential small mutations 31 that could ultimately affect function and phenotype 32 . 239 However, patterns of genetic variation within regulatory DNA have not been characterized with 240 high precision. To quantify these, we calculated nucleotide diversity (π) within and around 241 consensus genomic footprints using whole-genome sequencing data compiled from >65,000 242 individual under the TOPMED project 33 (Methods). Canonically, reduced levels of π reflect the 243 elimination of deleterious alleles from the population by natural selection, and hence are 244 typically indicative of functional constraint 34 . Surprisingly, we found a dramatic increase in 245 nucleotide diversity centered precisely within the core of genomic footprints (Fig. 5a), and thus 246 that these elements are highly polymorphic in human populations. This result eclipses prior 247 global analyses indicating that transcription factor occupancy sites are generally not under 248 substantial purifying selection 4 both in the magnitude of the observed effect, and in its 249 nucleotide-precise localization within the core of genomic footprints. 250 The focal increase in genetic diversity within footprints indicated that the nucleotides 251 encoding footprinted elements may have an increased mutational load vs. immediately adjacent 252 sequences. To explore this possibility, we focused on variants with extremely low allele 253 frequencies in human populations (minor allele frequency < 10 -4 ); such variants are assumed to 254 result from de novo germline (ie., non-segregating) mutation and are often used as a surrogate 255 for mutation rate in humans 35 . We found that the distribution of extremely rare variants around 256 and within genomic footprints mirrored that of nucleotide diversity, compatible with context-257 driven increased mutation rate in the sequences underlying footprints (Fig. 5b). Of note, many 258 transcription factors favor recognition of dinucleotide combinations such as CpGs that are 259 intrinsically hypermutable. Conversely, de novo mutations have been implicated in the genesis 260 of TF recognition sites 36,37 . Thus, hypermutation within genomic footprints may fill a key 261 evolutionary role by favoring variability in TF occupancy and hence natural variation in gene 262 regulation. 263

GWAS variants are enriched in TF footprints 264
Given the above, genetic variation within genomic footprints should, in principle, be a 265 key contributor phenotypic variation; however, to date this has defied accurate quantification. 266 We therefore next resolved the large set of variants strongly associated (nominal p-value < 267 5x10 -8 ) with diverse diseases and phenotypic traits from the NHGRI/EBI GWAS Catalogue 38 to 268 consensus genomic footprints. To account for the baseline increase in genetic variation present 269 within genomic footprints described above, we performed exhaustive (1,000x) sampling 270 matched variants (by minor allele frequency, linkage-disequilibrium (LD) structure, and distance 271 to the nearest gene) from the 1,000 Genome Project 39 (Methods). Additionally, we expanded 272 both GWAS catalogue and matched sampled variants to include variants in perfect LD (r 2 =1). 273 Within DHSs, aggregated GWAS catalogue SNPs were enriched within footprints but not in non-274 footprinted subregions, and enrichment within footprints increased monotonically with footprint 275 strength ( Fig. 6c and Extended Data Fig. 15). 276 The GWAS catalogue aggregates hundreds of traits, with corresponding expected 277 diversity in cognate cell/tissue types. To gain a more accurate view of the enrichment of trait-278 associated variants in footprints, we compared SNP-based trait heritability of individual 279 traits 40,41 . Using summary statistic data from individual GWAS studies from the UK BioBank, we 280 applied partitioned LD-score regression to compute the relative heritability contribution of 281 variants within all DHSs and footprints collectively vs. DHSs and footprints therein from the 282 expected cognate cell type for the trait (Fig. 6d-e). This analysis revealed striking enrichment of 283 variants that account for trait heritability in footprints generally (>5-fold) and most prominently in 284 footprints from the cognate cell type (up to >45-fold) (Fig. 6d-e). Collectively, we conclude that 285 the genetic signals from disease-and trait-associated variants within DHSs emanate from 286 transcription factor footprints, and that variants within footprints are major contributors to trait 287 heritability. 288

Discussion 289
Our report details the highest resolution view to date of regulatory factor occupancy 290 patterns on the human genome measured across an expansive range of cell and tissue 291 contexts sampled from >140 genotype backgrounds. The scale and breadth of the data have 292 enabled the delineation of a reference set of ~4.5 million genomic sequence elements that 293 collectively define nucleotides critical for genome regulation and function and form the building 294 blocks of regulatory DNA. These footprints now provide a ready and extensible nucleotide-295 precise reference for diverse analyses, particularly those involving genetic variation. 296 The preferential localization of disease-and trait-associated variation within regulatory 297 DNA has heretofore been described in terms of entire regulatory regions demarcated by DHSs 298 or clusters thereof. Our results now show that genetic association and heritability signals from 299 regulatory DNA overwhelmingly emanate from indexed transcription factor footprints, which 300 should greatly facilitate the connection of disease-and trait-associated genetic variation with 301 genome function. 302 Perhaps most strikingly, we report that human genetic variation is itself concentrated 303 within transcription factor footprints, owing apparently to a combination of mutation propensity 304 and the evolved sequence recognition repertoire of human transcription factors, which favors 305 hypermutable nucleotide combinations (e.g., CG dinucleotides). Given that human and mouse 306 TFs share the large majority of their recognition landscapes, concentration of variation within TF 307 occupancy sites has likely played a considerable role in shaping human -and indeed all 308 mammalian -gene regulation. It implies, furthermore, that the genome is heavily primed for 309 regulatory evolution, providing a possible mechanism underlying facilitated phenotypic 310 evolution 42 311 Author contributions: J.V. designed experiments and performed analysis. J.L. aided in the design of statistical methods for footprint detection. R.S. organized and aided in the processing of raw data. W.M. aided in the creation of the consensus footprint index. J.S. and J.V. wrote the paper. All others participated in data generation and organization.     a, Comparative footprinting within the SCAMP5 promoter identifies 3 footprints differentially occupied in nervous cell and tissue types. Top, DNase I cleavage in two exemplar nervous and non-nervous cell types. Bottom, mean differential per nucleotide cleavage(log2) between nervous-system derived (n=26) and non-nervous samples (n=12). The color of each bar indicates the statistical significance (-log10 p) of the per-nucleotide differential test. b, Differential footprint testing within thousands of DHS accessible in between nervous and nonnervous related biosamples. c, The vast majority of tested DHSs encode a single TF binding topology. Top, percentage of the DHSs tested that containing one or more differentially occupied element. Bottom left, distribution of differentially footprinted elements per DHS. Bottom right, selected TF recognition sequences significantly enriched in differentially occupied footprints (binomial test p<0.01). Indicated in parenthesis is the fold-enrichment vs. expected (based on prevalence of footprinted motif in tested regions). Top, allelically resolved per-nucleotide DNase I cleavage aggregated from 56 heterozygous samples. Middle, DNase I cleavage in two selected samples homozygous for either reference or alternative alleles. Bottom, mean differential per nucleotide cleavage (log2) between homozygous reference (n=74) and alternative samples (n=12). The color of each bar indicates the statistical significance (-log10 p) of the per-nucleotide differential test (Methods). The variant and differentially footprinted nucleotides precisely colocalize to a NFIX recognition element. d, Density histogram of allelic ratios for variants overlapping a footprinted NFIX recognition sequence. Grey, all variants tested for imbalance (n=7,110). Blue, all variants significantly (n=1,889) imbalanced. g, Scatterplot of allelic imbalance computed from heterozygous individuals (x-axis) vs. the relative difference in footprint depth between homozygous individuals at variants overlapping an NFIX footprint. Each point pertains to a SNV within a footprinted NFIX binding site imbalanced in heterozygotes and differentially footprinted in homozygotes. Grey line indicates fit linear model. e, Allelic imbalance measurements parallels predicted energetic effects of variants within NFIX footprints. Shown is the mean log-odds motif score (reference vs. alternate allele) of all tested variants within footprinted motifs binned by allelic ratios.
Vierstra et al. 15 Figure 6: Human genetic variation is broadly enriched within genomic footprints a, Distribution of genetic variation with respect to consensus footprints. Plotted is the mean per nucleotide diversity determined from whole genome sequencing of 62,784 individuals (TOPMED project). b, Histogram of the distribution of rare variation (minor allele frequency<0.0001) within and surrounding genomic footprints. c, Enrichment of GWAS variation within or outside consensus genomic footprints over randomly sampled variants from 1,000 Genome Project. Enrichment was computed after expanding both GWAS and sampled variants with those in perfect LD (r 2 =1.0, central European population). d, Enrichment of SNP-based trait heritability using LD-score regression for UK BioBank GWAS traits lymphocyte counts (c) and red blood cell counts (d). Asterisk denotes statistically significant enrichments (* indicates p-value<0.01).