Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo

Journal name:
Nature Genetics
Year published:
Published online
Corrected online


The function of human regulatory regions depends exquisitely on their local genomic environment and on cellular context, complicating experimental analysis of common disease- and trait-associated variants that localize within regulatory DNA. We use allelically resolved genomic DNase I footprinting data encompassing 166 individuals and 114 cell types to identify >60,000 common variants that directly influence transcription factor occupancy and regulatory DNA accessibility in vivo. The unprecedented scale of these data enables systematic analysis of the impact of sequence variation on transcription factor occupancy in vivo. We leverage this analysis to develop accurate models of variation affecting the recognition sites for diverse transcription factors and apply these models to discriminate nearly 500,000 common regulatory variants likely to affect transcription factor occupancy across the human genome. The approach and results provide a new foundation for the analysis and interpretation of noncoding variation in complete human genomes and for systems-level investigation of disease-associated variants.

At a glance


  1. Identification of regulatory variants influencing DNA accessibility.
    Figure 1: Identification of regulatory variants influencing DNA accessibility.

    (a) Outline of the experimental procedure and data set. (b) Allelic analysis of DNA accessibility at heterozygous sites. Imbalance manifests as a deviation from a 50:50 ratio in the fraction of reads mapping to the two homologous chromosomes, potentially due to alteration of transcription factor binding by the sequence variant itself. (c) The extent of imbalanced variants discovered. A strict set of imbalanced variants was identified at 0.1% FDR and with >70% imbalance (blue). (d) Allelic ratios of sequencing reads relative to the reference allele. A ratio of 70% represents a 2.3-fold difference in accessibility between the two alleles. (e) The Pearson correlation of allelic ratios at adjacent SNPs broken down by distance to the next SNP. The dashed line represents the median width of DHS hotspots overlapping SNPs in this study. Shown are SNPs in high LD (r2 >0.8) in our samples (Supplementary Fig. 5).

  2. Effect of sampling depth on the detection of imbalance.
    Figure 2: Effect of sampling depth on the detection of imbalance.

    (ac) The discovery of imbalanced variants is depicted when considering additional samples (a), additional individuals for a given cell type (b) or additional cell types for a given individual (c). Imbalanced SNPs were identified in an increasing subset of the data, when adding one sample at a time (starting with the most deeply sequenced). Proportions were computed as the number of SNPs identified at intermediate data points divided by the total number of SNPs from the full data set for that series. Imbalance was established using the P-value cutoff corresponding to 0.1% FDR in the total data set and required at least 70% imbalance. Sequencing coverage was measured as the total reads over all SNPs passing filters. Shown in b and c are subsets of highly sampled cell types and individuals, respectively.

  3. Cross-cell type analysis of imbalance.
    Figure 3: Cross–cell type analysis of imbalance.

    (a) Pairwise Pearson correlations of allelic ratios between samples. Note the increased correlation among samples from the same individual or cell type in comparison to all pairwise samples. Error bars, s.d. P values were derived from the Mann-Whitney U test. (b) Sites were classified as context independent or dependent by the presence (+) or absence (–) of cell type–specific and/or overall imbalance; NA, absence of a DHS. (c) Analysis of the relationship between imbalance in one or more cell types and overall imbalance at the same site. The 29,889 sites without any imbalance are not shown. (d) Allelic ratios per cell type, oriented such that 1.0 represents the direction of overall imbalance at each site. Allelic ratios deviate from 0.5 at context-independent sites even in cell types without significant imbalance (gray arrow). In contrast, context-dependent sites are characterized by strong imbalance only in a subset of cell types. A minority of context-dependent sites display discordant imbalance between samples (blue asterisk). Sites without overall imbalance are shown in Supplementary Figure 8d. Imbalance was considered significant at 5% FDR and >60% allelic ratio (dashed gray lines). (e) Model of context-dependent imbalance at a composite regulatory element bound by both cell type–specific and constitutive transcription factors (TFs).

  4. Imbalance in CTCF occupancy and H3K4me3.
    Figure 4: Imbalance in CTCF occupancy and H3K4me3.

    (a,b) The extent of imbalance (5% FDR) in CTCF occupancy (a) and H3K4me3 (b). (c,d) Allelic consistency for DNase I sensitivity with CTCF occupancy (c) and H3K4me3 (d); shown are sites imbalanced for both features considered. The r value is the Pearson correlation of the allelic ratios. SNPs for DNase I sensitivity imbalanced at 5% FDR were used.

  5. Profiles of transcription factor sensitivity to sequence variation.
    Figure 5: Profiles of transcription factor sensitivity to sequence variation.

    (ac) Concentration of imbalanced SNPs within recognition sequences for AP-1 (left) and CTF/NF-I (right) transcription factors. Shown are all SNPs tested for imbalance (a), significantly imbalanced variants (0.1% FDR) (b) and the proportion of imbalanced SNPs per position (c). Color indicates sites where the allele with higher accessibility has higher information content according to the transcription factor motif. The white background denotes the width of the motif. (d) Survey of the transcription factor motifs analyzed for profiles of imbalance. Similar motifs were grouped into a non-redundant transcription factor cluster (Supplementary Fig. 10). Transcription factors with insufficient SNPs overlapping their motifs were not analyzed (Online Methods). (eh) Transcription factor clusters with enrichment of imbalanced SNPs. Each point represents an individual motif. Shown are the number of SNPs overlapping recognition sites (e), the number of imbalanced SNPs (f), the frequency of substitutions resulting in imbalance (g) and the log2-transformed enrichment of the proportion of imbalanced SNPs lying in transcription factor recognition sequences relative to non-imbalanced SNPs (h). Green asterisks in e mark transcription factor clusters highlighted in the main text. The significance of the enrichment for significant SNPs in motifs in h was assessed by permutation.

  6. Buffering of regulatory variation.
    Figure 6: Buffering of regulatory variation.

    (a) Schematic of the chromatin environment at promoter DHSs indicating increased DHS size, accessibility and density of transcription factor binding sites relative to distal DHSs. (b) Threshold model of transcription factor occupancy explaining the buffering of point changes at strong sites. (c) The frequency of imbalance relative to variant position with respect to the TSS demonstrates buffering within the promoter region. Buffering is strongest between −2.5 kb and +5.0 kb with respect to the TSS. Bins are labeled by the endpoint furthest from the TSS. (d) The frequency of imbalance, broken down by site strength as measured by DNase I accessibility across all cell types having a DHS.

  7. Recognition of variation affecting transcription factor occupancy across the genome.
    Figure 7: Recognition of variation affecting transcription factor occupancy across the genome.

    (a) Scores for noncoding variants in a DHS were calculated as the maximum score from all overlapping transcription factor–specific models. PWMs, position weight matrices. (b,c) Measurement of performance versus experimentally determined imbalanced variants (Online Methods). (b) Positive predictive value (PPV; the proportion of predicted variants that are true positives, also known as precision) is plotted for increasing score cutoffs. At a score cutoff of 0.1 (dotted line), 51% of predictions are true positives. The red line measures performance on the held-out FL_E validation data set. (c) Precision (as in b) versus recall (the overall proportion of imbalanced SNPs that are correctly predicted). A higher area under the curve represents better model performance. (d) Identification of common human sequence variants affecting transcription factor occupancy. The cumulative distribution shows the number of SNPs exceeding a given score cutoff. PPVs at selected cutoffs are transcribed from the data in b.

Accession codes

Primary accessions

Change history

Corrected online 17 November 2015
In the version of this article initially published online, the Online Methods incorrectly abbreviated mapping quality as MAQ rather than MAPQ. Also in the Online Methods, the procedure for downsampling allele counts for cross–cell type analysis of imbalance was incorrectly written as "we subsampled each site to three cell types and further downsampled the allele counts to mapping quality for the lowest of the three cell types." The sentence should read "we subsampled each site to three cell types and further downsampled to the allele counts to match the lowest of the three cell types." The errors have been corrected for the print, PDF and HTML versions of this article.


  1. Gross, D.S. & Garrard, W.T. Nuclease hypersensitive sites in chromatin. Annu. Rev. Biochem. 57, 159197 (1988).
  2. Thurman, R.E. et al. The accessible chromatin landscape of the human genome. Nature 489, 7582 (2012).
  3. Maurano, M.T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 11901195 (2012).
  4. Degner, J.F. et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390394 (2012).
  5. Palmiter, R.D. & Brinster, R.L. Germ-line transformation of mice. Annu. Rev. Genet. 20, 465499 (1986).
  6. Sanyal, A., Lajoie, B.R., Jain, G. & Dekker, J. The long-range interaction landscape of gene promoters. Nature 489, 109113 (2012).
  7. Peterson, K.R. & Stamatoyannopoulos, G. Role of gene order in developmental control of human γ- and β-globin gene expression. Mol. Cell. Biol. 13, 48364843 (1993).
  8. Thanos, D. & Maniatis, T. Virus induction of human IFN β gene expression requires the assembly of an enhanceosome. Cell 83, 10911100 (1995).
  9. Archer, T.K., Lefebvre, P., Wolford, R.G. & Hager, G.L. Transcription factor loading on the MMTV promoter: a bimodal mechanism for promoter activation. Science 255, 15731576 (1992).
  10. Mendenhall, E.M. et al. Locus-specific editing of histone modifications at endogenous enhancers. Nat. Biotechnol. 31, 11331136 (2013).
  11. Aalfs, J.D. & Kingston, R.E. What does 'chromatin remodeling' mean? Trends Biochem. Sci. 25, 548555 (2000).
  12. Ronald, J. et al. Simultaneous genotyping, gene-expression measurement, and detection of allele-specific expression with oligonucleotide arrays. Genome Res. 15, 284291 (2005).
  13. Ni, Y., Hall, A.W., Battenhouse, A. & Iyer, V.R. Simultaneous SNP identification and assessment of allele-specific bias from ChIP-seq data. BMC Genet. 13, 46 (2012).
  14. Knight, J.C., Keating, B.J., Rockett, K.A. & Kwiatkowski, D.P. In vivo characterization of regulatory polymorphisms by allele-specific quantification of RNA polymerase loading. Nat. Genet. 33, 469475 (2003).
  15. McDaniell, R. et al. Heritable individual-specific and allele-specific chromatin signatures in humans. Science 328, 235239 (2010).
  16. Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232235 (2010).
  17. Maurano, M.T., Wang, H., Kutyavin, T. & Stamatoyannopoulos, J.A. Widespread site-dependent buffering of human regulatory polymorphism. PLoS Genet. 8, e1002599 (2012).
  18. Kilpinen, H. et al. Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription. Science 342, 744747 (2013).
  19. Reddy, T.E. et al. Effects of sequence variation on differential allelic transcription factor occupancy and gene expression. Genome Res. 22, 860869 (2012).
  20. McVicker, G. et al. Identification of genetic variants that affect histone modifications in human cells. Science 342, 747749 (2013).
  21. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 20782079 (2009).
  22. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012).
  23. Heap, G.A. et al. Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Hum. Mol. Genet. 19, 122134 (2010).
  24. Stergachis, A.B. et al. Exonic transcription factor binding directs codon choice and affects protein evolution. Science 342, 13671372 (2013).
  25. Zhang, K. et al. Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat. Methods 6, 613618 (2009).
  26. Henikoff, S. & Shilatifard, A. Histone modification: cause or cog? Trends Genet. 27, 389396 (2011).
  27. Neph, S. et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 8390 (2012).
  28. Spivakov, M. et al. Analysis of variation at transcription factor binding sites in Drosophila and humans. Genome Biol. 13, R49 (2012).
  29. Biddie, S.C. et al. Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. Mol. Cell 43, 145155 (2011).
  30. Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308311 (2001).
  31. Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 17201723 (2009).
  32. Zhao, Y. & Stormo, G.D. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29, 480483 (2011).
  33. Rohs, R. et al. The role of DNA shape in protein-DNA recognition. Nature 461, 12481253 (2009).
  34. Meijsing, S.H. et al. DNA binding site sequence directs glucocorticoid receptor structure and activity. Science 324, 407410 (2009).
  35. Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327339 (2013).
  36. Lee, J.-H. et al. A robust approach to identifying tissue-specific gene expression regulatory variants using personalized human induced pluripotent stem cells. PLoS Genet. 5, e1000718 (2009).
  37. Ding, J. et al. Gene expression in skin and lymphoblastoid cells: refined statistical method reveals extensive overlap in cis-eQTL signals. Am. J. Hum. Genet. 87, 779789 (2010).
  38. Price, A.L. et al. Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet. 7, e1001317 (2011).
  39. Grundberg, E. et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat. Genet. 44, 10841089 (2012).
  40. Flutre, T., Wen, X., Pritchard, J. & Stephens, M. A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet. 9, e1003486 (2013).
  41. Veyrieras, J.-B. et al. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 4, e1000214 (2008).
  42. John, S. et al. Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nat. Genet. 43, 264268 (2011).
  43. John, S. et al. Genome-scale mapping of DNase I hypersensitivity. Curr. Protoc. Mol. Biol. Chapter 27, Unit 21.27 (2013).
  44. Wang, H. et al. Widespread plasticity in CTCF occupancy linked to DNA methylation. Genome Res. 22, 16801688 (2012).
  45. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
  46. Neph, S. et al. BEDOPS: high-performance genomic feature operations. Bioinformatics 28, 19191920 (2012).
  47. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 21562158 (2011).
  48. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 17541760 (2009).
  49. Lazarovici, A. et al. Probing DNA shape and methylation state on a genomic scale with DNase I. Proc. Natl. Acad. Sci. USA 110, 63766381 (2013).
  50. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 10611073 (2010).
  51. Le Novère, N. MELTING, computing the melting temperature of nucleic acid duplex. Bioinformatics 17, 12261227 (2001).
  52. Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108D110 (2006).
  53. Portales-Casamar, E. et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 38, D105D110 (2010).
  54. Newburger, D.E. & Bulyk, M.L. UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res. 37, D77D82 (2009).
  55. Grant, C.E., Bailey, T.L. & Noble, W.S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 10171018 (2011).
  56. Gupta, S., Stamatoyannopoulos, J.A., Bailey, T.L. & Noble, W.S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).
  57. Galas, D.J. & Schmitz, A. DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res. 5, 31573170 (1978).
  58. Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 39403941 (2005).
  59. Cooper, G.M. et al. Characterization of evolutionary rates and constraints in three Mammalian genomes. Genome Res. 14, 539548 (2004).
  60. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 10341050 (2005).
  61. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310315 (2014).
  62. Gulko, B., Hubisz, M.J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276283 (2015).
  63. Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955961 (2015).

Download references

Author information

  1. Present address: Institute for Systems Genetics, New York University Langone Medical Center, New York, New York, USA.

    • Matthew T Maurano


  1. Department of Genome Sciences, University of Washington, Seattle, Washington, USA.

    • Matthew T Maurano,
    • Eric Haugen,
    • Richard Sandstrom,
    • Jeff Vierstra,
    • Anthony Shafer,
    • Rajinder Kaul &
    • John A Stamatoyannopoulos
  2. Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, Washington, USA.

    • Rajinder Kaul
  3. Division of Oncology, Department of Medicine, University of Washington, Seattle, Washington, USA.

    • John A Stamatoyannopoulos
  4. Altius Institute for Biomedical Sciences, Seattle, Washington, USA.

    • John A Stamatoyannopoulos


M.T.M., E.H. and J.A.S. conceived and designed the experiments. M.T.M. and E.H. analyzed the data. J.V. and M.T.M. performed transcription factor cluster analysis. R.S. provided bioinformatics support. A.S. generated targeted footprinting data. R.K. assisted with data collection. M.T.M. and J.A.S. wrote the manuscript. M.T.M. and J.A.S. jointly supervised research.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (11,580 KB)

    Supplementary Figures 1–13 and Supplementary Tables 3–13 and 15–17.

  2. Supplementary Data Set 2. TF clusters of similar motifs. (23,336 KB)

    Motif weblogos from the JASPAR, UniPROBE and Jolma et al.35 databases grouped by TF cluster. Motifs from TRANSFAC are listed by name without showing a weblogo.

Text files

  1. Supplementary Table 1. Overview of the DNase I data used in this study. (36 KB)

    DNase I mapping of 116 cell types and tissues used in the study, including the shorthand name for the tissue. Signal portion of tags (SPOT) scores are a measure of enrichment and refer to the proportion of reads mapping within a DHS. Read counts include reads mapped uniquely with ≤2 mismatches to an autosomal chromosome; paired-end reads were required to both properly map to the same chromosome. Read counts are in millions. *, FL_E was excluded from the primary analysis and used for independent validation of the predictions in Figure 7. Previously published data sets are labeled by publication (refs. 2,3,24,27,64–67).

  2. Supplementary Table 2. Overview of the ChIP-seq data used in this study. (12 KB)

    ChIP-seq mapping of CTCF and H3K4me3 in 77 cell types and tissues used in the study, Signal portion of tags (SPOT) scores are a measure of enrichment and refer to the proportion of reads mapping within a DHS. Read counts include reads mapped uniquely with ≤2 mismatches to an autosomal chromosome; paired-end reads were required to both properly map to the same chromosome. Read counts are in millions. Previously published data sets are labeled by publication (refs. 2,17,44,68).

  3. Supplementary Table 14. Clustering of motifs into TF families. (34 KB)

    Clustering of motifs from the JASPAR, UniProbe, TRANSFAC and Jolma et al.35 databases. Each TF cluster is listed along with the names of constituent motifs.

  4. Supplementary Data Set 1. SNPs tested for imbalance in DNA accessibility. (28 MB)

    SNPs are listed by their hg19 coordinates. The rsID is used for SNPs in dbSNP 138. SNPs are classified as imbalanced as in Figure 1c. PctRef refers to the proportion of reads mapping to the reference allele (Fig.1d).

Zip files

  1. Supplementary Data Set 3. SNVs predicted to affect DNA accessibility. (9587 KB)

    List of SNVs from dbSNP 138 overlapping a TF recognition sequence in a DHS hotspot predicted to affect accessibility with a score greater than 0.10. The file is in extended bed format using hg19 coordinates and includes a header line. Each row contains the SNP coordinates and dbSNP ID, a score scaled as the probability of imbalance, the PWM name and strand, the position of the SNP relative to the PWM match and the two alleles of the SNP.

Additional data