The function of human regulatory regions depends exquisitely on their local genomic environment and on cellular context, complicating experimental analysis of common disease- and trait-associated variants that localize within regulatory DNA. We use allelically resolved genomic DNase I footprinting data encompassing 166 individuals and 114 cell types to identify >60,000 common variants that directly influence transcription factor occupancy and regulatory DNA accessibility in vivo. The unprecedented scale of these data enables systematic analysis of the impact of sequence variation on transcription factor occupancy in vivo. We leverage this analysis to develop accurate models of variation affecting the recognition sites for diverse transcription factors and apply these models to discriminate nearly 500,000 common regulatory variants likely to affect transcription factor occupancy across the human genome. The approach and results provide a new foundation for the analysis and interpretation of noncoding variation in complete human genomes and for systems-level investigation of disease-associated variants.
At a glance
- Nuclease hypersensitive sites in chromatin. Annu. Rev. Biochem. 57, 159–197 (1988). &
- The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012). et al.
- Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012). et al.
- DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390–394 (2012). et al.
- Germ-line transformation of mice. Annu. Rev. Genet. 20, 465–499 (1986). &
- The long-range interaction landscape of gene promoters. Nature 489, 109–113 (2012). , , &
- Role of gene order in developmental control of human γ- and β-globin gene expression. Mol. Cell. Biol. 13, 4836–4843 (1993). &
- Virus induction of human IFN β gene expression requires the assembly of an enhanceosome. Cell 83, 1091–1100 (1995). &
- Transcription factor loading on the MMTV promoter: a bimodal mechanism for promoter activation. Science 255, 1573–1576 (1992). , , &
- Locus-specific editing of histone modifications at endogenous enhancers. Nat. Biotechnol. 31, 1133–1136 (2013). et al.
- What does 'chromatin remodeling' mean? Trends Biochem. Sci. 25, 548–555 (2000). &
- Simultaneous genotyping, gene-expression measurement, and detection of allele-specific expression with oligonucleotide arrays. Genome Res. 15, 284–291 (2005). et al.
- Simultaneous SNP identification and assessment of allele-specific bias from ChIP-seq data. BMC Genet. 13, 46 (2012). , , &
- In vivo characterization of regulatory polymorphisms by allele-specific quantification of RNA polymerase loading. Nat. Genet. 33, 469–475 (2003). , , &
- Heritable individual-specific and allele-specific chromatin signatures in humans. Science 328, 235–239 (2010). et al.
- Variation in transcription factor binding among humans. Science 328, 232–235 (2010). et al.
- Widespread site-dependent buffering of human regulatory polymorphism. PLoS Genet. 8, e1002599 (2012). , , &
- Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription. Science 342, 744–747 (2013). et al.
- Effects of sequence variation on differential allelic transcription factor occupancy and gene expression. Genome Res. 22, 860–869 (2012). et al.
- Identification of genetic variants that affect histone modifications in human cells. Science 342, 747–749 (2013). et al.
- The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). et al.
- ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
- Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Hum. Mol. Genet. 19, 122–134 (2010). et al.
- Exonic transcription factor binding directs codon choice and affects protein evolution. Science 342, 1367–1372 (2013). et al.
- Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat. Methods 6, 613–618 (2009). et al.
- Histone modification: cause or cog? Trends Genet. 27, 389–396 (2011). &
- An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 83–90 (2012). et al.
- Analysis of variation at transcription factor binding sites in Drosophila and humans. Genome Biol. 13, R49 (2012). et al.
- Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. Mol. Cell 43, 145–155 (2011). et al.
- dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001). et al.
- Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009). et al.
- Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29, 480–483 (2011). &
- The role of DNA shape in protein-DNA recognition. Nature 461, 1248–1253 (2009). et al.
- DNA binding site sequence directs glucocorticoid receptor structure and activity. Science 324, 407–410 (2009). et al.
- DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013). et al.
- A robust approach to identifying tissue-specific gene expression regulatory variants using personalized human induced pluripotent stem cells. PLoS Genet. 5, e1000718 (2009). et al.
- Gene expression in skin and lymphoblastoid cells: refined statistical method reveals extensive overlap in cis-eQTL signals. Am. J. Hum. Genet. 87, 779–789 (2010). et al.
- Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet. 7, e1001317 (2011). et al.
- Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat. Genet. 44, 1084–1089 (2012). et al.
- A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet. 9, e1003486 (2013). , , &
- High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 4, e1000214 (2008). et al.
- Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nat. Genet. 43, 264–268 (2011). et al.
- Genome-scale mapping of DNase I hypersensitivity. Curr. Protoc. Mol. Biol. Chapter 27, Unit 21.27 (2013). et al.
- Widespread plasticity in CTCF occupancy linked to DNA methylation. Genome Res. 22, 1680–1688 (2012). et al.
- Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009). , , &
- BEDOPS: high-performance genomic feature operations. Bioinformatics 28, 1919–1920 (2012). et al.
- The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011). et al.
- Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009). &
- Probing DNA shape and methylation state on a genomic scale with DNase I. Proc. Natl. Acad. Sci. USA 110, 6376–6381 (2013). et al.
- 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
- MELTING, computing the melting temperature of nucleic acid duplex. Bioinformatics 17, 1226–1227 (2001).
- TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006). et al.
- JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 38, D105–D110 (2010). et al.
- UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res. 37, D77–D82 (2009). &
- FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011). , &
- Quantifying similarity between motifs. Genome Biol. 8, R24 (2007). , , &
- DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res. 5, 3157–3170 (1978). &
- ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005). , , &
- Characterization of evolutionary rates and constraints in three Mammalian genomes. Genome Res. 14, 539–548 (2004). et al.
- Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005). et al.
- A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014). et al.
- A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015). , , &
- A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015). et al.
- Supplementary Text and Figures (11,580 KB)
Supplementary Figures 1–13 and Supplementary Tables 3–13 and 15–17.
- Supplementary Data Set 2. TF clusters of similar motifs. (23,336 KB)
Motif weblogos from the JASPAR, UniPROBE and Jolma et al.35 databases grouped by TF cluster. Motifs from TRANSFAC are listed by name without showing a weblogo.
- Supplementary Table 1. Overview of the DNase I data used in this study. (36 KB)
DNase I mapping of 116 cell types and tissues used in the study, including the shorthand name for the tissue. Signal portion of tags (SPOT) scores are a measure of enrichment and refer to the proportion of reads mapping within a DHS. Read counts include reads mapped uniquely with ≤2 mismatches to an autosomal chromosome; paired-end reads were required to both properly map to the same chromosome. Read counts are in millions. *, FL_E was excluded from the primary analysis and used for independent validation of the predictions in Figure 7. Previously published data sets are labeled by publication (refs. 2,3,24,27,64–67).
- Supplementary Table 2. Overview of the ChIP-seq data used in this study. (12 KB)
ChIP-seq mapping of CTCF and H3K4me3 in 77 cell types and tissues used in the study, Signal portion of tags (SPOT) scores are a measure of enrichment and refer to the proportion of reads mapping within a DHS. Read counts include reads mapped uniquely with ≤2 mismatches to an autosomal chromosome; paired-end reads were required to both properly map to the same chromosome. Read counts are in millions. Previously published data sets are labeled by publication (refs. 2,17,44,68).
- Supplementary Table 14. Clustering of motifs into TF families. (34 KB)
Clustering of motifs from the JASPAR, UniProbe, TRANSFAC and Jolma et al.35 databases. Each TF cluster is listed along with the names of constituent motifs.
- Supplementary Data Set 1. SNPs tested for imbalance in DNA accessibility. (28 MB)
SNPs are listed by their hg19 coordinates. The rsID is used for SNPs in dbSNP 138. SNPs are classified as imbalanced as in Figure 1c. PctRef refers to the proportion of reads mapping to the reference allele (Fig.1d).
- Supplementary Data Set 3. SNVs predicted to affect DNA accessibility. (9587 KB)
List of SNVs from dbSNP 138 overlapping a TF recognition sequence in a DHS hotspot predicted to affect accessibility with a score greater than 0.10. The file is in extended bed format using hg19 coordinates and includes a header line. Each row contains the SNP coordinates and dbSNP ID, a score scaled as the probability of imbalance, the PWM name and strand, the position of the SNP relative to the PWM match and the two alleles of the SNP.