Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo

An Erratum to this article was published on 01 January 2016

This article has been updated

Abstract

The function of human regulatory regions depends exquisitely on their local genomic environment and on cellular context, complicating experimental analysis of common disease- and trait-associated variants that localize within regulatory DNA. We use allelically resolved genomic DNase I footprinting data encompassing 166 individuals and 114 cell types to identify >60,000 common variants that directly influence transcription factor occupancy and regulatory DNA accessibility in vivo. The unprecedented scale of these data enables systematic analysis of the impact of sequence variation on transcription factor occupancy in vivo. We leverage this analysis to develop accurate models of variation affecting the recognition sites for diverse transcription factors and apply these models to discriminate nearly 500,000 common regulatory variants likely to affect transcription factor occupancy across the human genome. The approach and results provide a new foundation for the analysis and interpretation of noncoding variation in complete human genomes and for systems-level investigation of disease-associated variants.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Identification of regulatory variants influencing DNA accessibility.
Figure 2: Effect of sampling depth on the detection of imbalance.
Figure 3: Cross–cell type analysis of imbalance.
Figure 4: Imbalance in CTCF occupancy and H3K4me3.
Figure 5: Profiles of transcription factor sensitivity to sequence variation.
Figure 6: Buffering of regulatory variation.
Figure 7: Recognition of variation affecting transcription factor occupancy across the genome.

Accession codes

Primary accessions

Gene Expression Omnibus

Change history

  • 17 November 2015

    In the version of this article initially published online, the Online Methods incorrectly abbreviated mapping quality as MAQ rather than MAPQ. Also in the Online Methods, the procedure for downsampling allele counts for cross–cell type analysis of imbalance was incorrectly written as "we subsampled each site to three cell types and further downsampled the allele counts to mapping quality for the lowest of the three cell types." The sentence should read "we subsampled each site to three cell types and further downsampled to the allele counts to match the lowest of the three cell types." The errors have been corrected for the print, PDF and HTML versions of this article.

References

  1. 1

    Gross, D.S. & Garrard, W.T. Nuclease hypersensitive sites in chromatin. Annu. Rev. Biochem. 57, 159–197 (1988).

    Article  CAS  Google Scholar 

  2. 2

    Thurman, R.E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3

    Maurano, M.T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. 4

    Degner, J.F. et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390–394 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. 5

    Palmiter, R.D. & Brinster, R.L. Germ-line transformation of mice. Annu. Rev. Genet. 20, 465–499 (1986).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. 6

    Sanyal, A., Lajoie, B.R., Jain, G. & Dekker, J. The long-range interaction landscape of gene promoters. Nature 489, 109–113 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. 7

    Peterson, K.R. & Stamatoyannopoulos, G. Role of gene order in developmental control of human γ- and β-globin gene expression. Mol. Cell. Biol. 13, 4836–4843 (1993).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. 8

    Thanos, D. & Maniatis, T. Virus induction of human IFN β gene expression requires the assembly of an enhanceosome. Cell 83, 1091–1100 (1995).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. 9

    Archer, T.K., Lefebvre, P., Wolford, R.G. & Hager, G.L. Transcription factor loading on the MMTV promoter: a bimodal mechanism for promoter activation. Science 255, 1573–1576 (1992).

    Article  CAS  Google Scholar 

  10. 10

    Mendenhall, E.M. et al. Locus-specific editing of histone modifications at endogenous enhancers. Nat. Biotechnol. 31, 1133–1136 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. 11

    Aalfs, J.D. & Kingston, R.E. What does 'chromatin remodeling' mean? Trends Biochem. Sci. 25, 548–555 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. 12

    Ronald, J. et al. Simultaneous genotyping, gene-expression measurement, and detection of allele-specific expression with oligonucleotide arrays. Genome Res. 15, 284–291 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. 13

    Ni, Y., Hall, A.W., Battenhouse, A. & Iyer, V.R. Simultaneous SNP identification and assessment of allele-specific bias from ChIP-seq data. BMC Genet. 13, 46 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. 14

    Knight, J.C., Keating, B.J., Rockett, K.A. & Kwiatkowski, D.P. In vivo characterization of regulatory polymorphisms by allele-specific quantification of RNA polymerase loading. Nat. Genet. 33, 469–475 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. 15

    McDaniell, R. et al. Heritable individual-specific and allele-specific chromatin signatures in humans. Science 328, 235–239 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. 16

    Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232–235 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. 17

    Maurano, M.T., Wang, H., Kutyavin, T. & Stamatoyannopoulos, J.A. Widespread site-dependent buffering of human regulatory polymorphism. PLoS Genet. 8, e1002599 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. 18

    Kilpinen, H. et al. Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription. Science 342, 744–747 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. 19

    Reddy, T.E. et al. Effects of sequence variation on differential allelic transcription factor occupancy and gene expression. Genome Res. 22, 860–869 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. 20

    McVicker, G. et al. Identification of genetic variants that affect histone modifications in human cells. Science 342, 747–749 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. 21

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

  22. 22

    ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  23. 23

    Heap, G.A. et al. Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Hum. Mol. Genet. 19, 122–134 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. 24

    Stergachis, A.B. et al. Exonic transcription factor binding directs codon choice and affects protein evolution. Science 342, 1367–1372 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. 25

    Zhang, K. et al. Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat. Methods 6, 613–618 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. 26

    Henikoff, S. & Shilatifard, A. Histone modification: cause or cog? Trends Genet. 27, 389–396 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. 27

    Neph, S. et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 83–90 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. 28

    Spivakov, M. et al. Analysis of variation at transcription factor binding sites in Drosophila and humans. Genome Biol. 13, R49 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. 29

    Biddie, S.C. et al. Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. Mol. Cell 43, 145–155 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. 30

    Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. 31

    Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. 32

    Zhao, Y. & Stormo, G.D. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29, 480–483 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. 33

    Rohs, R. et al. The role of DNA shape in protein-DNA recognition. Nature 461, 1248–1253 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. 34

    Meijsing, S.H. et al. DNA binding site sequence directs glucocorticoid receptor structure and activity. Science 324, 407–410 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. 35

    Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).

    Article  CAS  Google Scholar 

  36. 36

    Lee, J.-H. et al. A robust approach to identifying tissue-specific gene expression regulatory variants using personalized human induced pluripotent stem cells. PLoS Genet. 5, e1000718 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. 37

    Ding, J. et al. Gene expression in skin and lymphoblastoid cells: refined statistical method reveals extensive overlap in cis-eQTL signals. Am. J. Hum. Genet. 87, 779–789 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. 38

    Price, A.L. et al. Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet. 7, e1001317 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. 39

    Grundberg, E. et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat. Genet. 44, 1084–1089 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. 40

    Flutre, T., Wen, X., Pritchard, J. & Stephens, M. A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet. 9, e1003486 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. 41

    Veyrieras, J.-B. et al. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 4, e1000214 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. 42

    John, S. et al. Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nat. Genet. 43, 264–268 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. 43

    John, S. et al. Genome-scale mapping of DNase I hypersensitivity. Curr. Protoc. Mol. Biol. Chapter 27, Unit 21.27 (2013).

  44. 44

    Wang, H. et al. Widespread plasticity in CTCF occupancy linked to DNA methylation. Genome Res. 22, 1680–1688 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. 45

    Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

    Article  Google Scholar 

  46. 46

    Neph, S. et al. BEDOPS: high-performance genomic feature operations. Bioinformatics 28, 1919–1920 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. 47

    Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. 48

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. 49

    Lazarovici, A. et al. Probing DNA shape and methylation state on a genomic scale with DNase I. Proc. Natl. Acad. Sci. USA 110, 6376–6381 (2013).

    Article  Google Scholar 

  50. 50

    1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  51. 51

    Le Novère, N. MELTING, computing the melting temperature of nucleic acid duplex. Bioinformatics 17, 1226–1227 (2001).

    Article  PubMed  PubMed Central  Google Scholar 

  52. 52

    Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006).

    Article  CAS  Google Scholar 

  53. 53

    Portales-Casamar, E. et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 38, D105–D110 (2010).

    Article  CAS  Google Scholar 

  54. 54

    Newburger, D.E. & Bulyk, M.L. UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res. 37, D77–D82 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. 55

    Grant, C.E., Bailey, T.L. & Noble, W.S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. 56

    Gupta, S., Stamatoyannopoulos, J.A., Bailey, T.L. & Noble, W.S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. 57

    Galas, D.J. & Schmitz, A. DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res. 5, 3157–3170 (1978).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. 58

    Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. 59

    Cooper, G.M. et al. Characterization of evolutionary rates and constraints in three Mammalian genomes. Genome Res. 14, 539–548 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. 60

    Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. 61

    Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. 62

    Gulko, B., Hubisz, M.J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. 63

    Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by US National Institutes of Health grants U54HG004592, U54HG007010, U01ES01156, 1S10RR026770 and 1S10OD017999 to J.A.S. and National Institute of Mental Health fellowship F31MH094073 to M.T.M. J.V. was supported by a National Science Foundation Graduate Research Fellowship under grant DGE-071824.

Author information

Affiliations

Authors

Contributions

M.T.M., E.H. and J.A.S. conceived and designed the experiments. M.T.M. and E.H. analyzed the data. J.V. and M.T.M. performed transcription factor cluster analysis. R.S. provided bioinformatics support. A.S. generated targeted footprinting data. R.K. assisted with data collection. M.T.M. and J.A.S. wrote the manuscript. M.T.M. and J.A.S. jointly supervised research.

Corresponding authors

Correspondence to Matthew T Maurano or John A Stamatoyannopoulos.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13 and Supplementary Tables 3–13 and 15–17. (PDF 11308 kb)

Supplementary Table 1. Overview of the DNase I data used in this study.

DNase I mapping of 116 cell types and tissues used in the study, including the shorthand name for the tissue. Signal portion of tags (SPOT) scores are a measure of enrichment and refer to the proportion of reads mapping within a DHS. Read counts include reads mapped uniquely with ≤2 mismatches to an autosomal chromosome; paired-end reads were required to both properly map to the same chromosome. Read counts are in millions. *, FL_E was excluded from the primary analysis and used for independent validation of the predictions in Figure 7. Previously published data sets are labeled by publication (refs. 2,3,24,27,64–67). (TXT 35 kb)

Supplementary Table 2. Overview of the ChIP-seq data used in this study.

ChIP-seq mapping of CTCF and H3K4me3 in 77 cell types and tissues used in the study, Signal portion of tags (SPOT) scores are a measure of enrichment and refer to the proportion of reads mapping within a DHS. Read counts include reads mapped uniquely with ≤2 mismatches to an autosomal chromosome; paired-end reads were required to both properly map to the same chromosome. Read counts are in millions. Previously published data sets are labeled by publication (refs. 2,17,44,68). (TXT 12 kb)

Supplementary Table 14. Clustering of motifs into TF families.

Clustering of motifs from the JASPAR, UniProbe, TRANSFAC and Jolma et al.35 databases. Each TF cluster is listed along with the names of constituent motifs. (TXT 34 kb)

Supplementary Data Set 1. SNPs tested for imbalance in DNA accessibility.

SNPs are listed by their hg19 coordinates. The rsID is used for SNPs in dbSNP 138. SNPs are classified as imbalanced as in Figure 1c. PctRef refers to the proportion of reads mapping to the reference allele (Fig.1d). (TXT 27676 kb)

Supplementary Data Set 2. TF clusters of similar motifs.

Motif weblogos from the JASPAR, UniPROBE and Jolma et al.35 databases grouped by TF cluster. Motifs from TRANSFAC are listed by name without showing a weblogo. (PDF 23365 kb)

Supplementary Data Set 3. SNVs predicted to affect DNA accessibility.

List of SNVs from dbSNP 138 overlapping a TF recognition sequence in a DHS hotspot predicted to affect accessibility with a score greater than 0.10. The file is in extended bed format using hg19 coordinates and includes a header line. Each row contains the SNP coordinates and dbSNP ID, a score scaled as the probability of imbalance, the PWM name and strand, the position of the SNP relative to the PWM match and the two alleles of the SNP. (ZIP 9362 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Maurano, M., Haugen, E., Sandstrom, R. et al. Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo. Nat Genet 47, 1393–1401 (2015). https://doi.org/10.1038/ng.3432

Download citation

Further reading

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing