Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo

Maurano, Matthew T; Haugen, Eric; Sandstrom, Richard; Vierstra, Jeff; Shafer, Anthony; Kaul, Rajinder; Stamatoyannopoulos, John A

doi:10.1038/ng.3432

Analysis
Published: 26 October 2015

Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo

Nature Genetics volume 47, pages 1393–1401 (2015)Cite this article

17k Accesses
139 Citations
40 Altmetric
Metrics details

Subjects

An Erratum to this article was published on 01 January 2016

This article has been updated

Abstract

The function of human regulatory regions depends exquisitely on their local genomic environment and on cellular context, complicating experimental analysis of common disease- and trait-associated variants that localize within regulatory DNA. We use allelically resolved genomic DNase I footprinting data encompassing 166 individuals and 114 cell types to identify >60,000 common variants that directly influence transcription factor occupancy and regulatory DNA accessibility in vivo. The unprecedented scale of these data enables systematic analysis of the impact of sequence variation on transcription factor occupancy in vivo. We leverage this analysis to develop accurate models of variation affecting the recognition sites for diverse transcription factors and apply these models to discriminate nearly 500,000 common regulatory variants likely to affect transcription factor occupancy across the human genome. The approach and results provide a new foundation for the analysis and interpretation of noncoding variation in complete human genomes and for systems-level investigation of disease-associated variants.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Identification of regulatory variants influencing DNA accessibility.**

**Figure 2: Effect of sampling depth on the detection of imbalance.**

**Figure 3: Cross–cell type analysis of imbalance.**

**Figure 4: Imbalance in CTCF occupancy and H3K4me3.**

**Figure 5: Profiles of transcription factor sensitivity to sequence variation.**

**Figure 6: Buffering of regulatory variation.**

**Figure 7: Recognition of variation affecting transcription factor occupancy across the genome.**

Global reference mapping of human transcription factor footprints

Article Open access 29 July 2020

Jeff Vierstra, John Lazar, … John A. Stamatoyannopoulos

Landscape of allele-specific transcription factor binding in the human genome

Article Open access 12 May 2021

Sergey Abramov, Alexandr Boytsov, … Ivan V. Kulakovskiy

Tissue context determines the penetrance of regulatory DNA variation

Article Open access 14 May 2021

Jessica M. Halow, Rachel Byron, … Matthew T. Maurano

Accession codes

Primary accessions

Gene Expression Omnibus

Change history

17 November 2015
In the version of this article initially published online, the Online Methods incorrectly abbreviated mapping quality as MAQ rather than MAPQ. Also in the Online Methods, the procedure for downsampling allele counts for cross–cell type analysis of imbalance was incorrectly written as "we subsampled each site to three cell types and further downsampled the allele counts to mapping quality for the lowest of the three cell types." The sentence should read "we subsampled each site to three cell types and further downsampled to the allele counts to match the lowest of the three cell types." The errors have been corrected for the print, PDF and HTML versions of this article.

References

Gross, D.S. & Garrard, W.T. Nuclease hypersensitive sites in chromatin. Annu. Rev. Biochem. 57, 159–197 (1988).
Article CAS PubMed Google Scholar
Thurman, R.E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).
CAS PubMed PubMed Central Google Scholar
Maurano, M.T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
Article CAS PubMed PubMed Central Google Scholar
Degner, J.F. et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390–394 (2012).
Article CAS PubMed PubMed Central Google Scholar
Palmiter, R.D. & Brinster, R.L. Germ-line transformation of mice. Annu. Rev. Genet. 20, 465–499 (1986).
Article CAS PubMed PubMed Central Google Scholar
Sanyal, A., Lajoie, B.R., Jain, G. & Dekker, J. The long-range interaction landscape of gene promoters. Nature 489, 109–113 (2012).
Article CAS PubMed PubMed Central Google Scholar
Peterson, K.R. & Stamatoyannopoulos, G. Role of gene order in developmental control of human γ- and β-globin gene expression. Mol. Cell. Biol. 13, 4836–4843 (1993).
Article CAS PubMed PubMed Central Google Scholar
Thanos, D. & Maniatis, T. Virus induction of human IFN β gene expression requires the assembly of an enhanceosome. Cell 83, 1091–1100 (1995).
Article CAS PubMed Google Scholar
Archer, T.K., Lefebvre, P., Wolford, R.G. & Hager, G.L. Transcription factor loading on the MMTV promoter: a bimodal mechanism for promoter activation. Science 255, 1573–1576 (1992).
Article CAS PubMed Google Scholar
Mendenhall, E.M. et al. Locus-specific editing of histone modifications at endogenous enhancers. Nat. Biotechnol. 31, 1133–1136 (2013).
Article CAS PubMed PubMed Central Google Scholar
Aalfs, J.D. & Kingston, R.E. What does 'chromatin remodeling' mean? Trends Biochem. Sci. 25, 548–555 (2000).
Article CAS PubMed Google Scholar
Ronald, J. et al. Simultaneous genotyping, gene-expression measurement, and detection of allele-specific expression with oligonucleotide arrays. Genome Res. 15, 284–291 (2005).
Article CAS PubMed PubMed Central Google Scholar
Ni, Y., Hall, A.W., Battenhouse, A. & Iyer, V.R. Simultaneous SNP identification and assessment of allele-specific bias from ChIP-seq data. BMC Genet. 13, 46 (2012).
Article CAS PubMed PubMed Central Google Scholar
Knight, J.C., Keating, B.J., Rockett, K.A. & Kwiatkowski, D.P. In vivo characterization of regulatory polymorphisms by allele-specific quantification of RNA polymerase loading. Nat. Genet. 33, 469–475 (2003).
Article CAS PubMed Google Scholar
McDaniell, R. et al. Heritable individual-specific and allele-specific chromatin signatures in humans. Science 328, 235–239 (2010).
Article CAS PubMed PubMed Central Google Scholar
Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232–235 (2010).
Article CAS PubMed PubMed Central Google Scholar
Maurano, M.T., Wang, H., Kutyavin, T. & Stamatoyannopoulos, J.A. Widespread site-dependent buffering of human regulatory polymorphism. PLoS Genet. 8, e1002599 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kilpinen, H. et al. Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription. Science 342, 744–747 (2013).
Article CAS PubMed PubMed Central Google Scholar
Reddy, T.E. et al. Effects of sequence variation on differential allelic transcription factor occupancy and gene expression. Genome Res. 22, 860–869 (2012).
Article CAS PubMed PubMed Central Google Scholar
McVicker, G. et al. Identification of genetic variants that affect histone modifications in human cells. Science 342, 747–749 (2013).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Heap, G.A. et al. Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Hum. Mol. Genet. 19, 122–134 (2010).
Article CAS PubMed Google Scholar
Stergachis, A.B. et al. Exonic transcription factor binding directs codon choice and affects protein evolution. Science 342, 1367–1372 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhang, K. et al. Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat. Methods 6, 613–618 (2009).
Article CAS PubMed PubMed Central Google Scholar
Henikoff, S. & Shilatifard, A. Histone modification: cause or cog? Trends Genet. 27, 389–396 (2011).
Article CAS PubMed Google Scholar
Neph, S. et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 83–90 (2012).
Article CAS PubMed PubMed Central Google Scholar
Spivakov, M. et al. Analysis of variation at transcription factor binding sites in Drosophila and humans. Genome Biol. 13, R49 (2012).
Article CAS PubMed PubMed Central Google Scholar
Biddie, S.C. et al. Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. Mol. Cell 43, 145–155 (2011).
Article CAS PubMed PubMed Central Google Scholar
Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article CAS PubMed PubMed Central Google Scholar
Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009).
Article CAS PubMed PubMed Central Google Scholar
Zhao, Y. & Stormo, G.D. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29, 480–483 (2011).
Article CAS PubMed PubMed Central Google Scholar
Rohs, R. et al. The role of DNA shape in protein-DNA recognition. Nature 461, 1248–1253 (2009).
Article CAS PubMed PubMed Central Google Scholar
Meijsing, S.H. et al. DNA binding site sequence directs glucocorticoid receptor structure and activity. Science 324, 407–410 (2009).
Article CAS PubMed PubMed Central Google Scholar
Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).
Article CAS PubMed Google Scholar
Lee, J.-H. et al. A robust approach to identifying tissue-specific gene expression regulatory variants using personalized human induced pluripotent stem cells. PLoS Genet. 5, e1000718 (2009).
Article CAS PubMed PubMed Central Google Scholar
Ding, J. et al. Gene expression in skin and lymphoblastoid cells: refined statistical method reveals extensive overlap in cis-eQTL signals. Am. J. Hum. Genet. 87, 779–789 (2010).
Article CAS PubMed PubMed Central Google Scholar
Price, A.L. et al. Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet. 7, e1001317 (2011).
Article CAS PubMed PubMed Central Google Scholar
Grundberg, E. et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat. Genet. 44, 1084–1089 (2012).
Article CAS PubMed PubMed Central Google Scholar
Flutre, T., Wen, X., Pritchard, J. & Stephens, M. A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet. 9, e1003486 (2013).
Article CAS PubMed PubMed Central Google Scholar
Veyrieras, J.-B. et al. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 4, e1000214 (2008).
Article CAS PubMed PubMed Central Google Scholar
John, S. et al. Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nat. Genet. 43, 264–268 (2011).
Article CAS PubMed PubMed Central Google Scholar
John, S. et al. Genome-scale mapping of DNase I hypersensitivity. Curr. Protoc. Mol. Biol. Chapter 27, Unit 21.27 (2013).
Wang, H. et al. Widespread plasticity in CTCF occupancy linked to DNA methylation. Genome Res. 22, 1680–1688 (2012).
Article CAS PubMed PubMed Central Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Article PubMed PubMed Central Google Scholar
Neph, S. et al. BEDOPS: high-performance genomic feature operations. Bioinformatics 28, 1919–1920 (2012).
Article CAS PubMed PubMed Central Google Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
CAS PubMed PubMed Central Google Scholar
Lazarovici, A. et al. Probing DNA shape and methylation state on a genomic scale with DNase I. Proc. Natl. Acad. Sci. USA 110, 6376–6381 (2013).
Article PubMed PubMed Central Google Scholar
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Le Novère, N. MELTING, computing the melting temperature of nucleic acid duplex. Bioinformatics 17, 1226–1227 (2001).
Article PubMed Google Scholar
Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006).
Article CAS PubMed Google Scholar
Portales-Casamar, E. et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 38, D105–D110 (2010).
Article CAS PubMed Google Scholar
Newburger, D.E. & Bulyk, M.L. UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res. 37, D77–D82 (2009).
Article CAS PubMed Google Scholar
Grant, C.E., Bailey, T.L. & Noble, W.S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Article CAS PubMed PubMed Central Google Scholar
Gupta, S., Stamatoyannopoulos, J.A., Bailey, T.L. & Noble, W.S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).
Article CAS PubMed PubMed Central Google Scholar
Galas, D.J. & Schmitz, A. DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res. 5, 3157–3170 (1978).
Article CAS PubMed PubMed Central Google Scholar
Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005).
Article CAS PubMed Google Scholar
Cooper, G.M. et al. Characterization of evolutionary rates and constraints in three Mammalian genomes. Genome Res. 14, 539–548 (2004).
Article CAS PubMed PubMed Central Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Article CAS PubMed PubMed Central Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Article CAS PubMed PubMed Central Google Scholar
Gulko, B., Hubisz, M.J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by US National Institutes of Health grants U54HG004592, U54HG007010, U01ES01156, 1S10RR026770 and 1S10OD017999 to J.A.S. and National Institute of Mental Health fellowship F31MH094073 to M.T.M. J.V. was supported by a National Science Foundation Graduate Research Fellowship under grant DGE-071824.

Author information

Matthew T Maurano
Present address: Present address: Institute for Systems Genetics, New York University Langone Medical Center, New York, New York, USA.,

Authors and Affiliations

Department of Genome Sciences, University of Washington, Seattle, Washington, USA
Matthew T Maurano, Eric Haugen, Richard Sandstrom, Jeff Vierstra, Anthony Shafer, Rajinder Kaul & John A Stamatoyannopoulos
Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, Washington, USA
Rajinder Kaul
Division of Oncology, Department of Medicine, University of Washington, Seattle, Washington, USA
John A Stamatoyannopoulos
Altius Institute for Biomedical Sciences, Seattle, Washington, USA
John A Stamatoyannopoulos

Authors

Matthew T Maurano
View author publications
You can also search for this author in PubMed Google Scholar
Eric Haugen
View author publications
You can also search for this author in PubMed Google Scholar
Richard Sandstrom
View author publications
You can also search for this author in PubMed Google Scholar
Jeff Vierstra
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Shafer
View author publications
You can also search for this author in PubMed Google Scholar
Rajinder Kaul
View author publications
You can also search for this author in PubMed Google Scholar
John A Stamatoyannopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.T.M., E.H. and J.A.S. conceived and designed the experiments. M.T.M. and E.H. analyzed the data. J.V. and M.T.M. performed transcription factor cluster analysis. R.S. provided bioinformatics support. A.S. generated targeted footprinting data. R.K. assisted with data collection. M.T.M. and J.A.S. wrote the manuscript. M.T.M. and J.A.S. jointly supervised research.

Corresponding authors

Correspondence to Matthew T Maurano or John A Stamatoyannopoulos.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13 and Supplementary Tables 3–13 and 15–17. (PDF 11308 kb)

Supplementary Table 1. Overview of the DNase I data used in this study.

DNase I mapping of 116 cell types and tissues used in the study, including the shorthand name for the tissue. Signal portion of tags (SPOT) scores are a measure of enrichment and refer to the proportion of reads mapping within a DHS. Read counts include reads mapped uniquely with ≤2 mismatches to an autosomal chromosome; paired-end reads were required to both properly map to the same chromosome. Read counts are in millions. *, FL_E was excluded from the primary analysis and used for independent validation of the predictions in Figure 7. Previously published data sets are labeled by publication (refs. 2,3,24,27,64–67). (TXT 35 kb)

Supplementary Table 2. Overview of the ChIP-seq data used in this study.

ChIP-seq mapping of CTCF and H3K4me3 in 77 cell types and tissues used in the study, Signal portion of tags (SPOT) scores are a measure of enrichment and refer to the proportion of reads mapping within a DHS. Read counts include reads mapped uniquely with ≤2 mismatches to an autosomal chromosome; paired-end reads were required to both properly map to the same chromosome. Read counts are in millions. Previously published data sets are labeled by publication (refs. 2,17,44,68). (TXT 12 kb)

Supplementary Table 14. Clustering of motifs into TF families.

Clustering of motifs from the JASPAR, UniProbe, TRANSFAC and Jolma et al.³⁵ databases. Each TF cluster is listed along with the names of constituent motifs. (TXT 34 kb)

Supplementary Data Set 1. SNPs tested for imbalance in DNA accessibility.

SNPs are listed by their hg19 coordinates. The rsID is used for SNPs in dbSNP 138. SNPs are classified as imbalanced as in Figure 1c. PctRef refers to the proportion of reads mapping to the reference allele (Fig.1d). (TXT 27676 kb)

Supplementary Data Set 2. TF clusters of similar motifs.

Motif weblogos from the JASPAR, UniPROBE and Jolma et al.³⁵ databases grouped by TF cluster. Motifs from TRANSFAC are listed by name without showing a weblogo. (PDF 23365 kb)

Supplementary Data Set 3. SNVs predicted to affect DNA accessibility.

List of SNVs from dbSNP 138 overlapping a TF recognition sequence in a DHS hotspot predicted to affect accessibility with a score greater than 0.10. The file is in extended bed format using hg19 coordinates and includes a header line. Each row contains the SNP coordinates and dbSNP ID, a score scaled as the probability of imbalance, the PWM name and strand, the position of the SNP relative to the PWM match and the two alleles of the SNP. (ZIP 9362 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maurano, M., Haugen, E., Sandstrom, R. et al. Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo. Nat Genet 47, 1393–1401 (2015). https://doi.org/10.1038/ng.3432

Download citation

Received: 17 July 2015
Accepted: 02 October 2015
Published: 26 October 2015
Issue Date: December 2015
DOI: https://doi.org/10.1038/ng.3432

This article is cited by

Massively parallel identification of functionally consequential noncoding genetic variants in undiagnosed rare disease patients
- Jasmine A. McQuerry
- Merry Mclaird
- Scott T. Younger
Scientific Reports (2022)
An effector index to predict target genes at GWAS loci
- Vincenzo Forgetta
- Lai Jiang
- J. Brent Richards
Human Genetics (2022)
JAK inhibitors dampen activation of interferon-stimulated transcription of ACE2 isoforms in human airway epithelial cells
- Hye Kyung Lee
- Olive Jung
- Lothar Hennighausen
Communications Biology (2021)
Tissue context determines the penetrance of regulatory DNA variation
- Jessica M. Halow
- Rachel Byron
- Matthew T. Maurano
Nature Communications (2021)
Genetic perturbation of PU.1 binding and chromatin looping at neutrophil enhancers associates with autoimmune disease
- Stephen Watt
- Louella Vasquez
- Nicole Soranzo
Nature Communications (2021)

Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo

Subjects

Abstract

Access options

Similar content being viewed by others

Global reference mapping of human transcription factor footprints

Landscape of allele-specific transcription factor binding in the human genome

Tissue context determines the penetrance of regulatory DNA variation

Accession codes

Primary accessions

Gene Expression Omnibus

Change history

17 November 2015

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Supplementary information

Supplementary Text and Figures

Supplementary Table 1. Overview of the DNase I data used in this study.

Supplementary Table 2. Overview of the ChIP-seq data used in this study.

Supplementary Table 14. Clustering of motifs into TF families.

Supplementary Data Set 1. SNPs tested for imbalance in DNA accessibility.

Supplementary Data Set 2. TF clusters of similar motifs.

Supplementary Data Set 3. SNVs predicted to affect DNA accessibility.

Rights and permissions

About this article

Cite this article

This article is cited by

Massively parallel identification of functionally consequential noncoding genetic variants in undiagnosed rare disease patients

An effector index to predict target genes at GWAS loci

JAK inhibitors dampen activation of interferon-stimulated transcription of ACE2 isoforms in human airway epithelial cells

Tissue context determines the penetrance of regulatory DNA variation

Genetic perturbation of PU.1 binding and chromatin looping at neutrophil enhancers associates with autoimmune disease

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Accession codes

Primary accessions

Gene Expression Omnibus

Change history

17 November 2015

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links