Systematic analysis of binding of transcription factors to noncoding variants

Yan, Jian; Qiu, Yunjiang; Ribeiro dos Santos, André M.; Yin, Yimeng; Li, Yang E.; Vinckier, Nick; Nariai, Naoki; Benaglio, Paola; Raman, Anugraha; Li, Xiaoyu; Fan, Shicai; Chiou, Joshua; Chen, Fulin; Frazer, Kelly A.; Gaulton, Kyle J.; Sander, Maike; Taipale, Jussi; Ren, Bing

doi:10.1038/s41586-021-03211-0

Article
Published: 27 January 2021

Systematic analysis of binding of transcription factors to noncoding variants

Nature volume 591, pages 147–151 (2021)Cite this article

32k Accesses
63 Citations
182 Altmetric
Metrics details

Subjects

Abstract

Many sequence variants have been linked to complex human traits and diseases¹, but deciphering their biological functions remains challenging, as most of them reside in noncoding DNA. Here we have systematically assessed the binding of 270 human transcription factors to 95,886 noncoding variants in the human genome using an ultra-high-throughput multiplex protein–DNA binding assay, termed single-nucleotide polymorphism evaluation by systematic evolution of ligands by exponential enrichment (SNP-SELEX). The resulting 828 million measurements of transcription factor–DNA interactions enable estimation of the relative affinity of these transcription factors to each variant in vitro and evaluation of the current methods to predict the effects of noncoding variants on transcription factor binding. We show that the position weight matrices of most transcription factors lack sufficient predictive power, whereas the support vector machine combined with the gapped k-mer representation show much improved performance, when assessed on results from independent SNP-SELEX experiments involving a new set of 61,020 sequence variants. We report highly predictive models for 94 human transcription factors and demonstrate their utility in genome-wide association studies and understanding of the molecular pathways involved in diverse human traits and diseases.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: High-throughput analysis of the binding of human transcription factors to common sequence variants by SNP-SELEX.**

**Fig. 2: Evaluation of the current PWM models using the SNP-SELEX data.**

**Fig. 3: DeltaSVM models outperform ΔPWM in predicting differential transcription factor binding to noncoding variants in vitro and in vivo.**

**Fig. 4: deltaSVM models predict TFs probably involved in complex traits and diseases.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Genome-wide association studies

Article 26 August 2021

Data availability

Sequencing data generated in this study can be accessed via Gene Expression Omnibus (GEO) under accession number GSE118725. The raw sequencing data for transcription factor ChIP–seq of GM12878 is extracted from the ENCODE portal (https://www.encodeproject.org). The specific transcription factor data can be accessed by searching the accession numbers listed in Supplementary Table 4. The web portal (http://renlab.sdsc.edu/GVATdb/) provides a searchable interface for SNPs and transcription factors tested in the current study. Enriched motifs for SNP-SELEX experiments using Homer are presented in Supplementary File 1. Scores for all tested SNP–transcription factor pairs from SNP-SELEX experiments are shown in Supplementary File 2. The data for high-confidence allelic binding of 94 transcription factors to all common SNPs in the human genome predicted by deltaSVM models are presented in Supplementary File 3.

Code availability

Custom codes used to process and generate the results described in the current study were deposited to GitHub (https://github.com/ren-lab/snp-selex).

References

Buniello, A. et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47 (D1), D1005–D1012 (2019).
CAS PubMed Google Scholar
Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).
CAS PubMed PubMed Central Google Scholar
Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017).
PubMed PubMed Central Google Scholar
Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).
CAS PubMed Google Scholar
Ruan, S., Swamidass, S. J. & Stormo, G. D. BEESEM: estimation of binding energy models using HT-SELEX data. Bioinformatics 33, 2288–2295 (2017). https://doi.org/10.1093/bioinformatics/btx191.
Article CAS PubMed PubMed Central Google Scholar
Farley, E. K. et al. Suboptimization of developmental enhancers. Science 350, 325–328 (2015).
ADS CAS PubMed PubMed Central Google Scholar
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
ADS Google Scholar
Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).
ADS CAS PubMed Google Scholar
Altshuler, D. M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
ADS CAS PubMed Google Scholar
Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).
CAS PubMed PubMed Central Google Scholar
Rohs, R. et al. The role of DNA shape in protein–DNA recognition. Nature 461, 1248–1253 (2009).
ADS CAS PubMed PubMed Central Google Scholar
Morgunova, E. et al. Two distinct DNA sequences recognized by transcription factors represent enthalpy and entropy optima. eLife 7, e32963 (2018).
PubMed PubMed Central Google Scholar
Mahajan, A. et al. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nat. Genet. 50, 1505–1513 (2018).
CAS PubMed PubMed Central Google Scholar
Greenwald, W. W. et al. Pancreatic islet chromatin accessibility and conformation reveals distal enhancer networks of type 2 diabetes risk. Nat. Commun. 10, 2078 (2019).
ADS PubMed PubMed Central Google Scholar
Battle, A., Brown, C. D., Engelhardt, B. E. & Montgomery, S. B. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
ADS PubMed Google Scholar
Olefsky, J., Farquhar, J. W. & Reaven, G. Relationship between fasting plasma insulin level and resistance to insulin-mediated glucose uptake in normal and diabetic subjects. Diabetes 22, 507–513 (1973).
CAS PubMed Google Scholar
Soyal, S. M. et al. Associations of haplotypes upstream of IRS1 with insulin resistance, type 2 diabetes, dyslipidemia, preclinical atherosclerosis, and skeletal muscle LOC646736 mRNA levels. J. Diabetes Res. 2015, 405371 (2015).
PubMed PubMed Central Google Scholar
Manning, A. K. et al. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nat. Genet. 44, 659–669 (2012).
CAS PubMed PubMed Central Google Scholar
Scott, R. A. et al. Large-scale association analyses identify new loci influencing glycemic traits and provide insight into the underlying biological pathways. Nat. Genet. 44, 991–1005 (2012).
CAS PubMed PubMed Central Google Scholar
Nordquist, N. et al. The transcription factor TFAP2B is associated with insulin resistance and adiposity in healthy adolescents. Obesity 17, 1762–1767 (2009).
CAS PubMed Google Scholar
Apazoglou, K. et al. Antidepressive effects of targeting ELK-1 signal transduction. Nat. Med. 24, 591–597 (2018).
CAS PubMed Google Scholar
Leonardini, A., Laviola, L., Perrini, S., Natalicchio, A. & Giorgino, F. Cross-talk between PPARγ and insulin signaling and modulation of insulin sensitivity. PPAR Res. 2009, 818945 (2009).
PubMed Google Scholar
Fruchart, J. C., Duriez, P. & Staels, B. Peroxisome proliferator-activated receptor-alpha activators regulate genes governing lipoprotein metabolism, vascular inflammation and atherosclerosis. Curr. Opin. Lipidol. 10, 245–257 (1999).
CAS PubMed Google Scholar
Shachter, N. S. Apolipoproteins C-I and C-III as important modulators of lipoprotein metabolism. Curr. Opin. Lipidol. 12, 297–304 (2001).
CAS PubMed Google Scholar
Crosby, J. et al. Loss-of-function mutations in APOC3, triglycerides, and coronary disease. N. Engl. J. Med. 371, 22–31 (2014).
PubMed Google Scholar
Gotto, A. M., Jr. Triglyceride as a risk factor for coronary artery disease. Am. J. Cardiol. 82 (9A), 22Q–25Q (1998).
PubMed Google Scholar
Khetarpal, S. A., Qamar, A., Millar, J. S. & Rader, D. J. Targeting ApoC-III to reduce coronary disease risk. Curr. Atheroscler. Rep. 18, 54 (2016).
PubMed Google Scholar
Jolma, A. et al. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature 527, 384–388 (2015).
ADS CAS PubMed Google Scholar
Kato, N. Insights into the genetic basis of type 2 diabetes. J. Diabetes Investig. 4, 233–244 (2013).
CAS PubMed PubMed Central Google Scholar
Mahajan, A. et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat. Genet. 46, 234–244 (2014).
CAS PubMed Google Scholar
Johnson, A. D. et al. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 24, 2938–2939 (2008).
CAS PubMed PubMed Central Google Scholar
Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).
ADS CAS PubMed PubMed Central Google Scholar
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
ADS CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
PubMed PubMed Central Google Scholar
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
CAS PubMed PubMed Central Google Scholar
Mathelier, A. et al. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 44 (D1), D110–D115 (2016).
CAS PubMed Google Scholar
Nitta, K. R. et al. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. eLife 4, (2015).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
CAS PubMed PubMed Central Google Scholar
Zhou, X., Lindsay, H. & Robinson, M. D. Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 42, e91 (2014).
CAS PubMed PubMed Central Google Scholar
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
PubMed PubMed Central Google Scholar
Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
CAS PubMed PubMed Central Google Scholar
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
ADS CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).
CAS PubMed PubMed Central Google Scholar
Yan, J. et al. Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell 154, 801–813 (2013).
CAS PubMed Google Scholar
Zhang, Y. et al. Model-based analysis of ChIP–seq (MACS). Genome Biol. 9, R137 (2008).
PubMed PubMed Central Google Scholar
van de Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J. K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015).
PubMed PubMed Central Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
CAS PubMed PubMed Central Google Scholar
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
CAS PubMed PubMed Central Google Scholar
Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.3 (2013).
Google Scholar
Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).
CAS PubMed PubMed Central Google Scholar
Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).
CAS PubMed PubMed Central Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
CAS PubMed Google Scholar
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
CAS PubMed PubMed Central Google Scholar
Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
CAS PubMed Google Scholar
Harrow, J. et al. GENCODE: the reference human genome annotation for the ENCODE project. Genome Res. 22, 1760–1774 (2012).
CAS PubMed PubMed Central Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
PubMed PubMed Central Google Scholar
Dennis, G., Jr et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 4, 3 (2003).
Google Scholar
Cock, P. J. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
CAS PubMed PubMed Central Google Scholar
Zuo, C., Shin, S. & Keleş, S. atSNP: transcription factor binding affinity testing for regulatory SNP detection. Bioinformatics 31, 3353–3355 (2015).
CAS PubMed PubMed Central Google Scholar
Lee, D. LS-GKM: a new gkm-SVM for large-scale datasets. Bioinformatics 32, 2196–2198 (2016).
CAS PubMed PubMed Central Google Scholar
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
PubMed PubMed Central Google Scholar
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
CAS PubMed PubMed Central Google Scholar
Dubois, P. C. et al. Multiple common variants for celiac disease influencing immune gene expression. Nat. Genet. 42, 295–302 (2010).
CAS PubMed PubMed Central Google Scholar
Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45, 1274–1283 (2013).
CAS PubMed PubMed Central Google Scholar
Lambert, J. P. et al. Mapping differential interactomes by affinity purification coupled with data-independent mass spectrometry acquisition. Nat. Methods 10, 1239–1245 (2013).
CAS PubMed PubMed Central Google Scholar
Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014).
ADS CAS PubMed Google Scholar
Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).
ADS PubMed Central Google Scholar
Bentham, J. et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat. Genet. 47, 1457–1464 (2015).
CAS PubMed PubMed Central Google Scholar
de Lange, K. M. et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat. Genet. 49, 256–261 (2017).
PubMed PubMed Central Google Scholar
Nelson, C. P. et al. Association analyses based on false discovery rate implicate new loci for coronary artery disease. Nat. Genet. 49, 1385–1391 (2017).
CAS PubMed Google Scholar
Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank S. Preissl and S. A. Chen for insightful comments during manuscript preparation; and S. Kuan, Z. Liu and B. Li for technical assistance. This work was supported by the Ludwig Institute for Cancer Research (B.R.), NIDDK (U01 DK105541 to B.R., M.S., and K.F.), Vetenskapsrådet Sweden (537-2014-6796 to J.Y.), and a CAPES foundation fellowship (BEX 5304/15-6 to A.M.R.S.).

Author information

These authors contributed equally: Jian Yan, Yunjiang Qiu, André M. Ribeiro dos Santos

Authors and Affiliations

School of Medicine, Northwest University, Xi’an, China
Jian Yan, Xiaoyu Li & Fulin Chen
Ludwig Institute for Cancer Research, La Jolla, CA, USA
Jian Yan, Yunjiang Qiu, André M. Ribeiro dos Santos, Yang E. Li, Anugraha Raman & Bing Ren
Department of Biomedical Sciences, City University of Hong Kong, Hong Kong SAR, China
Jian Yan & Xiaoyu Li
Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Solna, Sweden
Jian Yan, Yimeng Yin & Jussi Taipale
Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, USA
Yunjiang Qiu & Anugraha Raman
Universidade Federal do Pará, Institute of Biological Sciences, Belém, Brazil
André M. Ribeiro dos Santos
Department of Biochemistry, University of Cambridge, Cambridge, UK
Yimeng Yin & Jussi Taipale
Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
Yang E. Li, Maike Sander & Bing Ren
Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
Nick Vinckier, Naoki Nariai, Paola Benaglio, Shicai Fan, Joshua Chiou, Kelly A. Frazer, Kyle J. Gaulton & Maike Sander
Genome-Scale Biology Program, University of Helsinki, Helsinki, Finland
Jussi Taipale
Center for Epigenomics, University of California San Diego, La Jolla, CA, USA
Bing Ren

Authors

Jian Yan
View author publications
You can also search for this author in PubMed Google Scholar
Yunjiang Qiu
View author publications
You can also search for this author in PubMed Google Scholar
André M. Ribeiro dos Santos
View author publications
You can also search for this author in PubMed Google Scholar
Yimeng Yin
View author publications
You can also search for this author in PubMed Google Scholar
Yang E. Li
View author publications
You can also search for this author in PubMed Google Scholar
Nick Vinckier
View author publications
You can also search for this author in PubMed Google Scholar
Naoki Nariai
View author publications
You can also search for this author in PubMed Google Scholar
Paola Benaglio
View author publications
You can also search for this author in PubMed Google Scholar
Anugraha Raman
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyu Li
View author publications
You can also search for this author in PubMed Google Scholar
Shicai Fan
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Chiou
View author publications
You can also search for this author in PubMed Google Scholar
Fulin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kelly A. Frazer
View author publications
You can also search for this author in PubMed Google Scholar
Kyle J. Gaulton
View author publications
You can also search for this author in PubMed Google Scholar
Maike Sander
View author publications
You can also search for this author in PubMed Google Scholar
Jussi Taipale
View author publications
You can also search for this author in PubMed Google Scholar
Bing Ren
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.R., M.S., K.J.G., K.A.F., J.T. and J.Y. conceived the project. J.Y., Y.Y., X.L., N.N., and N.V. carried out experiments. Y.Q., A.M.R.d.S., Y.E.L., A.R., S.F., P.B., F.C. and J.C. performed data analysis. J.Y., Y.Q., A.M.R.d.S., J.T. and B.R. wrote the manuscript with input from all co-authors.

Corresponding authors

Correspondence to Jian Yan, Jussi Taipale or Bing Ren.

Ethics declarations

Competing interests

B.R. is a co-founder and consultant for Arima Genomics and a co-founder of Epigenome Technologies.

Additional information

Peer review information Nature thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 The sequence features of input oligonucleotides.

a, An example of the oligo design for SNP-SELEX. Two random nucleotides were added to each end of the oligos as unique molecule identifiers (UMIs) to remove over-represented PCR duplicates. Illumina TruSeq dual-index system was adapted for oligo design. b, The GC content (left) and CpG frequency (right) of SNP-SELEX input were more similar to those of TF binding sites in the human genome (TFBS), open chromatin (DHS) and the entire human genome in general (hg19) than random sequences used in HT-SELEX. c, Comparison of k-mer coverage (left) and sequencing depth (right) of libraries between SNP-SELEX and HT-SELEX.

Extended Data Fig. 2 Derivation of OBS and PBS.

a, Equations demonstrate the relationships between OBS and the association constant (Ka) of TF-DNA interactions. b, An example of how oligonucleotides were evolutionarily selected during SNP-SELEX. Table of counts for oligonucleotide chr6:31171810-31171850 is shown at left and the OBS curve is shown on the right. c, d, Histograms show the number of oligonucleotide sequence bound by each TF (c), the number of binding TFs for each oligonucleotide sequence (d). e, An example of how the abundance of SNPs varies in the course of a SNP-SELEX experiment. The table of counts for SNP rs9263880 is shown at the left and PBS curve is shown on the right. The orange line inside the black boxes indicates the reads of T-allele-containing fragment and the blue line shows the reads of C-allele-containing fragment.

Extended Data Fig. 3 Reproducibility of SNP-SELEX data.

a, Density plots show an example of the distribution of OBS of all oligos assayed in ELK SNP-SELEX replicative experiments. Vertical dashed lines indicate the cut-off for significant binding sequences (P = 0.05 by Monte Carlo randomization). The 40-bp genomic sequences with OBS that is over the indicated values are recognized as significant binding sites of ELK1 or ELK4. DBD: DNA binding domain. FL: full-length protein. b, Density plots show an example of the distribution of PBS of all oligos assayed in ELK SNP-SELEX replicative experiments. Vertical dashed lines indicate the cut-off for significantly differential binding (P = 0.01 by Monte Carlo randomization). The 40-bp SNP-containing genomic sequences with PBS over the indicated values are recognized as significantly differential (allelic) binding sites of ELK1 or ELK4. DBD: DNA binding domain. FL: full-length protein. c, An example illustrating differential DNA binding at six SNPs, in four SNP-SELEX experiments, including (i) two full-length ELK1 replicates, on the first two lines; (ii) one DNA binding domain (DBD) ELK1, on the third line; and one full-length ELK4 TF which belongs to the same structure family, on the last line. Each panel represents the logarithmic odds-ratio (y-axis) of observing the reference allele (REF), represented by a triangle, and the alternative allele (ALT), represented by a circle, over SNP-SELEX cycles (x-axis). The two alleles of each SNP are coloured according to their nucleotides, where A is red, C is green, G is blue, and T is yellow. The figure shows that SNP-SELEX experiments of both replicates, full-length, DBD, and same structure TF family presents the same allelic preference. d, e, Comparison of oligonucleotide enrichment (d) and allele preference (e) between different biological replicates (replicates), full-length (FL), and DNA Binding Domain (DBD), members of the same structural family (family), and random pairs (others). For each pair of experiments, we compared the oligonucleotides that display binding in both experiments for binding oligonucleotides and compared PCC between the PBS from each experiment. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× interquartile range (IQR).

Extended Data Fig. 4 SNP-SELEX results are correlated with TF binding in vitro and in vivo.

a, Comparison of the SNPs with differential TF binding determined by SNP-SELEX and ΔPWM. An error matrix table showing the number of SNPs for which the same allele was identified as the preferred allele by both methods (Agreed), SNPs for which one allele was determined as preferential substrate by one method but no allele was called by the other (PWM+/ SNP-SELEX– and PWM–/ SNP-SELEX+), and SNPs where different alleles were called as preferential bound by each method (Contradictory). Note that the vast majority of the results agreed, with the most disagreement coming from PWM+/ SNP-SELEX–. b, Comparison of the PWM scores (left) and the OBS scores (right) between SNPs with concordant and discordant predictions. Note that discordant predictions mostly come from weak binding sites with low PWM scores and low OBS scores. Two-sided Mann–Whitney U test P value is shown on the top. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. c, Box plots show performance of ΔPWM in predicting pbSNPs grouped by DNA binding domain structural families (left) and information content of motifs for each corresponding TF family (right). AUPRC is used to evaluate the performance of ΔPWM. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. d, Box plots show PCC between PBS and ΔPWM (left) and information content (right) for each TF family. PCCs for some TF families are higher than others, independent of the information content (IC) of corresponding PWM models. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. e, A scatter plot shows the correlation of PBS and allelic binding ratio derived from SNP-SELEX and ChIP–seq in GM12878 cells respectively. The PCCs and P values calculated based on t-test are shown on the lower right corner. The allelic binding ratio is computed as the log₁₀ odd ratio over input (see Methods for details). In total, 341 TF-SNP pairs including 269 unique SNPs and six TFs were plotted. TFs used include ATF2, PKNOX1, IRF3, NR2F1, YBX1, and TBX21.

Extended Data Fig. 5 SNP-SELEX results are correlated with allelic enhancer activities detected using high-throughput reporter assays.

a, A schematic diagram shows the strategy of using STARR-seq to assess the effect of SNPs in enhancer activity in HepG2 and HEK 293T cells. b, Heat map shows pair-wise PCCs calculated among STARR-seq datasets. The read counts of each SNP in the starting reporter library, in the mRNA pools in three HepG2 replicates, and three HEK 293T replicates were used for PCC calculation. c, MA plot of the logarithmic fold-change (y-axis) of read counts of SNP-containing mRNA over that of the input library expressed as logarithmic counts per million (CPM) (x-axis) for HEK 293T, on the top panel, and HepG2, on the bottom panel. Each dot on the plot corresponds to an oligonucleotide, and the oligonucleotides showing enrichment (empirical FDR <0.05) are colored in red. d, Bar plots comparing the fractions of paSNPs determined using STARR-seq in pbSNPs and non-pbSNPs by SNP-SELEX. Odds Ratio (OR) is shown between imbalanced and balanced SNPs, and the P value is calculated by Fisher exact test. Error bars denote the 95% confidence interval calculated by Wilson method (Methods). Only pbSNPs corresponding to the highly expressed TFs (RPKM >3) in the cell lines are considered for the analysis. n = 167 SNP-cell pairs for pbSNPs; n = 509 SNP-cell pairs for non-pbSNPs. e, Bar plots comparing the fractions of paSNPs determined using STARR-seq in pbSNPs and non-pbSNPs predicted by ΔPWM. SNPs with P <0.01 by atSNP were considered as pbSNPs. Odds Ratio (OR) is shown between imbalanced and balanced SNPs, and the P value is calculated by Fisher exact test. Error bars denote the 95% confidence interval calculated by Wilson method (Methods). Only pbSNPs by highly expressed TFs (RPKM >3) in the corresponding cell lines are considered for the analysis. n = 564 SNP-cell pairs for pbSNPs; n = 112 SNP-cell pairs for non-pbSNPs.

Extended Data Fig. 6 deltaSVM more accurately predicts effects of noncoding variants on TF binding in vivo than ΔPWM.

a, A schematic graph for the training of deltaSVM models for 533 TFs. Data from previously reported HT-SELEX experiments using random DNA oligonucleotide sequences were used to derive these models. To develop deltaSVM models for each TF, the reads in each HT-SELEX cycle beyond cycle 0 reads were used as positive training sets, and the reads not enriched were used as negative training sets. All unique 10-mers were scored using gapped k-mer models to compute weights for deltaSVM. The two alleles of the 40-bp SELEX oligos were then scored using deltaSVM models to generate deltaSVM scores. b, Box plots compare the performance of deltaSVM, PWM derived from HT-SELEX with the multinomial or BEESEM algorithms in predicting pbSNPs for 129 TFs. The results from fivefold cross-validation were shown. Two statistical evaluations were used, including AUROC (left) and AUPRC (right). P values by two-sided Mann–Whitney U test are shown on the top. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. c, d, Scatter plots compare the performance between deltaSVM (y-axis) and ΔPWM (x-axis) derived by multinomial models (c) and BEESEM models (d) by in predicting allelic binding of 129 TFs for which both models were available. Results from fivefold cross-validation were shown. The values in both axes were AUPRC. e, An overview of the SNP-SELEX experimental procedure describing the novel batch of SNP-SELEX. f, A scatter plot compares the performance between deltaSVM (y-axis) and BEESEM-generated ΔPWM (x-axis) in predicting allelic binding of 87 TFs for which both models are available by the novel batch of SNP-SELEX. The values in both axes are AUPRC. g, The logo describes the PWM model of a homodimeric binding pattern of TF HLF, with the monomeric half-site indicated by the purple arrows. The red boxes indicate the positions at which the SNP rs79124498 is located (left) and its co-dependent base position (right). The y-axis corresponds to the information content at each position of the PWM (x-axis).

Extended Data Fig. 7 Comparison of deltaSVM models and ΔPWM in predicting allelic TF binding in weak and strong TF binding sites.

SNPs are categorized into five quantiles based on the OBS of the 40-bp DNA fragments. The performance of ΔPWM (green) and deltaSVM (orange) in predicting allelic binding of TFs was evaluated for SNPs in each category. a, b, Two batches of pbSNPs were used as gold standards for performance assessment: the pbSNPs from the initial SNP-SELEX experiments, with fivefold cross-validation (a) and the novel batch SNP-SELEX data (b). Both AUROC (upper) and AUPRC (lower) are shown for statistic assessment of the model performance. The first quantile represents SNPs with the weakest binding strength and the fifth quantile represents SNPs with the strongest binding strength. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR.

Extended Data Fig. 8 DeltaSVM models predict more accurately the noncoding variants affecting TF binding in vivo than ΔPWM.

a, DeltaSVM models outperform ΔPWM in predicting differential DNA binding in vitro. Precision-Recall curves were used to assess the performance of either model in predicting allelic binding events identified in SNP-SELEX for three TFs, including ATF2, HLF, and MAFG. In all three cases, the performance of deltaSVM models (purple) was much better than that of ΔPWM (yellow). The AUC used for quantitative comparison was shown within each plot. b, Bar plots show the fractions of pbSNPs exhibiting allelic imbalance in TF ChIP–seq assays in HepG2 cells among all SNPs that were predicted to be differentially bound by a TF according to the deltaSVM models (purple) or the ΔPWM (yellow). The same datasets as in Fig. 3e were used. Only SNPs that were predicted to be bound by the TF were used in the comparison. The threshold for oligonucleotide binding and for the predicted pbSNPs was determined as the median score for the bound oligonucleotides and pbSNPs respectively. Error bars centred with mean percentage denote the 95% confidence interval calculated by Wilson method (Methods). For ΔPWM, n = 2872(ATF2); n = 4134(HLF); n = 100(MAFG). For deltaSVM, n = 115(ATF2); n = 355(HLF); n = 16(MAFG). c, Bar plots show the fractions of pbSNPs exhibiting allelic imbalance in TF ChIP–seq assays in GM12878 cells among all SNPs that were predicted as differentially bound by a TF according to the deltaSVM models (purple) or the ΔPWM (yellow). Three TFs were included in the analyses, ATF2, NR2F1, and PKNOX1. Only SNPs that were predicted to be bound by the TF were used in the comparison. The threshold for oligonucleotide binding and the predicted pbSNPs was determined as the median scores for the bound oligos and pbSNPs respectively. Error bars centred with mean percentage denote the 95% confidence interval calculated by Wilson method (Methods). For ΔPWM, n = 4318(ATF2); n = 673(NR2F1); n = 225(PKNOX1). For deltaSVM, n = 142(ATF2); n = 229(NR2F1); n = 142(PKNOX1). d, Similar to Fig. 3e, deltaSVM models outperform ΔPWM in predicting differential DNA binding in vivo. Three TF ChIP–seq datasets from GM12878 cells were used for the comparison, including the same dataset as shown in b. Elbow plots show that for each TF, the top-ranked allelic SNPs predicted by deltaSVM models were found to have allelic imbalance in ChIP–seq assays performed in GM12878 cells (purple). By contrast, for allelic SNPs predicted by ΔPWM, only a small fraction showed allelic imbalance in vivo (yellow).

Extended Data Fig. 9 T2D risk SNPs are enriched for pbSNPs.

a, Bar plots show the enrichment of pbSNPs in T2D risk SNPs identified from an independent study¹⁴. The levels of enrichment were displayed for different groups risk SNPs categorized based on the PPA (Posterior Probability of Association). Note that SNPs with stronger PPAs and thus higher likelihood of being causal for T2D are more likely to be pbSNPs. b, Bar plots show the enrichment of T2D risk SNPs in allelic TF binding SNPs predicted by PWM models using the same credible sets as Fig. 4a (ref. ¹³). Specifically, SNPs with P <0.01 by atSNP were used as allelic TF binding SNPs. The level of association is categorized according to PPA as in a. Note that the likely causal SNPs with stronger T2D risk association no longer display higher enrichment for ΔPWM-predicted allelic SNPs. c, A T2D GWAS leading SNP rs7578326 and a pbSNP differentially bound by TFs CEBPB, CEBPE, MYBL2, and NFE2, is predicted to target the IRS1 gene based on Hi-C analysis (circled in blue in bottom panel) in HepG2 cells. The locus around the SNP is enriched for H3K27ac and H3K4me1. d, CRISPRi using dCas9 fused with repressive KRAB domain and guide RNA targeting the locus of SNP rs7578326 (upper) leads to reduced expression of IRS1 gene in HepG2 but not in HEK 293T cells. qPCR results from three biological replicates in HepG2 (left) and HEK 293 (right) cells are plotted in the bottom panel. Y-axis shows the power transformed values of expression presented as mean ± s.d. Raw data are shown as small black circles for clarification. P values computed using two-sided t-test are noted on the top. e, SNP rs7578326 is an eQTL in liver and adipose tissues. Normalized expression value from GTEx project for IRS1 gene is grouped based on individuals’ genotype of SNP rs7578326. Linear regression P values and effect sizes are noted on the top. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR.

Extended Data Fig. 10 Candidate TFs involved in complex traits and diseases identified by enrichment of TF binding alone.

a, A heat map shows the significant enrichment of SNPs predicted to be located within TF-DNA binding sites among traits- or disease-associated SNP. The colour key is shown, and the value represents the -log₁₀ P value. TF-trait pairs mentioned in the text were marked with *. Note that the SNPs here do not necessarily affect TF binding affinity. The candidate regulator we observed and validated (Fig. 4b) could not be identified here if we only use the presence of SNPs at the binding sites without taking into account the effect of SNP on binding affinity. b, d, qPCR results from three biological replicates of MAFG (b) and HLF (d) in WT (HepG2), Control (Negative and HiPerfect), and cells treated with different siRNAs. Expression values are presented as mean ± s.d. c, e, MA-plot showing differentially expressed genes comparing MAFG knockdown (c) and HLF knockdown (e) versus controls. Significant differentially expressed genes (FDR <0.2) were marked in red.

Supplementary information

Reporting Summary

Supplementary Table 1

List of variants tested by SNP-SELEX.

Supplementary Table 2

List of TF tested by SNP-SELEX.

Supplementary Table 3

List of pbSNPs.

Supplementary Table 4

List of TF ChIP-seq data.

Supplementary Table 5

List of STARR-seq oligos.

Supplementary Table 6

List of paSNPs.

Supplementary Table 7

Summary statistics of performance of deltaSVM models and ΔPWM models in predicting pbSNPs.

Supplementary Table 8

List of pbSNPs in the novel batch.

Supplementary Table 9

List of GWAS summary statistics used in Fig 4b.

Supplementary Table 10

Summary statistics of HepG2 haplotype phasing.

Supplementary Data 1

Enriched motifs for SNP-SELEX experiments.

Supplementary Data 2

Scores for all tested SNP-TF pairs by SNP-SELEX experiments.

Supplementary Data 3

The 94 high-confidence deltaSVM models predicted allelic binding of all common SNPs in the human genome.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yan, J., Qiu, Y., Ribeiro dos Santos, A.M. et al. Systematic analysis of binding of transcription factors to noncoding variants. Nature 591, 147–151 (2021). https://doi.org/10.1038/s41586-021-03211-0

Download citation

Received: 28 November 2018
Accepted: 11 December 2020
Published: 27 January 2021
Issue Date: 04 March 2021
DOI: https://doi.org/10.1038/s41586-021-03211-0

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.