Many sequence variants have been linked to complex human traits and diseases1, but deciphering their biological functions remains challenging, as most of them reside in noncoding DNA. Here we have systematically assessed the binding of 270 human transcription factors to 95,886 noncoding variants in the human genome using an ultra-high-throughput multiplex protein–DNA binding assay, termed single-nucleotide polymorphism evaluation by systematic evolution of ligands by exponential enrichment (SNP-SELEX). The resulting 828 million measurements of transcription factor–DNA interactions enable estimation of the relative affinity of these transcription factors to each variant in vitro and evaluation of the current methods to predict the effects of noncoding variants on transcription factor binding. We show that the position weight matrices of most transcription factors lack sufficient predictive power, whereas the support vector machine combined with the gapped k-mer representation show much improved performance, when assessed on results from independent SNP-SELEX experiments involving a new set of 61,020 sequence variants. We report highly predictive models for 94 human transcription factors and demonstrate their utility in genome-wide association studies and understanding of the molecular pathways involved in diverse human traits and diseases.
This is a preview of subscription content
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Sequencing data generated in this study can be accessed via Gene Expression Omnibus (GEO) under accession number GSE118725. The raw sequencing data for transcription factor ChIP–seq of GM12878 is extracted from the ENCODE portal (https://www.encodeproject.org). The specific transcription factor data can be accessed by searching the accession numbers listed in Supplementary Table 4. The web portal (http://renlab.sdsc.edu/GVATdb/) provides a searchable interface for SNPs and transcription factors tested in the current study. Enriched motifs for SNP-SELEX experiments using Homer are presented in Supplementary File 1. Scores for all tested SNP–transcription factor pairs from SNP-SELEX experiments are shown in Supplementary File 2. The data for high-confidence allelic binding of 94 transcription factors to all common SNPs in the human genome predicted by deltaSVM models are presented in Supplementary File 3.
Custom codes used to process and generate the results described in the current study were deposited to GitHub (https://github.com/ren-lab/snp-selex).
Buniello, A. et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47 (D1), D1005–D1012 (2019).
Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).
Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017).
Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).
Ruan, S., Swamidass, S. J. & Stormo, G. D. BEESEM: estimation of binding energy models using HT-SELEX data. Bioinformatics 33, 2288–2295 (2017). https://doi.org/10.1093/bioinformatics/btx191.
Farley, E. K. et al. Suboptimization of developmental enhancers. Science 350, 325–328 (2015).
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).
Altshuler, D. M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).
Rohs, R. et al. The role of DNA shape in protein–DNA recognition. Nature 461, 1248–1253 (2009).
Morgunova, E. et al. Two distinct DNA sequences recognized by transcription factors represent enthalpy and entropy optima. eLife 7, e32963 (2018).
Mahajan, A. et al. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nat. Genet. 50, 1505–1513 (2018).
Greenwald, W. W. et al. Pancreatic islet chromatin accessibility and conformation reveals distal enhancer networks of type 2 diabetes risk. Nat. Commun. 10, 2078 (2019).
Battle, A., Brown, C. D., Engelhardt, B. E. & Montgomery, S. B. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Olefsky, J., Farquhar, J. W. & Reaven, G. Relationship between fasting plasma insulin level and resistance to insulin-mediated glucose uptake in normal and diabetic subjects. Diabetes 22, 507–513 (1973).
Soyal, S. M. et al. Associations of haplotypes upstream of IRS1 with insulin resistance, type 2 diabetes, dyslipidemia, preclinical atherosclerosis, and skeletal muscle LOC646736 mRNA levels. J. Diabetes Res. 2015, 405371 (2015).
Manning, A. K. et al. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nat. Genet. 44, 659–669 (2012).
Scott, R. A. et al. Large-scale association analyses identify new loci influencing glycemic traits and provide insight into the underlying biological pathways. Nat. Genet. 44, 991–1005 (2012).
Nordquist, N. et al. The transcription factor TFAP2B is associated with insulin resistance and adiposity in healthy adolescents. Obesity 17, 1762–1767 (2009).
Apazoglou, K. et al. Antidepressive effects of targeting ELK-1 signal transduction. Nat. Med. 24, 591–597 (2018).
Leonardini, A., Laviola, L., Perrini, S., Natalicchio, A. & Giorgino, F. Cross-talk between PPARγ and insulin signaling and modulation of insulin sensitivity. PPAR Res. 2009, 818945 (2009).
Fruchart, J. C., Duriez, P. & Staels, B. Peroxisome proliferator-activated receptor-alpha activators regulate genes governing lipoprotein metabolism, vascular inflammation and atherosclerosis. Curr. Opin. Lipidol. 10, 245–257 (1999).
Shachter, N. S. Apolipoproteins C-I and C-III as important modulators of lipoprotein metabolism. Curr. Opin. Lipidol. 12, 297–304 (2001).
Crosby, J. et al. Loss-of-function mutations in APOC3, triglycerides, and coronary disease. N. Engl. J. Med. 371, 22–31 (2014).
Gotto, A. M., Jr. Triglyceride as a risk factor for coronary artery disease. Am. J. Cardiol. 82 (9A), 22Q–25Q (1998).
Khetarpal, S. A., Qamar, A., Millar, J. S. & Rader, D. J. Targeting ApoC-III to reduce coronary disease risk. Curr. Atheroscler. Rep. 18, 54 (2016).
Jolma, A. et al. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature 527, 384–388 (2015).
Kato, N. Insights into the genetic basis of type 2 diabetes. J. Diabetes Investig. 4, 233–244 (2013).
Mahajan, A. et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat. Genet. 46, 234–244 (2014).
Johnson, A. D. et al. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 24, 2938–2939 (2008).
Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Mathelier, A. et al. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 44 (D1), D110–D115 (2016).
Nitta, K. R. et al. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. eLife 4, (2015).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Zhou, X., Lindsay, H. & Robinson, M. D. Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 42, e91 (2014).
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).
Yan, J. et al. Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell 154, 801–813 (2013).
Zhang, Y. et al. Model-based analysis of ChIP–seq (MACS). Genome Biol. 9, R137 (2008).
van de Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J. K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.3 (2013).
Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).
Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
Harrow, J. et al. GENCODE: the reference human genome annotation for the ENCODE project. Genome Res. 22, 1760–1774 (2012).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Dennis, G., Jr et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 4, 3 (2003).
Cock, P. J. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Zuo, C., Shin, S. & Keleş, S. atSNP: transcription factor binding affinity testing for regulatory SNP detection. Bioinformatics 31, 3353–3355 (2015).
Lee, D. LS-GKM: a new gkm-SVM for large-scale datasets. Bioinformatics 32, 2196–2198 (2016).
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Dubois, P. C. et al. Multiple common variants for celiac disease influencing immune gene expression. Nat. Genet. 42, 295–302 (2010).
Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45, 1274–1283 (2013).
Lambert, J. P. et al. Mapping differential interactomes by affinity purification coupled with data-independent mass spectrometry acquisition. Nat. Methods 10, 1239–1245 (2013).
Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014).
Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).
Bentham, J. et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat. Genet. 47, 1457–1464 (2015).
de Lange, K. M. et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat. Genet. 49, 256–261 (2017).
Nelson, C. P. et al. Association analyses based on false discovery rate implicate new loci for coronary artery disease. Nat. Genet. 49, 1385–1391 (2017).
Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).
We thank S. Preissl and S. A. Chen for insightful comments during manuscript preparation; and S. Kuan, Z. Liu and B. Li for technical assistance. This work was supported by the Ludwig Institute for Cancer Research (B.R.), NIDDK (U01 DK105541 to B.R., M.S., and K.F.), Vetenskapsrådet Sweden (537-2014-6796 to J.Y.), and a CAPES foundation fellowship (BEX 5304/15-6 to A.M.R.S.).
B.R. is a co-founder and consultant for Arima Genomics and a co-founder of Epigenome Technologies.
Peer review information Nature thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
a, An example of the oligo design for SNP-SELEX. Two random nucleotides were added to each end of the oligos as unique molecule identifiers (UMIs) to remove over-represented PCR duplicates. Illumina TruSeq dual-index system was adapted for oligo design. b, The GC content (left) and CpG frequency (right) of SNP-SELEX input were more similar to those of TF binding sites in the human genome (TFBS), open chromatin (DHS) and the entire human genome in general (hg19) than random sequences used in HT-SELEX. c, Comparison of k-mer coverage (left) and sequencing depth (right) of libraries between SNP-SELEX and HT-SELEX.
a, Equations demonstrate the relationships between OBS and the association constant (Ka) of TF-DNA interactions. b, An example of how oligonucleotides were evolutionarily selected during SNP-SELEX. Table of counts for oligonucleotide chr6:31171810-31171850 is shown at left and the OBS curve is shown on the right. c, d, Histograms show the number of oligonucleotide sequence bound by each TF (c), the number of binding TFs for each oligonucleotide sequence (d). e, An example of how the abundance of SNPs varies in the course of a SNP-SELEX experiment. The table of counts for SNP rs9263880 is shown at the left and PBS curve is shown on the right. The orange line inside the black boxes indicates the reads of T-allele-containing fragment and the blue line shows the reads of C-allele-containing fragment.
a, Density plots show an example of the distribution of OBS of all oligos assayed in ELK SNP-SELEX replicative experiments. Vertical dashed lines indicate the cut-off for significant binding sequences (P = 0.05 by Monte Carlo randomization). The 40-bp genomic sequences with OBS that is over the indicated values are recognized as significant binding sites of ELK1 or ELK4. DBD: DNA binding domain. FL: full-length protein. b, Density plots show an example of the distribution of PBS of all oligos assayed in ELK SNP-SELEX replicative experiments. Vertical dashed lines indicate the cut-off for significantly differential binding (P = 0.01 by Monte Carlo randomization). The 40-bp SNP-containing genomic sequences with PBS over the indicated values are recognized as significantly differential (allelic) binding sites of ELK1 or ELK4. DBD: DNA binding domain. FL: full-length protein. c, An example illustrating differential DNA binding at six SNPs, in four SNP-SELEX experiments, including (i) two full-length ELK1 replicates, on the first two lines; (ii) one DNA binding domain (DBD) ELK1, on the third line; and one full-length ELK4 TF which belongs to the same structure family, on the last line. Each panel represents the logarithmic odds-ratio (y-axis) of observing the reference allele (REF), represented by a triangle, and the alternative allele (ALT), represented by a circle, over SNP-SELEX cycles (x-axis). The two alleles of each SNP are coloured according to their nucleotides, where A is red, C is green, G is blue, and T is yellow. The figure shows that SNP-SELEX experiments of both replicates, full-length, DBD, and same structure TF family presents the same allelic preference. d, e, Comparison of oligonucleotide enrichment (d) and allele preference (e) between different biological replicates (replicates), full-length (FL), and DNA Binding Domain (DBD), members of the same structural family (family), and random pairs (others). For each pair of experiments, we compared the oligonucleotides that display binding in both experiments for binding oligonucleotides and compared PCC between the PBS from each experiment. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× interquartile range (IQR).
a, Comparison of the SNPs with differential TF binding determined by SNP-SELEX and ΔPWM. An error matrix table showing the number of SNPs for which the same allele was identified as the preferred allele by both methods (Agreed), SNPs for which one allele was determined as preferential substrate by one method but no allele was called by the other (PWM+/ SNP-SELEX– and PWM–/ SNP-SELEX+), and SNPs where different alleles were called as preferential bound by each method (Contradictory). Note that the vast majority of the results agreed, with the most disagreement coming from PWM+/ SNP-SELEX–. b, Comparison of the PWM scores (left) and the OBS scores (right) between SNPs with concordant and discordant predictions. Note that discordant predictions mostly come from weak binding sites with low PWM scores and low OBS scores. Two-sided Mann–Whitney U test P value is shown on the top. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. c, Box plots show performance of ΔPWM in predicting pbSNPs grouped by DNA binding domain structural families (left) and information content of motifs for each corresponding TF family (right). AUPRC is used to evaluate the performance of ΔPWM. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. d, Box plots show PCC between PBS and ΔPWM (left) and information content (right) for each TF family. PCCs for some TF families are higher than others, independent of the information content (IC) of corresponding PWM models. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. e, A scatter plot shows the correlation of PBS and allelic binding ratio derived from SNP-SELEX and ChIP–seq in GM12878 cells respectively. The PCCs and P values calculated based on t-test are shown on the lower right corner. The allelic binding ratio is computed as the log10 odd ratio over input (see Methods for details). In total, 341 TF-SNP pairs including 269 unique SNPs and six TFs were plotted. TFs used include ATF2, PKNOX1, IRF3, NR2F1, YBX1, and TBX21.
Extended Data Fig. 5 SNP-SELEX results are correlated with allelic enhancer activities detected using high-throughput reporter assays.
a, A schematic diagram shows the strategy of using STARR-seq to assess the effect of SNPs in enhancer activity in HepG2 and HEK 293T cells. b, Heat map shows pair-wise PCCs calculated among STARR-seq datasets. The read counts of each SNP in the starting reporter library, in the mRNA pools in three HepG2 replicates, and three HEK 293T replicates were used for PCC calculation. c, MA plot of the logarithmic fold-change (y-axis) of read counts of SNP-containing mRNA over that of the input library expressed as logarithmic counts per million (CPM) (x-axis) for HEK 293T, on the top panel, and HepG2, on the bottom panel. Each dot on the plot corresponds to an oligonucleotide, and the oligonucleotides showing enrichment (empirical FDR <0.05) are colored in red. d, Bar plots comparing the fractions of paSNPs determined using STARR-seq in pbSNPs and non-pbSNPs by SNP-SELEX. Odds Ratio (OR) is shown between imbalanced and balanced SNPs, and the P value is calculated by Fisher exact test. Error bars denote the 95% confidence interval calculated by Wilson method (Methods). Only pbSNPs corresponding to the highly expressed TFs (RPKM >3) in the cell lines are considered for the analysis. n = 167 SNP-cell pairs for pbSNPs; n = 509 SNP-cell pairs for non-pbSNPs. e, Bar plots comparing the fractions of paSNPs determined using STARR-seq in pbSNPs and non-pbSNPs predicted by ΔPWM. SNPs with P <0.01 by atSNP were considered as pbSNPs. Odds Ratio (OR) is shown between imbalanced and balanced SNPs, and the P value is calculated by Fisher exact test. Error bars denote the 95% confidence interval calculated by Wilson method (Methods). Only pbSNPs by highly expressed TFs (RPKM >3) in the corresponding cell lines are considered for the analysis. n = 564 SNP-cell pairs for pbSNPs; n = 112 SNP-cell pairs for non-pbSNPs.
Extended Data Fig. 6 deltaSVM more accurately predicts effects of noncoding variants on TF binding in vivo than ΔPWM.
a, A schematic graph for the training of deltaSVM models for 533 TFs. Data from previously reported HT-SELEX experiments using random DNA oligonucleotide sequences were used to derive these models. To develop deltaSVM models for each TF, the reads in each HT-SELEX cycle beyond cycle 0 reads were used as positive training sets, and the reads not enriched were used as negative training sets. All unique 10-mers were scored using gapped k-mer models to compute weights for deltaSVM. The two alleles of the 40-bp SELEX oligos were then scored using deltaSVM models to generate deltaSVM scores. b, Box plots compare the performance of deltaSVM, PWM derived from HT-SELEX with the multinomial or BEESEM algorithms in predicting pbSNPs for 129 TFs. The results from fivefold cross-validation were shown. Two statistical evaluations were used, including AUROC (left) and AUPRC (right). P values by two-sided Mann–Whitney U test are shown on the top. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. c, d, Scatter plots compare the performance between deltaSVM (y-axis) and ΔPWM (x-axis) derived by multinomial models (c) and BEESEM models (d) by in predicting allelic binding of 129 TFs for which both models were available. Results from fivefold cross-validation were shown. The values in both axes were AUPRC. e, An overview of the SNP-SELEX experimental procedure describing the novel batch of SNP-SELEX. f, A scatter plot compares the performance between deltaSVM (y-axis) and BEESEM-generated ΔPWM (x-axis) in predicting allelic binding of 87 TFs for which both models are available by the novel batch of SNP-SELEX. The values in both axes are AUPRC. g, The logo describes the PWM model of a homodimeric binding pattern of TF HLF, with the monomeric half-site indicated by the purple arrows. The red boxes indicate the positions at which the SNP rs79124498 is located (left) and its co-dependent base position (right). The y-axis corresponds to the information content at each position of the PWM (x-axis).
Extended Data Fig. 7 Comparison of deltaSVM models and ΔPWM in predicting allelic TF binding in weak and strong TF binding sites.
SNPs are categorized into five quantiles based on the OBS of the 40-bp DNA fragments. The performance of ΔPWM (green) and deltaSVM (orange) in predicting allelic binding of TFs was evaluated for SNPs in each category. a, b, Two batches of pbSNPs were used as gold standards for performance assessment: the pbSNPs from the initial SNP-SELEX experiments, with fivefold cross-validation (a) and the novel batch SNP-SELEX data (b). Both AUROC (upper) and AUPRC (lower) are shown for statistic assessment of the model performance. The first quantile represents SNPs with the weakest binding strength and the fifth quantile represents SNPs with the strongest binding strength. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR.
Extended Data Fig. 8 DeltaSVM models predict more accurately the noncoding variants affecting TF binding in vivo than ΔPWM.
a, DeltaSVM models outperform ΔPWM in predicting differential DNA binding in vitro. Precision-Recall curves were used to assess the performance of either model in predicting allelic binding events identified in SNP-SELEX for three TFs, including ATF2, HLF, and MAFG. In all three cases, the performance of deltaSVM models (purple) was much better than that of ΔPWM (yellow). The AUC used for quantitative comparison was shown within each plot. b, Bar plots show the fractions of pbSNPs exhibiting allelic imbalance in TF ChIP–seq assays in HepG2 cells among all SNPs that were predicted to be differentially bound by a TF according to the deltaSVM models (purple) or the ΔPWM (yellow). The same datasets as in Fig. 3e were used. Only SNPs that were predicted to be bound by the TF were used in the comparison. The threshold for oligonucleotide binding and for the predicted pbSNPs was determined as the median score for the bound oligonucleotides and pbSNPs respectively. Error bars centred with mean percentage denote the 95% confidence interval calculated by Wilson method (Methods). For ΔPWM, n = 2872(ATF2); n = 4134(HLF); n = 100(MAFG). For deltaSVM, n = 115(ATF2); n = 355(HLF); n = 16(MAFG). c, Bar plots show the fractions of pbSNPs exhibiting allelic imbalance in TF ChIP–seq assays in GM12878 cells among all SNPs that were predicted as differentially bound by a TF according to the deltaSVM models (purple) or the ΔPWM (yellow). Three TFs were included in the analyses, ATF2, NR2F1, and PKNOX1. Only SNPs that were predicted to be bound by the TF were used in the comparison. The threshold for oligonucleotide binding and the predicted pbSNPs was determined as the median scores for the bound oligos and pbSNPs respectively. Error bars centred with mean percentage denote the 95% confidence interval calculated by Wilson method (Methods). For ΔPWM, n = 4318(ATF2); n = 673(NR2F1); n = 225(PKNOX1). For deltaSVM, n = 142(ATF2); n = 229(NR2F1); n = 142(PKNOX1). d, Similar to Fig. 3e, deltaSVM models outperform ΔPWM in predicting differential DNA binding in vivo. Three TF ChIP–seq datasets from GM12878 cells were used for the comparison, including the same dataset as shown in b. Elbow plots show that for each TF, the top-ranked allelic SNPs predicted by deltaSVM models were found to have allelic imbalance in ChIP–seq assays performed in GM12878 cells (purple). By contrast, for allelic SNPs predicted by ΔPWM, only a small fraction showed allelic imbalance in vivo (yellow).
a, Bar plots show the enrichment of pbSNPs in T2D risk SNPs identified from an independent study14. The levels of enrichment were displayed for different groups risk SNPs categorized based on the PPA (Posterior Probability of Association). Note that SNPs with stronger PPAs and thus higher likelihood of being causal for T2D are more likely to be pbSNPs. b, Bar plots show the enrichment of T2D risk SNPs in allelic TF binding SNPs predicted by PWM models using the same credible sets as Fig. 4a (ref. 13). Specifically, SNPs with P <0.01 by atSNP were used as allelic TF binding SNPs. The level of association is categorized according to PPA as in a. Note that the likely causal SNPs with stronger T2D risk association no longer display higher enrichment for ΔPWM-predicted allelic SNPs. c, A T2D GWAS leading SNP rs7578326 and a pbSNP differentially bound by TFs CEBPB, CEBPE, MYBL2, and NFE2, is predicted to target the IRS1 gene based on Hi-C analysis (circled in blue in bottom panel) in HepG2 cells. The locus around the SNP is enriched for H3K27ac and H3K4me1. d, CRISPRi using dCas9 fused with repressive KRAB domain and guide RNA targeting the locus of SNP rs7578326 (upper) leads to reduced expression of IRS1 gene in HepG2 but not in HEK 293T cells. qPCR results from three biological replicates in HepG2 (left) and HEK 293 (right) cells are plotted in the bottom panel. Y-axis shows the power transformed values of expression presented as mean ± s.d. Raw data are shown as small black circles for clarification. P values computed using two-sided t-test are noted on the top. e, SNP rs7578326 is an eQTL in liver and adipose tissues. Normalized expression value from GTEx project for IRS1 gene is grouped based on individuals’ genotype of SNP rs7578326. Linear regression P values and effect sizes are noted on the top. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR.
Extended Data Fig. 10 Candidate TFs involved in complex traits and diseases identified by enrichment of TF binding alone.
a, A heat map shows the significant enrichment of SNPs predicted to be located within TF-DNA binding sites among traits- or disease-associated SNP. The colour key is shown, and the value represents the -log10 P value. TF-trait pairs mentioned in the text were marked with *. Note that the SNPs here do not necessarily affect TF binding affinity. The candidate regulator we observed and validated (Fig. 4b) could not be identified here if we only use the presence of SNPs at the binding sites without taking into account the effect of SNP on binding affinity. b, d, qPCR results from three biological replicates of MAFG (b) and HLF (d) in WT (HepG2), Control (Negative and HiPerfect), and cells treated with different siRNAs. Expression values are presented as mean ± s.d. c, e, MA-plot showing differentially expressed genes comparing MAFG knockdown (c) and HLF knockdown (e) versus controls. Significant differentially expressed genes (FDR <0.2) were marked in red.
List of variants tested by SNP-SELEX.
List of TF tested by SNP-SELEX.
List of pbSNPs.
List of TF ChIP-seq data.
List of STARR-seq oligos.
List of paSNPs.
Summary statistics of performance of deltaSVM models and ΔPWM models in predicting pbSNPs.
List of pbSNPs in the novel batch.
List of GWAS summary statistics used in Fig 4b.
Summary statistics of HepG2 haplotype phasing.
Enriched motifs for SNP-SELEX experiments.
Scores for all tested SNP-TF pairs by SNP-SELEX experiments.
The 94 high-confidence deltaSVM models predicted allelic binding of all common SNPs in the human genome.
About this article
Cite this article
Yan, J., Qiu, Y., Ribeiro dos Santos, A.M. et al. Systematic analysis of binding of transcription factors to noncoding variants. Nature 591, 147–151 (2021). https://doi.org/10.1038/s41586-021-03211-0
Seminars in Immunopathology (2021)