Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Systematic analysis of binding of transcription factors to noncoding variants

Abstract

Many sequence variants have been linked to complex human traits and diseases1, but deciphering their biological functions remains challenging, as most of them reside in noncoding DNA. Here we have systematically assessed the binding of 270 human transcription factors to 95,886 noncoding variants in the human genome using an ultra-high-throughput multiplex protein–DNA binding assay, termed single-nucleotide polymorphism evaluation by systematic evolution of ligands by exponential enrichment (SNP-SELEX). The resulting 828 million measurements of transcription factor–DNA interactions enable estimation of the relative affinity of these transcription factors to each variant in vitro and evaluation of the current methods to predict the effects of noncoding variants on transcription factor binding. We show that the position weight matrices of most transcription factors lack sufficient predictive power, whereas the support vector machine combined with the gapped k-mer representation show much improved performance, when assessed on results from independent SNP-SELEX experiments involving a new set of 61,020 sequence variants. We report highly predictive models for 94 human transcription factors and demonstrate their utility in genome-wide association studies and understanding of the molecular pathways involved in diverse human traits and diseases.

This is a preview of subscription content

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: High-throughput analysis of the binding of human transcription factors to common sequence variants by SNP-SELEX.
Fig. 2: Evaluation of the current PWM models using the SNP-SELEX data.
Fig. 3: DeltaSVM models outperform ΔPWM in predicting differential transcription factor binding to noncoding variants in vitro and in vivo.
Fig. 4: deltaSVM models predict TFs probably involved in complex traits and diseases.

Data availability

Sequencing data generated in this study can be accessed via Gene Expression Omnibus (GEO) under accession number GSE118725. The raw sequencing data for transcription factor ChIP–seq of GM12878 is extracted from the ENCODE portal (https://www.encodeproject.org). The specific transcription factor data can be accessed by searching the accession numbers listed in Supplementary Table 4. The web portal (http://renlab.sdsc.edu/GVATdb/) provides a searchable interface for SNPs and transcription factors tested in the current study. Enriched motifs for SNP-SELEX experiments using Homer are presented in Supplementary File 1. Scores for all tested SNP–transcription factor pairs from SNP-SELEX experiments are shown in Supplementary File 2. The data for high-confidence allelic binding of 94 transcription factors to all common SNPs in the human genome predicted by deltaSVM models are presented in Supplementary File 3.

Code availability

Custom codes used to process and generate the results described in the current study were deposited to GitHub (https://github.com/ren-lab/snp-selex).

References

  1. 1.

    Buniello, A. et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47 (D1), D1005–D1012 (2019).

    CAS  PubMed  Google Scholar 

  2. 2.

    Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017).

    PubMed  Google Scholar 

  4. 4.

    Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).

    CAS  PubMed  Google Scholar 

  5. 5.

    Ruan, S., Swamidass, S. J. & Stormo, G. D. BEESEM: estimation of binding energy models using HT-SELEX data. Bioinformatics 33, 2288–2295 (2017). https://doi.org/10.1093/bioinformatics/btx191.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Farley, E. K. et al. Suboptimization of developmental enhancers. Science 350, 325–328 (2015).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    ADS  Google Scholar 

  8. 8.

    Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).

    ADS  CAS  PubMed  Google Scholar 

  9. 9.

    Altshuler, D. M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).

    ADS  CAS  PubMed  Google Scholar 

  10. 10.

    Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Rohs, R. et al. The role of DNA shape in protein–DNA recognition. Nature 461, 1248–1253 (2009).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Morgunova, E. et al. Two distinct DNA sequences recognized by transcription factors represent enthalpy and entropy optima. eLife 7, e32963 (2018).

    PubMed  PubMed Central  Google Scholar 

  13. 13.

    Mahajan, A. et al. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nat. Genet. 50, 1505–1513 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Greenwald, W. W. et al. Pancreatic islet chromatin accessibility and conformation reveals distal enhancer networks of type 2 diabetes risk. Nat. Commun. 10, 2078 (2019).

    ADS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Battle, A., Brown, C. D., Engelhardt, B. E. & Montgomery, S. B. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

    ADS  PubMed  Google Scholar 

  16. 16.

    Olefsky, J., Farquhar, J. W. & Reaven, G. Relationship between fasting plasma insulin level and resistance to insulin-mediated glucose uptake in normal and diabetic subjects. Diabetes 22, 507–513 (1973).

    CAS  PubMed  Google Scholar 

  17. 17.

    Soyal, S. M. et al. Associations of haplotypes upstream of IRS1 with insulin resistance, type 2 diabetes, dyslipidemia, preclinical atherosclerosis, and skeletal muscle LOC646736 mRNA levels. J. Diabetes Res. 2015, 405371 (2015).

    PubMed  PubMed Central  Google Scholar 

  18. 18.

    Manning, A. K. et al. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nat. Genet. 44, 659–669 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Scott, R. A. et al. Large-scale association analyses identify new loci influencing glycemic traits and provide insight into the underlying biological pathways. Nat. Genet. 44, 991–1005 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Nordquist, N. et al. The transcription factor TFAP2B is associated with insulin resistance and adiposity in healthy adolescents. Obesity 17, 1762–1767 (2009).

    CAS  PubMed  Google Scholar 

  21. 21.

    Apazoglou, K. et al. Antidepressive effects of targeting ELK-1 signal transduction. Nat. Med. 24, 591–597 (2018).

    CAS  PubMed  Google Scholar 

  22. 22.

    Leonardini, A., Laviola, L., Perrini, S., Natalicchio, A. & Giorgino, F. Cross-talk between PPARγ and insulin signaling and modulation of insulin sensitivity. PPAR Res. 2009, 818945 (2009).

    PubMed  Google Scholar 

  23. 23.

    Fruchart, J. C., Duriez, P. & Staels, B. Peroxisome proliferator-activated receptor-alpha activators regulate genes governing lipoprotein metabolism, vascular inflammation and atherosclerosis. Curr. Opin. Lipidol. 10, 245–257 (1999).

    CAS  PubMed  Google Scholar 

  24. 24.

    Shachter, N. S. Apolipoproteins C-I and C-III as important modulators of lipoprotein metabolism. Curr. Opin. Lipidol. 12, 297–304 (2001).

    CAS  PubMed  Google Scholar 

  25. 25.

    Crosby, J. et al. Loss-of-function mutations in APOC3, triglycerides, and coronary disease. N. Engl. J. Med. 371, 22–31 (2014).

    PubMed  Google Scholar 

  26. 26.

    Gotto, A. M., Jr. Triglyceride as a risk factor for coronary artery disease. Am. J. Cardiol. 82 (9A), 22Q–25Q (1998).

    PubMed  Google Scholar 

  27. 27.

    Khetarpal, S. A., Qamar, A., Millar, J. S. & Rader, D. J. Targeting ApoC-III to reduce coronary disease risk. Curr. Atheroscler. Rep. 18, 54 (2016).

    PubMed  Google Scholar 

  28. 28.

    Jolma, A. et al. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature 527, 384–388 (2015).

    ADS  CAS  PubMed  Google Scholar 

  29. 29.

    Kato, N. Insights into the genetic basis of type 2 diabetes. J. Diabetes Investig. 4, 233–244 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Mahajan, A. et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat. Genet. 46, 234–244 (2014).

    CAS  PubMed  Google Scholar 

  31. 31.

    Johnson, A. D. et al. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 24, 2938–2939 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

    PubMed  PubMed Central  Google Scholar 

  35. 35.

    Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Mathelier, A. et al. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 44 (D1), D110–D115 (2016).

    CAS  PubMed  Google Scholar 

  37. 37.

    Nitta, K. R. et al. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. eLife 4, (2015).

  38. 38.

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Zhou, X., Lindsay, H. & Robinson, M. D. Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 42, e91 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    PubMed  PubMed Central  Google Scholar 

  41. 41.

    Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. 44.

    Yan, J. et al. Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell 154, 801–813 (2013).

    CAS  PubMed  Google Scholar 

  45. 45.

    Zhang, Y. et al. Model-based analysis of ChIP–seq (MACS). Genome Biol. 9, R137 (2008).

    PubMed  PubMed Central  Google Scholar 

  46. 46.

    van de Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J. K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015).

    PubMed  PubMed Central  Google Scholar 

  47. 47.

    McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. 48.

    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. 49.

    Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.3 (2013).

    Google Scholar 

  50. 50.

    Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. 51.

    Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  PubMed  Google Scholar 

  53. 53.

    Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54.

    Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).

    CAS  Google Scholar 

  55. 55.

    Harrow, J. et al. GENCODE: the reference human genome annotation for the ENCODE project. Genome Res. 22, 1760–1774 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. 56.

    Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

    PubMed  PubMed Central  Google Scholar 

  57. 57.

    Dennis, G., Jr et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 4, 3 (2003).

    Google Scholar 

  58. 58.

    Cock, P. J. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. 59.

    Zuo, C., Shin, S. & Keleş, S. atSNP: transcription factor binding affinity testing for regulatory SNP detection. Bioinformatics 31, 3353–3355 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. 60.

    Lee, D. LS-GKM: a new gkm-SVM for large-scale datasets. Bioinformatics 32, 2196–2198 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. 61.

    Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).

    PubMed  PubMed Central  Google Scholar 

  62. 62.

    Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  63. 63.

    Dubois, P. C. et al. Multiple common variants for celiac disease influencing immune gene expression. Nat. Genet. 42, 295–302 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  64. 64.

    Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45, 1274–1283 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  65. 65.

    Lambert, J. P. et al. Mapping differential interactomes by affinity purification coupled with data-independent mass spectrometry acquisition. Nat. Methods 10, 1239–1245 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  66. 66.

    Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  67. 67.

    Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).

    ADS  PubMed Central  Google Scholar 

  68. 68.

    Bentham, J. et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat. Genet. 47, 1457–1464 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  69. 69.

    de Lange, K. M. et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat. Genet. 49, 256–261 (2017).

    PubMed  PubMed Central  Google Scholar 

  70. 70.

    Nelson, C. P. et al. Association analyses based on false discovery rate implicate new loci for coronary artery disease. Nat. Genet. 49, 1385–1391 (2017).

    CAS  PubMed  Google Scholar 

  71. 71.

    Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank S. Preissl and S. A. Chen for insightful comments during manuscript preparation; and S. Kuan, Z. Liu and B. Li for technical assistance. This work was supported by the Ludwig Institute for Cancer Research (B.R.), NIDDK (U01 DK105541 to B.R., M.S., and K.F.), Vetenskapsrådet Sweden (537-2014-6796 to J.Y.), and a CAPES foundation fellowship (BEX 5304/15-6 to A.M.R.S.).

Author information

Affiliations

Authors

Contributions

B.R., M.S., K.J.G., K.A.F., J.T. and J.Y. conceived the project. J.Y., Y.Y., X.L., N.N., and N.V. carried out experiments. Y.Q., A.M.R.d.S., Y.E.L., A.R., S.F., P.B., F.C. and J.C. performed data analysis. J.Y., Y.Q., A.M.R.d.S., J.T. and B.R. wrote the manuscript with input from all co-authors.

Corresponding authors

Correspondence to Jian Yan, Jussi Taipale or Bing Ren.

Ethics declarations

Competing interests

B.R. is a co-founder and consultant for Arima Genomics and a co-founder of Epigenome Technologies.

Additional information

Peer review information Nature thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 The sequence features of input oligonucleotides.

a, An example of the oligo design for SNP-SELEX. Two random nucleotides were added to each end of the oligos as unique molecule identifiers (UMIs) to remove over-represented PCR duplicates. Illumina TruSeq dual-index system was adapted for oligo design. b, The GC content (left) and CpG frequency (right) of SNP-SELEX input were more similar to those of TF binding sites in the human genome (TFBS), open chromatin (DHS) and the entire human genome in general (hg19) than random sequences used in HT-SELEX. c, Comparison of k-mer coverage (left) and sequencing depth (right) of libraries between SNP-SELEX and HT-SELEX.

Extended Data Fig. 2 Derivation of OBS and PBS.

a, Equations demonstrate the relationships between OBS and the association constant (Ka) of TF-DNA interactions. b, An example of how oligonucleotides were evolutionarily selected during SNP-SELEX. Table of counts for oligonucleotide chr6:31171810-31171850 is shown at left and the OBS curve is shown on the right. c, d, Histograms show the number of oligonucleotide sequence bound by each TF (c), the number of binding TFs for each oligonucleotide sequence (d). e, An example of how the abundance of SNPs varies in the course of a SNP-SELEX experiment. The table of counts for SNP rs9263880 is shown at the left and PBS curve is shown on the right. The orange line inside the black boxes indicates the reads of T-allele-containing fragment and the blue line shows the reads of C-allele-containing fragment.

Extended Data Fig. 3 Reproducibility of SNP-SELEX data.

a, Density plots show an example of the distribution of OBS of all oligos assayed in ELK SNP-SELEX replicative experiments. Vertical dashed lines indicate the cut-off for significant binding sequences (P = 0.05 by Monte Carlo randomization). The 40-bp genomic sequences with OBS that is over the indicated values are recognized as significant binding sites of ELK1 or ELK4. DBD: DNA binding domain. FL: full-length protein. b, Density plots show an example of the distribution of PBS of all oligos assayed in ELK SNP-SELEX replicative experiments. Vertical dashed lines indicate the cut-off for significantly differential binding (P = 0.01 by Monte Carlo randomization). The 40-bp SNP-containing genomic sequences with PBS over the indicated values are recognized as significantly differential (allelic) binding sites of ELK1 or ELK4. DBD: DNA binding domain. FL: full-length protein. c, An example illustrating differential DNA binding at six SNPs, in four SNP-SELEX experiments, including (i) two full-length ELK1 replicates, on the first two lines; (ii) one DNA binding domain (DBD) ELK1, on the third line; and one full-length ELK4 TF which belongs to the same structure family, on the last line. Each panel represents the logarithmic odds-ratio (y-axis) of observing the reference allele (REF), represented by a triangle, and the alternative allele (ALT), represented by a circle, over SNP-SELEX cycles (x-axis). The two alleles of each SNP are coloured according to their nucleotides, where A is red, C is green, G is blue, and T is yellow. The figure shows that SNP-SELEX experiments of both replicates, full-length, DBD, and same structure TF family presents the same allelic preference. d, e, Comparison of oligonucleotide enrichment (d) and allele preference (e) between different biological replicates (replicates), full-length (FL), and DNA Binding Domain (DBD), members of the same structural family (family), and random pairs (others). For each pair of experiments, we compared the oligonucleotides that display binding in both experiments for binding oligonucleotides and compared PCC between the PBS from each experiment. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× interquartile range (IQR).

Extended Data Fig. 4 SNP-SELEX results are correlated with TF binding in vitro and in vivo.

a, Comparison of the SNPs with differential TF binding determined by SNP-SELEX and ΔPWM. An error matrix table showing the number of SNPs for which the same allele was identified as the preferred allele by both methods (Agreed), SNPs for which one allele was determined as preferential substrate by one method but no allele was called by the other (PWM+/ SNP-SELEX– and PWM–/ SNP-SELEX+), and SNPs where different alleles were called as preferential bound by each method (Contradictory). Note that the vast majority of the results agreed, with the most disagreement coming from PWM+/ SNP-SELEX–. b, Comparison of the PWM scores (left) and the OBS scores (right) between SNPs with concordant and discordant predictions. Note that discordant predictions mostly come from weak binding sites with low PWM scores and low OBS scores. Two-sided Mann–Whitney U test P value is shown on the top. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. c, Box plots show performance of ΔPWM in predicting pbSNPs grouped by DNA binding domain structural families (left) and information content of motifs for each corresponding TF family (right). AUPRC is used to evaluate the performance of ΔPWM. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. d, Box plots show PCC between PBS and ΔPWM (left) and information content (right) for each TF family. PCCs for some TF families are higher than others, independent of the information content (IC) of corresponding PWM models. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. e, A scatter plot shows the correlation of PBS and allelic binding ratio derived from SNP-SELEX and ChIP–seq in GM12878 cells respectively. The PCCs and P values calculated based on t-test are shown on the lower right corner. The allelic binding ratio is computed as the log10 odd ratio over input (see Methods for details). In total, 341 TF-SNP pairs including 269 unique SNPs and six TFs were plotted. TFs used include ATF2, PKNOX1, IRF3, NR2F1, YBX1, and TBX21.

Extended Data Fig. 5 SNP-SELEX results are correlated with allelic enhancer activities detected using high-throughput reporter assays.

a, A schematic diagram shows the strategy of using STARR-seq to assess the effect of SNPs in enhancer activity in HepG2 and HEK 293T cells. b, Heat map shows pair-wise PCCs calculated among STARR-seq datasets. The read counts of each SNP in the starting reporter library, in the mRNA pools in three HepG2 replicates, and three HEK 293T replicates were used for PCC calculation. c, MA plot of the logarithmic fold-change (y-axis) of read counts of SNP-containing mRNA over that of the input library expressed as logarithmic counts per million (CPM) (x-axis) for HEK 293T, on the top panel, and HepG2, on the bottom panel. Each dot on the plot corresponds to an oligonucleotide, and the oligonucleotides showing enrichment (empirical FDR <0.05) are colored in red. d, Bar plots comparing the fractions of paSNPs determined using STARR-seq in pbSNPs and non-pbSNPs by SNP-SELEX. Odds Ratio (OR) is shown between imbalanced and balanced SNPs, and the P value is calculated by Fisher exact test. Error bars denote the 95% confidence interval calculated by Wilson method (Methods). Only pbSNPs corresponding to the highly expressed TFs (RPKM >3) in the cell lines are considered for the analysis. n = 167 SNP-cell pairs for pbSNPs; n = 509 SNP-cell pairs for non-pbSNPs. e, Bar plots comparing the fractions of paSNPs determined using STARR-seq in pbSNPs and non-pbSNPs predicted by ΔPWM. SNPs with P <0.01 by atSNP were considered as pbSNPs. Odds Ratio (OR) is shown between imbalanced and balanced SNPs, and the P value is calculated by Fisher exact test. Error bars denote the 95% confidence interval calculated by Wilson method (Methods). Only pbSNPs by highly expressed TFs (RPKM >3) in the corresponding cell lines are considered for the analysis. n = 564 SNP-cell pairs for pbSNPs; n = 112 SNP-cell pairs for non-pbSNPs.

Extended Data Fig. 6 deltaSVM more accurately predicts effects of noncoding variants on TF binding in vivo than ΔPWM.

a, A schematic graph for the training of deltaSVM models for 533 TFs. Data from previously reported HT-SELEX experiments using random DNA oligonucleotide sequences were used to derive these models. To develop deltaSVM models for each TF, the reads in each HT-SELEX cycle beyond cycle 0 reads were used as positive training sets, and the reads not enriched were used as negative training sets. All unique 10-mers were scored using gapped k-mer models to compute weights for deltaSVM. The two alleles of the 40-bp SELEX oligos were then scored using deltaSVM models to generate deltaSVM scores. b, Box plots compare the performance of deltaSVM, PWM derived from HT-SELEX with the multinomial or BEESEM algorithms in predicting pbSNPs for 129 TFs. The results from fivefold cross-validation were shown. Two statistical evaluations were used, including AUROC (left) and AUPRC (right). P values by two-sided Mann–Whitney U test are shown on the top. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR. c, d, Scatter plots compare the performance between deltaSVM (y-axis) and ΔPWM (x-axis) derived by multinomial models (c) and BEESEM models (d) by in predicting allelic binding of 129 TFs for which both models were available. Results from fivefold cross-validation were shown. The values in both axes were AUPRC. e, An overview of the SNP-SELEX experimental procedure describing the novel batch of SNP-SELEX. f, A scatter plot compares the performance between deltaSVM (y-axis) and BEESEM-generated ΔPWM (x-axis) in predicting allelic binding of 87 TFs for which both models are available by the novel batch of SNP-SELEX. The values in both axes are AUPRC. g, The logo describes the PWM model of a homodimeric binding pattern of TF HLF, with the monomeric half-site indicated by the purple arrows. The red boxes indicate the positions at which the SNP rs79124498 is located (left) and its co-dependent base position (right). The y-axis corresponds to the information content at each position of the PWM (x-axis).

Extended Data Fig. 7 Comparison of deltaSVM models and ΔPWM in predicting allelic TF binding in weak and strong TF binding sites.

SNPs are categorized into five quantiles based on the OBS of the 40-bp DNA fragments. The performance of ΔPWM (green) and deltaSVM (orange) in predicting allelic binding of TFs was evaluated for SNPs in each category. a, b, Two batches of pbSNPs were used as gold standards for performance assessment: the pbSNPs from the initial SNP-SELEX experiments, with fivefold cross-validation (a) and the novel batch SNP-SELEX data (b). Both AUROC (upper) and AUPRC (lower) are shown for statistic assessment of the model performance. The first quantile represents SNPs with the weakest binding strength and the fifth quantile represents SNPs with the strongest binding strength. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR.

Extended Data Fig. 8 DeltaSVM models predict more accurately the noncoding variants affecting TF binding in vivo than ΔPWM.

a, DeltaSVM models outperform ΔPWM in predicting differential DNA binding in vitro. Precision-Recall curves were used to assess the performance of either model in predicting allelic binding events identified in SNP-SELEX for three TFs, including ATF2, HLF, and MAFG. In all three cases, the performance of deltaSVM models (purple) was much better than that of ΔPWM (yellow). The AUC used for quantitative comparison was shown within each plot. b, Bar plots show the fractions of pbSNPs exhibiting allelic imbalance in TF ChIP–seq assays in HepG2 cells among all SNPs that were predicted to be differentially bound by a TF according to the deltaSVM models (purple) or the ΔPWM (yellow). The same datasets as in Fig. 3e were used. Only SNPs that were predicted to be bound by the TF were used in the comparison. The threshold for oligonucleotide binding and for the predicted pbSNPs was determined as the median score for the bound oligonucleotides and pbSNPs respectively. Error bars centred with mean percentage denote the 95% confidence interval calculated by Wilson method (Methods). For ΔPWM, n = 2872(ATF2); n = 4134(HLF); n = 100(MAFG). For deltaSVM, n = 115(ATF2); n = 355(HLF); n = 16(MAFG). c, Bar plots show the fractions of pbSNPs exhibiting allelic imbalance in TF ChIP–seq assays in GM12878 cells among all SNPs that were predicted as differentially bound by a TF according to the deltaSVM models (purple) or the ΔPWM (yellow). Three TFs were included in the analyses, ATF2, NR2F1, and PKNOX1. Only SNPs that were predicted to be bound by the TF were used in the comparison. The threshold for oligonucleotide binding and the predicted pbSNPs was determined as the median scores for the bound oligos and pbSNPs respectively. Error bars centred with mean percentage denote the 95% confidence interval calculated by Wilson method (Methods). For ΔPWM, n = 4318(ATF2); n = 673(NR2F1); n = 225(PKNOX1). For deltaSVM, n = 142(ATF2); n = 229(NR2F1); n = 142(PKNOX1). d, Similar to Fig. 3e, deltaSVM models outperform ΔPWM in predicting differential DNA binding in vivo. Three TF ChIP–seq datasets from GM12878 cells were used for the comparison, including the same dataset as shown in b. Elbow plots show that for each TF, the top-ranked allelic SNPs predicted by deltaSVM models were found to have allelic imbalance in ChIP–seq assays performed in GM12878 cells (purple). By contrast, for allelic SNPs predicted by ΔPWM, only a small fraction showed allelic imbalance in vivo (yellow).

Extended Data Fig. 9 T2D risk SNPs are enriched for pbSNPs.

a, Bar plots show the enrichment of pbSNPs in T2D risk SNPs identified from an independent study14. The levels of enrichment were displayed for different groups risk SNPs categorized based on the PPA (Posterior Probability of Association). Note that SNPs with stronger PPAs and thus higher likelihood of being causal for T2D are more likely to be pbSNPs. b, Bar plots show the enrichment of T2D risk SNPs in allelic TF binding SNPs predicted by PWM models using the same credible sets as Fig. 4a (ref. 13). Specifically, SNPs with P <0.01 by atSNP were used as allelic TF binding SNPs. The level of association is categorized according to PPA as in a. Note that the likely causal SNPs with stronger T2D risk association no longer display higher enrichment for ΔPWM-predicted allelic SNPs. c, A T2D GWAS leading SNP rs7578326 and a pbSNP differentially bound by TFs CEBPB, CEBPE, MYBL2, and NFE2, is predicted to target the IRS1 gene based on Hi-C analysis (circled in blue in bottom panel) in HepG2 cells. The locus around the SNP is enriched for H3K27ac and H3K4me1. d, CRISPRi using dCas9 fused with repressive KRAB domain and guide RNA targeting the locus of SNP rs7578326 (upper) leads to reduced expression of IRS1 gene in HepG2 but not in HEK 293T cells. qPCR results from three biological replicates in HepG2 (left) and HEK 293 (right) cells are plotted in the bottom panel. Y-axis shows the power transformed values of expression presented as mean ± s.d. Raw data are shown as small black circles for clarification. P values computed using two-sided t-test are noted on the top. e, SNP rs7578326 is an eQTL in liver and adipose tissues. Normalized expression value from GTEx project for IRS1 gene is grouped based on individuals’ genotype of SNP rs7578326. Linear regression P values and effect sizes are noted on the top. Horizontal line is median; hinges are 25th and 75th percentile; whiskers are most extreme value no further than 1.5× IQR.

Extended Data Fig. 10 Candidate TFs involved in complex traits and diseases identified by enrichment of TF binding alone.

a, A heat map shows the significant enrichment of SNPs predicted to be located within TF-DNA binding sites among traits- or disease-associated SNP. The colour key is shown, and the value represents the -log10 P value. TF-trait pairs mentioned in the text were marked with *. Note that the SNPs here do not necessarily affect TF binding affinity. The candidate regulator we observed and validated (Fig. 4b) could not be identified here if we only use the presence of SNPs at the binding sites without taking into account the effect of SNP on binding affinity. b, d, qPCR results from three biological replicates of MAFG (b) and HLF (d) in WT (HepG2), Control (Negative and HiPerfect), and cells treated with different siRNAs. Expression values are presented as mean ± s.d. c, e, MA-plot showing differentially expressed genes comparing MAFG knockdown (c) and HLF knockdown (e) versus controls. Significant differentially expressed genes (FDR <0.2) were marked in red.

Supplementary information

Reporting Summary

Supplementary Table 1

List of variants tested by SNP-SELEX.

Supplementary Table 2

List of TF tested by SNP-SELEX.

Supplementary Table 3

List of pbSNPs.

Supplementary Table 4

List of TF ChIP-seq data.

Supplementary Table 5

List of STARR-seq oligos.

Supplementary Table 6

List of paSNPs.

Supplementary Table 7

Summary statistics of performance of deltaSVM models and ΔPWM models in predicting pbSNPs.

Supplementary Table 8

List of pbSNPs in the novel batch.

Supplementary Table 9

List of GWAS summary statistics used in Fig 4b.

Supplementary Table 10

Summary statistics of HepG2 haplotype phasing.

Supplementary Data 1

Enriched motifs for SNP-SELEX experiments.

Supplementary Data 2

Scores for all tested SNP-TF pairs by SNP-SELEX experiments.

Supplementary Data 3

The 94 high-confidence deltaSVM models predicted allelic binding of all common SNPs in the human genome.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yan, J., Qiu, Y., Ribeiro dos Santos, A.M. et al. Systematic analysis of binding of transcription factors to noncoding variants. Nature 591, 147–151 (2021). https://doi.org/10.1038/s41586-021-03211-0

Download citation

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing