Article | Published:

High-throughput identification of human SNPs affecting regulatory element activity

Abstract

Most of the millions of SNPs in the human genome are non-coding, and many overlap with putative regulatory elements. Genome-wide association studies (GWAS) have linked many of these SNPs to human traits or to gene expression levels, but rarely with sufficient resolution to identify the causal SNPs. Functional screens based on reporter assays have previously been of insufficient throughput to test the vast space of SNPs for possible effects on regulatory element activity. Here we leveraged the throughput and resolution of the survey of regulatory elements (SuRE) reporter technology to survey the effect of 5.9 million SNPs, including 57% of the known common SNPs, on enhancer and promoter activity. We identified more than 30,000 SNPs that alter the activity of putative regulatory elements, partially in a cell-type-specific manner. Integration of this dataset with GWAS results may help to pinpoint SNPs that underlie human traits.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

Raw SuRE sequencing data are available at GEO (https://www.ncbi.nlm.nih.gov/geo/) under accession GSE128325. SuRE count tables, BigWig files for visualization of SuRE data tracks in genome browsers, lists of raQTLs and a table with SuRE data for all 5.9 million SNPs are available from the Open Science Framework (https://osf.io/w5bzq/wiki/home/?view). SuRE data can also be queried and visualized at https://sure.nki.nl. URLs to external data sources are listed in Supplementary Table 3.

Code availability

Scripts are available on https://github.com/vansteensellab/SuRE-SNV-code.

Software used is described in the relevant methods section and in the Nature Research Reporting Summary.

References

  1. 1.

    1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  2. 2.

    Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 95, 535–552 (2014).

  3. 3.

    Albert, F. W. & Kruglyak, L. The role of regulatory variation in complex traits and disease. Nat. Rev. Genet. 16, 197–212 (2015).

  4. 4.

    Miguel-Escalada, I., Pasquali, L. & Ferrer, J. Transcriptional enhancers: functional insights and role in human disease. Curr. Opin. Genet. Dev. 33, 71–76 (2015).

  5. 5.

    Deplancke, B., Alpern, D. & Gardeux, V. The genetics of transcription factor DNA binding variation. Cell 166, 538–554 (2016).

  6. 6.

    MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).

  7. 7.

    Gallagher, M. D. & Chen-Plotkin, A. S. The post-GWAS era: from association to function. Am. J. Hum. Genet. 102, 717–730 (2018).

  8. 8.

    GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

  9. 9.

    Schaid, D. J., Chen, W. & Larson, N. B. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet. 19, 491–504 (2018).

  10. 10.

    Nishizaki, S. S. & Boyle, A. P. Mining the unknown: assigning function to noncoding single nucleotide polymorphisms. Trends Genet. 33, 34–45 (2017).

  11. 11.

    Farh, K. K. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015).

  12. 12.

    Tak, Y. G. & Farnham, P. J. Making sense of GWAS: using epigenomics and genome engineering to understand the functional relevance of SNPs in non-coding regions of the human genome. Epigenetics Chromatin 8, 57 (2015).

  13. 13.

    Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).

  14. 14.

    Degner, J. F. et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390–394 (2012).

  15. 15.

    McVicker, G. et al. Identification of genetic variants that affect histone modifications in human cells. Science 342, 747–749 (2013).

  16. 16.

    Kilpinen, H. et al. Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription. Science 342, 744–747 (2013).

  17. 17.

    Kasowski, M. et al. Extensive variation in chromatin states across humans. Science 342, 750–752 (2013).

  18. 18.

    Waszak, S. M. et al. Population variation and genetic control of modular chromatin architecture in humans. Cell 162, 1039–1050 (2015).

  19. 19.

    Grubert, F. et al. Genetic control of chromatin states in humans involves local and distal chromosomal interactions. Cell 162, 1051–1065 (2015).

  20. 20.

    Gate, R. E. et al. Genetic determinants of co-accessible chromatin regions in activated T cells across humans. Nat. Genet. 50, 1140–1150 (2018).

  21. 21.

    Vockley, C. M. et al. Massively parallel quantification of the regulatory effects of noncoding genetic variation in a human cohort. Genome Res. 25, 1206–1214 (2015).

  22. 22.

    Tewhey, R. et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016).

  23. 23.

    Ulirsch, J. C. et al. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell 165, 1530–1545 (2016).

  24. 24.

    Liu, S. et al. Systematic identification of regulatory variants associated with cancer risk. Genome Biol. 18, 194 (2017).

  25. 25.

    Zhang, P. et al. High-throughput screening of prostate cancer risk loci by single nucleotide polymorphisms sequencing. Nat. Commun. 9, 2022 (2018).

  26. 26.

    van Arensbergen, J. et al. Genome-wide mapping of autonomous promoter activity in human cells. Nat. Biotechnol. 35, 145–153 (2017).

  27. 27.

    Nakamura, M. et al. Genome-wide association study identifies TNFSF15 and POU2AF1 as susceptibility loci for primary biliary cirrhosis in the Japanese population. Am. J. Hum. Genet. 91, 721–728 (2012).

  28. 28.

    Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  29. 29.

    Liu, X. et al. In situ capture of chromatin interactions by biotinylated dCas9. Cell 170, 1028–1043.e19 (2017).

  30. 30.

    Kumar, S., Ambrosini, G. & Bucher, P. SNP2TFBS - a database of regulatory SNPs affecting predicted transcription factor binding site affinity. Nucleic Acids Res. 45, D139–D144 (2017).

  31. 31.

    Law, J. C., Ritke, M. K., Yalowich, J. C., Leder, G. H. & Ferrell, R. E. Mutational inactivation of the p53 gene in the human erythroid leukemic K562 cell line. Leuk. Res. 17, 1045–1050 (1993).

  32. 32.

    Westerink, W. M., Stevenson, J. C., Horbach, G. J. & Schoonen, W. G. The development of RAD51C, Cystatin A, p53 and Nrf2 luciferase-reporter assays in metabolically competent HepG2 cells for the assessment of mechanism-based genotoxicity and of oxidative stress in the early research phase of drug development. Mutat. Res. 696, 21–40 (2010).

  33. 33.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

  34. 34.

    Cilia La Corte, A. L. et al. A functional XPNPEP2 promoter haplotype leads to reduced plasma aminopeptidase P and increased risk of ACE inhibitor-induced angioedema. Hum. Mutat. 32, 1326–1331 (2011).

  35. 35.

    Chen, Z. S., Guo, Y., Belinsky, M. G., Kotova, E. & Kruh, G. D. Transport of bile acids, sulfated steroids, estradiol 17-β-d-glucuronide, and leukotriene C4 by human multidrug resistance protein 8 (ABCC11). Mol. Pharmacol. 67, 545–557 (2005).

  36. 36.

    Makowski, M. M. et al. An interaction proteomics survey of transcription factor binding at recurrent TERT promoter mutations. Proteomics 16, 417–426 (2016).

  37. 37.

    Makowski, M. M. et al. Global profiling of protein–DNA and protein–nucleosome binding affinities using quantitative mass spectrometry. Nat. Commun. 9, 1653 (2018).

  38. 38.

    Boersema, P. J., Raijmakers, R., Lemeer, S., Mohammed, S. & Heck, A. J. Multiplex peptide stable isotope dimethyl labeling for quantitative proteomics. Nat. Protoc. 4, 484–494 (2009).

  39. 39.

    Astle, W. J. et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell 167, 1415–1429.e19 (2016).

  40. 40.

    Maslah, N., Cassinat, B., Verger, E., Kiladjian, J. J. & Velazquez, L. The role of LNK/SH2B3 genetic alterations in myeloproliferative neoplasms and other hematological disorders. Leukemia 31, 1661–1670 (2017).

  41. 41.

    Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819–823 (2013).

  42. 42.

    Sawai, H. et al. Genome-wide association study identified new susceptible genetic variants in HLA class I region for hepatitis B virus-related hepatocellular carcinoma. Sci. Rep. 8, 7958 (2018).

  43. 43.

    Wen, J. et al. Expression quantitative trait loci in long non-coding RNA ZNRD1-AS1 influence both HBV infection and hepatocellular carcinoma development. Mol. Carcinog. 54, 1275–1282 (2015).

  44. 44.

    Nguyen, T. A. et al. High-throughput functional comparison of promoter and enhancer activities. Genome Res. 26, 1023–1033 (2016).

  45. 45.

    Ward, L. D. & Kellis, M. HaploReg v4: systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease. Nucleic Acids Res. 44, D877–D881 (2016).

  46. 46.

    Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D260–D266 (2018).

  47. 47.

    Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10–12 (2011).

  48. 48.

    Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

  49. 49.

    van de Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J. K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015).

  50. 50.

    Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).

  51. 51.

    Brinkman, E. K., Chen, T., Amendola, M. & van Steensel, B. Easy quantitative assessment of genome editing by sequence trace decomposition. Nucleic Acids Res. 42, e168 (2014).

  52. 52.

    de Vree, P. J. et al. Targeted sequencing by proximity ligation for comprehensive variant detection and local haplotyping. Nat. Biotechnol. 32, 1019–1025 (2014).

  53. 53.

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

  54. 54.

    Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

  55. 55.

    Dignam, J. D., Lebovitz, R. M. & Roeder, R. G. Accurate transcription initiation by RNA polymerase II in a soluble extract from isolated mammalian nuclei. Nucleic Acids Res. 11, 1475–1489 (1983).

  56. 56.

    Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).

Download references

Acknowledgements

We thank the NKI Genomics Core Facility and Research IT team for technical support, the RHPC facility of the Netherlands Cancer Institute for providing computational resources and members of our laboratories for helpful discussions. Supported by ERC Advanced Grant no. 694466 (to B.v.S.); ERC Starting Grant no. 637587 (to E.d.W.); NIH grant no. R01HG003008 and Columbia University’s Vagelos Precision Medicine Pilot Program (to H.J.B.). F.C. was supported by a Swiss National Science Foundation postdoctoral fellowship (no. P2EZP3_165206). J.v.A., L.P., M.d.H., M.P.B., F.C., R.H.v.d.W., H.T., E.d.W., M.V. and B.v.S. are part of the Oncode Institute, which is partly funded by the Dutch Cancer Society (KWF).

Author information

J.v.A. designed and performed experiments, analyzed data and wrote the manuscript. L.P., V.D.F. and H.J.B. developed algorithms and analyzed data. M.d.H. M.P.B., M.V., R.H.v.d.W., H.T., F.C., U.V., E.d.W. and L.F. generated and/or analyzed data. F.C. developed the web application. B.v.S. designed experiments, analyzed data and wrote the manuscript.

Correspondence to Joris van Arensbergen or Bas van Steensel.

Ethics declarations

Competing interests

J.v.A. is founder of Gen-X B.V. (http://www.gen-x.bio/). E.d.W. is co-founder and shareholder of Cergentis B.V. F.C. is a co-founder of enGene Statistics GmbH.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Characterization of SuRE libraries and SuRE data.

a. Inserted fragment size distribution for each SuRE library (bin size 25 bp). b. Histogram showing the coverage of each SNP position in the combined SuRE libraries. c. Same as (b) but now for each SNP allele. d. Representative ~0.5-Mb genomic region showing SuRE signals of HG02601, SuRE library 1 in K562 cells, together with DNase-seq and H3K27ac signals in K562 cells (Encode Project Consortium, 2012, Nature 489, 57-74). e. qq plot showing the distribution of Wilcoxon rank-sum test P-values for SNPs in SuRE in K562 (y-axis) compared to distribution of Wilcoxon rank-sum test P-values obtained after random shuffling the SuRE expression values for each SNP (x-axis). Shown is a random subset of 100,000 SNPs. Gray line indicates y=x diagonal. f. Same as (e) but for HepG2. g. Volcano plot showing for all raQTLs in K562 (n=19,237) the log2 difference in SuRE signals for the REF and the ALT allele (x-axis) and the associated Wilcoxon rank-sum test P-values (y-axis). h. Same as (g) but for HepG2 (n=14,183). i. Histogram showing for all raQTLs in K562 the probability of the nearest neighbor SNP also being a raQTL, as a function of their distance. The dotted gray line indicates probability 0.5. j. SuRE Wilcoxon rank-sum test P-values in K562 and HepG2 cells for all SNPs that are raQTLs in at least one of the two cell types (See Fig. 1e). Gray lines indicate the P-value cut-offs for each cell type. k. Histogram showing for all K562 specific raQTLs the SuRE signal of the strongest allele in K562 (blue) and in HepG2 (yellow). l. Same as (k) but for all HepG2 specific raQTLs.

Supplementary Figure 2 Comparison of allelic imbalances in SuRE, H3K27ac and ATAC-seq.

a. Comparison of allelic imbalance of SuRE signals and H3K27ac ChIP-seq signals (normalized for genomic DNA allelic read counts) for raQTLs for which K562 cells are heterozygous and at least 10 H3K27ac reads covered the raQTL. REF: reference allele; ALT: alternative allele; OR: odds-ratio. b. Same as (a.) but for ATAC-seq. c-e. Average profile of DNase-seq (c), ATAC-seq (d) and H3K27ac (e) signal for all raQTLs that are heterozygous in K562 cells. The vertical gray lines indicate the raQTL position and the horizontal gray lines indicate the approximate background signal. Note how the signal relative to background is much better for DNase-seq than for these ATAC-seq or H3K27ac signals.

Supplementary Figure 3 Genomic distributions and minor allele frequencies of raQTLs.

a. Frequencies of raQTLs in K562 cells (n=19,237; dark color) or matching control SNPs (n=19,237; pale color) among all non-exonic SNPs within 100 kb of TSSs of loss-of-function tolerant genes or loss-of-function intolerant genes (Lek et al. 2016, Nature 536, 285-291). b. Same as (a) but for HepG2 cells (n=14,183). c. Distributions of minor allele frequencies according to the 1000 Genomes Project (1000 Genomes Project Consortium, 2015. Nature 526, 68-74) for raQTLs (dark color) and matched control SNPs (pale color) in K562 cells. P-values are obtained with a Wilcoxon rank-sum test d. Same as (c) but for HepG2.

Supplementary Figure 4 Additional data related to eQTL - SuRE comparisons.

a. Odds ratios of concordance of whole blood eQTL and K562 raQTL SNPs, as a function of maximum distance to the TSS of the associated eGene. All odds ratios are significantly larger than 1 (one-sided Fisher’s exact test, P<3.6e-3 for all distance cutoffs). Analysis based on 623,210 SNPs that overlap between SuRE and eQTL datasets. b. Same as (a) but for liver eQTL SNPs that are raQTLs in HepG2. All odds ratios are significantly larger than 1 (P<2.3e-4 for all distance cutoffs). Analysis based on 186,613 SNPs that overlap between SuRE and eQTL datasets. c. Genome track plot combining SuRE data and eQTL mapping data for LYZ in whole blood, similar to main Fig. 4e. d. Protein binding analysis for rs554591, similar to main Fig. 4f. e. Barplots indicating fraction of reads containing each of the two alleles for rs623853 in K562 genomic DNA (left) and K562 DNase-seq reads (Encode Project Consortium, 2012, Nature 489, 57-74) (right). P-values are from one-sided Fisher exact test. f. Same as (e) but for rs554591.

Supplementary Figure 5 Unexplained allele-specific variation of NR_125431 expression before editing of rs3748136.

a. The A and G alleles of rs1053036 in NR_125431 are cis-linked to the A and G alleles of rs3748136, respectively. Linkage model is based on TLA mapping. This locus in K562 cells is most likely triploid. b. Fraction of reads containing each of the two alleles of SNP rs1053036 in NR_125431 in K562 genomic DNA (left) and K562 RNA-seq reads (right). P-value was obtained by a one-sided Fisher exact test. The complete lack of expression of the A allele of NR_125431 is unexpected and may point to a genetic defect of the A allele in K562 cells. c. Clonal lines derived from K562 cells show extreme expression variation of NR_125431. For CRISPR-based editing we proceeded with clone BL_2.

Supplementary information

Supplementary Information

Supplementary Figs. 1–5 and Supplementary Table 3

Reporting Summary

Supplementary Table 1

Overview complexities and sequencing depth for all SuRE libraries.

Supplementary Table 2

Oligonucleotide sequences.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark
Fig. 1: Identification of raQTLs by SuRE.
Fig. 2: Correlation of SuRE signals with local chromatin states.
Fig. 3: Concordance of SuRE data and predictions based on TF binding motifs.
Fig. 4: Candidate causal SNPs identified by SuRE among large sets of eQTL SNPs.
Fig. 5: Candidate causal SNPs identified by SuRE among large sets of GWAS SNPs.
Fig. 6: Candidate causal SNPs identified by SuRE among GWAS SNPs for HCC.
Supplementary Figure 1: Characterization of SuRE libraries and SuRE data.
Supplementary Figure 2: Comparison of allelic imbalances in SuRE, H3K27ac and ATAC-seq.
Supplementary Figure 3: Genomic distributions and minor allele frequencies of raQTLs.
Supplementary Figure 4: Additional data related to eQTL - SuRE comparisons.
Supplementary Figure 5: Unexplained allele-specific variation of NR_125431 expression before editing of rs3748136.