High-throughput identification of human SNPs affecting regulatory element activity

Abstract

Most of the millions of SNPs in the human genome are non-coding, and many overlap with putative regulatory elements. Genome-wide association studies (GWAS) have linked many of these SNPs to human traits or to gene expression levels, but rarely with sufficient resolution to identify the causal SNPs. Functional screens based on reporter assays have previously been of insufficient throughput to test the vast space of SNPs for possible effects on regulatory element activity. Here we leveraged the throughput and resolution of the survey of regulatory elements (SuRE) reporter technology to survey the effect of 5.9 million SNPs, including 57% of the known common SNPs, on enhancer and promoter activity. We identified more than 30,000 SNPs that alter the activity of putative regulatory elements, partially in a cell-type-specific manner. Integration of this dataset with GWAS results may help to pinpoint SNPs that underlie human traits.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Identification of raQTLs by SuRE.
Fig. 2: Correlation of SuRE signals with local chromatin states.
Fig. 3: Concordance of SuRE data and predictions based on TF binding motifs.
Fig. 4: Candidate causal SNPs identified by SuRE among large sets of eQTL SNPs.
Fig. 5: Candidate causal SNPs identified by SuRE among large sets of GWAS SNPs.
Fig. 6: Candidate causal SNPs identified by SuRE among GWAS SNPs for HCC.

Data availability

Raw SuRE sequencing data are available at GEO (https://www.ncbi.nlm.nih.gov/geo/) under accession GSE128325. SuRE count tables, BigWig files for visualization of SuRE data tracks in genome browsers, lists of raQTLs and a table with SuRE data for all 5.9 million SNPs are available from the Open Science Framework (https://osf.io/w5bzq/wiki/home/?view). SuRE data can also be queried and visualized at https://sure.nki.nl. URLs to external data sources are listed in Supplementary Table 3.

Code availability

Scripts are available on https://github.com/vansteensellab/SuRE-SNV-code.

Software used is described in the relevant methods section and in the Nature Research Reporting Summary.

References

  1. 1.

    1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

  2. 2.

    Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 95, 535–552 (2014).

    CAS  Article  Google Scholar 

  3. 3.

    Albert, F. W. & Kruglyak, L. The role of regulatory variation in complex traits and disease. Nat. Rev. Genet. 16, 197–212 (2015).

    CAS  Article  Google Scholar 

  4. 4.

    Miguel-Escalada, I., Pasquali, L. & Ferrer, J. Transcriptional enhancers: functional insights and role in human disease. Curr. Opin. Genet. Dev. 33, 71–76 (2015).

    CAS  Article  Google Scholar 

  5. 5.

    Deplancke, B., Alpern, D. & Gardeux, V. The genetics of transcription factor DNA binding variation. Cell 166, 538–554 (2016).

    CAS  Article  Google Scholar 

  6. 6.

    MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).

    CAS  Article  Google Scholar 

  7. 7.

    Gallagher, M. D. & Chen-Plotkin, A. S. The post-GWAS era: from association to function. Am. J. Hum. Genet. 102, 717–730 (2018).

    CAS  Article  Google Scholar 

  8. 8.

    GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

    Article  Google Scholar 

  9. 9.

    Schaid, D. J., Chen, W. & Larson, N. B. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet. 19, 491–504 (2018).

    CAS  Article  Google Scholar 

  10. 10.

    Nishizaki, S. S. & Boyle, A. P. Mining the unknown: assigning function to noncoding single nucleotide polymorphisms. Trends Genet. 33, 34–45 (2017).

    CAS  Article  Google Scholar 

  11. 11.

    Farh, K. K. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015).

    CAS  Article  Google Scholar 

  12. 12.

    Tak, Y. G. & Farnham, P. J. Making sense of GWAS: using epigenomics and genome engineering to understand the functional relevance of SNPs in non-coding regions of the human genome. Epigenetics Chromatin 8, 57 (2015).

    Article  Google Scholar 

  13. 13.

    Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).

    CAS  Article  Google Scholar 

  14. 14.

    Degner, J. F. et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390–394 (2012).

    CAS  Article  Google Scholar 

  15. 15.

    McVicker, G. et al. Identification of genetic variants that affect histone modifications in human cells. Science 342, 747–749 (2013).

    CAS  Article  Google Scholar 

  16. 16.

    Kilpinen, H. et al. Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription. Science 342, 744–747 (2013).

    CAS  Article  Google Scholar 

  17. 17.

    Kasowski, M. et al. Extensive variation in chromatin states across humans. Science 342, 750–752 (2013).

    CAS  Article  Google Scholar 

  18. 18.

    Waszak, S. M. et al. Population variation and genetic control of modular chromatin architecture in humans. Cell 162, 1039–1050 (2015).

    CAS  Article  Google Scholar 

  19. 19.

    Grubert, F. et al. Genetic control of chromatin states in humans involves local and distal chromosomal interactions. Cell 162, 1051–1065 (2015).

    CAS  Article  Google Scholar 

  20. 20.

    Gate, R. E. et al. Genetic determinants of co-accessible chromatin regions in activated T cells across humans. Nat. Genet. 50, 1140–1150 (2018).

    CAS  Article  Google Scholar 

  21. 21.

    Vockley, C. M. et al. Massively parallel quantification of the regulatory effects of noncoding genetic variation in a human cohort. Genome Res. 25, 1206–1214 (2015).

    CAS  Article  Google Scholar 

  22. 22.

    Tewhey, R. et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016).

    CAS  Article  Google Scholar 

  23. 23.

    Ulirsch, J. C. et al. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell 165, 1530–1545 (2016).

    CAS  Article  Google Scholar 

  24. 24.

    Liu, S. et al. Systematic identification of regulatory variants associated with cancer risk. Genome Biol. 18, 194 (2017).

    Article  Google Scholar 

  25. 25.

    Zhang, P. et al. High-throughput screening of prostate cancer risk loci by single nucleotide polymorphisms sequencing. Nat. Commun. 9, 2022 (2018).

    Article  Google Scholar 

  26. 26.

    van Arensbergen, J. et al. Genome-wide mapping of autonomous promoter activity in human cells. Nat. Biotechnol. 35, 145–153 (2017).

    Article  Google Scholar 

  27. 27.

    Nakamura, M. et al. Genome-wide association study identifies TNFSF15 and POU2AF1 as susceptibility loci for primary biliary cirrhosis in the Japanese population. Am. J. Hum. Genet. 91, 721–728 (2012).

    CAS  Article  Google Scholar 

  28. 28.

    Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  Google Scholar 

  29. 29.

    Liu, X. et al. In situ capture of chromatin interactions by biotinylated dCas9. Cell 170, 1028–1043.e19 (2017).

    CAS  Article  Google Scholar 

  30. 30.

    Kumar, S., Ambrosini, G. & Bucher, P. SNP2TFBS - a database of regulatory SNPs affecting predicted transcription factor binding site affinity. Nucleic Acids Res. 45, D139–D144 (2017).

    CAS  Article  Google Scholar 

  31. 31.

    Law, J. C., Ritke, M. K., Yalowich, J. C., Leder, G. H. & Ferrell, R. E. Mutational inactivation of the p53 gene in the human erythroid leukemic K562 cell line. Leuk. Res. 17, 1045–1050 (1993).

    CAS  Article  Google Scholar 

  32. 32.

    Westerink, W. M., Stevenson, J. C., Horbach, G. J. & Schoonen, W. G. The development of RAD51C, Cystatin A, p53 and Nrf2 luciferase-reporter assays in metabolically competent HepG2 cells for the assessment of mechanism-based genotoxicity and of oxidative stress in the early research phase of drug development. Mutat. Res. 696, 21–40 (2010).

    CAS  Article  Google Scholar 

  33. 33.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    CAS  Article  Google Scholar 

  34. 34.

    Cilia La Corte, A. L. et al. A functional XPNPEP2 promoter haplotype leads to reduced plasma aminopeptidase P and increased risk of ACE inhibitor-induced angioedema. Hum. Mutat. 32, 1326–1331 (2011).

    CAS  Article  Google Scholar 

  35. 35.

    Chen, Z. S., Guo, Y., Belinsky, M. G., Kotova, E. & Kruh, G. D. Transport of bile acids, sulfated steroids, estradiol 17-β-d-glucuronide, and leukotriene C4 by human multidrug resistance protein 8 (ABCC11). Mol. Pharmacol. 67, 545–557 (2005).

    CAS  Article  Google Scholar 

  36. 36.

    Makowski, M. M. et al. An interaction proteomics survey of transcription factor binding at recurrent TERT promoter mutations. Proteomics 16, 417–426 (2016).

    CAS  Article  Google Scholar 

  37. 37.

    Makowski, M. M. et al. Global profiling of protein–DNA and protein–nucleosome binding affinities using quantitative mass spectrometry. Nat. Commun. 9, 1653 (2018).

    Article  Google Scholar 

  38. 38.

    Boersema, P. J., Raijmakers, R., Lemeer, S., Mohammed, S. & Heck, A. J. Multiplex peptide stable isotope dimethyl labeling for quantitative proteomics. Nat. Protoc. 4, 484–494 (2009).

    CAS  Article  Google Scholar 

  39. 39.

    Astle, W. J. et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell 167, 1415–1429.e19 (2016).

    CAS  Article  Google Scholar 

  40. 40.

    Maslah, N., Cassinat, B., Verger, E., Kiladjian, J. J. & Velazquez, L. The role of LNK/SH2B3 genetic alterations in myeloproliferative neoplasms and other hematological disorders. Leukemia 31, 1661–1670 (2017).

    CAS  Article  Google Scholar 

  41. 41.

    Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819–823 (2013).

    CAS  Article  Google Scholar 

  42. 42.

    Sawai, H. et al. Genome-wide association study identified new susceptible genetic variants in HLA class I region for hepatitis B virus-related hepatocellular carcinoma. Sci. Rep. 8, 7958 (2018).

    Article  Google Scholar 

  43. 43.

    Wen, J. et al. Expression quantitative trait loci in long non-coding RNA ZNRD1-AS1 influence both HBV infection and hepatocellular carcinoma development. Mol. Carcinog. 54, 1275–1282 (2015).

    CAS  Article  Google Scholar 

  44. 44.

    Nguyen, T. A. et al. High-throughput functional comparison of promoter and enhancer activities. Genome Res. 26, 1023–1033 (2016).

    CAS  Article  Google Scholar 

  45. 45.

    Ward, L. D. & Kellis, M. HaploReg v4: systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease. Nucleic Acids Res. 44, D877–D881 (2016).

    CAS  Article  Google Scholar 

  46. 46.

    Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D260–D266 (2018).

    CAS  Article  Google Scholar 

  47. 47.

    Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10–12 (2011).

    Article  Google Scholar 

  48. 48.

    Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    CAS  Article  Google Scholar 

  49. 49.

    van de Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J. K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015).

    Article  Google Scholar 

  50. 50.

    Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).

    CAS  Article  Google Scholar 

  51. 51.

    Brinkman, E. K., Chen, T., Amendola, M. & van Steensel, B. Easy quantitative assessment of genome editing by sequence trace decomposition. Nucleic Acids Res. 42, e168 (2014).

    Article  Google Scholar 

  52. 52.

    de Vree, P. J. et al. Targeted sequencing by proximity ligation for comprehensive variant detection and local haplotyping. Nat. Biotechnol. 32, 1019–1025 (2014).

    Article  Google Scholar 

  53. 53.

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

    Article  Google Scholar 

  54. 54.

    Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

    CAS  Article  Google Scholar 

  55. 55.

    Dignam, J. D., Lebovitz, R. M. & Roeder, R. G. Accurate transcription initiation by RNA polymerase II in a soluble extract from isolated mammalian nuclei. Nucleic Acids Res. 11, 1475–1489 (1983).

    CAS  Article  Google Scholar 

  56. 56.

    Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We thank the NKI Genomics Core Facility and Research IT team for technical support, the RHPC facility of the Netherlands Cancer Institute for providing computational resources and members of our laboratories for helpful discussions. Supported by ERC Advanced Grant no. 694466 (to B.v.S.); ERC Starting Grant no. 637587 (to E.d.W.); NIH grant no. R01HG003008 and Columbia University’s Vagelos Precision Medicine Pilot Program (to H.J.B.). F.C. was supported by a Swiss National Science Foundation postdoctoral fellowship (no. P2EZP3_165206). J.v.A., L.P., M.d.H., M.P.B., F.C., R.H.v.d.W., H.T., E.d.W., M.V. and B.v.S. are part of the Oncode Institute, which is partly funded by the Dutch Cancer Society (KWF).

Author information

Affiliations

Authors

Contributions

J.v.A. designed and performed experiments, analyzed data and wrote the manuscript. L.P., V.D.F. and H.J.B. developed algorithms and analyzed data. M.d.H. M.P.B., M.V., R.H.v.d.W., H.T., F.C., U.V., E.d.W. and L.F. generated and/or analyzed data. F.C. developed the web application. B.v.S. designed experiments, analyzed data and wrote the manuscript.

Corresponding authors

Correspondence to Joris van Arensbergen or Bas van Steensel.

Ethics declarations

Competing interests

J.v.A. is founder of Gen-X B.V. (http://www.gen-x.bio/). E.d.W. is co-founder and shareholder of Cergentis B.V. F.C. is a co-founder of enGene Statistics GmbH.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Characterization of SuRE libraries and SuRE data.

a. Inserted fragment size distribution for each SuRE library (bin size 25 bp). b. Histogram showing the coverage of each SNP position in the combined SuRE libraries. c. Same as (b) but now for each SNP allele. d. Representative ~0.5-Mb genomic region showing SuRE signals of HG02601, SuRE library 1 in K562 cells, together with DNase-seq and H3K27ac signals in K562 cells (Encode Project Consortium, 2012, Nature 489, 57-74). e. qq plot showing the distribution of Wilcoxon rank-sum test P-values for SNPs in SuRE in K562 (y-axis) compared to distribution of Wilcoxon rank-sum test P-values obtained after random shuffling the SuRE expression values for each SNP (x-axis). Shown is a random subset of 100,000 SNPs. Gray line indicates y=x diagonal. f. Same as (e) but for HepG2. g. Volcano plot showing for all raQTLs in K562 (n=19,237) the log2 difference in SuRE signals for the REF and the ALT allele (x-axis) and the associated Wilcoxon rank-sum test P-values (y-axis). h. Same as (g) but for HepG2 (n=14,183). i. Histogram showing for all raQTLs in K562 the probability of the nearest neighbor SNP also being a raQTL, as a function of their distance. The dotted gray line indicates probability 0.5. j. SuRE Wilcoxon rank-sum test P-values in K562 and HepG2 cells for all SNPs that are raQTLs in at least one of the two cell types (See Fig. 1e). Gray lines indicate the P-value cut-offs for each cell type. k. Histogram showing for all K562 specific raQTLs the SuRE signal of the strongest allele in K562 (blue) and in HepG2 (yellow). l. Same as (k) but for all HepG2 specific raQTLs.

Supplementary Figure 2 Comparison of allelic imbalances in SuRE, H3K27ac and ATAC-seq.

a. Comparison of allelic imbalance of SuRE signals and H3K27ac ChIP-seq signals (normalized for genomic DNA allelic read counts) for raQTLs for which K562 cells are heterozygous and at least 10 H3K27ac reads covered the raQTL. REF: reference allele; ALT: alternative allele; OR: odds-ratio. b. Same as (a.) but for ATAC-seq. c-e. Average profile of DNase-seq (c), ATAC-seq (d) and H3K27ac (e) signal for all raQTLs that are heterozygous in K562 cells. The vertical gray lines indicate the raQTL position and the horizontal gray lines indicate the approximate background signal. Note how the signal relative to background is much better for DNase-seq than for these ATAC-seq or H3K27ac signals.

Supplementary Figure 3 Genomic distributions and minor allele frequencies of raQTLs.

a. Frequencies of raQTLs in K562 cells (n=19,237; dark color) or matching control SNPs (n=19,237; pale color) among all non-exonic SNPs within 100 kb of TSSs of loss-of-function tolerant genes or loss-of-function intolerant genes (Lek et al. 2016, Nature 536, 285-291). b. Same as (a) but for HepG2 cells (n=14,183). c. Distributions of minor allele frequencies according to the 1000 Genomes Project (1000 Genomes Project Consortium, 2015. Nature 526, 68-74) for raQTLs (dark color) and matched control SNPs (pale color) in K562 cells. P-values are obtained with a Wilcoxon rank-sum test d. Same as (c) but for HepG2.

Supplementary Figure 4 Additional data related to eQTL - SuRE comparisons.

a. Odds ratios of concordance of whole blood eQTL and K562 raQTL SNPs, as a function of maximum distance to the TSS of the associated eGene. All odds ratios are significantly larger than 1 (one-sided Fisher’s exact test, P<3.6e-3 for all distance cutoffs). Analysis based on 623,210 SNPs that overlap between SuRE and eQTL datasets. b. Same as (a) but for liver eQTL SNPs that are raQTLs in HepG2. All odds ratios are significantly larger than 1 (P<2.3e-4 for all distance cutoffs). Analysis based on 186,613 SNPs that overlap between SuRE and eQTL datasets. c. Genome track plot combining SuRE data and eQTL mapping data for LYZ in whole blood, similar to main Fig. 4e. d. Protein binding analysis for rs554591, similar to main Fig. 4f. e. Barplots indicating fraction of reads containing each of the two alleles for rs623853 in K562 genomic DNA (left) and K562 DNase-seq reads (Encode Project Consortium, 2012, Nature 489, 57-74) (right). P-values are from one-sided Fisher exact test. f. Same as (e) but for rs554591.

Supplementary Figure 5 Unexplained allele-specific variation of NR_125431 expression before editing of rs3748136.

a. The A and G alleles of rs1053036 in NR_125431 are cis-linked to the A and G alleles of rs3748136, respectively. Linkage model is based on TLA mapping. This locus in K562 cells is most likely triploid. b. Fraction of reads containing each of the two alleles of SNP rs1053036 in NR_125431 in K562 genomic DNA (left) and K562 RNA-seq reads (right). P-value was obtained by a one-sided Fisher exact test. The complete lack of expression of the A allele of NR_125431 is unexpected and may point to a genetic defect of the A allele in K562 cells. c. Clonal lines derived from K562 cells show extreme expression variation of NR_125431. For CRISPR-based editing we proceeded with clone BL_2.

Supplementary information

Supplementary Information

Supplementary Figs. 1–5 and Supplementary Table 3

Reporting Summary

Supplementary Table 1

Overview complexities and sequencing depth for all SuRE libraries.

Supplementary Table 2

Oligonucleotide sequences.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

van Arensbergen, J., Pagie, L., FitzPatrick, V.D. et al. High-throughput identification of human SNPs affecting regulatory element activity. Nat Genet 51, 1160–1169 (2019). https://doi.org/10.1038/s41588-019-0455-2

Download citation

Further reading

Search

Sign up for the Nature Briefing newsletter for a daily update on COVID-19 science.
Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing