Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution

The majority of common variants associated with common diseases, as well as an unknown proportion of causal mutations for rare diseases, fall in noncoding regions of the genome. Although catalogs of noncoding regulatory elements are steadily improving, we have a limited understanding of the functional effects of mutations within them. Here, we perform saturation mutagenesis in conjunction with massively parallel reporter assays on 20 disease-associated gene promoters and enhancers, generating functional measurements for over 30,000 single nucleotide substitutions and deletions. We find that the density of putative transcription factor binding sites varies widely between regulatory elements, as does the extent to which evolutionary conservation or integrative scores predict functional effects. These data provide a powerful resource for interpreting the pathogenicity of clinically observed mutations in these disease-associated regulatory elements, and comprise a rich dataset for the further development of algorithms that aim to predict the regulatory effects of noncoding mutations.

that the cancer-associated allele (A) led to a significant increase in promoter activity (~8 fold versus empty vector) compared to the unassociated allele (~3.5 fold versus empty vector) 4 . Our MPRA for this promoter was carried out in HeLa cells along with the cotransfection of USF1 and USF2. We did not observe a significant effect on promoter activity for rs1867277 (Supplementary Table 10). Overall, we only observed that 6 out of 1923 variants with a minimum of 10 associated tags showed a significant effect (p-value < 10 -5 ) on promoter activity, but with low expression effects. Our three technical replicates also had the lowest correlation of all MPRA experiments (0.16 Pearson correlation; Supplementary Table 7), precluding our ability to provide definitively interpretable results for this promoter.

GP1BB
Bernard-Soulier syndrome (BSS) is a rare autosomal recessive bleeding disorder characterized by defects of the GPIb-IX-V complex, a platelet receptor for von Willebrand factor (VWF). Most of the mutations identified in the genes encoding for the GP1BB (GPIbα), GP1BB (GPIbβ), and GP9 (GPIX) subunits prevent expression of the complex at the platelet membrane or its interaction with VWF. A mutation at the GP1BB promoter, NM_000407.4:c.-160C>G, is thought to disrupt the second GATA consensus binding site, and was shown to decrease promoter activity by 84% (Supplementary Table 10) 5 . In our MPRA, which was carried out in HEL 92.1.7 cell lines (erythroblasts originating from human bone marrow), we also observed a reduction in promoter activity for this mutation (52% promoter activity compared to baseline). We also saw a significant reduction across the whole GATA binding site. Several additional regions led to reduced promoter activity. For example, chr22: 19,710,710,900 (GRCh37), which is annotated by JASPAR to have binding sites for Ets-related factors 6 , showed significant reduction of promoter activity due to various mutations. Additional regions such as chr22: 19,710,912-19,710,923, chr22:19,710,940-19,710,949 or chr22:19,711,001-19,711,032 (all GRCh37) also encompassed several mutations that led to significant reduction in promoter activity. Overall, we observed a good correlation (0.88 Pearson correlation; Supplementary Table 7) between replicates for this MPRA.

HBB
Mutations in the the Hemoglobin Subunit Beta (HBB) gene cause beta thalassemias (β0 or β+), characterized by reduced hemoglobin levels that lead to lower oxygen levels in the body. The severity of the disease depends on the nature of the mutation. There are 31 clinical variants in the HBB promoter that were reported as disease causing, based on the Leiden Open Variation Database (https://lovd.bx.psu.edu/home.php?select_db=HBB) 7 . In our MPRA, which was carried out in in HEL 92.1.7 cells (erythroblasts originating from human bone marrow), 17 out of 31 clinical variants were found to have a significant effect on promoter activity (p-value < 10 -5 ) (Supplementary Table 10). We observed several blocks where mutations led to reduction in activity, possibly due to transcription factor binding sites. For example, in positions c.-142 to c.-136 there is a known binding site for the erythroid Krüppel factor (EKLF), a zinc-finger transcription factor that plays a critical role in erythropoiesis and regulation of β-globin switching 8 . We observed a decrease in expression of about 12%-19% for previously reported nucleotide variants within this binding site (Supplementary  Table 10). In contrast, the promoter mutation NM_000518.4:c.-18C>G, which has been previously shown to decrease steady state levels of mRNA to 85.2% 9 and is thought to disrupt the binding and transactivation of EKLF (causing mild β+ thalassemia 8 ), did not show a significant effect in our study (Supplementary Table 10). We identified 12 promoter activating variants (showing greater than 20% upregulation in promoter activity), for example variant c.-99C>G showed an 128% increase in activity (Supplementary Table  10) and could potentially have a strong effect on HBB expression. Overall, we observed a good correlation between technical replicates (0.77 Pearson correlation; Supplementary Table 7). There could be several reasons why some of our MPRA results don't completely match previous results for these clinically associated variants. Our pvalue minimum threshold might be too stringent to allow us to characterize these mutations (e.g. if we use a p-value threshold of 0.01, 23 out of the 31 variants show a significant effect on promoter activity), the clinical association might be incorrect and/or experimental differences could be contributing to these differences.  Table 10). Work carried out in human erythroleukemia cells (K562 and HEL) has identified several proteins that may bind to the HBG1 promoter and either activate or repress its activity 10  255C>T decreases the similarity to the SP1 recognition site and is associated with a more moderate increase in HBG1 expression 10,11 .
We carried out MPRA on the HBG1 promoter in HEL 92.1.7 cells (erythroblasts originating from human bone marrow) obtaining high reproducibility between technical replicates (0.92 Pearson correlation; Supplementary Table 7). We detected multiple blocks where mutations lead to reduced promoter activity, suggesting they serve as binding sites for activating transcription factors. We saw a significant repressive activity for mutations within the CAAT, CCAAT and CACCC blocks. However, we did not observe a continuous block of significant variants in our MPRA for the ATGCAAAT octamer sequence (c.  Table 10). For example, c.-228T>C showed a strong increase in expression of over 50% and c.-167C>T led to a 65% reduction in promoter activity. Similar to HBB, these results could be due to: 1) Our pvalue minimum threshold being too stringent (if we use a p-value threshold of 0.01, 8 out of the 14 variants show a significant effect on promoter activity); 2) The clinical association might be incorrect; 3) Experimental differences could all be contributing to these differences 4) Another limitation of our results for this promoter could be due to its regulation being driven by a complex pattern of repressor and activator binding which are temporally expressed.

HNF4A P2 promoter
Mutations in the Hepatocyte Nuclear Factor 4 Alpha (HNF4A) gene lead to maturity-onset diabetes of the young (MODY), a monogenic autosomal dominant subtype of early-onset diabetes mellitus caused by defective insulin secretion of pancreatic beta cells 12,13 . HNF4A has two characterized promoters, P1 and P2. We focused on the distant upstream P2 promoter, which is 46 kb 5' to the P1 promoter and is thought to be the major promoter regulating HNF4A pancreatic beta and hepatic cell expression 12  have been associated with MODY 80,81 and are thought to affect the binding of various transcription factors and reduce promoter activity. Transfection assays with various mutations have identified functional binding sites for HNF1A, HNF1B, pancreatic and duodenal homeobox 1 (PDX1) and other transcription factors known to be associated with MODY 12 . They also showed that mutations c.-136A>G and c.-169C>T lead to reduced promoter activity compared to the wild-type sequence in HEK293 cells 12,13 .
We carried out MPRA in HEK293T cells and observed an average Pearson correlation of 0.89 across replicates (Supplemental Table 7). Out of the five MODY-associated mutations, we only observed a significant effect for c.-192C>G, which increased promoter activity by 20% (Supplementary Table 10 Figure 8).

MSMB
A SNP, rs10993994, in the promoter (-57 bp from TSS) of the Microseminoprotein Beta (MSMB) gene was found to be associated with prostate cancer risk in two separate GWAS 14,15 and showed differential reporter activity 16,17 . The risk allele, T, of this SNP reduced promoter activity to 30% and 13% compared to the unassociated allele in HEK293T human embryonic kidney cells and prostate cancer cell lines (LNCaP), respectively 18,19 . We carried out MPRA for this promoter in HEK293T cells line, and observed a good correlation between replicates (0.88 Pearson correlation; Supplementary Table 7). For rs10993994, we obtained a 15% increase in promoter activity for the risk allele (T) instead of reduction in promoter activity as observed in LNCaP cells (Supplementary Table 10). This is likely due to the difference in cell lines used. The MSMB promoter has two potential androgen response element half sites (TGTTCT) located -113 to -118 bp 5' to the TSS and another is 23 bp upstream of SNP rs10993994 18 (chr10:51,549,467-51,549,472; GRCh37), which are likely to only get activated in LNCap cells but not in HEK293T cells. We observe multiple blocks that showed significant effects on promoter activity, which align with ENCODE motifs and conservation at the 3' half, but not on the other half (Supplementary Figure 8).

PKLR
Three different single nucleotide mutations (NM_000298.5:c.-324T>A, c.-83G>C, c.-72A>G) and a polymorphic deletion c.-248delT in the promoter of the Pyruvate Kinase L/R (PKLR) gene have been found in patients with pyruvate kinase (PK) deficiency 20,21 , which leads to anemia. Functional analysis for both rat and human PKLR promoters revealed four conserved motifs (CAC/SP1, PKR-RE, GATA1) within the first 250 bp upstream region 21 . Both c.-72A>G and c.-83G>C significantly reduced promoter activity. c.-72A>G has been attributed to disruption of the consensus binding motif for GATA-1 at c.-69 to c.-74. c.-83G>C was shown to be a part of a core binding motif (CTCTG) of a novel regulatory element (PKR-RE1) in the erythroid-specific promoter of PKLR, in close proximity to a regulatory GATA-1 binding site. c.-324T>A did not show an effect on promoter activity, c.-248delT turned out to be a non-functional polymorphism 20,21 .
We carried out MPRA for this promoter using two different post-transfection time points (24 and 48 hours) in K562 cells and observed a good correlation between replicates (0.86 and 0.91 Pearson correlation for 24hr and 48hr respectively; Supplementary Table 7). We observed a significant reduction in promoter activity for two of the disease causing mutations, c.-72A>G (24h=-1.77, 48h=-2.40) and c.-83G>C (24h=-1.25, 48h=-2. 19). For c.-324T>A and c.-248delT we did not observe a significant effect which is consistent with the previous functional study 20,21 . Around 15% of all tested variants showed a significant effect, with the majority of these variants being located either near the 3' or 5' end, suggesting that these two regions are important for PKLR promoter activity.  Table 12, 13), compared to all tested promoter elements.

Enhancers (in alphabetical order)
BCL11A B cell CLL/lymphoma 11A (BCL11A) is a transcription factor that regulates hemoglobin switching and has a characterized erythroid enhancer where common genetic variation has been associated with fetal hemoglobin (HbF) levels 22 . Saturation mutagenesis of the BCL11A enhancer via CRISPR-Cas9 guide RNA libraries uncovered key functional regions in this enhancer and implicated DNase hypersensitive site (DHS) +58 to have the most significant effect on BCL11A expression and HbF levels 23 . We thus used DHS +58 for our MPRA, which was carried out in the HEL92.1.7 erythroblast cell line. For the BCL11A experiments, we did not observe high correlations between replicates (0.38 Pearson correlation; Supplementary Table 7), possibly due to low fold activation by this enhancer in our assay/cell line (2.5 fold compared to empty vector) and a high mutation rate from the error-prone PCR step (creating an average of 6 variants per construct and very few wild-type sequences; factors that negatively impact model fit). Only 3% of the tested variants had a significant effect (p-value < 10 -5 ). The average log2 expression effect of all variants was only 0.064 (4.5%) and 0.21 (16%) for the subset of significant variants. The low activation we observed for this enhancer could be due to the use of different cells lines. HEL92.1.7 is an established erythoblast cell line collected from bone marrow. HUDEP-2 24 , which was used for the CRISPR-Cas9 assays, are human erythroid progenitors derived from the umbilical cord and are thought to be more similar to adult erythroid cells. It could also be due to differences in experimental methodologies, i.e. endogenous mutations via CRISPR-Cas9 that assay phenotypic changes versus MPRAs that test the ability of the sequence to drive regulatory activity.

IRF4
A SNP, rs12203592, within an intron of the gene encoding the Interferon Regulatory Factor 4 (IRF4), a transcription factor involved in pigmentation, is strongly associated with sensitivity of skin to sun exposure, freckles, blue eyes and brown hair color 25 . IRF4 is activated by the melanocyte master regulator melanocyte inducing transcription factor (MITF) along with the transcription factor AP-2 alpha (TFAP2A). The pigmentationassociated allele, rs12203592-T, is thought to disrupt a TFAP2A binding site and was shown to reduce enhancer activity 25 . We carried out saturation-based MPRA for this enhancer in human melanoma SK-MEL-28 cells and observed strong correlations between technical replicates (0.99 Pearson correlation; Supplementary Table 7). The rs12203592-T allele led to a significant reduction in enhancer activity, 36% compared to wild-type (Supplementary Table 11). Overall, we observed that 37% of the variants had a significant effect on enhancer activity. Conservation seems to be an informative measurement for activity and IRF4 has one of the best agreement between different annotations and absolute expression effects (see Results and Supplementary Table 12,13).

IRF6
A SNP, rs642961, in an Interferon Regulatory Factor (IRF6) associated-enhancer was shown to confer an 18% attributable risk for isolated cleft lip 26 . This variant is part of a common haplotype, rs76145088 (G>A), rs642961 (G>A) and rs77542756 (A>G) and the AAG associated-haplotype is thought to disrupt AP-2α binding sites and increase enhancer activity compared to the unassociated haplotypes in human foreskin keratinocyte HFK cells 26 . However, this difference did not reach statistical significance. Saturation based MPRA for this IRF6 enhancer was done in the human keratinocyte HaCaT cells and obtained a strong correlation between technical replicates (0.96 Pearson correlation; Supplementary Table 7). We observed that around 19% of all variants show a significant effect on enhancer activity. We did not observe a significant effect on enhancer activity for rs642961 and rs77542756 by themselves (Supplementary Table 11). However, the region where rs642961 and rs77542756 were located (chr1:209,989,152-209,989,314; GRCh37) had several activating mutations. For SNP rs76145088 our MPRA detected a significant repression by 18% (Supplementary Table 11). We could not detect an enrichment of significant variants at the AP-2α binding sites and because of the design of our MPRA, the associated-haplotype AAG is not covered. IRF6 had several conserved sites, but most of them do not align with blocks of significant expression effects (Supplementary Figure 9), leading to low correlation of conservation and functional scores (Supplementary Table 12 , 13). Of note, although rs642961 shows minor effects on reporter gene expression in cell culture, it is possible that it has significantly larger or even opposing effects on IRF6 expression in vivo in its native chromosomal context and/or in the presence of trans-acting variants 26 .

MYC rs6983267 and rs11986220
Activation of the MYC proto-oncogene, bHLH transcription factor (MYC) gene is a hallmark of cancer initiation and maintenance. Several GWAS reported association between the G-allele of rs6983267 and increased risk for various types of cancers such as colorectal and prostate cancer [27][28][29] . Studies have also shown that a region including rs6983267 has enhancer activity and interacts with the proto-oncogene MYC during early stages of colorectal cancer development 30,31 . rs11986220 resides ~100 kb away from rs6983267, and is associated with prostate cancer risk independent of rs6983267 32 . rs11986220 resides within a FOXA1 binding site, and the prostate cancer risk allele (A) is thought to facilitate both stronger FOXA1 binding and androgen responsiveness 32 . To study the individual variant effects of these SNPs and their surrounding sequences, we generated saturation mutagenesis libraries of 600 bp around rs6983267 and 464 bp around rs11986220. We tested the MYC rs6983267 MPRA library in human embryonic kidney cells, HEK293T, adding 20mM LiCl to activate the Wnt pathway and the MYC rs11986220 in human prostate adenocarcinoma LNCaP cells grown in a medium with 100nM Dihydrotestosterone to stimulate androgen activity.
For MYC rs6983267, we observed a strong correlation between replicates (0.75 Pearson correlation; Supplementary Table 7). We did not see a change in activity (p-value of coefficient 0.05; log2 expression effect -0.08 / 95% activity) due to the cancer-associated rs6983267 allele (T) (Supplementary Table 11). We observed a stretch of nucleotides that led to a reduction in enhancer activity at chr8:128,413,089-128,413,107 (GRCh37). This region correlates with an annotated CTCF binding site as determined by ERB v90 33 . Of note, two variants, chr8:128,413,279C>G and chr8:128,413,289G>T (GRCh37) led to a strong increase in enhancer activity by 86% and 84%.
For MYC rs11986220, we did not obtain a good correlation between replicates (0.31 Pearson correlation; Supplementary Table 7). rs11986220 showed a 14% repressing activity, but was not statistically significant (Supplementary Table 11). In total, only 11 out of 1685 variants had an expression effect on enhancer activity at a significance level of 10 -5 .

RET
A common non-coding variant, rs2435357, within an enhancer in intron 1 of the gene Rearranged During Transfection Proto-oncogene (RET) is significantly associated with Hirschsprung disease (HSCR) susceptibility 34 . The disease-associated variant leads to a significant reduction in reporter activity (6-8 fold) compared to the unassociated allele in Neuro-2a cells 34 . Saturation mutagenesis based MPRA was carried out for this enhancer in Neuro-2a cells, with the library cloned into the pGL3 vector, similar to the experiment that originally tested this enhancer 34 . Overall, we observed a reasonable correlation between technical replicates (0.71 Pearson correlation; Supplementary Table 7). We did not detect a significant effect on enhancer activity for rs2435357 (Supplementary Table  11). However, this SNP is located within a sequence block that has multiple variants with significant activating and repressing effects (chr10:43,582,096-43,582,232; GRCh37). Both rs2506005 and rs2506004, which reside 76 bp upstream and 217 bp downstream of rs2435357 respectively, and are in linkage disequilibrium with this SNP, also showed no significant effect on enhancer activity. A region located towards the 3'-end of the sequence (chr10:43,582,390-43,582,526; GRCh37) is enriched in variants that led to differential enhancer activity.

TCF7L2
GWAS for type-2 diabetes (T2D) found a significant association with SNP rs7903146, located within an intron of the gene Transcription Factor 7 Like 2 (TCF7L2). The rs7903146 risk T allele is associated with impaired insulin secretion, incretin effects and enhanced rate of hepatic glucose production 35 . Luciferase assays of both alleles in mouse pancreatic beta MIN6 cells exhibited significant allele-specific differences with the T allele leading to a three fold stronger enhancer activity compared to the non-risk allele (Supplementary Table 11) 102 . We carried out mutation saturation based MPRA for this enhancer in MIN6 cells, which covered over 99% of all mutations in the tested region (Supplementary Table 3) and obtained high variant effect correlations between replicates (0.76 Pearson correlation; Supplementary Table 7). Functional and conservation scores do not correlate well with the observed variant effects, but we observed some correlation with JASPAR motifs (Supplementary Table 12, 13). We do observe a significant effect on enhancer activity for rs7903146, even though only a 19% expression increase.

UC88
Ultraconserved elements (UCEs) are sequences greater than 200 bp in length that are identical between human, mouse and rat 36 . Half of the UCEs were shown to be functional enhancers in developing mouse embryos, the majority of which are active in the central nervous system 37 . UC88 is located on chr2:162,094,920-162,095,508 (GRCh37) and was shown to be a robust forebrain enhancer at mouse embryonic day E11.5 (hs416; VISTA enhancer browser 38 ). UC88 was tested for enhancer activity in mouse neuroblastoma Neuro-2a cells and showed significant enhancer activity (9.3 fold compared to empty vector). Mutation saturation based MPRA was performed on UC88 in Neuro-2a cells. Replicates had a lower replicate correlation (0.64 average Pearson coefficient); however, this increased to 0.9 for significant variants (using a lenient p-value threshold of < 0.01) (Supplementary Table 7). This limited our interpretability to 10% (196 out of 1964, p-value < 0.01) of all measured variants in UC88, not allowing us to see clear patterns regarding known motifs or other annotations (Supplementary Figure 9). Overall, we obtained a low correlation of functional, conservation and motif scores (Supplementary Table 12, 13).

ZFAND3
A Type 2 Diabetes GWAS found an associated with rs58692659, which is ~10 kb upstream to the zinc finger AN1-type containing 3 (ZFAND3) gene 39 . The rs58692659 SNP maps to an open chromatin region, identified previously by both islet FAIRE-seq and ChIP-seq for H2A.Z, and is thought to be bound by several beta cell transcription factors, including PDX1, FOXA2, NKX2.2 NKX6.1 and NEUROD1. Further TFBS analyses and EMSA suggest that the T allele alters the binding of NEUROD1 39 . Luciferase assays showed reduced enhancer activity (~5 fold) for the rs58692659 risk allele (T) compared to the C allele in mouse MIN6 beta cells 39 . We carried out mutation saturation based MPRA on this enhancer in MIN6 cells and obtained high variant effect correlations between replicates (0.89 Pearson correlation; Supplementary Table 7). For rs58692659, we observed a significant reduction in enhancer activity (70% residual activity) for the disease-associated allele (T) which is consistent with the previous reported luciferase results. Furthermore, we found that rs58692659 resides in a region (chr6:37,775,451-37,775,758; GRCh37) that with numerous mutations that lead to either a reduction or increase in enhancer activity (Supplementary Table 11). This region is highly conserved in vertebrates and contains predicted binding sites for the islet-enriched transcription factors RFX4, MAX and NHLH1 39 as annotated by JASPAR 2018 6 . Due to the high correlation of conservation, the tested functional scores result in high correlations as well (Spearman correlation ~0.4; Supplementary Table 12, 13).

ZRS
Mutations in the Sonic Hedgehog (SHH) limb enhancer, called the zone of polarizing activity regulatory sequence (ZRS), lead to a wide array of limb malformations 40 . These include polydactyly, triphalangeal thumb (TPT), syndactyly, tibial hypoplasia and Werner mesomelic syndrome (OMIM #188770) [41][42][43][44][45][46] . Several of these mutations were tested for enhancer activity using mouse transgenic assays, with the majority showing ectopic expression compared to the wild-type sequence 40 . In vitro assays were also done for a 1.7 kb version of this enhancer in NIH3T3 mouse fibroblasts which showed higher enhancer activity upon co-transfection of homeobox d13 (Hoxd13) and heart and neural crest derivatives expressed 2 (Hand2). As we were limited in length due to technical reasons, we first tested a 485 bp ZRS sequence that encompasses the majority of disease causing mutations in this in vitro system. We observed a significant fold change for co-transfections with Hoxd13 by itself (4.2 fold) or along with Hand2 (2.0 fold) compared to the empty vector. We thus set out to do saturation mutagenesis MPRA in NIH3T3 cells using both conditions (i.e. Hoxd13 or Hoxd13+Hand2).
We observed reasonable correlations between technical replicates (0.76 Pearson correlation for both Hoxd13 or Hoxd13+Hand2; Supplementary Table 7). Overall, we did not detect a high fraction of significant variant effects and those that were significant were distributed over the entire region with stronger effects towards the 3' end. For most disease-causing mutations, we did not observe a significant effect, likely due to technical reasons. These could be due to this enhancer having important spatial and temporal activities in the developing limb that are likely not represented in an in vitro setting that uses a single cell type along with the co-transfection of additional factors. However, we did observe an effect for some known mutations. These include chr7:156,584,153T>C (GRCh37; ZRS417), which resulted in 16% and 9% increase with Hoxd13 and Hoxd13+Hand2 respectively (Supplementary Table 11), and is known to cause a more severe limb phenotype which includes polydactyly of the four extremities and bilateral tibial deficiency 44 . Another interesting mutation is chr7:156,584,285C>A (GRCh37; ZRS285) that led to 17% and 13% higher enhancer activity with Hoxd13 and Hoxd13+Hand2 respectively (Supplementary Table 11), and is known to cause polydactyly in Silkie chickens 47 .

Supplementary Figures
Supplementary Figure 1 A B Promoter and enhancer luciferase assays. Bar charts representing the relative luciferase activity of the 10 selected promoters (A) and 10 selected enhancers (B) for either the wild-type (blue) or the saturation mutagenesis library (mutant library; red) normalized to the empty vector. In (A), for FOXE1, 7.5ug of upstream transcription factor (USF) 1 and 2 were co-transfected along with the vector/library. All promoters activities were measured after 24hr transfection except for PKLR which was measured after 48hr. In (B), for ZRS, 3.75ug HOXD13 or HOXD13 plus HAND2, were co-transfected along with the vector/library. For MYC (rs11986220), 10nM dihydrotestosterone (DHT) were also co-transfected. For MYC (rs6983267), 20mM of LiCl was added after 24hr to induce Wnt pathway activity and luciferase activity was measured at 32 hours. For all other enhancers, luciferase activity was measured 24 hours after transfection. All results are mean ± standard deviation of at least 3 independent experiments. SORT1 enhancer assays. Bar charts representing the relative luciferase activity of the SORT1 enhancer wild-type sequence (blue), mutant library (red) and flip library (light red). Luciferase levels were measured following 24 hour and results are plotted with mean ± standard deviation of at least 3 independent experiments.
Tags assigned from saturation mutagenesis libraries. Plots for all 24 assignments described in Supplementary Table 3. Variant are sampled evenly across all the regions. The number of tags per variant follows previously characterized biases in error-prone PCR using Taq polymerase, with a preference of transitions over transversions and T-A preference among transversions. Insertions were rare, while short deletions occur at rates similar to those of the rare transversions. Shape denotes the reference and the color the alternative allele of the variant. The "-" in the legend stands for a 1 bp deletion.

Quantitative RT-PCR validation results of TERT-GBM short-interfering RNA experiments.
To measure siRNA knockdown efficiency, qPCR was performed using the Ambion Power SYBR Green Cells-to-Ct kit to measure mRNA abundance of GABPA and TERT with primer sequences previously used (see Methods). Relative expression levels were calculated using the deltaCT method against the housekeeping gene GUSB.

51
Overlay of inferred variant effects for SNVs from promoter elements with available annotation data. Motif predictions overlayed with biochemical evidence from ChIP-seq experiments were obtained from Ensembl Regulatory Build (ERB) 33 and ENCODE 3 annotations. JASPAR motif predictions were obtained from the JASPAR 2018 6 release.

Overlay of inferred variant effects for SNVs with available annotation data for all enhancer elements.
Motif predictions overlayed with biochemical evidence from ChIPseq experiments were obtained from Ensembl Regulatory Build (ERB) 33 and ENCODE 3 annotations. JASPAR motif predictions were obtained from the JASPAR 2018 6 release.

Supplementary Figure 10
Receiver operating characteristic curves of computational scores used for the binary classification of promoter variants. Plots are divided into the classification of the top 200, 500 and 1000 high-effect promoter variants (p-value < 10 -5 , min tags 10) versus the same number of random variants (per element) with no expression effect (log2 value < 0.05, min tags 10). Curves show the average performance across 100 classification rounds (see Methods). If multiple scores exist per method, only the best score is plotted (see Supplementary Table 16).

Supplementary Figure 11
Precision-recall curves of computational scores used for the binary classification of promoter variants. Plots are divided into the classification of the top 200, 500 and 1000 high-effect promoter variants (p-value < 10 -5 , min tags 10) versus the same number of random variants (per element) with no expression effect (log2 value < 0.05, min tags 10). Curves show the average across 100 classification rounds (see Methods). If multiple scores exist per method, only the best score is plotted (see Supplementary Table 16).

Supplementary Figure 12
Receiver operating characteristic curves of computational scores used for the binary classification of enhancer variants. Plots are divided into the classification of the top 200, 500 and 1000 high-effect enhancer variants (p-value < 10 -5 , min tags 10) versus the same number of random variants (per element) with no expression effect (log2 expression effect < 0.05, min tags 10). Curves show the average across 100 classification rounds (see Methods). If multiple scores exist per method, only the best score is plotted (see Supplementary Table 17).

Supplementary Figure 13
Precision-recall curves of computational scores used for the binary classification of enhancer variants. Plots are divided into the classification of the top 200, 500, and 1000 high-effect enhancer variants (p-value < 10 -5 , min tags 10) versus the same number of random variants (per element) with no expression effect (log2 expression effect < 0.05, min tags 10). Curves show the average across 100 classification rounds (see Methods). If multiple scores exist per method only the best score is plotted (see Supplementary  Table 17).

Supplementary Tables
Supplementary Table 1 Tested promoter sequences. Note that the construct length of MSMB is 2 bp longer than the described reference genome coordinate range, due to an insertion polymorphism present in the construct. Vector sequences for pGL4.11b and pGL4.11c are available for download from NCBI GenBank (accessions MK484103 and MK484104, respectively) and from our OSF project 75B2M.  Table 2 Tested enhancer sequences. Note that the construct length of MYC (rs11986220) is 1 bp longer than the described reference genome coordinate range, due to an insertion polymorphism present in the construct. Human ZRS (hZRS, a limb-specific enhancer of the Sonic hedgehog gene) was co-transfected with HOXD13 and HOXD13+HAND2. UC88 is an ultraconserved element, which has not been previously associated with a phenotype. Vector sequences for pGL3c, pGL4.23c/d, and pGL4Zc are available for download from NCBI GenBank (accessions MK484107, MK484105/MK484106, and MK484108, respectively) and our OSF project 75B2M.

Supplementary Table 7
Pearson correlation across different transfection replicates. Correlation from the three transfection replicates of fitted SNV and 1 bp deletion variant effects divided by their standard deviation (requiring at least 10 tags in each replicate). MYC (rs11986220) is abbreviated to MYCs1 and MYC (rs6983267) abbreviated to MYCs2. TERT-HEK293T, TER-GBM, TERT-GBM-SiGABPA and TERT-GBM-SiScramble-2 are abbreviated to TERT-H, TERT-G, TERT-GAa and TERT-GSc, respectively. The left side considers all estimates obtained, the right side tries to exclude nonsignificant allele effects by requiring a lenient p-value threshold (<0.01) in at least one of the replicates. We use the reduction in included alleles as a proxy for the proportion of alleles with regulatory activity for each element (Frac. Supplementary Table 9 Effects of specific substitution or transition (ts) vs. transversion (tv) effects for significant readouts across a representative experiment of each element. The number of mutations with a significant p-value (< 10 -5 ) is reported from models combining transfection replicates. The overlap with single nucleotide variants present in the gnomAD r2.1 is reported. Elements from Supplementary  Total  SNVs  28937  4601  2756  602  689  144  75  13   Total  Tv  19113  2411  1755  419  244  30  21  8   Total  Ts  9824  2190  1001  183  445  114  54  5 Previously described variants of promoters . Clinical variants of promoters (HBB,  HBG1, PKLR, F9, GP1BB, TERT, HNF4A, MSMB, and FOXE1) measured in this MPRA study, as well as additional described variants in Supplementary Note 1. Corresponding RefSeq transcripts of the HGVS annotation are listed in Supplementary Table 1. Fold changes are reported as "non significant" (n.s), if the associated p-value is higher than 10 -5 . Variants that were not available from our experiments are marked as "not covered" (n.c.). We are including effects for the PKLR_24h and PKLR_48h (PKLR_24h/PKLR_48h) as well as TERT-GBM and TERT-HEK experiments (TERT-GBM/TERT-HEK). Previous reported phenotypes in literature are summarized in the Phenotype column.

Supplementary Table 16
Performance of computational scores for binary classification of promoter variants with high and low expression effect. The top 200, 500, and 1000 highest expression effect promoter variants (p-value < 10 -5 , min tags 10) were selected and matched with no effect variants (log2 expression effect < 0.05: min tags 10). The same number of effect/no effect variants was sampled for each element, without normalizing each elements' contribution to the overall set.

Supplementary Table 17
Performance of computational scores for binary classification of enhancers variants with high and low expression effect. The top 200, 500, and 1000 highest expression effect enhancer variants (p-value < 10 -5 , min tags 10) were selected and matched with no effect variants (log2 expression effect < 0.05: min tags 10). The same number of effect/no effect variants was sampled for each element, without normalizing each elements' contribution to the overall set. In fact, the three elements contributing the  Table 14) for which cell-types where publically available (HEK293T, HeLa S3, HepG2, K562, and LNCaP). In cases where an annotation is based on positions rather than alleles, we assumed the same value for all substitutions at each position. Bold text marks the best performing method.

Supplementary Table 21
Primers for preparing association libraries for short enhancers/promoters. These primers are also used to amplify longer enhancers and then those products are fragmented for subassembly (Supplementary Table 22). Plasmid DNA libraries are amplified with sequencing adaptor primers capturing the cloned sequence with its tag and adding the P5/P7 Illumina flow cell sequences. All sequences are provided 5' to 3'.