# Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution

## Abstract

The majority of common variants associated with common diseases, as well as an unknown proportion of causal mutations for rare diseases, fall in noncoding regions of the genome. Although catalogs of noncoding regulatory elements are steadily improving, we have a limited understanding of the functional effects of mutations within them. Here, we perform saturation mutagenesis in conjunction with massively parallel reporter assays on 20 disease-associated gene promoters and enhancers, generating functional measurements for over 30,000 single nucleotide substitutions and deletions. We find that the density of putative transcription factor binding sites varies widely between regulatory elements, as does the extent to which evolutionary conservation or integrative scores predict functional effects. These data provide a powerful resource for interpreting the pathogenicity of clinically observed mutations in these disease-associated regulatory elements, and comprise a rich dataset for the further development of algorithms that aim to predict the regulatory effects of noncoding mutations.

## Introduction

The vast majority of the human genome is noncoding. Nonetheless, even as the cost of DNA sequencing plummeted over the past decade, the primary focus of sequencing-based studies of human disease has been on the ~1% that is protein-coding, i.e., the exome. However, it is clear that disease-contributory variation can and does occur within the noncoding regions of the genome1,2. For example, there are many Mendelian diseases for which specific mutations in promoters and enhancers are unequivocally causal3. Furthermore, for most Mendelian diseases, not all cases are explained by coding mutations, suggesting that regulatory mutations may explain some proportion of the remainder. For common diseases, although coding regions may be the most enriched subset of the genome, the vast majority of signal maps to the noncoding genome, and in particular to accessible chromatin in disease-relevant cell types4,5.

Nonetheless, the pinpointing of disease-contributory noncoding variants among the millions of variants present in any single individual6, or the hundreds of millions of variants observed in human populations7, remains a daunting challenge. To advance our understanding of disease as well as the clinical utility of genetic information, it is clear that we need to develop scalable multiplex assays of variant effect (MAVEs)8, specifically methods for accurately assessing the functional consequences of noncoding variants.

While our mechanistic understanding of regulatory sequences remains limited9, several groups, including us, have developed tools that summarize large amounts of functional genomic data (e.g., evolutionary conservation, gene model information, histone or TF ChIP-seq peaks, transcription factor binding site (TFBS) predictions) into scores that can be used to predict noncoding variant effects (e.g., CADD10, DeepSEA11, Eigen12, FATHMM-MKL13, FunSeq214, GWAVA15, LINSIGHT16, and ReMM17), segment annotations (e.g., chromHMM18, Segway19, and fitCons20), or sequence-based models (deltaSVM21). However, although these scores are widely used, it remains unclear how well they work.

A major bottleneck in the development of any interpretive method for noncoding variants is the assessment of prediction quality, as there are relatively few known pathogenic noncoding variants, nor consistently ascertained sets of functional measurements of noncoding variants. A recent study by Smedley et al.17 cataloged a total of 453 known disease-associated noncoding single nucleotide variants (SNVs) and used those to derive a score (ReMM). However, many of these variants fall within a small number of promoters that have been extensively studied. This leaves in question how generalizable the resulting scores are. Furthermore, catalogs of disease-associated variants like the one used by Smedley et al.17, or available from ClinVar22 or HGMD23, provide only qualitative labels for SNVs (e.g., likely pathogenic), rather than quantitative information on the magnitude of the effect. In sum, the qualitative nature, possible ascertainment biases, and relative paucity of “known” functional noncoding variants severely limit the assessment of available methods. While massively parallel reporter assays (MPRAs) that can assess thousands of sequences and their variants for their activity have been used to test the effect of a large number of variants, these have been primarily focused on common variants24,25 or less than a handful of disease-associated regulatory elements26,27.

To address this gap, we set out to generate variant-specific activity maps for 20 disease-associated regulatory elements, including ten promoters (of TERT, LDLR, HBB, HBG, HNF4A, MSMB, PKLR, F9, FOXE1, and GP1BB) and ten enhancers (of SORT1, ZRS, BCL11A, IRF4, IRF6, MYC (2×), RET, TCF7L2, and ZFAND3), together with one ultraconserved enhancer (UC88)28,29. Specifically, we use MPRAs to perform saturation mutagenesis26,27 on each of these regulatory elements, spanning 28–73% in GC content and 187–601 base pairs (bp) in length. Altogether, we empirically measure the functional effects of over 30,000 SNVs or single nucleotide deletions. We observe that the density of putative TFBS varies widely across the elements tested, as does the performance of various predictive strategies. These data comprise a comprehensive resource for the benchmarking and further development of noncoding variant effect scores, as well as an empirical database for the interpretation of the disease-causing potential of nearly any possible SNV in these regulatory elements.

## Results

### Selection of disease-associated elements

We selected 21 regulatory elements, including 20 commonly studied, disease-relevant promoter and enhancer sequences from the literature (Supplementary Tables 1 and 2), and one ultraconserved enhancer (UC88). For the former, we focused primarily on regulatory sequences in which specific mutations are known to cause disease, both for their clinical relevance and to provide for positive control variants. Selected elements were limited to 600 bp for technical reasons related to the mapping of variants to barcodes by subassembly30. In addition, we selected only sequences where cell line-based reporter assays were previously established.

For example, we selected the low-density lipoprotein receptor (LDLR) promoter, where mutations have been shown to cause familial hypercholesterolemia (FH), a disorder that results in accelerated atherosclerosis and increased risk for coronary heart disease31,32,33. We also tested the core promoter region (–200 to +57) of the telomerase reverse transcriptase (TERT) gene which is associated with oncogenic mutations34. In particular, NM_198253.2:c.-124C>T or c.-146C>T are frequently found in several cancer types, including glioblastoma34,35,36,37.

We also selected a sortilin 1 (SORT1) enhancer. A series of genome-wide association studies showed that the minor allele of a common noncoding polymorphism at the 1p13 locus (rs12740374) creates a CCAAT/enhancer binding protein (C/EBP) TFBS and increases the hepatic expression of the SORT1 gene, reducing LDL-C levels and risk for myocardial infarction in humans38. We cloned a ~600 bp region that includes rs12740374 as well as most nearby annotated TFBS according to ENCODE data (wgEncodeRegTfbsClusteredV3 track, UCSC Genome Browser), to identify additional functional variants in the enhancer and surrounding region. For this enhancer, we also conducted MPRA experiments in both forward and reverse orientations, with the goal of testing for any directionality dependence of variant effects.

All 21 selected promoter and enhancer regions were individually validated for functional activity in the appropriate cell lines (Supplementary Figs. 1 and 2; Supplementary Tables 1 and 2). This initial validation allowed us to optimize reporter assay conditions and to confirm that the cloned subsequences of the candidate regulatory elements resulted in measurable activities in the appropriate cell types. The validated luciferase expression levels ranged from 2- to 200-fold over empty vector (Supplementary Tables 1 and 2).

### Construction of saturation mutagenesis libraries

In order to test the functional effects of thousands of mutations in these selected disease-associated regulatory elements, we first developed a scalable protocol for saturation mutagenesis-based MPRAs26,27 (Fig. 1). For each of the 21 regulatory elements (Supplementary Tables 1 and 2), we used error-prone PCR to introduce sequence variation at a frequency of less than 1 change per 100 bp. While error-prone PCR is known to be biased in the types of mutations that are generated (e.g., a preference for transitions and T/A transversions)39, high library complexities (50k–2M constructs per target) allowed us to capture nearly all possible SNVs as well as many 1-bp deletions with multiple independent constructs per variant (Supplementary Table 3). To distinguish the individual amplification products, we incorporated 15 or 20 bp random sequence tags 3′ of the target region using overhanging primers during the error-prone PCR.

We then cloned promoters and all, but two, enhancers into the backbones of slightly modified pGL4.11 (Promega, promoter) or pGL4.23 (Promega, enhancer) vectors (see Supplementary Tables 1 and 2), respectively, from which the reporter gene (as well as the minimal promoter in the case of enhancers) had been removed. For each of the 21 regulatory elements, we determined which variants were linked to which random tag sequences by deeply sequencing the corresponding library (see the “Methods” section). In the final step, we inserted the luciferase reporter gene (as well as the minimal promoter in the case of enhancers) in between the regulatory element and the tag sequence, and transformed the MPRA library into E.coli. With this insertion of the luciferase reporter gene, the above-introduced random tag sequence becomes part of its 3′ untranslated region (3′ UTR).

We obtained tag assignments (i.e., variant-tag associations) for a total of 24 saturation mutagenesis libraries (see the “Methods” section). This included the 21 selected regions listed in Supplementary Tables 1 and 2, as well as an additional full replicate for the LDLR and SORT1 enhancer libraries, and an additional SORT1 library with reversed sequence orientation of the enhancer. Supplementary Fig. 3 plots the number of tags associated with substitutions and 1-bp deletions along the target sequences for each library. The representation of tags associated with specific variants follows previously characterized biases in error-prone PCR using Taq polymerase40, with a preference of transitions (exchange of purine for pyrimidine base) over transversions (exchange between two-ring purines A/G to one-ring pyrimidines C/T) and T-A preference among transversions. Insertions were rare, while short deletions occured at rates similar to those of the rare transversions. For all libraries, we observed complete or near-complete coverage of all potential SNVs (Supplementary Table 3) as well as partial 1-bp deletion coverage. On average, 99.9% [99.1%, 100%] of all potential SNVs in the targeted regions are associated with at least one tag, while on average 55.4% [31.4%, 71.1%] of 1-bp deletions are associated with at least one tag.

For each MPRA experiment, around 5 million cells (Supplementary Tables 1 and 2) were plated and incubated for 24 h before transfection with the libraries. In each experiment, three independent cultures (replicates) were transfected with the same library. In addition, for LDLR and SORT1, independent MPRA libraries were created, as outlined above, and cells were transfected from a different culture and on a different day. In one case (TERT), the same MPRA library was used for experiments in two different cell types (HEK293T and a glioblastoma cell line).

We then used our published protocol for quantifying effects from RNA and DNA tag-sequencing readouts, including the previously suggested modification of using unique molecular identifiers (UMIs) during targeted amplification41 (see the “Methods” section). More specifically, the relative abundance of reporter gene transcripts driven by each promoter or enhancer variant was measured by counting associated 3′ UTR tags in amplicons derived from RNA (obtained by targeted RT-PCR), and normalized to its relative abundance in plasmid DNA (obtained by targeted PCR). For all experiments, we excluded tags not matching the assignment and determined the frequency of a tag in RNA or DNA from high-throughput sequencing experiments based on the number of unique UMIs. We only considered tag sequences observed in both RNA and DNA. Supplementary Table 4 summarizes the number of RNA and DNA counts obtained in each experiment. From individual tag counts in RNA and DNA, we fit a multiple linear regression model to infer individual variant effects (see the “Methods” section).

For data quality reasons, we introduced a minimum threshold on the number of associated tags per variant used in model fitting (various quality measures for fitted variant effects versus the number of tags are plotted in Supplementary Fig. 4). We picked this threshold based on the correlation of variant effects obtained when comparing between the independent libraries of LDLR (Fig. 1b and Supplementary Fig. 5) and SORT1 (Fig. 1c and Supplementary Fig. 6). Using all SNVs and 1-bp deletions with at least one associated tag in each transfection replicate, variant effects show a Pearson correlation of 0.93 (LDLR) and 0.94 (SORT1). When requiring a minimum of ten tag measurements after combining all three transfection replicates, correlations increase to 0.97 (LDLR) and 0.96 (SORT1). Requiring even higher thresholds (Supplementary Table 5) further improves replicate correlation up to 0.98 for both experiments (min. 50 tags), but reduces SNV coverage to 86.3% and 1-bp deletion coverage to 15.6% across all datasets (Supplementary Table 6). We therefore used a minimum of ten tags, reducing average coverage from 99.8% to 98.4% [93.0%, 100.0%] for all putative SNVs and from 44.4% to 25.3% [10.5%, 41.7%] for 1-bp deletions.

To assure high quality of our complete dataset, we evaluated the Pearson correlation of variant effects divided by their standard deviation among pairs of transfection replicates (Supplementary Table 7). We observed the lowest replicate correlation for BCL11A, FOXE1, and one of the MYC elements (rs1198622). In contrast, experiments for HBG1, IRF4, LDLR, SORT1, and TERT exhibited high reproducibility among transfection replicates (Pearson correlation > 0.9). Exploring differences in the proportion of alleles with significant regulatory activity, we observed a wide range of values across elements (3–52% of variants using a lenient p-value threshold of <0.01 for the fit; average 22%; Supplementary Table 7). We find that this proportion is strongly correlated with the performance of transfection replicates (Pearson correlation of 0.78), but we also note a circular relation for the significance of results, low experimental noise, and high reproducibility.

We sought to explore whether factors like target length, wild-type activity in the luciferase assay (Supplementary Fig. 1, Supplementary Tables 1 and 2), measures of assignment complexity (Supplementary Table 3), as well as DNA and RNA sequencing depth (Supplementary Table 4) contribute to technical reproducibility. Linear models of up to three features fit in a leave-one-out setup explain up to 33% of variance (luciferase wild-type activity, average number of tags per SNVs, and the proportion of wild-type haplotypes) or 29% of the variance (luciferase wild-type activity, average number of variants per haplotype, and average number of DNA counts obtained) in reproducibility between transfection replicates. Overall, these analyses emphasize the baseline activity of a regulatory element as the largest factor (i.e., highly active elements are associated with greater technical reproducibility).

### General properties of observed regulatory mutations

Altogether, our MPRAs quantified the regulatory effects of 31,243 potential mutations (min. ten tags) at 9834 unique positions (Supplementary Table 8) and we setup an interactive website for exploring this dataset (https://mpra.gs.washington.edu/satMutMPRA/). Of the unique mutations, 4830 (15%) were identified as causing significant changes relative to the wild-type promoter or enhancer sequence (p-value of fit <10−5). Of those causing significant changes, 1789 (37%) increased expression (by a median of 20%) and 3041 (63%) decreased expression (by a median of 24%). The majority of significant effects were of small magnitude. If we require a minimum two-fold change, we identify a total of 83 activating and 559 repressing mutations. The significant shift toward repressing mutations (binomial test, p-value < 10−42) is consistent with a model where most transcriptional regulators are activators and binding is more easily lost than gained with single nucleotide changes.

Out of the 31,243 successfully assayed mutations, 2306 are 1-bp deletions, of which 229 meet the significance threshold (p-value of fit <10−5). This is a lower proportion than observed for SNVs (10% vs. 16%), most likely due to the lower rates at which 1-bp deletions are created by error-prone PCR, resulting in representation by fewer tags. Supporting this notion, 1-bp deletions tend to be associated with larger absolute effect sizes than SNVs (Wilcoxon Rank Sum test with continuity correction, p-value < 0.05, location shift 0.04). Similarly, we observe a large shift toward repressive effects with 1-bp deletions (significant effects: 33% activating [27%, 40%], 76 activating and 153 repressing; min. two-fold change 5% activating [1%, 17%], 2 activating and 37 repressing), but due to the low number of observations, this shift is not significantly different than that observed for SNVs (significant effects: 37% activating [36%, 39%], 1719 activating and 2882 repressing; min. two-fold change 14% activating [11%, 17%], 80 activating and 514 repressing).

We have greater power to detect significant effects for transitions than transversions, likely consequent to the higher sampling by error-prone PCR (Supplementary Table 9; binomial test comparing the proportion of significant transitions (2190/9824) vs. transversions (2411/19,113); p-value < 10−16). This is supported by specific transversions (A-T, T-A) that are also created more frequently by error-prone PCR (Supplementary Fig. 3) and represent a higher proportion of significant observations (A-T 374/2372 and T-A 422/2289, combined binomial test vs. all transversions, p-value < 10−16). Despite our greater power for assaying transitions, transversions had larger absolute effect sizes (Wilcoxon Rank Sum test with continuity correction, p-value < 10−16, location shift 0.14). This observation supports a model where regulatory elements evolved some level of robustness to the more frequent transitional changes (as is the case for coding sequences42), and is consistent with previous research showing that transversions have a larger impact on TF motifs and allele-specific TF binding43.

Our increased power to measure the effects of transitions resulted in an artifactual enrichment for significant effects among SNVs previously observed in gnomAD r2.17 (binomial test; overlap of tested SNVs with gnomAD, n = 689/31,243; of those with significant effects, n = 144/689; p-value < 0.001), where 64% of SNVs are transitions compared with 34% of mutations created in our libraries. However, testing separately for transitions and transversions, there is no enrichment of significant effects among SNVs previously observed in gnomAD. In fact, we observed a smaller absolute effect size for previously observed SNVs (Wilcoxon Rank Sum test with continuity correction; p-value = 0.06, location shift 0.03). This effect is significant if we exclude singletons (excl. 82/144 significant variants; Wilcoxon Rank Sum test with continuity correction; p-value < 0.01, location shift 0.07), consistent with purifying selection acting on standing regulatory variation.

The most obvious pattern upon visual inspection of the data is a strong clustering of positions associated with significant mutations (e.g., Fig. 1d). This clustering was non-random for all but the F9 and FOXE1 experiments (Wilcoxon Rank Sum tests with continuity correction vs. 1000 data shuffles; p-value < 0.01; for 16/21 elements, p-value < 10−5), as determined from comparing run lengths for significant changes including directionality of the change. While FOXE1 is one of the experiments mentioned above with low experimental reproducibility, a non-random clustering of significant regulatory changes was observed in F9 when additionally requiring a minimum effect size of 20% (Wilcoxon Rank Sum tests with continuity correction vs. 1000 data shuffles; p-value < 10-9). These results are consistent with expectations for TFBS (specific examples are discussed below).

In the following sections, we describe the results of saturation mutagenesis of three of the regulatory elements in greater detail: the LDLR promoter, the TERT promoter, and a SORT1-associated enhancer. Similar expositions on the remaining 18 elements are provided as Supplementary Note 1 and Supplementary Tables 10 and 11. Finally, we compare the relative performance of various computational tools for predicting these empirical measurements of regulatory effects.

### LDLR promoter

FH is an autosomal dominant disorder of low-density lipoprotein (LDL) metabolism, which results in accelerated atherosclerosis and increased risk of coronary heart disease44. With a prevalence of about 1 in 500 individuals, FH is the most common monogenic disorder of lipoprotein metabolism. It is mainly due to mutations in the LDL receptor (LDLR) gene that lead to the accumulation of LDL particles in the plasma45. Several studies have shown that variants in the LDLR promoter can alter the transcriptional activity of the gene and also cause FH31,32,33 (full reference list in Supplementary Table 12). While in some cases, mutations in the promoter were identified in patients, a functional follow-up, like testing the regulatory effect of the variants by means of a luciferase assay, was not always conducted46,47,48. To decipher the functional activity of these previously identified, as well as essentially all potential SNVs in the LDLR (NM_000527.4) promoter, we performed saturation mutagenesis MPRA in HepG2 cells, a commonly used cell line for LDLR functional studies33. Experiments for this promoter were performed using two full replicates (i.e., two independently constructed saturation mutagenesis libraries), referred to below as LDLR and LDLR.2 (Fig. 1b).

We observed strong concordance between our MPRA-based measurements and variants with previous luciferase activity results (Supplementary Table 12). For example, a c.-152C>T mutation was previously reported to reduce promoter activity (to 40% of normal activity), while a c.-217C>T variant was shown to increase transcription (to 160% of normal activity)31. We observe a reduction of 32%/39% (LDLR/LDLR.2) and activation of 273%/263% (LDLR/LDLR.2) for these variants, respectively. c.-142C>T reduced promoter activity (to 20% of normal activity) in transient transfection assays in HepG2 cells32, and we observed 20%/11% (LDLR/LDLR.2) residual activity. Mutations located in regulatory elements R2 and R3 (c.-136C>G, c.-140C>G, and c.-140C>T) resulted in 6–15% residual activity;33 our MPRA results confirm these findings in both replicates (10–22% residual activity). We also observed no significant changes in promoter activity for c.-36T>G and c.-88G>A, consistent with a previous study of these variants33.

Overall, we observe that variants located in close proximity and overlapping the same TFBS tend to show similar deactivating effects (e.g., SP1 and SREBP1/SREBP2 sites in Supplementary Table 12 and Fig. 1d). Previously reported variants located in the 5′ UTR of LDLR generally did not affect promoter activity. The high concordance between full replicates (Pearson correlation of 0.97) as well as the agreement with previous studies give us confidence in the potential of our MPRA results to be useful for the clinical interpretation of LDLR promoter mutations. It also reinforces the value of functional assays covering all possible variants of a regulatory sequence of interest, as this provides consistent and comparable readouts together with a distribution of effect sizes.

### TERT promoter

Mutations in the telomerase reverse transcriptase (TERT) promoter (NM_198253.2), in particular c.-124C>T or c.-146C>T, increase telomerase activity and are among the most common somatic mutations observed in cancer34,35,36,37. Previous luciferase assay studies showed that these mutations increase promoter activity in human embryonic kidney (HEK) 293 cells, glioblastoma, melanoma, bladder cancer, and hepatocellular carcinoma (HepG2) cells34,49,50,51. In glioblastoma cells, c.-124C>T or c.-146C>T mutations result in a 2–4 log2 fold increase in promoter activity51.

Here, we tested the TERT promoter MPRA library in two different cell types, HEK293T and glioblastoma SF7996 cells52, referred to here as GBM (Fig. 2a), observing a log2 fold increase in promoter activity of 2.00/2.86-fold for c.-124C>T and 1.42/2.42-fold for c.-146C>T in HEK293T and GBM cells, respectively. We also identified additional activating mutations, some of which were previously identified in cancer studies49,53,54 and are annotated in COSMIC55. These include c.-45G>T and c.-54C>A, previously identified as somatic mutations in bladder cancers53,54, and c.-57A>C, previously associated with both melanoma49 and bladder cancer54. We observed activating effects for c.-45G>T and c.-54C>A with a log2 increase of 0.81/1.65-fold and 0.45/1.03-fold for HEK293T and GBM cells, respectively. For c.-57A>C, we observed a 0.65/1.14-fold log2 increase in HEK293T and GBM cells, respectively, similar to previous reporter assays that obtained increased expression of 0.6-fold (152%) and 0.3-fold (123%) on a log2-scale over the wild-type construct in Ma-Mel-86a and HEK293T cells49, respectively.

A common single-nucleotide polymorphism (SNP) in the TERT promoter, rs2853669 (c.-245A>G), previously studied in several cell lines and cancer cohorts, has been suggested to alter promoter-mediated TERT expression by impacting E2F1 or ETS/TCF binding56,57. However, studies in both breast cancer58 and glioblastomas59 failed to find any impact on risk or prognosis of this polymorphism. The epidemiologic findings are in line with our results, as we did not observe a significant effect of this variant on promoter activity in either cell type.

We next sought to assess whether there are differences in mutational effects on the TERT promoter between HEK293T and GBM cells that could be driven by the trans environment. Overall, variant effects were highly concordant between the two cell types (Fig. 2a). However, we did observe significant differences at several specific positions. In particular, variants between c.-62 and c.-70, which corresponds to an E2F repressor site, were found to increase promoter activity in GBM cells, likely due to the disruption of this motif (Fig. 2a). None of these effects were observed in HEK293T cells, suggesting that different E2F family protein abundances could be driving the differences in promoter activity between these cell types, and potentially between the corresponding cancer types.

Previous work has shown that the commonly observed cancer-associated activating somatic mutations, c.-146C>T and c.-124C>T, lead to the formation of an ETS binding site that is bound by the multimeric GABP transcription factor in GBM cells51. To evaluate the relevance of GABP binding on TERT promoter activity more globally, we retested our TERT MPRA library in GBM cells with a short interfering RNA (siRNA) targeting GABPA. We first optimized GABPA knockdown conditions using qPCR, such that it reduced GABPA expression by 68% ± 7% and TERT promoter activity by 58% ± 12%, compared with a scrambled siRNA control (Supplementary Fig. 7). We then tested our MPRA library in GBM cells using either the GABPA siRNA or the scrambled control. A total of 63 variants were identified as significantly different (see “Methods” and Supplementary Table 13), 59 leading to a reduction, and 4 to an increase in activity. Both c.-146C>T and c.-124C>T, previously reported to create GABP binding sites, showed significantly reduced activity in the siGABPA knockdown compared with the scrambled control (Fig. 2b). Variants with significantly different activity were 2.8-fold enriched for ETS-related factor motifs (ETS1, ELK4, ETV1, ETV4-6, and GAPBA) annotated from JASPAR 201860 as compared with all 908 other variants present in both experiments (one-sided Fisher’s exact test; p-value < 0.001).

Apart from the two variants (c.-146C>T and c.-124C>T) known to create GABPA binding sites, we identified nine additional variants with the potential to create new ETS family motifs from the siGABPA knockdown (Supplementary Table 13). To include such instances as part of a global analysis, we computed score differences of the corresponding position weight matrices (PWMs) from activating and repressing variants that overlap either a reference or newly created ETS family motif (reference or alternative sequence larger than the 80th percentile of the motifs’ best matches; 12 activating and 26 repressive variants). Figure 2c shows that activating variants create new ETS-related factor motifs and repressing variants disrupt them. The PWM score changes are highly significant between activating and repressing variants (one-sided Wilcoxon Rank Sum test, p-value < 10−11). Overall, the average expression reduction for the motif-disruptive allele in cases where variants disrupt or create ETS family motifs is 29.5% ± 14%, which is in concordance with the 58% ± 12% reduced qPCR expression of TERT (Supplementary Fig. 7).

### SORT1-associated enhancer

The Sortilin 1 (SORT1)-associated enhancer was identified via a common SNP, rs12740374 (GRCh37 chr1:109,817,590G>T), which is associated with myocardial infarction61. The minor allele T creates a potential C/EBP binding site, leading to ~4-fold greater luciferase activity compared with the major allele G in a reporter assay38. This is thought to alter the expression of the SORT1 and proline and serine rich coiled-coil 1 (PSRC1) genes, leading to changes in LDL and VLDL plasma levels38. The major allele is thus associated with higher LDL-C levels and increased risk for myocardial infarction. These results are also consistent with prior human lipoprotein QTL analyses62,63.

We carried out saturation-based MPRA in HepG2 cells using a 600 bp region encompassing rs12740374 with two full replicates (i.e., two independently constructed saturation mutagenesis libraries transfected at different days with three technical replicates each, SORT1 and SORT1.2, Fig. 1c). Consistent with the literature, our MPRA results show a significant effect for rs12740374, leading to a 2.92/2.74 log2-fold increase in expression. Furthermore, we observe many other substitutions of large effect, with a disproportionate number of >2-fold expression changes in our overall dataset (144/645) occurring in the SORT1 enhancer (Supplementary Table 8). The locations of these variants are strongly clustered (Fig. 3), indicative of several TFBS in this region.

The directionality independence of enhancers is inherent to their definition but is not often tested. To evaluate whether the orientation of an enhancer could bias the effects of mutations within it, we generated a third SORT1 enhancer library where the orientation of the enhancer was flipped, termed SORT1.flip. Comparison of all variants showing a difference in activity compared with the reference sequence (p-value of fit <10−5) in the SORT1.flip with the other two libraries in the opposite orientation (SORT1 and SORT1.2), we observe a very strong correlation (0.97 and 0.96 for SORT1 and SORT1.2, respectively; Pearson’s correlation), on par with the two biological replicates in the same orientation, SORT1 and SORT1.2 (0.98; Pearson’s correlation). This result supports the directionality-independence of enhancers64 as well as of the effects of variants within them.

However, we did observe a few significant differences between SORT1/SORT1.2 and SORT1.flip near the 3′ region of the forward orientation (GRCh37 chr1:109,817,859-109,817,872), i.e., adjacent to the minimal promoter on the reporter construct. This block of variants led to a significant increase in activity in the forward orientation (SORT1/SORT1.2) but not in the opposite orientation (SORT1.flip) (Fig. 3). Analysis of this region for TFBS using JASPAR 201860 found a potential EBF1 motif (MA0154.3), suggesting that this factor might lead to the orientation activity differences observed in our assays. EBF1 is known to act as an activator or repressor of gene expression65,66 and mutations to its core motif sequences (GGG and CCC) had the strongest effect in our SORT1 assay. Nonetheless, its 3′ location and orientation dependence suggests that it is likely contributing to minimal promoter rather than enhancer activity.

### Computational tools predict expression effects poorly

Altogether, our MPRAs analyzed over 30,000 different mutations for their effects on regulatory function. We next set out to assess the performance of the available computational tools and annotations for predicting the regulatory effects of individual variants. For this purpose, we examined various measures of conservation (PhyloP67, PhastCons68, and GERP++69) as well as a number of computational tools that integrate large sets of functional genomics data into combined scores (CADD70, DeepSEA11, Eigen12, FATHMM-MKL13, FunSeq214, GWAVA15, LINSIGHT16, and ReMM17). In addition, we previously identified the number of overlapping TFBS as a significant predictive measure of the activity of a specific region41. We therefore also analyzed TFBS annotations resulting from motif predictions overlayed with biochemical evidence from ChIP-seq experiments (available as Ensembl Regulatory Build (ERB)71 and ENCODE72 annotations) as well as pure motif predictions from JASPAR 201860. Using JASPAR predictions, we extended this analysis to individual positions and explored different score thresholds or just the factors predicted most frequently across the region (see “Methods”). All these annotations and scores are agnostic to the cell type(s) in which we studied each sequence. Therefore, we also compared our results to sequence-based models (deltaSVM21) for 10 of 21 MPRAs, i.e., where a model was publicly available for the corresponding cell type (HEK293T, HeLa S3, HepG2, K562, and LNCaP). In cases where an annotation is based on positions rather than alleles, we assumed the same value for all substitutions at each position. We did not include the 1-bp deletions in this analysis, as most annotations are not defined for deletions.

Supplementary Tables 14 and 15 report the Pearson and Spearman correlation of the obtained expression effect readouts with conservation metrics, combined annotation scores, and overlapping TF predictions, respectively. Figure 4 and Supplementary Figs. 8 and 9 visually contrast expression effects as well as a subset of these annotations (including ENCODE72 and Ensembl71 motif annotations). By correlating absolute expression effects with functional scores, we identified species conservation as a major driver of the currently available combined scores in these regions. For example, the correlation results of CADD v1.3/v1.4, Eigen, FATHMM-MKL and LINSIGHT show more than 90% Pearson correlation across elements with the results of PhastCons scores calculated from the alignment of mammalian genome sequences. However, conservation seems only informative for a subset of the studied noncoding regulatory elements (e.g., LDLR, ZFAND3, and IRF4, but not SORT1, F9, or GP1BB). We observe that repressive effects can at least be partially aligned with available motif data (e.g., F9, GP1BB, IRF4, LDLR, and SORT1). However, experimentally supported motif annotations are frequently incomplete (for an example see Fig. 4a, around c.-215 where motifs are predicted in JASPAR but absent from ENCODE and ERB). In several cases, motif annotation is also missing completely. For 13 of the 21 elements studied here, no motif annotation was available from ERB; for 2 of the 21 elements no motif annotation was available from ENCODE. Furthermore, the gain-of-binding motifs, e.g., the motifs underlying activating mutations in TERT, are currently not at all or insufficiently modeled in available scores, as these motifs are frequently missed by scans of the reference genome.

Looking at average Spearman correlations across our 21 regions (Fig. 5, Supplementary Table 15), DeepSEA (0.22) performed best, followed by FunSeq2 (0.14), Eigen (0.14), high-scoring (top 10th percentile) JASPAR predictions (0.14), and FATHMM (0.14). However, the average is a poor measure here. Spearman correlations for absolute expression effects of some elements were reasonably high for several methods (0.3–0.6), while for other elements no or negative correlations were detected for most or all methods. We saw the best performance in predicting an individual element (LDLR) for FATHMM-MKL (0.59), followed by GerpN (0.59), Eigen (0.58), and vertebrate PhastCons (0.58). Besides LDLR, the next best agreement between annotations and absolute expression effects was observed for F9 (top 10th percentile JASPAR 0.52), IRF4 (Eigen 0.48), ZFAND3 (DeepSEA 0.44), and PKLR (LINSIGHT 0.41). The lowest agreement was observed for FOXE1 (DeepSEA 0.05) and BCL11A (DeepSEA 0.03); the two elements observed with the lowest replicate correlation (Supplementary Table 7). However, high replicate correlations did also not indicate high correlations with existing scores or annotations. For example, SORT1 (replicate correlations of 0.99) and TERT (replicate correlations of >0.96 for GBM experiment) have highest correlations of 0.33 (top 25th percentile JASPAR) and 0.37 (DeepSEA), respectively.

Among the ten elements for which sequence-based models were available from deltaSVM, DeepSEA, the top 10th percentile JASPAR predictions and FunSeq2 scores still performed best in predicting absolute effect size. The absolute deltaSVM models ranked 7th out of 25 measures and were most similar to the number of overlapping top 25th percentile JASPAR predicted motifs (0.80 Pearson correlation). However, deltaSVM models showed improved performance when correlating expression effects including their directionality (0.53 Spearman correlation for SORT1, 0.43 Spearman correlation for F9, Supplementary Table 15). The directionality available from deltaSVM, i.e., predicted expression gain and loss, illustrates how sequence-based models can overcome missing gain-of-motif annotations from reference sequence-based predictions.

In these comparisons, we focused on the relative ranking of variant effects and included a large number of close-to-zero effect estimates affected by experimental noise. This might be conservative for estimating the predictive power in classification settings, e.g., separating pathogenic from benign variants. Testing the separation of no/little effect SNVs from the top 200/500/1000 highest effect promoter (Supplementary Table 16; Supplementary Figs. 10 and 11) or enhancer (Supplementary Table 17; Supplementary Figs. 12 and 13) variants, we also find DeepSEA with the best average performance in both groups (ave. AUROC 0.831 promoter and 0.709 enhancer). We see better performance on the top 200 vs. top 500 vs. the top 1000 variants for both enhancers and promoters. While this difference is more pronounced for promoters, it goes along a shift toward certain elements contributing the majority of selected variants (see Supplementary Tables 16 and 17). Considering that certain elements dominate our classification results (promoters: LDLR, PKLR, and TERT; enhancers: SORT1, IRF4, and IRF6), we caution to overinterpret differences between promoters and enhancers (e.g., better performance of conservation scores for promoters vs. better performance of motif predictions for enhancers). A much larger number of elements in each group (exceeding the scope of this study) and eventually a balanced sampling of elements will be required for that.

To summarize, we observe that even the best performing computational tools or annotations are less predictive of regulatory variants compared with the classification of coding variants70 and can explain only a small proportion of the expression effects observed in our data. The highest variance explained based on Pearson R2 is 0.41 (mammalian PhastCons for LDLR, Supplementary Table 14), the average across all elements is just 0.03.

## Discussion

Although limited in their naturalness, MPRAs enable a rigorous, quantitative ascertainment of the regulatory consequences of genetic variants. Previous studies applied MPRAs to study common genetic variants in various regions, with the goal of fine-mapping the causal regulatory determinants of GWAS or eQTL associations24,25. In contrast, here we selected sequences with known regulatory potential—and moreover, sequences previously implicated in human disease—and sought to quantify the consequences of all possible SNVs on that potential.

Saturating MPRAs uniquely facilitates several kinds of analysis. First, we are able to formally evaluate the distribution of effect sizes in regulatory elements. For example, what proportion of variants are inert, activating or repressing? How typical or atypical is a regulatory variant that results in a two-fold expression change? Although the answers to such questions undoubtedly differs between regulatory elements, the number of elements and variants that were studied allows us to begin to make generalizations. Second, saturating MPRAs facilitate the fine-scale identification of TFBSs, including ones that may correspond to transcriptional regulators for which ChIP-seq data are not available or that are not well represented in motif databases. Importantly, it also allows the discovery of binding sites that are created by genetic variants, i.e., through activating mutations. Third, we show that the intersection of saturation mutagenesis MPRAs and TF perturbation, i.e., our siRNA-based knockdown of GABPA, enables confirmation that a particular TF binds to a particular TFBS. Larger-scale implementations of this approach may facilitate the routine identification of the specific trans-acting factors that are responsible for the regulatory potential of each cis-acting regulatory element.

Our study and these data have several limitations that merit highlighting. First, we are limited with respect to context, both cis and trans. To address the former, we used longer sequences than are typical for MPRAs, up to 600 bp, but it remains the case that these are studied on episomal vectors rather than in their native locations. To address the latter, we selected cell lines in which these elements were previously shown to be active, and moreover relevant to the diseases in which these elements were implicated. Nonetheless, previous studies have demonstrated that some regulatory polymorphisms do not always reflect their in vivo effects in cell culture-based assays73, particularly for developmental genes that show temporal and tissue-specific expression patterns (e.g., see results for the IRF6 and ZRS in Supplementary Note 1). A second limitation relates to the reproducibility of measurements for some of the elements studied, and in particular for those with lower basal activity (which we found to be the largest factor impacting reproducibility). Potential approaches to address this in future work include using a stronger minimal promoter (for enhancers) or simply using more complex libraries to further reduce noise (for all elements).

A clear result of our analyses is that although myriad annotations and integrative scores are available, and although some annotations/scores are surprisingly successful in specific cases, no current score consistently performs well in predicting the regulatory consequences of SNVs in the human genome. It is our hope that this dataset of functional measurements for over 30,000 single-nucleotide substitution and deletion regulatory mutations in disease-associated regulatory elements will be useful for the field for studying the shortcomings of current tools, and hopefully inspiring their improvement.

In summary, we successfully scaled saturation mutagenesis-based MPRAs to measure the regulatory consequences of tens-of-thousands of sequence variants in promoter and enhancer sequences previously associated with clinically relevant phenotypes. We believe that our experiments provide a rich dataset for benchmarking predictive models of variant effects, an unprecedented database for the interpretation of potentially disease-causing regulatory mutations, and the potential for critical insight for the development of improved computational tools.

## Methods

### Selection of target sequences and luciferase assays

Promoter and enhancer sequences of interest (Supplementary Tables 1 and 2) were amplified from human genomic DNA (Roche 11691112001). The targets were amplified with overhanging primers (Supplementary Table 18) to add shared sequences for cloning. All promoters were cloned into pGL4.11b vector [modified from pGL4.11 (Promega) by Dr Richard M. Myers lab] and most of the enhancers were cloned into the pGL4.23 vector (Promega) that contains a minimal promoter followed by the luciferase reporter gene (Supplementary Tables 1 and 2). All inserts were confirmed by Sanger sequencing. We measured the relative luciferase activity of the selected promoters and enhancers for the wild-type as well as the saturation mutagenesis library, normalized to the empty vector. For this purpose, HepG2 (HB-8065), HEK293T (CRL-11268), HeLa (CCL-2), HaCaT (CRL-2404), Neuro-2a (CCL-131), LNCaP (CRL-1740), and SK-MEL-28 (HTB-72) were obtained from American Type Culture Collection (ATCC), the primary glioblastoma cell line SF7996 (GBM)52 was obtained from UCSF (Dr Costello’s lab), and Min6 was a gift by Dr Feroz Papa’s lab at UCSF. A total of 2.0 × 105 cells were cultured in 96-well plates overnight using standard protocols and were transfected with 100 ng of plasmid bearing the promoter/enhancer sequence (Supplementary Tables 1 and 2) (along with 10 ng of the Renilla vector to facilitate normalization for transfection efficiency) using X-tremeGENE HP DNA transfection reagent (Roche 06366236001) according to the manufacturer’s protocol. K562 (ATCC CCL-243), HEL92.1.7 (ATCC TIB-180), and NIH/3T3 cells (ATCC CRL-1658) were cultured in 24-well plates and transfected using X-tremeGENE HP with 500 ng of the constructed plasmid, along with 50 ng of the Renilla vector. The promoter/enhancer activity was measured using the Dual-Luciferase reporter assay (Promega E1910) on a Synergy 2 microplate reader (BioTek Instruments) following a post-transfection interval that varied by experiment (Supplementary Tables 1 and 2).

### Construction of MPRA libraries

The cloned enhancer or promoter sequences were amplified in two rounds of PCR. After amplifying the cloned sequences once to append universal adaptors for subsequent steps (Supplementary Table 19), a second error-prone PCR with Mutazyme II (GeneMorph II Random Mutagenesis Kit, Agilent 200550) was used to introduce sequence variation. This second round also added a 15 or 20 bp random tag (i.e., barcode) contained within an overhanging primer oligo (Supplementary Table 20) to each construct.

The resulting PCR products were cloned into the respective vector backbone (Supplementary Tables 1 and 2) without the luciferase reporter via NEBuilder HiFi DNA Assembly (NEB E2621) and transformed into 10-Beta Electrocompetent cells (NEB C3020K). As needed, multiple transformations were pooled and midi-prepped together (Chargeswitch Pro Filter Plasmid Midi Kit, Invitrogen CS31104). Using the vector backbone without the luciferase gene allowed for association of each sequence variant with its newly added tag via sequencing (see below). Once this interim library was determined to have a sufficient complexity and representation of variants, the luciferase gene was inserted between the enhancer/promoter and its tag via a sticky end ligation and transformed again to make the final library for transfection.

### Assignment of tags to sequence variants

The associations between tags and sequence variants created by error-prone PCR were learned by amplifying and deep sequencing of this region of the plasmid library before the luciferase gene was added, i.e., while the enhancer/promoters and tags are in close proximity. For short enhancers/promoters, the libraries were amplified with sequencing adaptor primers that captured the cloned sequence with its tag and added the P5/P7 Illumina flow cell sequences (Supplementary Table 21). For long enhancers/promoters, a custom subassembly sequencing approach30 was used to obtain associated tags and sequences along the targets. Here, libraries were also first amplified with sequencing adaptors, but then some of the full-length products were subjected to tagmentation via a Nextera library prep (Illumina FC-121-1031). The tagmented products were amplified with a Nextera-specific primer on the P5 end, and a primer containing only the P7 flow cell sequence on the other end (Supplementary Table 22). These PCR fragments were size-selected on a 1%-agarose gel into two size bins. Full-length and fragmented libraries were quantitated with a Kapa Library Quantification Kit (Roche 07960140001). Products were run on either an Illumina MiSeq or NextSeq instrument (Supplementary Table 23). Full-length and large-size fragment bins were loaded with increased DNA concentration, as these are less efficiently amplified during the cluster generation process. Sequence reads were aligned using BWA-mem v0.7.10-r78974 with an increased penalty against local alignments (-L 80) to the Sanger determined references. A minimum coverage of three reads along the whole target was required to include variant calls from bcftools v1.275 for each identified tag. Summary statistics for these assignments are available in Supplementary Table 3.

### Expression of libraries and nucleic acid extraction

For each experiment, about 5 million cells were plated in 15-cm plates and incubated for 24 h before transfection. Each of three independent cultures (replicates) were transfected with 15 μg of the constructed MPRA libraries using X-tremeGENE HP (Roche 06366236001). After indicated hours (Supplementary Tables 1 and 2), cells were harvested, genomic DNA and total RNA were extracted using AllPrep DNA/RNA mini kit (Qiagen 80204). Total RNA was subjected to mRNA selection (Oligotex, Qiagen 72022) and treated with Turbo DNase (Thermo Fisher Scientific AM2238) to remove contaminating DNA.

### RNA interference (TERT promoter)

Following the protocol outlined in Bell et al.51, siRNAs were transfected into GBM cells (SF7996) using DharmaFECT 1 following the manufacturer’s protocol. Briefly, cells were seeded at a density of 30,000 cells/mL in a 96-well plate and 5 million cells in 15-cm plates in parallel. Twenty-four hours post seeding, cells were transfected with 50 nM of siRNA and 0.3 μL of DharmaFECT 1 reagent (Dharmacon T-2001). At 48 h post transfection with siRNA, cells were transfected again with the TERT saturation mutagenesis library for another 24 h before harvesting for genomic DNA and total RNA as described above. To measure siRNA knockdown efficiency, cDNA was generated from the 96-well plate, and qPCR performed (Power SYBR Green Cells-to-Ct kit, Ambion 4402953) to measure mRNA abundance of GABPA and TERT with primer sequences previously used (Supplementary Table 24). Relative expression levels (Supplementary Fig. 7) were calculated using the deltaCT method against housekeeping gene GUSB51.

### RNA and DNA library preparation

For each replicate, RNA was reverse transcribed with Superscript II (Invitrogen 18064-014) using a primer downstream of the tag. The resulting cDNA was first pooled and then split into multiple reactions to reduce PCR jackpotting effects. Amplification was performed with Kapa Robust polymerase (Roche KK5024) for three cycles, incorporating UMIs 10 bp in length, a sample identifier and the Illumina P7 adapter (Supplementary Table 25). PCR products were cleaned up with AMPure XP beads (Beckman Coulter A63880) to remove the primers and concentrate the products. These products underwent a second round of amplification in eight reactions per replicate for 15 cycles, with a reverse primer containing only P7 (Supplementary Table 25). All reactions were pooled, run on an agarose gel for size selection, and then sequenced. For the DNA, each replicate was amplified for three cycles with UMI-incorporating primers, just as the RNA. First round products were then cleaned up with AMPure XP beads, and amplified in split reactions, each for 20 cycles. Reactions were pooled, gel-purified, and sequenced.

### Sequencing and primary processing

RNA and DNA for each of the three replicates were sequenced on an Illumina NextSeq instrument (2 × 15 or 2 × 20 bp + 10 bp UMI + 10 bp sample index; primers available in Supplementary Table 26). Paired-end reads each sequenced the tags from the forward and reverse direction and allowed for adapter trimming and consensus calling of tags76. Tag or UMI reads containing unresolved bases (N) or those not matching the designed length were excluded. In the downstream analysis, each tag × UMI pair is counted only once and only tags matching the above obtained assignment were considered (Supplementary Table 4).

### Inferring SNV effects

RNA and DNA counts for each replicate were combined by tag sequence, excluding tags not observed in both RNA and DNA of the same experimental replicate. All tags (T) not associated with insertions or multiple bp deletions were included in a matrix of RNA count, DNA count, and N binary columns indicating whether a specific sequences variant was associated with the tag. We then fit a multiple linear regression model of log2(RNA$${j}$$) ~ log2(DNA$${j}$$) + N + offset (jT) and report the coefficients of N as effects for each variant. Further, we fit a combined model (Equation 1) across all three experimental replicates ($$i \in \{ 1,2,3\}$$), where we combine all RNA measures in one column and keep the DNA readouts separated by replicate (i.e., filling missing values with 0).

$${\mathrm{log}}_2\left( {{\mathrm{RNA}}_{i,{{j}}}} \right) \sim \left\{ {\begin{array}{*{20}{c}} {{\mathrm{log}}_2\left( {{\mathrm{DNA}}_1,j}\right)|i = 1} \\ {0|{\mathrm{else}}} \end{array}} \right\} + \left\{ {\begin{array}{*{20}{c}} {{\mathrm{log}}_2\left( {{\mathrm{DNA}}_2,j} \right)|i = 2} \\ {0|{\mathrm{else}}} \end{array}} \right\} \\ \hskip 12pt \hskip 12pt + \, \left\{ {\begin{array}{*{20}{c}} {{\mathrm{log}}_2\left( {{\mathrm{DNA}}_3,j} \right)|i = 3} \\ {0|{\mathrm{else}}} \end{array}} \right\} + N + {\mathrm{offset}}$$
(1)

We required a minimum of ten tags per variant, before considering variant effects in downstream analyses (Supplementary Tables 5 and 6). While statistically inflated (assumption of independence between tags, correlated counts due to saturation mutagenesis process and no modeling of epistasis effects), we used the p-value for the obtained coefficients as a proxy for the support for each individual nonzero variant effect.

We used the correlation between transfection replication of fitted variant effects divided by the inferred standard deviation as a measure of reproducibility (Supplementary Table 7). We explored major contributors to the reproducibility by leave-one-out regression models using putative predictors from Supplementary Tables 14 (target length, Luciferase fold-change, number of associated tags, proportion of wild type in the library, average mutation rate, number of variants per insert, tags per SNV, tags per 1-bp deletion, assigned DNA and RNA counts, number of wild-type tags, number of variants across tags, and average number of variants per tag).

To identify significantly different variants in the siRNA knockdown experiment, 95% confidence intervals were generated from the generalized linear model using the confint.lm method of the stats package in R. Variants with non-overlapping confidence intervals between the experiments TERT-GBM-SiGABPA and TERT-GBM-SiScramble were considered significantly different.

### Annotation and downstream analysis

BP positions and SNVs along the targeted regions were annotated based on GRCh37/hg19 coordinates. CADD v1.0, v1.3, and v1.4 were downloaded [http://cadd.gs.washington.edu/download], and phastCons, phyloP, and GERP++ scores were used as provided with the CADD v1.3 annotations. Functional effect scores for FunSeq v2.16 [http://archive.gersteinlab.org/funseq2.1.2/hg19_NCscore_funseq216.tsv.bgz], LINSIGHT [http://compgen.cshl.edu/~yihuang/tracks/LINSIGHT.bw], and ReMM v0.3.1 [https://charite.github.io/software-remm-score.html] were downloaded for the whole genome and regions of interest extracted. GWAVA scores for the unknown region and TSS models were calculated for all positions along the targeted regions using available software [ftp://ftp.sanger.ac.uk/pub/resources/software/gwava/v1.0/]. DeepSEA [http://deepsea.princeton.edu/job/analysis/create/] and fathmm-MKL [http://fathmm.biocompute.org.uk/fathmmMKL.htm] scores were retrieved using the respective online web interfaces.

DeltaSVM scores were computed from the precomputed k-mer weights [http://www.beerlab.org/deltasvm] of the initial deltaSVM publication21. For each variant the average of all possible k-mer scores of the alternative allele is subtracted from the average of all possible k-mer scores of the reference allele. Not available k-mers in the files are treated as zero. Only precomputed k-mer weights of HEK293T, HeLa S3, HepG2, K562, and LNCaP were available. The model DukeDnase was used for experiments in HEK293T (HNF4A, MSMB, MYC rs6983267, and TERT). For cell types with multiple available k-mer models, we selected the best performing model based on our data. This was DHS_H3Kme1 for HepG2 (SORT1, F9, and LDLR) and K562 (PKLR), UwDnase for LNCaP (MYC rs11986220) and HeLa S3 (FOXE1). We applied the HEK293T model to our TERT experimental results for the matching cell line as well as for GBM cells, based on the high correlation observed for these experiments. We obtained higher correlations in GBM cells, probably due to better experimental performance and are using these values in the comparison with other functional scores described above.

We used predicted TFBS available from JASPAR 2018 [http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/hg19/JASPAR2018_hg19_all_chr.bed.gz]60. Scores reported for each motif match were divided by the length of the match. These motif scores of all binding site predictions were combined across the 21 genomic regions to identify thresholds for the 90th (38), 75th (32.5), 50th (27.8182), 25th (24.4167), and 10th (21.4) percentile. To identify factors with the highest number of motifs in each region, we identified the five most frequent factors and included additional factors with the same number of motif matches in the region. For visualization, overlapping matches of the same motif were combined and matches on both strands considered only once.

TFBS predictions overlapping respective ChIP-peaks in ENCODE experiments were downloaded from http://compbio.mit.edu/encode-motifs/. TFBSs annotated in the ERB v90 were downloaded from [ftp://ftp.ensembl.org/pub/release-90/regulation/homo_sapiens] and coordinate converted to GRCh37 using the UCSC liftover program.

To test the performance of computational scores in a classification setting, i.e., distinguishing between large effect SNVs and those with no effect, we selected the top 200, 500, and 1000 variants with the largest expression effects across all elements (min. ten tags required; p-value of fit < 10−5) and randomly sampled the same number of variants with an absolute log2 expression effect lower than 0.05 (min. tags ten required), preserving the contribution of each element in both classes. In the top 1000 analysis, the number of positives exceeded the number of negatives and we downsampled the positives, respectively. If a region was studied in multiple experiments, we selected the one with the largest replicate correlation (Supplementary Table 7). We used annotations and scores as described above. In cases where an annotation is based on positions rather than alleles, we assumed the same value for all substitutions at that position. The analysis was repeated for 100 samples of the no-effect SNVs and the area under the receiver operating characteristic as well as the area under the precision-recall curve was computed.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

## Data availability

Plasmid construct sequences were deposited in NCBI GenBank (accessions pGL4.11b/MK484103.1 [https://www.ncbi.nlm.nih.gov/nuccore/MK484103.1], pGL4.11c/MK484104.1 [https://www.ncbi.nlm.nih.gov/nuccore/MK484104.1], pGL4.23c/MK484105.1 [https://www.ncbi.nlm.nih.gov/nuccore/MK484105.1], pGL4.23d/MK484106.1 [https://www.ncbi.nlm.nih.gov/nuccore/MK484106.1], pGL3c/MK484107.1 [https://www.ncbi.nlm.nih.gov/nuccore/MK484107.1], and pGL4Zc/MK484108.1 [https://www.ncbi.nlm.nih.gov/nuccore/MK484108.1]). The raw sequencing data, obtained tag-to-variant assignments, and processed RNA/DNA data have been submitted to NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE126550 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126550]. The expression effect estimates, a tool for interactive visualization, and further information is available at https://doi.org/10.17605/OSF.IO/75B2M.

## References

1. 1.

Shendure, J. & Akey, J. M. The origins, determinants, and consequences of human mutations. Science 349, 1478–1483 (2015).

2. 2.

Li, X. et al. The impact of rare variation on gene expression across tissues. Nature 550, 239–243 (2017).

3. 3.

Chatterjee, S. & Ahituv, N. Gene regulatory elements, major drivers of human disease. Annu. Rev. Genom. Hum. Genet. 18, 45–63 (2017).

4. 4.

Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).

5. 5.

Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).

6. 6.

1000 Genomes Project Consortium. et al.A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010)..

7. 7.

Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

8. 8.

Starita, L. M. et al. Variant interpretation: functional assays to the rescue. Am. J. Hum. Genet. 101, 315–325 (2017).

9. 9.

Levo, M. & Segal, E. In pursuit of design principles of regulatory sequences. Nat. Rev. Genet. 15, 453–468 (2014).

10. 10.

Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

11. 11.

Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

12. 12.

Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).

13. 13.

Shihab, H. A. et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 31, 1536–1543 (2015).

14. 14.

Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014).

15. 15.

Ritchie, G. R. S., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).

16. 16.

Huang, Y.-F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).

17. 17.

Smedley, D. et al. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease. Am. J. Hum. Genet. 99, 595–606 (2016).

18. 18.

Ernst, J. & Kellis, M. Chromatin-state discovery and genome annotation with ChromHMM. Nat. Protoc. 12, 2478–2492 (2017).

19. 19.

Chan, R. C. W. et al. Segway 2.0: Gaussian mixture models and minibatch training. Bioinformatics 34, 669–671 (2018).

20. 20.

Gulko, B., Hubisz, M. J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).

21. 21.

Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).

22. 22.

Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).

23. 23.

Stenson, P. D. et al. The human gene mutation database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum. Genet. 136, 665–677 (2017).

24. 24.

Tewhey, R. et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 172, 1132–1134 (2018).

25. 25.

Ulirsch, J. C. et al. Systematic Functional dissection of common genetic variation affecting red blood cell traits. Cell 165, 1530–1545 (2016).

26. 26.

Patwardhan, R. P. et al. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat. Biotechnol. 27, 1173–1175 (2009).

27. 27.

Patwardhan, R. P. et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nat. Biotechnol. 30, 265–270 (2012).

28. 28.

Bejerano, G. et al. Ultraconserved elements in the human genome. Science 304, 1321–1325 (2004).

29. 29.

Visel, A. et al. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat. Genet. 40, 158–160 (2008).

30. 30.

Hiatt, J. B., Patwardhan, R. P., Turner, E. H., Lee, C. & Shendure, J. Parallel, tag-directed assembly of locally derived short sequence reads. Nat. Methods 7, 119–122 (2010).

31. 31.

Scholtz, C. L. et al. Mutation -59c–>t in repeat 2 of the LDL receptor promoter: reduction in transcriptional activity and possible allelic interaction in a South African family with familial hypercholesterolaemia. Hum. Mol. Genet. 8, 2025–2030 (1999).

32. 32.

Mozas, P. et al. A mutation (-49C>T) in the promoter of the low density lipoprotein receptor gene associated with familial hypercholesterolemia. J. Lipid Res. 43, 13–18 (2002).

33. 33.

De Castro-Orós, I. et al. Functional analysis of LDLR promoter and 5’ UTR mutations in subjects with clinical diagnosis of familial hypercholesterolemia. Hum. Mutat. 32, 868–872 (2011).

34. 34.

Huang, F. W. et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 (2013).

35. 35.

Liu, T., Yuan, X. & Xu, D. Cancer-specific telomerase reverse transcriptase (TERT) promoter mutations: biological and clinical implications. Genes 7, E38 (2016).

36. 36.

Killela, P. J. et al. TERT promoter mutations occur frequently in gliomas and a subset of tumors derived from cells with low rates of self-renewal. Proc. Natl Acad. Sci. USA 110, 6021–6026 (2013).

37. 37.

Vinagre, J. et al. Frequency of TERT promoter mutations in human cancers. Nat. Commun. 4, 2185 (2013).

38. 38.

Musunuru, K. et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466, 714–719 (2010).

39. 39.

Copp, J. N., Hanson-Manful, P., Ackerley, D. F. & Patrick, W. M. Error-prone PCR and effective generation of gene variant libraries for directed evolution. Methods Mol. Biol. 1179, 3–22 (2014).

40. 40.

McCullum, E. O., Williams, B. A. R., Zhang, J. & Chaput, J. C. Random mutagenesis by error-prone PCR. Methods Mol. Biol. 634, 103–109 (2010).

41. 41.

Inoue, F. et al. A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res. 27, 38–52 (2017).

42. 42.

Li, W. H., Wu, C. I. & Luo, C. C. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2, 150–174 (1985).

43. 43.

Guo, C. et al. Transversions have larger regulatory effects than transitions. BMC Genom. 18, 394 (2017).

44. 44.

Leigh, S. E. A., Foster, A. H., Whittall, R. A., Hubbart, C. S. & Humphries, S. E. Update and analysis of the University College London low density lipoprotein receptor familial hypercholesterolemia database. Ann. Hum. Genet. 72, 485–498 (2008).

45. 45.

Liyanage, K. E., Burnett, J. R., Hooper, A. J. & van Bockxmeer, F. M. Familial hypercholesterolemia: epidemiology, Neolithic origins and modern geographic distribution. Crit. Rev. Clin. Lab. Sci. 48, 1–18 (2011).

46. 46.

Lind, S. et al. Genetic characterization of Swedish patients with familial hypercholesterolemia: a heterogeneous pattern of mutations in the LDL receptor gene. Atherosclerosis 163, 399–407 (2002).

47. 47.

Fouchier, S. W., Kastelein, J. J. P. & Defesche, J. C. Update of the molecular basis of familial hypercholesterolemia in The Netherlands. Hum. Mutat. 26, 550–556 (2005).

48. 48.

Dedoussis, G. V. Z. et al. Molecular characterization of familial hypercholesterolemia in German and Greek patients. Hum. Mutat. 23, 285–286 (2004).

49. 49.

Horn, S. et al. TERT promoter mutations in familial and sporadic melanoma. Science 339, 959–961 (2013).

50. 50.

Rachakonda, P. S. et al. TERT promoter mutations in bladder cancer affect patient survival and disease recurrence through modification by a common polymorphism. Proc. Natl Acad. Sci. USA 110, 17426–17431 (2013).

51. 51.

Bell, R. J. A. et al. Cancer. The transcription factor GABP selectively binds and activates the mutant TERT promoter in cancer. Science 348, 1036–1039 (2015).

52. 52.

Mancini, A. et al. Disruption of the β1L isoform of GABP reverses glioblastoma replicative immortality in a TERT promoter mutation-dependent manner. Cancer Cell 34, 513–528.e8 (2018).

53. 53.

Huang, D.-S. et al. Recurrent TERT promoter mutations identified in a large-scale study of multiple tumour types are associated with increased TERT expression and telomerase activation. Eur. J. Cancer 51, 969–976 (2015).

54. 54.

Hurst, C. D., Platt, F. M. & Knowles, M. A. Comprehensive mutation analysis of the TERT promoter in bladder cancer and detection of mutations in voided urine. Eur. Urol. 65, 367–369 (2014).

55. 55.

Forbes, S. A. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2017).

56. 56.

Labussière, M. et al. TERT promoter mutations in gliomas, genetic associations and clinico-pathological correlations. Br. J. Cancer 111, 2024–2032 (2014).

57. 57.

Spiegl-Kreinecker, S. et al. Prognostic quality of activating TERT promoter mutations in glioblastoma: interaction with the rs2853669 polymorphism and patient age at diagnosis. Neuro. Oncol. 17, 1231–1240 (2015).

58. 58.

Varadi, V. et al. A functional promoter polymorphism in the TERT gene does not affect inherited susceptibility to breast cancer. Cancer Genet. Cytogenet. 190, 71–74 (2009).

59. 59.

Nencha, U. et al. TERT promoter mutations and rs2853669 polymorphism: prognostic impact and interactions with common alterations in glioblastomas. J. Neurooncol. 126, 441–446 (2016).

60. 60.

Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D260–D266 (2018).

61. 61.

Myocardial Infarction Genetics Consortium. et al.Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants. Nat. Genet. 41, 334–341 (2009).

62. 62.

Kathiresan, S. et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat. Genet. 40, 189–197 (2008).

63. 63.

Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713 (2010).

64. 64.

Melgar, M. F., Collins, F. S. & Sethupathy, P. Discovery of active enhancers through bidirectional expression of short transcripts. Genome Biol. 12, R113 (2011).

65. 65.

Treiber, T. et al. Early B cell factor 1 regulates B cell gene networks by activation, repression, and transcription- independent poising of chromatin. Immunity 32, 714–725 (2010).

66. 66.

Li, R. et al. Dynamic EBF1 occupancy directs sequential epigenetic and transcriptional events in B-cell programming. Genes Dev. 32, 96–111 (2018).

67. 67.

Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).

68. 68.

Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

69. 69.

Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP. PLoS Comput. Biol. 6, e1001025 (2010).

70. 70.

Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019).

71. 71.

Zerbino, D. R., Wilder, S. P., Johnson, N., Juettemann, T. & Flicek, P. R. The ensembl regulatory build. Genome Biol. 16, 56 (2015).

72. 72.

Kheradpour, P. & Kellis, M. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Res. 42, 2976–2987 (2014).

73. 73.

Cirulli, E. T. & Goldstein, D. B. In vitro assays fail to predict in vivo effects of regulatory polymorphisms. Hum. Mol. Genet. 16, 1931–1939 (2007).

74. 74.

Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

75. 75.

Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

76. 76.

Kircher, M. Analysis of high-throughput ancient DNA sequencing data. Methods Mol. Biol. 840, 197–228 (2012).

## Acknowledgements

We thank current and previous members of the Ahituv, Kircher, and Shendure laboratories, as well as Daniela Witten, Greg Cooper, and Franklin Huang, for valuable discussions and feedback. Our work was supported by the National Human Genome Research Institute (NHGRI) grant numbers 1R01HG006768 and 1UM1HG009408 (NA & JS), 1R01HG009136 (JS), NHGRI and National Cancer Institute grant number 1R01CA197139 (NA & JS), National Institute of Mental Health grant numbers 1R01MH109907 and 1U01MH116438 (NA), and the Berlin Institute of Health (MK). J.S. is an investigator of the Howard Hughes Medical Institute.

## Author information

All authors designed the experiments; B.M., C.X. and F.I. performed all wet lab experiments; C.X., J.S., M.K., M.S. and N.A. outlined data analysis; M.K. processed and analyzed the high-throughput sequencing data; C.X., M.K., and M.S. performed data analysis; and all authors interpreted the experimental results and wrote the paper.

Correspondence to Martin Kircher or Jay Shendure or Nadav Ahituv.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Peer review information: Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Kircher, M., Xiong, C., Martin, B. et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat Commun 10, 3583 (2019) doi:10.1038/s41467-019-11526-w

• ### An in silico analysis of robust but fragile gene regulation links enhancer length to robustness

• Kenneth Barr
• , John Reinitz
•  & Ilya Ioshikhes

PLOS Computational Biology (2019)

• ### MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect

• Daniel Esposito
• , Jochen Weile
• , Jay Shendure
• , Lea M. Starita
• , Anthony T. Papenfuss
• , Frederick P. Roth
• , Douglas M. Fowler
•  & Alan F. Rubin

Genome Biology (2019)