Introduction

The vast majority of the human genome is noncoding. Nonetheless, even as the cost of DNA sequencing plummeted over the past decade, the primary focus of sequencing-based studies of human disease has been on the ~1% that is protein-coding, i.e., the exome. However, it is clear that disease-contributory variation can and does occur within the noncoding regions of the genome1,2. For example, there are many Mendelian diseases for which specific mutations in promoters and enhancers are unequivocally causal3. Furthermore, for most Mendelian diseases, not all cases are explained by coding mutations, suggesting that regulatory mutations may explain some proportion of the remainder. For common diseases, although coding regions may be the most enriched subset of the genome, the vast majority of signal maps to the noncoding genome, and in particular to accessible chromatin in disease-relevant cell types4,5.

Nonetheless, the pinpointing of disease-contributory noncoding variants among the millions of variants present in any single individual6, or the hundreds of millions of variants observed in human populations7, remains a daunting challenge. To advance our understanding of disease as well as the clinical utility of genetic information, it is clear that we need to develop scalable multiplex assays of variant effect (MAVEs)8, specifically methods for accurately assessing the functional consequences of noncoding variants.

While our mechanistic understanding of regulatory sequences remains limited9, several groups, including us, have developed tools that summarize large amounts of functional genomic data (e.g., evolutionary conservation, gene model information, histone or TF ChIP-seq peaks, transcription factor binding site (TFBS) predictions) into scores that can be used to predict noncoding variant effects (e.g., CADD10, DeepSEA11, Eigen12, FATHMM-MKL13, FunSeq214, GWAVA15, LINSIGHT16, and ReMM17), segment annotations (e.g., chromHMM18, Segway19, and fitCons20), or sequence-based models (deltaSVM21). However, although these scores are widely used, it remains unclear how well they work.

A major bottleneck in the development of any interpretive method for noncoding variants is the assessment of prediction quality, as there are relatively few known pathogenic noncoding variants, nor consistently ascertained sets of functional measurements of noncoding variants. A recent study by Smedley et al.17 cataloged a total of 453 known disease-associated noncoding single nucleotide variants (SNVs) and used those to derive a score (ReMM). However, many of these variants fall within a small number of promoters that have been extensively studied. This leaves in question how generalizable the resulting scores are. Furthermore, catalogs of disease-associated variants like the one used by Smedley et al.17, or available from ClinVar22 or HGMD23, provide only qualitative labels for SNVs (e.g., likely pathogenic), rather than quantitative information on the magnitude of the effect. In sum, the qualitative nature, possible ascertainment biases, and relative paucity of “known” functional noncoding variants severely limit the assessment of available methods. While massively parallel reporter assays (MPRAs) that can assess thousands of sequences and their variants for their activity have been used to test the effect of a large number of variants, these have been primarily focused on common variants24,25 or less than a handful of disease-associated regulatory elements26,27.

To address this gap, we set out to generate variant-specific activity maps for 20 disease-associated regulatory elements, including ten promoters (of TERT, LDLR, HBB, HBG, HNF4A, MSMB, PKLR, F9, FOXE1, and GP1BB) and ten enhancers (of SORT1, ZRS, BCL11A, IRF4, IRF6, MYC (2×), RET, TCF7L2, and ZFAND3), together with one ultraconserved enhancer (UC88)28,29. Specifically, we use MPRAs to perform saturation mutagenesis26,27 on each of these regulatory elements, spanning 28–73% in GC content and 187–601 base pairs (bp) in length. Altogether, we empirically measure the functional effects of over 30,000 SNVs or single nucleotide deletions. We observe that the density of putative TFBS varies widely across the elements tested, as does the performance of various predictive strategies. These data comprise a comprehensive resource for the benchmarking and further development of noncoding variant effect scores, as well as an empirical database for the interpretation of the disease-causing potential of nearly any possible SNV in these regulatory elements.

Results

Selection of disease-associated elements

We selected 21 regulatory elements, including 20 commonly studied, disease-relevant promoter and enhancer sequences from the literature (Supplementary Tables 1 and 2), and one ultraconserved enhancer (UC88). For the former, we focused primarily on regulatory sequences in which specific mutations are known to cause disease, both for their clinical relevance and to provide for positive control variants. Selected elements were limited to 600 bp for technical reasons related to the mapping of variants to barcodes by subassembly30. In addition, we selected only sequences where cell line-based reporter assays were previously established.

For example, we selected the low-density lipoprotein receptor (LDLR) promoter, where mutations have been shown to cause familial hypercholesterolemia (FH), a disorder that results in accelerated atherosclerosis and increased risk for coronary heart disease31,32,33. We also tested the core promoter region (–200 to +57) of the telomerase reverse transcriptase (TERT) gene which is associated with oncogenic mutations34. In particular, NM_198253.2:c.-124C>T or c.-146C>T are frequently found in several cancer types, including glioblastoma34,35,36,37.

We also selected a sortilin 1 (SORT1) enhancer. A series of genome-wide association studies showed that the minor allele of a common noncoding polymorphism at the 1p13 locus (rs12740374) creates a CCAAT/enhancer binding protein (C/EBP) TFBS and increases the hepatic expression of the SORT1 gene, reducing LDL-C levels and risk for myocardial infarction in humans38. We cloned a ~600 bp region that includes rs12740374 as well as most nearby annotated TFBS according to ENCODE data (wgEncodeRegTfbsClusteredV3 track, UCSC Genome Browser), to identify additional functional variants in the enhancer and surrounding region. For this enhancer, we also conducted MPRA experiments in both forward and reverse orientations, with the goal of testing for any directionality dependence of variant effects.

All 21 selected promoter and enhancer regions were individually validated for functional activity in the appropriate cell lines (Supplementary Figs. 1 and 2; Supplementary Tables 1 and 2). This initial validation allowed us to optimize reporter assay conditions and to confirm that the cloned subsequences of the candidate regulatory elements resulted in measurable activities in the appropriate cell types. The validated luciferase expression levels ranged from 2- to 200-fold over empty vector (Supplementary Tables 1 and 2).

Construction of saturation mutagenesis libraries

In order to test the functional effects of thousands of mutations in these selected disease-associated regulatory elements, we first developed a scalable protocol for saturation mutagenesis-based MPRAs26,27 (Fig. 1). For each of the 21 regulatory elements (Supplementary Tables 1 and 2), we used error-prone PCR to introduce sequence variation at a frequency of less than 1 change per 100 bp. While error-prone PCR is known to be biased in the types of mutations that are generated (e.g., a preference for transitions and T/A transversions)39, high library complexities (50k–2M constructs per target) allowed us to capture nearly all possible SNVs as well as many 1-bp deletions with multiple independent constructs per variant (Supplementary Table 3). To distinguish the individual amplification products, we incorporated 15 or 20 bp random sequence tags 3′ of the target region using overhanging primers during the error-prone PCR.

Fig. 1
figure 1

Saturation mutagenesis MPRA of disease-associated regulatory elements. a Saturation mutagenesis MPRA. Error-prone PCR is used to generate sequence variants in a regulatory region of interest. The resulting PCR products with ~1/100 changes compared with the template region are integrated in a plasmid library containing random tag sequences in the 3′ UTR of a reporter gene. Associations between tags and sequence variants are learned through high-throughput sequencing. High complexity MPRA libraries (50k–2M) are transfected as plasmids into cell lines of interest. RNA and DNA is collected and sequence tags are used as a readout. Variant expression correlation (min. ten tags required) between full replicates of b LDLR (LDLR; LDLR.2) and c SORT1 (SORT1; SORT1.2). d Log2 variant effect of all SNVs (min. required tags ten) ordered by their RefSeq transcript position in NM_000527.4 of the hypercholesterolemia-associated LDLR promoter. Upper part shows the LDLR experiment, lower the full replicate LDLR.2. Significance level (red/green lines) is 10−5 in both expression profiles

We then cloned promoters and all, but two, enhancers into the backbones of slightly modified pGL4.11 (Promega, promoter) or pGL4.23 (Promega, enhancer) vectors (see Supplementary Tables 1 and 2), respectively, from which the reporter gene (as well as the minimal promoter in the case of enhancers) had been removed. For each of the 21 regulatory elements, we determined which variants were linked to which random tag sequences by deeply sequencing the corresponding library (see the “Methods” section). In the final step, we inserted the luciferase reporter gene (as well as the minimal promoter in the case of enhancers) in between the regulatory element and the tag sequence, and transformed the MPRA library into E.coli. With this insertion of the luciferase reporter gene, the above-introduced random tag sequence becomes part of its 3′ untranslated region (3′ UTR).

We obtained tag assignments (i.e., variant-tag associations) for a total of 24 saturation mutagenesis libraries (see the “Methods” section). This included the 21 selected regions listed in Supplementary Tables 1 and 2, as well as an additional full replicate for the LDLR and SORT1 enhancer libraries, and an additional SORT1 library with reversed sequence orientation of the enhancer. Supplementary Fig. 3 plots the number of tags associated with substitutions and 1-bp deletions along the target sequences for each library. The representation of tags associated with specific variants follows previously characterized biases in error-prone PCR using Taq polymerase40, with a preference of transitions (exchange of purine for pyrimidine base) over transversions (exchange between two-ring purines A/G to one-ring pyrimidines C/T) and T-A preference among transversions. Insertions were rare, while short deletions occured at rates similar to those of the rare transversions. For all libraries, we observed complete or near-complete coverage of all potential SNVs (Supplementary Table 3) as well as partial 1-bp deletion coverage. On average, 99.9% [99.1%, 100%] of all potential SNVs in the targeted regions are associated with at least one tag, while on average 55.4% [31.4%, 71.1%] of 1-bp deletions are associated with at least one tag.

Readout of disease-associated elements

For each MPRA experiment, around 5 million cells (Supplementary Tables 1 and 2) were plated and incubated for 24 h before transfection with the libraries. In each experiment, three independent cultures (replicates) were transfected with the same library. In addition, for LDLR and SORT1, independent MPRA libraries were created, as outlined above, and cells were transfected from a different culture and on a different day. In one case (TERT), the same MPRA library was used for experiments in two different cell types (HEK293T and a glioblastoma cell line).

We then used our published protocol for quantifying effects from RNA and DNA tag-sequencing readouts, including the previously suggested modification of using unique molecular identifiers (UMIs) during targeted amplification41 (see the “Methods” section). More specifically, the relative abundance of reporter gene transcripts driven by each promoter or enhancer variant was measured by counting associated 3′ UTR tags in amplicons derived from RNA (obtained by targeted RT-PCR), and normalized to its relative abundance in plasmid DNA (obtained by targeted PCR). For all experiments, we excluded tags not matching the assignment and determined the frequency of a tag in RNA or DNA from high-throughput sequencing experiments based on the number of unique UMIs. We only considered tag sequences observed in both RNA and DNA. Supplementary Table 4 summarizes the number of RNA and DNA counts obtained in each experiment. From individual tag counts in RNA and DNA, we fit a multiple linear regression model to infer individual variant effects (see the “Methods” section).

For data quality reasons, we introduced a minimum threshold on the number of associated tags per variant used in model fitting (various quality measures for fitted variant effects versus the number of tags are plotted in Supplementary Fig. 4). We picked this threshold based on the correlation of variant effects obtained when comparing between the independent libraries of LDLR (Fig. 1b and Supplementary Fig. 5) and SORT1 (Fig. 1c and Supplementary Fig. 6). Using all SNVs and 1-bp deletions with at least one associated tag in each transfection replicate, variant effects show a Pearson correlation of 0.93 (LDLR) and 0.94 (SORT1). When requiring a minimum of ten tag measurements after combining all three transfection replicates, correlations increase to 0.97 (LDLR) and 0.96 (SORT1). Requiring even higher thresholds (Supplementary Table 5) further improves replicate correlation up to 0.98 for both experiments (min. 50 tags), but reduces SNV coverage to 86.3% and 1-bp deletion coverage to 15.6% across all datasets (Supplementary Table 6). We therefore used a minimum of ten tags, reducing average coverage from 99.8% to 98.4% [93.0%, 100.0%] for all putative SNVs and from 44.4% to 25.3% [10.5%, 41.7%] for 1-bp deletions.

To assure high quality of our complete dataset, we evaluated the Pearson correlation of variant effects divided by their standard deviation among pairs of transfection replicates (Supplementary Table 7). We observed the lowest replicate correlation for BCL11A, FOXE1, and one of the MYC elements (rs1198622). In contrast, experiments for HBG1, IRF4, LDLR, SORT1, and TERT exhibited high reproducibility among transfection replicates (Pearson correlation > 0.9). Exploring differences in the proportion of alleles with significant regulatory activity, we observed a wide range of values across elements (3–52% of variants using a lenient p-value threshold of <0.01 for the fit; average 22%; Supplementary Table 7). We find that this proportion is strongly correlated with the performance of transfection replicates (Pearson correlation of 0.78), but we also note a circular relation for the significance of results, low experimental noise, and high reproducibility.

We sought to explore whether factors like target length, wild-type activity in the luciferase assay (Supplementary Fig. 1, Supplementary Tables 1 and 2), measures of assignment complexity (Supplementary Table 3), as well as DNA and RNA sequencing depth (Supplementary Table 4) contribute to technical reproducibility. Linear models of up to three features fit in a leave-one-out setup explain up to 33% of variance (luciferase wild-type activity, average number of tags per SNVs, and the proportion of wild-type haplotypes) or 29% of the variance (luciferase wild-type activity, average number of variants per haplotype, and average number of DNA counts obtained) in reproducibility between transfection replicates. Overall, these analyses emphasize the baseline activity of a regulatory element as the largest factor (i.e., highly active elements are associated with greater technical reproducibility).

General properties of observed regulatory mutations

Altogether, our MPRAs quantified the regulatory effects of 31,243 potential mutations (min. ten tags) at 9834 unique positions (Supplementary Table 8) and we setup an interactive website for exploring this dataset (https://mpra.gs.washington.edu/satMutMPRA/). Of the unique mutations, 4830 (15%) were identified as causing significant changes relative to the wild-type promoter or enhancer sequence (p-value of fit <10−5). Of those causing significant changes, 1789 (37%) increased expression (by a median of 20%) and 3041 (63%) decreased expression (by a median of 24%). The majority of significant effects were of small magnitude. If we require a minimum two-fold change, we identify a total of 83 activating and 559 repressing mutations. The significant shift toward repressing mutations (binomial test, p-value < 10−42) is consistent with a model where most transcriptional regulators are activators and binding is more easily lost than gained with single nucleotide changes.

Out of the 31,243 successfully assayed mutations, 2306 are 1-bp deletions, of which 229 meet the significance threshold (p-value of fit <10−5). This is a lower proportion than observed for SNVs (10% vs. 16%), most likely due to the lower rates at which 1-bp deletions are created by error-prone PCR, resulting in representation by fewer tags. Supporting this notion, 1-bp deletions tend to be associated with larger absolute effect sizes than SNVs (Wilcoxon Rank Sum test with continuity correction, p-value < 0.05, location shift 0.04). Similarly, we observe a large shift toward repressive effects with 1-bp deletions (significant effects: 33% activating [27%, 40%], 76 activating and 153 repressing; min. two-fold change 5% activating [1%, 17%], 2 activating and 37 repressing), but due to the low number of observations, this shift is not significantly different than that observed for SNVs (significant effects: 37% activating [36%, 39%], 1719 activating and 2882 repressing; min. two-fold change 14% activating [11%, 17%], 80 activating and 514 repressing).

We have greater power to detect significant effects for transitions than transversions, likely consequent to the higher sampling by error-prone PCR (Supplementary Table 9; binomial test comparing the proportion of significant transitions (2190/9824) vs. transversions (2411/19,113); p-value < 10−16). This is supported by specific transversions (A-T, T-A) that are also created more frequently by error-prone PCR (Supplementary Fig. 3) and represent a higher proportion of significant observations (A-T 374/2372 and T-A 422/2289, combined binomial test vs. all transversions, p-value < 10−16). Despite our greater power for assaying transitions, transversions had larger absolute effect sizes (Wilcoxon Rank Sum test with continuity correction, p-value < 10−16, location shift 0.14). This observation supports a model where regulatory elements evolved some level of robustness to the more frequent transitional changes (as is the case for coding sequences42), and is consistent with previous research showing that transversions have a larger impact on TF motifs and allele-specific TF binding43.

Our increased power to measure the effects of transitions resulted in an artifactual enrichment for significant effects among SNVs previously observed in gnomAD r2.17 (binomial test; overlap of tested SNVs with gnomAD, n = 689/31,243; of those with significant effects, n = 144/689; p-value < 0.001), where 64% of SNVs are transitions compared with 34% of mutations created in our libraries. However, testing separately for transitions and transversions, there is no enrichment of significant effects among SNVs previously observed in gnomAD. In fact, we observed a smaller absolute effect size for previously observed SNVs (Wilcoxon Rank Sum test with continuity correction; p-value = 0.06, location shift 0.03). This effect is significant if we exclude singletons (excl. 82/144 significant variants; Wilcoxon Rank Sum test with continuity correction; p-value < 0.01, location shift 0.07), consistent with purifying selection acting on standing regulatory variation.

The most obvious pattern upon visual inspection of the data is a strong clustering of positions associated with significant mutations (e.g., Fig. 1d). This clustering was non-random for all but the F9 and FOXE1 experiments (Wilcoxon Rank Sum tests with continuity correction vs. 1000 data shuffles; p-value < 0.01; for 16/21 elements, p-value < 10−5), as determined from comparing run lengths for significant changes including directionality of the change. While FOXE1 is one of the experiments mentioned above with low experimental reproducibility, a non-random clustering of significant regulatory changes was observed in F9 when additionally requiring a minimum effect size of 20% (Wilcoxon Rank Sum tests with continuity correction vs. 1000 data shuffles; p-value < 10-9). These results are consistent with expectations for TFBS (specific examples are discussed below).

In the following sections, we describe the results of saturation mutagenesis of three of the regulatory elements in greater detail: the LDLR promoter, the TERT promoter, and a SORT1-associated enhancer. Similar expositions on the remaining 18 elements are provided as Supplementary Note 1 and Supplementary Tables 10 and 11. Finally, we compare the relative performance of various computational tools for predicting these empirical measurements of regulatory effects.

LDLR promoter

FH is an autosomal dominant disorder of low-density lipoprotein (LDL) metabolism, which results in accelerated atherosclerosis and increased risk of coronary heart disease44. With a prevalence of about 1 in 500 individuals, FH is the most common monogenic disorder of lipoprotein metabolism. It is mainly due to mutations in the LDL receptor (LDLR) gene that lead to the accumulation of LDL particles in the plasma45. Several studies have shown that variants in the LDLR promoter can alter the transcriptional activity of the gene and also cause FH31,32,33 (full reference list in Supplementary Table 12). While in some cases, mutations in the promoter were identified in patients, a functional follow-up, like testing the regulatory effect of the variants by means of a luciferase assay, was not always conducted46,47,48. To decipher the functional activity of these previously identified, as well as essentially all potential SNVs in the LDLR (NM_000527.4) promoter, we performed saturation mutagenesis MPRA in HepG2 cells, a commonly used cell line for LDLR functional studies33. Experiments for this promoter were performed using two full replicates (i.e., two independently constructed saturation mutagenesis libraries), referred to below as LDLR and LDLR.2 (Fig. 1b).

We observed strong concordance between our MPRA-based measurements and variants with previous luciferase activity results (Supplementary Table 12). For example, a c.-152C>T mutation was previously reported to reduce promoter activity (to 40% of normal activity), while a c.-217C>T variant was shown to increase transcription (to 160% of normal activity)31. We observe a reduction of 32%/39% (LDLR/LDLR.2) and activation of 273%/263% (LDLR/LDLR.2) for these variants, respectively. c.-142C>T reduced promoter activity (to 20% of normal activity) in transient transfection assays in HepG2 cells32, and we observed 20%/11% (LDLR/LDLR.2) residual activity. Mutations located in regulatory elements R2 and R3 (c.-136C>G, c.-140C>G, and c.-140C>T) resulted in 6–15% residual activity;33 our MPRA results confirm these findings in both replicates (10–22% residual activity). We also observed no significant changes in promoter activity for c.-36T>G and c.-88G>A, consistent with a previous study of these variants33.

Overall, we observe that variants located in close proximity and overlapping the same TFBS tend to show similar deactivating effects (e.g., SP1 and SREBP1/SREBP2 sites in Supplementary Table 12 and Fig. 1d). Previously reported variants located in the 5′ UTR of LDLR generally did not affect promoter activity. The high concordance between full replicates (Pearson correlation of 0.97) as well as the agreement with previous studies give us confidence in the potential of our MPRA results to be useful for the clinical interpretation of LDLR promoter mutations. It also reinforces the value of functional assays covering all possible variants of a regulatory sequence of interest, as this provides consistent and comparable readouts together with a distribution of effect sizes.

TERT promoter

Mutations in the telomerase reverse transcriptase (TERT) promoter (NM_198253.2), in particular c.-124C>T or c.-146C>T, increase telomerase activity and are among the most common somatic mutations observed in cancer34,35,36,37. Previous luciferase assay studies showed that these mutations increase promoter activity in human embryonic kidney (HEK) 293 cells, glioblastoma, melanoma, bladder cancer, and hepatocellular carcinoma (HepG2) cells34,49,50,51. In glioblastoma cells, c.-124C>T or c.-146C>T mutations result in a 2–4 log2 fold increase in promoter activity51.

Here, we tested the TERT promoter MPRA library in two different cell types, HEK293T and glioblastoma SF7996 cells52, referred to here as GBM (Fig. 2a), observing a log2 fold increase in promoter activity of 2.00/2.86-fold for c.-124C>T and 1.42/2.42-fold for c.-146C>T in HEK293T and GBM cells, respectively. We also identified additional activating mutations, some of which were previously identified in cancer studies49,53,54 and are annotated in COSMIC55. These include c.-45G>T and c.-54C>A, previously identified as somatic mutations in bladder cancers53,54, and c.-57A>C, previously associated with both melanoma49 and bladder cancer54. We observed activating effects for c.-45G>T and c.-54C>A with a log2 increase of 0.81/1.65-fold and 0.45/1.03-fold for HEK293T and GBM cells, respectively. For c.-57A>C, we observed a 0.65/1.14-fold log2 increase in HEK293T and GBM cells, respectively, similar to previous reporter assays that obtained increased expression of 0.6-fold (152%) and 0.3-fold (123%) on a log2-scale over the wild-type construct in Ma-Mel-86a and HEK293T cells49, respectively.

Fig. 2
figure 2

Saturation mutagenesis MPRA of the cancer-associated TERT promoter. a Log2 variant effect of all SNVs (min. ten tags required) ordered by their RefSeq transcript position in NM_198253.2 of TERT. Upper panel shows the TERT experiment in HEK293T cells and the lower in GBM (SF7996) cells, where the E2F repressor site is marked. A significance threshold of 10−5 was used (red vs. green vertical lines). b Expression profile of TERT-GBM-siScramble (gray). Ninety five percent confidence intervals of variants from TERT-GBM-siScramble (green) and TERT-GBM-siGABPA (red), that were significantly different between the two experiments, are overlaid. In addition, predicted ETS-related motifs in the reference sequence (green) or variant induced ETS-related motifs (blue) are marked. c Position weight matrix (PWM) score change of variants that show a significant difference between siGABPA and the scramble siRNA experiment. Motif scores are plotted as boxplots with median center line, upper and lower quartiles box limits, and 1.5× interquartile range whiskers. Variants were only used if they overlapped an ETS-related factor motif (GABPA, ETS1, ELK4, ETV1, and ETV4-6) with a score (reference or alternative sequence) larger than the 80th percentile of the best possible motif match to the PWM. TERT-GBM-siGABPA variant effects were divided by the effect measured in the siRNA scramble experiment. Three asterisks mark a significance level of 10−9 by the two-sided Wilcoxon Rank Sum test (activating n = 34, repressing n = 162)

A common single-nucleotide polymorphism (SNP) in the TERT promoter, rs2853669 (c.-245A>G), previously studied in several cell lines and cancer cohorts, has been suggested to alter promoter-mediated TERT expression by impacting E2F1 or ETS/TCF binding56,57. However, studies in both breast cancer58 and glioblastomas59 failed to find any impact on risk or prognosis of this polymorphism. The epidemiologic findings are in line with our results, as we did not observe a significant effect of this variant on promoter activity in either cell type.

We next sought to assess whether there are differences in mutational effects on the TERT promoter between HEK293T and GBM cells that could be driven by the trans environment. Overall, variant effects were highly concordant between the two cell types (Fig. 2a). However, we did observe significant differences at several specific positions. In particular, variants between c.-62 and c.-70, which corresponds to an E2F repressor site, were found to increase promoter activity in GBM cells, likely due to the disruption of this motif (Fig. 2a). None of these effects were observed in HEK293T cells, suggesting that different E2F family protein abundances could be driving the differences in promoter activity between these cell types, and potentially between the corresponding cancer types.

Previous work has shown that the commonly observed cancer-associated activating somatic mutations, c.-146C>T and c.-124C>T, lead to the formation of an ETS binding site that is bound by the multimeric GABP transcription factor in GBM cells51. To evaluate the relevance of GABP binding on TERT promoter activity more globally, we retested our TERT MPRA library in GBM cells with a short interfering RNA (siRNA) targeting GABPA. We first optimized GABPA knockdown conditions using qPCR, such that it reduced GABPA expression by 68% ± 7% and TERT promoter activity by 58% ± 12%, compared with a scrambled siRNA control (Supplementary Fig. 7). We then tested our MPRA library in GBM cells using either the GABPA siRNA or the scrambled control. A total of 63 variants were identified as significantly different (see “Methods” and Supplementary Table 13), 59 leading to a reduction, and 4 to an increase in activity. Both c.-146C>T and c.-124C>T, previously reported to create GABP binding sites, showed significantly reduced activity in the siGABPA knockdown compared with the scrambled control (Fig. 2b). Variants with significantly different activity were 2.8-fold enriched for ETS-related factor motifs (ETS1, ELK4, ETV1, ETV4-6, and GAPBA) annotated from JASPAR 201860 as compared with all 908 other variants present in both experiments (one-sided Fisher’s exact test; p-value < 0.001).

Apart from the two variants (c.-146C>T and c.-124C>T) known to create GABPA binding sites, we identified nine additional variants with the potential to create new ETS family motifs from the siGABPA knockdown (Supplementary Table 13). To include such instances as part of a global analysis, we computed score differences of the corresponding position weight matrices (PWMs) from activating and repressing variants that overlap either a reference or newly created ETS family motif (reference or alternative sequence larger than the 80th percentile of the motifs’ best matches; 12 activating and 26 repressive variants). Figure 2c shows that activating variants create new ETS-related factor motifs and repressing variants disrupt them. The PWM score changes are highly significant between activating and repressing variants (one-sided Wilcoxon Rank Sum test, p-value < 10−11). Overall, the average expression reduction for the motif-disruptive allele in cases where variants disrupt or create ETS family motifs is 29.5% ± 14%, which is in concordance with the 58% ± 12% reduced qPCR expression of TERT (Supplementary Fig. 7).

SORT1-associated enhancer

The Sortilin 1 (SORT1)-associated enhancer was identified via a common SNP, rs12740374 (GRCh37 chr1:109,817,590G>T), which is associated with myocardial infarction61. The minor allele T creates a potential C/EBP binding site, leading to ~4-fold greater luciferase activity compared with the major allele G in a reporter assay38. This is thought to alter the expression of the SORT1 and proline and serine rich coiled-coil 1 (PSRC1) genes, leading to changes in LDL and VLDL plasma levels38. The major allele is thus associated with higher LDL-C levels and increased risk for myocardial infarction. These results are also consistent with prior human lipoprotein QTL analyses62,63.

We carried out saturation-based MPRA in HepG2 cells using a 600 bp region encompassing rs12740374 with two full replicates (i.e., two independently constructed saturation mutagenesis libraries transfected at different days with three technical replicates each, SORT1 and SORT1.2, Fig. 1c). Consistent with the literature, our MPRA results show a significant effect for rs12740374, leading to a 2.92/2.74 log2-fold increase in expression. Furthermore, we observe many other substitutions of large effect, with a disproportionate number of >2-fold expression changes in our overall dataset (144/645) occurring in the SORT1 enhancer (Supplementary Table 8). The locations of these variants are strongly clustered (Fig. 3), indicative of several TFBS in this region.

Fig. 3
figure 3

Saturation mutagenesis MPRA of a myocardial infarction-associated SORT1 enhancer. Expression effects of SNVs from experiments SORT1, SORT1.2, and SORT1.flip. Direction of SORT1 and SORT1.2 was from left to right in the experiments. In the SORT1.flip experiment, the direction was reversed (right to left in the figure). Highlighted area in red, close to the experimental promoter site in SORT1 and SORT1.2, is different between the SORT1/SORT1.2 and SORT1.flip experiments. In this region, JASPAR annotates an EBF1 motif (MA0154.3). A significance threshold of 10−5 was used (red vs. green vertical lines)

The directionality independence of enhancers is inherent to their definition but is not often tested. To evaluate whether the orientation of an enhancer could bias the effects of mutations within it, we generated a third SORT1 enhancer library where the orientation of the enhancer was flipped, termed SORT1.flip. Comparison of all variants showing a difference in activity compared with the reference sequence (p-value of fit <10−5) in the SORT1.flip with the other two libraries in the opposite orientation (SORT1 and SORT1.2), we observe a very strong correlation (0.97 and 0.96 for SORT1 and SORT1.2, respectively; Pearson’s correlation), on par with the two biological replicates in the same orientation, SORT1 and SORT1.2 (0.98; Pearson’s correlation). This result supports the directionality-independence of enhancers64 as well as of the effects of variants within them.

However, we did observe a few significant differences between SORT1/SORT1.2 and SORT1.flip near the 3′ region of the forward orientation (GRCh37 chr1:109,817,859-109,817,872), i.e., adjacent to the minimal promoter on the reporter construct. This block of variants led to a significant increase in activity in the forward orientation (SORT1/SORT1.2) but not in the opposite orientation (SORT1.flip) (Fig. 3). Analysis of this region for TFBS using JASPAR 201860 found a potential EBF1 motif (MA0154.3), suggesting that this factor might lead to the orientation activity differences observed in our assays. EBF1 is known to act as an activator or repressor of gene expression65,66 and mutations to its core motif sequences (GGG and CCC) had the strongest effect in our SORT1 assay. Nonetheless, its 3′ location and orientation dependence suggests that it is likely contributing to minimal promoter rather than enhancer activity.

Computational tools predict expression effects poorly

Altogether, our MPRAs analyzed over 30,000 different mutations for their effects on regulatory function. We next set out to assess the performance of the available computational tools and annotations for predicting the regulatory effects of individual variants. For this purpose, we examined various measures of conservation (PhyloP67, PhastCons68, and GERP++69) as well as a number of computational tools that integrate large sets of functional genomics data into combined scores (CADD70, DeepSEA11, Eigen12, FATHMM-MKL13, FunSeq214, GWAVA15, LINSIGHT16, and ReMM17). In addition, we previously identified the number of overlapping TFBS as a significant predictive measure of the activity of a specific region41. We therefore also analyzed TFBS annotations resulting from motif predictions overlayed with biochemical evidence from ChIP-seq experiments (available as Ensembl Regulatory Build (ERB)71 and ENCODE72 annotations) as well as pure motif predictions from JASPAR 201860. Using JASPAR predictions, we extended this analysis to individual positions and explored different score thresholds or just the factors predicted most frequently across the region (see “Methods”). All these annotations and scores are agnostic to the cell type(s) in which we studied each sequence. Therefore, we also compared our results to sequence-based models (deltaSVM21) for 10 of 21 MPRAs, i.e., where a model was publicly available for the corresponding cell type (HEK293T, HeLa S3, HepG2, K562, and LNCaP). In cases where an annotation is based on positions rather than alleles, we assumed the same value for all substitutions at each position. We did not include the 1-bp deletions in this analysis, as most annotations are not defined for deletions.

Supplementary Tables 14 and 15 report the Pearson and Spearman correlation of the obtained expression effect readouts with conservation metrics, combined annotation scores, and overlapping TF predictions, respectively. Figure 4 and Supplementary Figs. 8 and 9 visually contrast expression effects as well as a subset of these annotations (including ENCODE72 and Ensembl71 motif annotations). By correlating absolute expression effects with functional scores, we identified species conservation as a major driver of the currently available combined scores in these regions. For example, the correlation results of CADD v1.3/v1.4, Eigen, FATHMM-MKL and LINSIGHT show more than 90% Pearson correlation across elements with the results of PhastCons scores calculated from the alignment of mammalian genome sequences. However, conservation seems only informative for a subset of the studied noncoding regulatory elements (e.g., LDLR, ZFAND3, and IRF4, but not SORT1, F9, or GP1BB). We observe that repressive effects can at least be partially aligned with available motif data (e.g., F9, GP1BB, IRF4, LDLR, and SORT1). However, experimentally supported motif annotations are frequently incomplete (for an example see Fig. 4a, around c.-215 where motifs are predicted in JASPAR but absent from ENCODE and ERB). In several cases, motif annotation is also missing completely. For 13 of the 21 elements studied here, no motif annotation was available from ERB; for 2 of the 21 elements no motif annotation was available from ENCODE. Furthermore, the gain-of-binding motifs, e.g., the motifs underlying activating mutations in TERT, are currently not at all or insufficiently modeled in available scores, as these motifs are frequently missed by scans of the reference genome.

Fig. 4
figure 4

Current computational tools are poor predictors of expression effects. Expression effects of a LDLR and b TERT (significance threshold 10−5; red vs. green vertical lines) compared with PhastCons68 conservation scores, combined scores of functional genomics data (CADD v1.470, DeepSEA11, Eigen12, FATHMM-MKL13, and number of overlapping 10th percentile scoring JASPAR60 motifs), and annotated motifs by ENCODE72 and Ensembl Regulatory Build (ERB) v9071

Looking at average Spearman correlations across our 21 regions (Fig. 5, Supplementary Table 15), DeepSEA (0.22) performed best, followed by FunSeq2 (0.14), Eigen (0.14), high-scoring (top 10th percentile) JASPAR predictions (0.14), and FATHMM (0.14). However, the average is a poor measure here. Spearman correlations for absolute expression effects of some elements were reasonably high for several methods (0.3–0.6), while for other elements no or negative correlations were detected for most or all methods. We saw the best performance in predicting an individual element (LDLR) for FATHMM-MKL (0.59), followed by GerpN (0.59), Eigen (0.58), and vertebrate PhastCons (0.58). Besides LDLR, the next best agreement between annotations and absolute expression effects was observed for F9 (top 10th percentile JASPAR 0.52), IRF4 (Eigen 0.48), ZFAND3 (DeepSEA 0.44), and PKLR (LINSIGHT 0.41). The lowest agreement was observed for FOXE1 (DeepSEA 0.05) and BCL11A (DeepSEA 0.03); the two elements observed with the lowest replicate correlation (Supplementary Table 7). However, high replicate correlations did also not indicate high correlations with existing scores or annotations. For example, SORT1 (replicate correlations of 0.99) and TERT (replicate correlations of >0.96 for GBM experiment) have highest correlations of 0.33 (top 25th percentile JASPAR) and 0.37 (DeepSEA), respectively.

Fig. 5
figure 5

Spearman correlation of computational scores with measured expression effects. The figure reports Spearman correlation coefficients (in percent) of the absolute expression effect for all SNVs with at least ten tags in each region with various measures agnostic to the cell type, like conservation (mammalian PhyloP, mammalian PhastCons, and GERP++), overlapping TFBS as predicted in JASPAR 2018 (counting those in the top 10th percentile of motif scores across all elements, all motifs, and additional percentiles are available in Supplementary Table 15), and computational tools that integrate large sets of functional genomics data in combined scores (CADD v1.4, DeepSEA, Eigen, FATHMM-MKL, FunSeq2, GWAVA region model, LINSIGHT, and ReMM). In addition, we compared a subset of experiments (10/21) to absolute deltaSVM scores available for specific cell types (HEK293T, HeLa S3, HepG2, K562, and LNCaP). In cases where an annotation is based on positions rather than alleles, we assumed the same value for all substitutions at each position. The column Type assigns each region as either enhancer (enh.), promoter (prom.), or ultraconserved element (UC). MYC (rs11986220) and MYC (rs6983267) are abbreviated to MYCs1 and MYCs2, respectively. Blue bars denote positive and red bars negative correlation

Among the ten elements for which sequence-based models were available from deltaSVM, DeepSEA, the top 10th percentile JASPAR predictions and FunSeq2 scores still performed best in predicting absolute effect size. The absolute deltaSVM models ranked 7th out of 25 measures and were most similar to the number of overlapping top 25th percentile JASPAR predicted motifs (0.80 Pearson correlation). However, deltaSVM models showed improved performance when correlating expression effects including their directionality (0.53 Spearman correlation for SORT1, 0.43 Spearman correlation for F9, Supplementary Table 15). The directionality available from deltaSVM, i.e., predicted expression gain and loss, illustrates how sequence-based models can overcome missing gain-of-motif annotations from reference sequence-based predictions.

In these comparisons, we focused on the relative ranking of variant effects and included a large number of close-to-zero effect estimates affected by experimental noise. This might be conservative for estimating the predictive power in classification settings, e.g., separating pathogenic from benign variants. Testing the separation of no/little effect SNVs from the top 200/500/1000 highest effect promoter (Supplementary Table 16; Supplementary Figs. 10 and 11) or enhancer (Supplementary Table 17; Supplementary Figs. 12 and 13) variants, we also find DeepSEA with the best average performance in both groups (ave. AUROC 0.831 promoter and 0.709 enhancer). We see better performance on the top 200 vs. top 500 vs. the top 1000 variants for both enhancers and promoters. While this difference is more pronounced for promoters, it goes along a shift toward certain elements contributing the majority of selected variants (see Supplementary Tables 16 and 17). Considering that certain elements dominate our classification results (promoters: LDLR, PKLR, and TERT; enhancers: SORT1, IRF4, and IRF6), we caution to overinterpret differences between promoters and enhancers (e.g., better performance of conservation scores for promoters vs. better performance of motif predictions for enhancers). A much larger number of elements in each group (exceeding the scope of this study) and eventually a balanced sampling of elements will be required for that.

To summarize, we observe that even the best performing computational tools or annotations are less predictive of regulatory variants compared with the classification of coding variants70 and can explain only a small proportion of the expression effects observed in our data. The highest variance explained based on Pearson R2 is 0.41 (mammalian PhastCons for LDLR, Supplementary Table 14), the average across all elements is just 0.03.

Discussion

Although limited in their naturalness, MPRAs enable a rigorous, quantitative ascertainment of the regulatory consequences of genetic variants. Previous studies applied MPRAs to study common genetic variants in various regions, with the goal of fine-mapping the causal regulatory determinants of GWAS or eQTL associations24,25. In contrast, here we selected sequences with known regulatory potential—and moreover, sequences previously implicated in human disease—and sought to quantify the consequences of all possible SNVs on that potential.

Saturating MPRAs uniquely facilitates several kinds of analysis. First, we are able to formally evaluate the distribution of effect sizes in regulatory elements. For example, what proportion of variants are inert, activating or repressing? How typical or atypical is a regulatory variant that results in a two-fold expression change? Although the answers to such questions undoubtedly differs between regulatory elements, the number of elements and variants that were studied allows us to begin to make generalizations. Second, saturating MPRAs facilitate the fine-scale identification of TFBSs, including ones that may correspond to transcriptional regulators for which ChIP-seq data are not available or that are not well represented in motif databases. Importantly, it also allows the discovery of binding sites that are created by genetic variants, i.e., through activating mutations. Third, we show that the intersection of saturation mutagenesis MPRAs and TF perturbation, i.e., our siRNA-based knockdown of GABPA, enables confirmation that a particular TF binds to a particular TFBS. Larger-scale implementations of this approach may facilitate the routine identification of the specific trans-acting factors that are responsible for the regulatory potential of each cis-acting regulatory element.

Our study and these data have several limitations that merit highlighting. First, we are limited with respect to context, both cis and trans. To address the former, we used longer sequences than are typical for MPRAs, up to 600 bp, but it remains the case that these are studied on episomal vectors rather than in their native locations. To address the latter, we selected cell lines in which these elements were previously shown to be active, and moreover relevant to the diseases in which these elements were implicated. Nonetheless, previous studies have demonstrated that some regulatory polymorphisms do not always reflect their in vivo effects in cell culture-based assays73, particularly for developmental genes that show temporal and tissue-specific expression patterns (e.g., see results for the IRF6 and ZRS in Supplementary Note 1). A second limitation relates to the reproducibility of measurements for some of the elements studied, and in particular for those with lower basal activity (which we found to be the largest factor impacting reproducibility). Potential approaches to address this in future work include using a stronger minimal promoter (for enhancers) or simply using more complex libraries to further reduce noise (for all elements).

A clear result of our analyses is that although myriad annotations and integrative scores are available, and although some annotations/scores are surprisingly successful in specific cases, no current score consistently performs well in predicting the regulatory consequences of SNVs in the human genome. It is our hope that this dataset of functional measurements for over 30,000 single-nucleotide substitution and deletion regulatory mutations in disease-associated regulatory elements will be useful for the field for studying the shortcomings of current tools, and hopefully inspiring their improvement.

In summary, we successfully scaled saturation mutagenesis-based MPRAs to measure the regulatory consequences of tens-of-thousands of sequence variants in promoter and enhancer sequences previously associated with clinically relevant phenotypes. We believe that our experiments provide a rich dataset for benchmarking predictive models of variant effects, an unprecedented database for the interpretation of potentially disease-causing regulatory mutations, and the potential for critical insight for the development of improved computational tools.

Methods

Selection of target sequences and luciferase assays

Promoter and enhancer sequences of interest (Supplementary Tables 1 and 2) were amplified from human genomic DNA (Roche 11691112001). The targets were amplified with overhanging primers (Supplementary Table 18) to add shared sequences for cloning. All promoters were cloned into pGL4.11b vector [modified from pGL4.11 (Promega) by Dr Richard M. Myers lab] and most of the enhancers were cloned into the pGL4.23 vector (Promega) that contains a minimal promoter followed by the luciferase reporter gene (Supplementary Tables 1 and 2). All inserts were confirmed by Sanger sequencing. We measured the relative luciferase activity of the selected promoters and enhancers for the wild-type as well as the saturation mutagenesis library, normalized to the empty vector. For this purpose, HepG2 (HB-8065), HEK293T (CRL-11268), HeLa (CCL-2), HaCaT (CRL-2404), Neuro-2a (CCL-131), LNCaP (CRL-1740), and SK-MEL-28 (HTB-72) were obtained from American Type Culture Collection (ATCC), the primary glioblastoma cell line SF7996 (GBM)52 was obtained from UCSF (Dr Costello’s lab), and Min6 was a gift by Dr Feroz Papa’s lab at UCSF. A total of 2.0 × 105 cells were cultured in 96-well plates overnight using standard protocols and were transfected with 100 ng of plasmid bearing the promoter/enhancer sequence (Supplementary Tables 1 and 2) (along with 10 ng of the Renilla vector to facilitate normalization for transfection efficiency) using X-tremeGENE HP DNA transfection reagent (Roche 06366236001) according to the manufacturer’s protocol. K562 (ATCC CCL-243), HEL92.1.7 (ATCC TIB-180), and NIH/3T3 cells (ATCC CRL-1658) were cultured in 24-well plates and transfected using X-tremeGENE HP with 500 ng of the constructed plasmid, along with 50 ng of the Renilla vector. The promoter/enhancer activity was measured using the Dual-Luciferase reporter assay (Promega E1910) on a Synergy 2 microplate reader (BioTek Instruments) following a post-transfection interval that varied by experiment (Supplementary Tables 1 and 2).

Construction of MPRA libraries

The cloned enhancer or promoter sequences were amplified in two rounds of PCR. After amplifying the cloned sequences once to append universal adaptors for subsequent steps (Supplementary Table 19), a second error-prone PCR with Mutazyme II (GeneMorph II Random Mutagenesis Kit, Agilent 200550) was used to introduce sequence variation. This second round also added a 15 or 20 bp random tag (i.e., barcode) contained within an overhanging primer oligo (Supplementary Table 20) to each construct.

The resulting PCR products were cloned into the respective vector backbone (Supplementary Tables 1 and 2) without the luciferase reporter via NEBuilder HiFi DNA Assembly (NEB E2621) and transformed into 10-Beta Electrocompetent cells (NEB C3020K). As needed, multiple transformations were pooled and midi-prepped together (Chargeswitch Pro Filter Plasmid Midi Kit, Invitrogen CS31104). Using the vector backbone without the luciferase gene allowed for association of each sequence variant with its newly added tag via sequencing (see below). Once this interim library was determined to have a sufficient complexity and representation of variants, the luciferase gene was inserted between the enhancer/promoter and its tag via a sticky end ligation and transformed again to make the final library for transfection.

Assignment of tags to sequence variants

The associations between tags and sequence variants created by error-prone PCR were learned by amplifying and deep sequencing of this region of the plasmid library before the luciferase gene was added, i.e., while the enhancer/promoters and tags are in close proximity. For short enhancers/promoters, the libraries were amplified with sequencing adaptor primers that captured the cloned sequence with its tag and added the P5/P7 Illumina flow cell sequences (Supplementary Table 21). For long enhancers/promoters, a custom subassembly sequencing approach30 was used to obtain associated tags and sequences along the targets. Here, libraries were also first amplified with sequencing adaptors, but then some of the full-length products were subjected to tagmentation via a Nextera library prep (Illumina FC-121-1031). The tagmented products were amplified with a Nextera-specific primer on the P5 end, and a primer containing only the P7 flow cell sequence on the other end (Supplementary Table 22). These PCR fragments were size-selected on a 1%-agarose gel into two size bins. Full-length and fragmented libraries were quantitated with a Kapa Library Quantification Kit (Roche 07960140001). Products were run on either an Illumina MiSeq or NextSeq instrument (Supplementary Table 23). Full-length and large-size fragment bins were loaded with increased DNA concentration, as these are less efficiently amplified during the cluster generation process. Sequence reads were aligned using BWA-mem v0.7.10-r78974 with an increased penalty against local alignments (-L 80) to the Sanger determined references. A minimum coverage of three reads along the whole target was required to include variant calls from bcftools v1.275 for each identified tag. Summary statistics for these assignments are available in Supplementary Table 3.

Expression of libraries and nucleic acid extraction

For each experiment, about 5 million cells were plated in 15-cm plates and incubated for 24 h before transfection. Each of three independent cultures (replicates) were transfected with 15 μg of the constructed MPRA libraries using X-tremeGENE HP (Roche 06366236001). After indicated hours (Supplementary Tables 1 and 2), cells were harvested, genomic DNA and total RNA were extracted using AllPrep DNA/RNA mini kit (Qiagen 80204). Total RNA was subjected to mRNA selection (Oligotex, Qiagen 72022) and treated with Turbo DNase (Thermo Fisher Scientific AM2238) to remove contaminating DNA.

RNA interference (TERT promoter)

Following the protocol outlined in Bell et al.51, siRNAs were transfected into GBM cells (SF7996) using DharmaFECT 1 following the manufacturer’s protocol. Briefly, cells were seeded at a density of 30,000 cells/mL in a 96-well plate and 5 million cells in 15-cm plates in parallel. Twenty-four hours post seeding, cells were transfected with 50 nM of siRNA and 0.3 μL of DharmaFECT 1 reagent (Dharmacon T-2001). At 48 h post transfection with siRNA, cells were transfected again with the TERT saturation mutagenesis library for another 24 h before harvesting for genomic DNA and total RNA as described above. To measure siRNA knockdown efficiency, cDNA was generated from the 96-well plate, and qPCR performed (Power SYBR Green Cells-to-Ct kit, Ambion 4402953) to measure mRNA abundance of GABPA and TERT with primer sequences previously used (Supplementary Table 24). Relative expression levels (Supplementary Fig. 7) were calculated using the deltaCT method against housekeeping gene GUSB51.

RNA and DNA library preparation

For each replicate, RNA was reverse transcribed with Superscript II (Invitrogen 18064-014) using a primer downstream of the tag. The resulting cDNA was first pooled and then split into multiple reactions to reduce PCR jackpotting effects. Amplification was performed with Kapa Robust polymerase (Roche KK5024) for three cycles, incorporating UMIs 10 bp in length, a sample identifier and the Illumina P7 adapter (Supplementary Table 25). PCR products were cleaned up with AMPure XP beads (Beckman Coulter A63880) to remove the primers and concentrate the products. These products underwent a second round of amplification in eight reactions per replicate for 15 cycles, with a reverse primer containing only P7 (Supplementary Table 25). All reactions were pooled, run on an agarose gel for size selection, and then sequenced. For the DNA, each replicate was amplified for three cycles with UMI-incorporating primers, just as the RNA. First round products were then cleaned up with AMPure XP beads, and amplified in split reactions, each for 20 cycles. Reactions were pooled, gel-purified, and sequenced.

Sequencing and primary processing

RNA and DNA for each of the three replicates were sequenced on an Illumina NextSeq instrument (2 × 15 or 2 × 20 bp + 10 bp UMI + 10 bp sample index; primers available in Supplementary Table 26). Paired-end reads each sequenced the tags from the forward and reverse direction and allowed for adapter trimming and consensus calling of tags76. Tag or UMI reads containing unresolved bases (N) or those not matching the designed length were excluded. In the downstream analysis, each tag × UMI pair is counted only once and only tags matching the above obtained assignment were considered (Supplementary Table 4).

Inferring SNV effects

RNA and DNA counts for each replicate were combined by tag sequence, excluding tags not observed in both RNA and DNA of the same experimental replicate. All tags (T) not associated with insertions or multiple bp deletions were included in a matrix of RNA count, DNA count, and N binary columns indicating whether a specific sequences variant was associated with the tag. We then fit a multiple linear regression model of log2(RNA\({j}\)) ~ log2(DNA\({j}\)) + N + offset (jT) and report the coefficients of N as effects for each variant. Further, we fit a combined model (Equation 1) across all three experimental replicates (\(i \in \{ 1,2,3\}\)), where we combine all RNA measures in one column and keep the DNA readouts separated by replicate (i.e., filling missing values with 0).

$${\mathrm{log}}_2\left( {{\mathrm{RNA}}_{i,{{j}}}} \right) \sim \left\{ {\begin{array}{*{20}{c}} {{\mathrm{log}}_2\left( {{\mathrm{DNA}}_1,j}\right)|i = 1} \\ {0|{\mathrm{else}}} \end{array}} \right\} + \left\{ {\begin{array}{*{20}{c}} {{\mathrm{log}}_2\left( {{\mathrm{DNA}}_2,j} \right)|i = 2} \\ {0|{\mathrm{else}}} \end{array}} \right\} \\ \hskip 12pt \hskip 12pt + \, \left\{ {\begin{array}{*{20}{c}} {{\mathrm{log}}_2\left( {{\mathrm{DNA}}_3,j} \right)|i = 3} \\ {0|{\mathrm{else}}} \end{array}} \right\} + N + {\mathrm{offset}}$$
(1)

We required a minimum of ten tags per variant, before considering variant effects in downstream analyses (Supplementary Tables 5 and 6). While statistically inflated (assumption of independence between tags, correlated counts due to saturation mutagenesis process and no modeling of epistasis effects), we used the p-value for the obtained coefficients as a proxy for the support for each individual nonzero variant effect.

We used the correlation between transfection replication of fitted variant effects divided by the inferred standard deviation as a measure of reproducibility (Supplementary Table 7). We explored major contributors to the reproducibility by leave-one-out regression models using putative predictors from Supplementary Tables 14 (target length, Luciferase fold-change, number of associated tags, proportion of wild type in the library, average mutation rate, number of variants per insert, tags per SNV, tags per 1-bp deletion, assigned DNA and RNA counts, number of wild-type tags, number of variants across tags, and average number of variants per tag).

To identify significantly different variants in the siRNA knockdown experiment, 95% confidence intervals were generated from the generalized linear model using the confint.lm method of the stats package in R. Variants with non-overlapping confidence intervals between the experiments TERT-GBM-SiGABPA and TERT-GBM-SiScramble were considered significantly different.

Annotation and downstream analysis

BP positions and SNVs along the targeted regions were annotated based on GRCh37/hg19 coordinates. CADD v1.0, v1.3, and v1.4 were downloaded [http://cadd.gs.washington.edu/download], and phastCons, phyloP, and GERP++ scores were used as provided with the CADD v1.3 annotations. Functional effect scores for FunSeq v2.16 [http://archive.gersteinlab.org/funseq2.1.2/hg19_NCscore_funseq216.tsv.bgz], LINSIGHT [http://compgen.cshl.edu/~yihuang/tracks/LINSIGHT.bw], and ReMM v0.3.1 [https://charite.github.io/software-remm-score.html] were downloaded for the whole genome and regions of interest extracted. GWAVA scores for the unknown region and TSS models were calculated for all positions along the targeted regions using available software [ftp://ftp.sanger.ac.uk/pub/resources/software/gwava/v1.0/]. DeepSEA [http://deepsea.princeton.edu/job/analysis/create/] and fathmm-MKL [http://fathmm.biocompute.org.uk/fathmmMKL.htm] scores were retrieved using the respective online web interfaces.

DeltaSVM scores were computed from the precomputed k-mer weights [http://www.beerlab.org/deltasvm] of the initial deltaSVM publication21. For each variant the average of all possible k-mer scores of the alternative allele is subtracted from the average of all possible k-mer scores of the reference allele. Not available k-mers in the files are treated as zero. Only precomputed k-mer weights of HEK293T, HeLa S3, HepG2, K562, and LNCaP were available. The model DukeDnase was used for experiments in HEK293T (HNF4A, MSMB, MYC rs6983267, and TERT). For cell types with multiple available k-mer models, we selected the best performing model based on our data. This was DHS_H3Kme1 for HepG2 (SORT1, F9, and LDLR) and K562 (PKLR), UwDnase for LNCaP (MYC rs11986220) and HeLa S3 (FOXE1). We applied the HEK293T model to our TERT experimental results for the matching cell line as well as for GBM cells, based on the high correlation observed for these experiments. We obtained higher correlations in GBM cells, probably due to better experimental performance and are using these values in the comparison with other functional scores described above.

We used predicted TFBS available from JASPAR 2018 [http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/hg19/JASPAR2018_hg19_all_chr.bed.gz]60. Scores reported for each motif match were divided by the length of the match. These motif scores of all binding site predictions were combined across the 21 genomic regions to identify thresholds for the 90th (38), 75th (32.5), 50th (27.8182), 25th (24.4167), and 10th (21.4) percentile. To identify factors with the highest number of motifs in each region, we identified the five most frequent factors and included additional factors with the same number of motif matches in the region. For visualization, overlapping matches of the same motif were combined and matches on both strands considered only once.

TFBS predictions overlapping respective ChIP-peaks in ENCODE experiments were downloaded from http://compbio.mit.edu/encode-motifs/. TFBSs annotated in the ERB v90 were downloaded from [ftp://ftp.ensembl.org/pub/release-90/regulation/homo_sapiens] and coordinate converted to GRCh37 using the UCSC liftover program.

To test the performance of computational scores in a classification setting, i.e., distinguishing between large effect SNVs and those with no effect, we selected the top 200, 500, and 1000 variants with the largest expression effects across all elements (min. ten tags required; p-value of fit < 10−5) and randomly sampled the same number of variants with an absolute log2 expression effect lower than 0.05 (min. tags ten required), preserving the contribution of each element in both classes. In the top 1000 analysis, the number of positives exceeded the number of negatives and we downsampled the positives, respectively. If a region was studied in multiple experiments, we selected the one with the largest replicate correlation (Supplementary Table 7). We used annotations and scores as described above. In cases where an annotation is based on positions rather than alleles, we assumed the same value for all substitutions at that position. The analysis was repeated for 100 samples of the no-effect SNVs and the area under the receiver operating characteristic as well as the area under the precision-recall curve was computed.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.