Global discovery of lupus genetic risk variant allelic enhancer activity

Genome-wide association studies of Systemic Lupus Erythematosus (SLE) nominate 3073 genetic variants at 91 risk loci. To systematically screen these variants for allelic transcriptional enhancer activity, we construct a massively parallel reporter assay (MPRA) library comprising 12,396 DNA oligonucleotides containing the genomic context around every allele of each SLE variant. Transfection into the Epstein-Barr virus-transformed B cell line GM12878 reveals 482 variants with enhancer activity, with 51 variants showing genotype-dependent (allelic) enhancer activity at 27 risk loci. Comparison of MPRA results in GM12878 and Jurkat T cell lines highlights shared and unique allelic transcriptional regulatory mechanisms at SLE risk loci. In-depth analysis of allelic transcription factor (TF) binding at and around allelic variants identifies one class of TFs whose DNA-binding motif tends to be directly altered by the risk variant and a second class of TFs that bind allelically without direct alteration of their motif by the variant. Collectively, our approach provides a blueprint for the discovery of allelic gene regulation at risk loci for any disease and offers insight into the transcriptional regulatory mechanisms underlying SLE.

S ystemic lupus erythematosus (SLE) is an autoimmune disease that can affect multiple organs, leading to debilitating inflammation and mortality 1 . Up to 150 cases are found per 100,000 individuals, and the limited treatment options contribute to considerable economic and social burden 1,2 . Epidemiological studies have established a role for both genetic and environmental factors in the development of SLE 2 . SLE has a relatively high heritability 3 . The vast majority of patients do not have a single disease-causing mutation (such as mutations in complement protein 1q); instead, genetic risk is accumulated additively through many genetic risk loci with modest effect sizes 4 .
Genome-wide association studies (GWASs) have identified 91 genetic risk loci that increase disease risk of SLE in a largely additive fashion 4 . Each SLE-risk locus is a segment of the genome containing a polymorphic "tag" variant (i.e., the variant with the most significant GWAS p-value) and the genetic variants in linkage disequilibrium with the tag variant. The majority (68%) of the established SLE-risk loci do not contain a disease-associated coding variant that changes amino acid usage 5 . Instead, variants at these loci are found in non-coding regions of the genome such as introns, promoters, enhancers, and other intergenic areas. Enrichment of these variants in enhancers and at transcription factor (TF) binding sites 6,7 implies that transcriptional perturbation may be a key to the development of SLE 8 . However, given the large number of candidate variants identified by GWASs, identification of the particular causal variant(s) remains challenging.
SLE is a complex disease that involves multiple cell types 2 . Previous systematic studies demonstrate that SLE-risk loci are enriched for B cell-specific genes 9 and regulatory regions 10 . Established biological mechanisms further highlight a key role for B cells in SLE-as the autoantibody-secreting cell type, B cells are critical to the pathoetiology of SLE, a disease characterized by autoantibody production 11 . B cells also present self-antigens to T cells in the development of an autoantigen-focused (i.e., "self") inflammatory response 12 . Meanwhile, Epstein-Barr virus (EBV)infected B cells have been implicated in SLE, with patients having a greater number of EBV-infected B cells and a higher viral load than people without SLE 13,14 . In addition, EBV infection is significantly more prevalent in SLE cases than controls 15,16 , and EBV-encoded EBNA2 interactions with the human genome are concentrated at SLE-risk loci in EBV-transformed B cell lines 10 . In vitro, EBV infection can transform B cells into a lymphoblastoid cell line (LCL) 17 . We have recently shown that histone mark and human and viral protein chromatin immunoprecipitation followed by sequencing (ChIP-seq) datasets from EBVtransformed B cell lines are highly and specifically enriched at SLE-risk loci relative to non-EBV-transformed B cell lines 9,10 . Given the above evidence, we chose the EBV-transformed B cell line GM12878 to study the effects of SLE-risk variants at SLE-risk loci.
In this work, we design and apply a massively parallel reporter assay (MPRA) [18][19][20][21][22][23][24][25][26][27] to systematically identify the SLE genetic risk variants that contribute to transcriptional dysregulation in the EBV-transformed B cell line GM12878 ( Fig. 1 and Supplementary  Fig. 1). MPRA extends standard reporter assays, replacing lowthroughput luciferase with high-throughput mRNA expression detection. We use MPRA to simultaneously screen the full set of genome-wide significant SLE-associated genetic variants for effects on gene regulation. Using this experimental approach, we nominate 51 putative causal variants that result in genotypedependent (allelic) transcriptional regulation. Comparison of MPRA results between GM12878 and the Jurkat T cell line reveals shared and cell-type-specific allelic behavior. Integration of these data with TF binding site predictions and functional genomics data reveals two distinct mechanisms whereby TFs bind risk variants in an allelic manner-directly impacted by a given variant (i.e., the variant directly alters the TF's DNA-binding site) or indirectly impacted by the variant (i.e., the variant alters the DNA binding of the TF's physical interaction partner or modulates chromatin accessibility). Collectively, these results provide an important resource for understanding SLE disease risk mechanisms and reveal an important role for groups of TFs in the mediation of allelic enhancer activity at plausibly causal SLE-risk variants in EBV-transformed B cells.

Results
MPRA library design and quality control. We first collected all SLE-associated risk loci reaching genome-wide association significance (p < 5 × 10 −8 ) published through March 2018 (Supplementary Data 1). Studies of all ancestral groups were included, and independent risk loci were defined as loci with lead (tag) variants at r 2 < 0.2. For each of these 91 risk loci, we performed linkage disequilibrium (LD) expansion (r 2 > 0. 8) in each ancestry of the initial genetic association(s), to include all possible diseaserelevant variants (Supplementary Data 2). In total, this procedure identified 3073 genetic variants. All established alleles of these variants were included, with 36 variants having three or more alleles. We also included 20 additional genetic variants from a previously published study 19 as positive and negative controls to assess the library's performance (Supplementary Data 3).
For each variant, we generated a pair of 170 base pair (bp) DNA oligonucleotides (subsequently referred to as "oligos") for each allele, with the variant located in the center and identical flanking genomic sequence across the alleles (Supplementary Data 4). A total of 12,478 oligos (3093 variants with 6239 alleles) were synthesized. For barcoding, a random 20mers were added to each oligo through PCR. Each unique barcode was matched with perfectly synthesized oligos. The number of unique barcodes per oligo had an approximately normal distribution with a median of 729 barcodes per oligo (Supplementary Fig. 2a and Supplementary Data 5). Only oligos with at least 30 unique barcodes were used for downstream analyses. A fragment containing a minimal promoter and an eGFP gene was inserted between the oligo and barcode to generate the MPRA transfection library. We note that the use of a minimal promoter allows us to effectively measure the ability of alleles to enhance, but not reduce, transcriptional activity. Three aliquots of the library were independently transfected into the EBV-transformed B cell line GM12878. We then used nucleic acid capture to enrich for eGFP mRNA and sequenced the barcode region. The normalized barcode ratio between the eGFP mRNA and the plasmid DNA was used to quantify the amount of enhancer activity driven by each oligo (Supplementary Note 1 and Supplementary Fig. 5f, g). This mRNA to DNA ratio measures the enhancing effect of an allele on eGFP expression under the control of a minimal promoter ( Fig. 1 and Supplementary Fig. 1). We observed strong correlation of enhancer activity between experimental replicates (mean pairwise Pearson correlation of 0.99) (Supplementary Fig. 2b-d). Likewise, calibration variants showed high accuracy, with 17 of the 20 variants matching the results of a previous study 19 (87.5% sensitivity and 75% specificity), collectively demonstrating a robust experimental system (Supplementary Data 3).
Hundreds of SLE-risk variants are located in genomic regions with enhancer activity in EBV-transformed B cells. Using the SLE MPRA library, we next identified genetic variants capable of driving enhancer activity in the EBV-transformed B cell line GM12878. An SLE-risk variant was considered a candidate for enhancer activity if an oligo corresponding to any allele had significantly increased transcriptional regulatory activity compared to controls (see "Methods"). Not all statistically significant changes in transcriptional activity are necessarily biologically relevant-a highly consistent, but slight change in expression levels is statistically, but not biologically, meaningful. We therefore considered an oligo to have enhancer activity only when (1) the oligo had statistically significant enhancer activity (p adj < 0.05) and (2) we observed at least a 50% increase in transcriptional activity compared to the corresponding barcode counts in the plasmid control. Based on these criteria, 16% of SLE-risk variants (482 variants, 853 alleles) demonstrated enhancer activity, henceforth referred to as "enhancer variants" (enVars) and "enhancer alleles" (enAlleles), respectively ( Fig. 2a and Supplementary Data 6).
We next explored the potential effects of enVars on gene expression. We connected each enVar to one or more genes using an approach that takes into account chromatin looping interactions, expression quantitative trait loci (eQTLs), and gene proximity (Supplementary Data 2 and 7) (see "Methods"). This approach identified 1006 genes in total, which are enriched for expected SLE-related processes such as the interferon pathway, the antigen processing and presentation pathway, and cytokinerelated pathways (Supplementary Fig. 3 and Supplementary Data 8), providing functional support for the enVars we identified.
Next, we searched for functional genomic features enriched within enVars relative to non-enVars using the RELI algorithm 10 . In brief, RELI estimates the significance of the intersection between an input set of genomic regions (e.g., enVars) and each member of a collection of functional genomics datasets (e.g., ChIP-seq for a particular histone mark or TF). For this analysis, we identified, curated, and systematically processed the 576 GM12878 ChIP-seq datasets available in the NCBI Gene Expression Omnibus (GEO) database (see "Methods"). Using RELI, we observed significant enrichment for overlap between enVars and multiple histone modification marks, including H3K4me3 (5.8-fold, p corrected < 10 −21 ) and H3K27ac (2.0-fold, p corrected < 10 −13 ) ( Fig. 2b and Supplementary Data 9). As expected, we did not identify enrichment for repressive marks such as H3K9me3 or H3K27me3 28 ( Fig. 2b and Supplementary Data 9). Altogether, the genomic features present within enVars confirm that many SLE genetic risk loci likely alter transcriptional regulation in EBV-transformed B cells.
We next asked if the genomic binding sites of particular TFs were enriched within our enVars using RELI and the TF ChIPseq datasets from GM12878. As expected, the enVars are highly enriched for ChIP-seq signal of TFs involved in regulation of the immune response, relative to variants lacking enhancer activity ( Fig. 2c and Supplementary Data 9). In particular, we found significant enrichment for all members of the NFκB TF family: REL/C-Rel (6.4-fold, p corrected < 10 −26 ), NFKB1/p50 (3.0fold, p corrected < 10 −18 ), RELA/p65 (3.1-fold, p corrected < 10 −16 ), RELB (2.7-fold, p corrected < 10 −10 ), and NFKB2/p52 (2.2-fold, p corrected < 10 −7 ). These results are consistent with our previous findings that altered binding of NFκB TFs is likely an important mechanism conferring SLE risk 10 . We also found significant enrichment for other TFs that have been previously implicated in SLE pathogenesis, such as PAX5 29 , MED1 30 , IKZF1 31 , ELF1 32 , and the EBV-encoded EBNA2 transactivator 10 ( Fig. 2c and Supplementary Data 9). As a complementary approach, we next assessed enrichment for TF binding site motifs in the enAllele DNA sequences using HOMER 33 and motifs contained in the Cis-BP database 34 (see "Methods"). This analysis also revealed enrichment of multiple TF families with known roles in SLE, including ETS, NFκB, and IRF 3 ( Fig. 2d and Supplementary Data 10). Many of these same TFs also have enriched ChIP-seq peaks at SLE-risk loci 10 . Collectively, these results indicate that particular TFs tend to not only concentrate at SLE-risk loci 10 , but also concentrate at alleles capable of driving gene expression in EBV-transformed B cells.
MPRA identifies 51 SLE-risk variants with allelic enhancer activity in EBV-transformed B cells. We next used our MPRA library to identify SLE genetic risk variants that drive alleledependent (allelic) enhancer activity. Allelic activity was assessed for each enVar by comparing enhancer activity between each pairs of alleles. We considered a SLE variant allelic if (1) at least one of its alleles is an enAllele; (2) we observed significant genotype-dependent activity using Student's t-test 19,35 (Supplementary Note 2 and Supplementary Fig. 6); and (3) the oligos had more than a 25% change between any pair of alleles. Using these  Data 11). For 31 of these 51 allelic enVars, the risk allele decreased enhancer activity relative to the non-risk allele, which is statistically indistinguishable from the 20 variants with increased risk allele activity (p = 0.1). Three of the allelic enVars can also alter the amino acid sequence of proteins-rs1059702 (IRAK1), rs1804182 (PLAT), and rs3027878 (HCFC1), consistent with previous studies identifying dual-use codons in the human genome 36 . Collectively, these 51 variants represent causal variant candidates for 27 SLE-risk loci (30% of all tested loci) (Supplementary Data 12). For these 27 risk loci, our approach reduced the number of potential causal variants with allelic activity in GM12878 from an average of 84 variants to an average of two variants per risk locus (Fig. 3b). For example, at 17q12 (marked by rs8079075), we reduce the candidate causal variant set from 249 to one, with the rs112569955 "G" risk allele showing a 36% increase in enhancer activity compared to the "A" non-risk allele.
We next used the MARIO pipeline 10 to search for allelic binding events (i.e., allelic imbalance between sequencing read counts) at SLE variants within 1058 LCL ChIP-seq datasets (576 from GM12878). By necessity, this approach is limited to the 47 allelic enVars that are heterozygous in at least one of these cell lines. In total, this procedure identified 11 variants with strong allelic imbalance (MARIO ARS value >0.4) in at least The normalized fold change of MPRA activity relative to plasmid control (X-axis) was calculated using DESeq2 (n = 3 biological replicates). Enhancer alleles (enAlleles) (blue) were identified as those alleles with significant activity relative to control (p adj < 0.05) and at least a 50% increase in activity (see "Methods"). The p-values were generated by two-sided Wald tests with Benjamini-Hochberg multiple testing correction. Full results are provided in Supplementary Data 6. b Enrichment of histone marks in GM12878 cells at enVars compared to non-enVars. p-values were estimated by one-sided z-test with Bonferroni multiple testing correction using RELI (see "Methods"). Full results are provided in Supplementary Data 9. c Enrichment of regulatory protein and transcription factor (TF) binding at enVars compared to non-enVars. p-values were estimated by one-sided z-test with Bonferroni multiple testing correction using RELI (see "Methods"). one ChIP-seq dataset (Supplementary Data 14), revealing groups of TFs and transcriptional regulators that allelically bind SLE-risk variants with genotype-dependent MPRA enhancer activity. For example, the rs3101018 variant, which is associated with SLE 38 and rheumatoid arthritis 39 in Europeans, shows 1.7-fold stronger enhancer activity for the reference/non-risk 'C' allele compared to the non-reference/risk 'T' allele ( Fig. 4a). These results are consistent with a previously established eQTL obtained from GTEx 40 , which demonstrates higher Complement C4A (C4A) expression in EBV-transformed B cell lines for the rs3101018 'C' allele than the 'T' allele ( Fig. 4b). Our MARIO allelic ChIP-seq analysis reveals 15 regulatory proteins that prefer the 'C' allele and 2 that prefer the 'T' allele ( Fig. 4c and Supplementary Fig. 4a). Among these, particularly robust signal is obtained for ATF7, with one experimental replicate in GM12878 displaying 77 vs. 18 reads ('C' vs. 'T') and another showing 66 vs. 23 reads ('C' vs. 'T') (Supplementary Data 14). Moreover, CREB1 and CREM strongly favor the 'C' allele as well (Supplementary Data 14). In agreement with these data, computational analysis of the DNA sequences surrounding this variant predicts that ATF7, CREB1, and CREM will all bind more strongly to the 'C' than the 'T' allele ( Fig. 4d). Intriguingly, ten additional proteins (FOXK2, PKNOX1, ARID3A, ZBTB40, ZNF217, ARNT, ELF1, IKZF2, MEF2B, and FOXM1) also bind allelically and have known DNA-binding motifs 41 , but none of them have binding sites altered by the variant. Further, we do not observe allelic chromatin accessibility in an available GM12878 ATAC-seq dataset (9 vs. 7 unique reads). Together, these results reveal a potentially causative SLE regulatory mechanism involving weaker direct binding of ATF7/ CREB1/CREM to the 'T' risk allele, altering the recruitment of additional proteins to the locus and lowering the expression of C4A.
We observe a similar phenomenon for the rs2069235 variant, which is associated with SLE in Asian ancestries 42 and rheumatoid arthritis in Europeans 43 . rs2069235 displays much stronger enhancer activity for the 'A' (non-reference/risk) allele compared to the 'G' (reference/non-risk) allele (Fig. 4e), consistent with the established synaptogyrin 1 (SYNGR1) eQTL in EBV-transformed B cell lines 40 (Fig. 4f). Inspection of our allelic ChIP-seq data reveals 14 proteins preferentially binding the 'A' allele, with none preferring the 'G' allele ( Fig. 4g and Supplementary Fig. 4b). Among these 14 proteins, only ELF1 has its binding site directly altered by the variant (Fig. 4h). Strikingly, 55 of the 78 available H3K27ac datasets are allelic at this variant, with all 55 preferring the 'A' allele. Likewise, 24 of 46 H3K4me1 datasets are allelic, with all of them also preferring the 'A' allele. Both of these histone marks are indicative of active chromatin 28 . Only a single histone mark dataset prefers the 'G' allele-the H3K27me3 mark, which is indicative of silenced chromatin 28 ( Fig. 4g and Supplementary Fig. 4b). Together, these data are consistent with a potentially causative SLE molecular mechanism involving an allele-dependent enhancer consisting of stronger direct binding of ELF1 to the 'A' risk allele, along with indirectly altered binding of multiple additional TFs to this locus.
Genotype-dependent binding to SLE variants with allelic enhancer activity by variant overlapping and variant adjacent TFs. As illustrated by the above examples, a particular TF can be involved in allelic mechanisms that are either directly impacted by a given variant (i.e., the variant directly alters the TF's DNAbinding site) or indirectly impacted by the variant (i.e., the variant alters the DNA binding of the TF's physical interaction partner, modulates chromatin accessibility, or affects another mechanism). At a given locus, we designate such TFs as variant overlapping and variant adjacent TFs, respectively (Fig. 5a). We next sought to discover such TFs at the 51 allelic enVars. At each allelic enVar locus, we identified variant overlapping TFs as those TFs predicted to have strong binding to one allele and weak binding to the other allele. Likewise, we identified variant adjacent TFs as those TFs with proximal strong predicted binding sites that do not directly overlap the variant (see "Methods"). We then searched for particular TFs that tend to act as variant overlapping TFs or as variant adjacent TFs at the 51 allelic eVars using a proportion test (see "Methods") and confirmed that their binding site locations are distributed relative to the variant as expected  . 5b). Consistent with our results at the C4A and SYNGR1 loci, variant overlapping TFs include members of the ETS (e.g., ELF1) and ATF-like (e.g., ATF7) families, along with other TFs whose genetic loci are associated with SLE, including IRF5 44 ( Fig. 5c and Supplementary Data 15). Variant adjacent TFs represent a distinct class, but also include several TFs with SLE genetic associations, including NFκB 45 , the Ikaros (IKZF) family 46 , and HMGA family members 47 (Fig. 5d and Supplementary Data 15). Collectively, these analyses reveal two distinct classes of TFs at a given SLE-associated locus that both likely play key roles in SLE mechanisms, along with particular TFs that tend to participate in one class or the other.
Allelic transcription regulatory mechanisms shared and unique across cell types. To explore the cell-type specificity of allelic enhancer activity, we transfected our SLE MPRA library into Jurkat cells, a T cell line, as T cells are another important cell type in SLE 2 . We identified 92 SLE-risk variants as allelic enVars in Jurkat cells, 25% of which were also found in GM12878 (Supplementary Fig. 5a, b and Supplementary Data 11). We then repeated the experiment in Jurkat cells stimulated with the inflammatory cytokine TNFα, a key cytokine in SLE development 48 , to identify stimulation-dependent allelic enVars. This resulted in the identification of 102 allelic enVars, 28 of which were specific to the stimulated Jurkat cells ( Supplementary  Fig. 5c, d and Supplementary Data 11). Altogether, our study identified a total of 145 allelic enVars across 50 independent SLErisk loci (Supplementary Fig. 5e). These results highlight allelic transcriptional regulatory mechanisms that are both cell-type and inflammatory signaling-dependent.
In summary, through the application of an allelic MPRA library to the EBV-transformed B cell line GM12878, we identified global transcriptional enhancer activity at 16% of SLE-associated genetic variants (enVars), with particular transcriptional regulatory proteins concentrated at these genomic locations. We further identified 51 SLE-risk variants with allelic enhancer activity (allelic enVars) that we now nominate as plausibly causal by acting through genotype-dependent changes to enhancer activity in GM12878. Upon comparison to allelic enhancer activity in the Jurkat T cell line, we identified shared and unique allelic transcriptional regulatory mechanisms at SLErisk loci. Using experimental TF ChIP-seq data and TF binding site motif scanning, we propose a model where the collective action of the genotype-dependent binding of particular variant overlapping and variant adjacent TFs leads to genotypedependent transcriptional activity at SLE-risk loci.

Discussion
Genome-wide association studies identify genetic loci with statistical disease associations. However, each risk locus often contains many plausibly causal variants due to linkage disequilibrium. This study is the first direct genome-wide measurement of enhancer activity at the~3000 known SLE genetic risk variants in any context. Unbiased experimental approaches such as MPRA are vital for resolving causal variants and their molecular mechanisms of action.
Our results indicate that 16% of the SLE-risk variants examined in this study have enhancer activity in the EBV-transformed B cell line GM12878. Furthermore, 51 of these enhancer variants at 27 loci have allelic enhancer activity. These findings are consistent with the theory that a large proportion of the genetic risk of SLE is mediated through transcriptional perturbation of critical B cell genes. Importantly, SLE-risk loci exhibit cell-type and inflammatory signaling-dependent allelic enhancer activity, with only 25% of allelic enVars shared between GM12878 and Jurkat cells. These results highlight both shared and unique allelic transcriptional regulatory mechanisms for SLE risk, which underlines the importance of the cell-type and cell-state in which the MPRA is performed.
In this study, we used the EBV-transformed B cell line GM12878 as a model for exploring the effects of SLE-risk variants. Previous studies have shown that B cells play a critical role in SLE development as immune cells that secrete autoantibodies driving etiology 11 . The relationship between EBV and SLE is widely appreciated. For example, EBV-infected B cells are more prevalent in SLE patients than in healthy people 13,14 and patients with SLE have a higher EBV viral load and infection rate relative to controls 15,16 . In addition, the EBV transcriptional regulator EBNA2 occupies SLE-risk loci in a genotype-dependent manner 10 . In vivo, EBV infection can convert primary B cells to activated lymphoblasts 49,50 . EBV will eventually enter latency in resting memory B cells and establish lifelong infection 51 . In vitro, EBV infection transforms B cells into immortalized lymphoblastoid cell lines (LCLs) 17 . While we chose the EBV-transformed B cell line GM12878 as the primary disease model for our study, there are limitations to the use of EBV-transformed B cell lines. For example, GM12878 is an immortalized cell line with differences in DNA methylation and gene expression levels from resting and activated B cells 52,53

. A recent study also suggests that EBV infection causes B cells to undergo a germinal center-like differentiation into cells partially resembling plasmablasts and early plasma cells 54 , which is only a transient stage in vivo.
Altogether, these limitations need to be considered when interpreting allelic transcriptional regulation of SLE-risk variants in EBV-transformed B cell lines.
A critical finding of this study is that SLE-risk variants with allelic enhancer activity likely alter the binding of many TFs. Although variants can directly affect the binding of variant overlapping TFs via disruption of a DNA-binding site, they can also simultaneously alter the binding of other variant adjacent TFs, presumably via genomic mechanisms such as altered chromatin accessibility, altered histone marks, indirect TF recruitment through physical interactions, changes in DNA shape, or changes to protein interaction partner DNA binding. This finding corroborates the previously proposed genetic variation-mediation model of motif-dependent and motif-independent TF binding [55][56][57] . In general, a given TF can be variant overlapping at one locus and variant adjacent at another, as exemplified by ATF7 (Fig. 4c,  g). Nonetheless, particular TF families tend to act as variant overlapping TFs at SLE loci (such as Ets, E-box, and ATF), whereas others tend to act as variant adjacent TFs (such as HMGA, Hox, and NFκB). Notably, many of these variant overlapping and variant adjacent TFs are themselves encoded by genetic risk loci associated with SLE (e.g., IRF5 44 , NFκB 45 , and ETS1 58 ), suggesting that there are multiple means through which a particular TF can contribute to disease-based genetic mechanisms. For example, IRF5 targets might be mis-regulated in an SLE patient due to genetic associations in the promoter of IRF5 that result in altered IRF5 protein levels 44 , or by genetic variants located within or adjacent to IRF5 binding sites at other genomic loci. It is currently unknown if these TF attributes are shared with other human diseases.
This study reveals possible causal genetic mechanisms involving altered binding of particular TFs at two important SLE-risk loci. C4A is a component of the inflammatory complement pathway that is critical for the appropriate clearance of apoptotic cells 59 . People without C4A due to rare, protein-changing mutations are at a greatly increased risk for autoimmune diseases, including type I diabetes and SLE 60 . Further, the risk of developing SLE is 2.62 times higher in subjects with low total C4A 59 . Consistent with this observation, the SLE-risk allele of rs3101018 at this genetic locus identified in our MPRA is associated with lower C4A expression (Fig. 4a). Moreover, this variant is an eQTL for C4A in EBV-transformed B cell lines and whole blood cells, with the risk allele displaying lower C4A expression 61 ( Fig. 4b and Supplementary Data 7). rs3101018 is located in the Human Leukocyte Antigen (HLA) region of the genome. While many genetic variants at this locus alter amino acid usage in major histocompatibility complex molecules and affect antigen presentation, non-coding genetic variants across the HLA region have also been demonstrated to affect gene expression independently of HLA-type [62][63][64] . In addition, the SLE-risk locus encoding SYNGR1 was recently identified in a high-density genotyping study of subjects with Asian ancestry 42 and also increases disease risk for schizophrenia 65 , primary biliary cirrhosis 66 , and rheumatoid arthritis 67 . SYNGR1 is an integral membrane protein that is most robustly expressed in neurons of the central nervous system; however, there is measurable transcription and translation of SYNGR1 in other tissues, including developing B cells 68 . eQTL data from EBV-transformed B cell lines and whole blood cells 61 (Fig. 4f and Supplementary Data 7) further support our MPRA-based findings of SLE-risk genotype-dependent enhancer activity and gene expression at this locus. Altogether, the results of our MPRA study provide mechanistic insight into these variants through the identification of allelic enVars that facilitate SLE-risk genotype-dependent gene expression.
GWAS provides important discernment of the genetic origins of disease. In conjunction with other genome-scale assays such as ATAC-seq, ChIP-seq, and HiChIP-seq, MPRA reveals likely causal variants and genes, and enables the assembly of causal mechanisms affecting gene expression. In this study, we used MPRA to uncover specific genetic variants within the risk haplotypes of a complex disease in specific cell types. Our integrative analyses reveal specific molecular mechanisms underlying genotype-dependent transcriptional regulation and SLE disease risk. We conclude that the MPRA is a robust tool for the nomination of causal genetic risk variants for any phenotype or disease with risk loci that act through genotype-dependent gene regulatory mechanisms, with this study providing a blueprint for dissecting the genetic etiology of many complex human diseases.
For single nucleotide polymorphisms, we pulled 170 base pairs (bps) of hg19flanking DNA sequences for every allele, with the variant located in the center (84 bps upstream and 85 bps downstream of the variant). For the other types of variants (indels), we designed the flanking sequences to ensure that the longest allele has 170 bps. Adapters (15 bps) were added to each sequence at either end (5′-ACTGGCCGCTTGACG -[170 bp oligo] -CACTGCGGCTCCTGC-3′) to make a 200 bp DNA sequence (Supplementary Data 4). For all resulting sequences, we created a forward and reverse complement sequence to compensate for possible DNA synthesis errors. A total of 12,478 oligos (3093 variants, 6239 alleles) were obtained from Twist Bioscience.
Library assembly. For assembly of the MPRA library, we followed the procedure described by Tewhey et al. 19 with minor modifications. In brief, we first created the empty vector pGL4.23ΔxbaΔluc from pGL4.23[luc2/minP] using primer Q5_deletion_rev and Q5_deletion_fwd following the manufacturer's instruction of the Q5 Site-Directed Mutagenesis Kit. Then, 20 bps barcodes were added to the synthesized oligos through 24X PCR with 50 μL system, each containing 1.86 ng oligo, 25 μL NEBNext® Ultra™ II Q5® Master Mix, 1 μM MPRA_v3_F, and MPRA_v3_201_R. PCR was performed under the following conditions: 98°C for 2 min, 12 cycles of (98°C for 10 s, 60°C for 15 s, 72°C for 45 s), 72°C for 5 min. Amplified product was purified and cloned into SfiI digested pGL4.23ΔxbaΔluc by Gibson assembly at 50°C for 1 h. The assembled backbone library was purified and then transformed into Escherichia coli (E. coli) through electroporation (2 kV, 200 ohm, 25 μF). Electroporated E. coli was expanded in 200 mL of LB Broth buffer supplemented with 100 μg/mL of carbenicillin at 37°C for 12 to 16 h. Plasmid was then extracted using the QIAGEN Plasmid Maxi Kit.
A miniP + eGFP fragment was amplified from pGL4.23[eGFP/miniP] through 8X PCR with 50 μL system, each containing 1 ng plasmid, 25 μL NEBNext® Ultra™ II Q5® Master Mix, 0.5 μM 200-MPRA_v3_GFP_Fusion_v2_F, and 201-MPRA_v3_GFP_Fusion_v2_R. PCR was performed under the following conditions: 98°C for 2 min, 20 cycles of (98°C for 10 s, 60°C for 15 s, 72°C for 45 s), 72°C for 5 min. The amplified product was purified and then inserted into AsiSI digested backbone library through Gibson assembly at 50°C for 1.5 h to create the transfection library. The resulting library was re-digested by RecBCD and AsiSI, purified, and then transformed into E. coli through electroporation (2 kV, 200 ohm, 25 μF). Transformed E. coli was cultured in 5 L of LB Broth buffer supplemented with 100 μg/mL of carbenicillin at 37°C for 12-16 h. The plasmid was then extracted using the QIAGEN Endo-free Plasmid Giga Kit.
Sequencing library for oligo and barcode association. The oligo and barcode regions were amplified from the backbone library through 4X PCR with a 100 μL system containing 200 ng plasmid, 50 μL NEBNext® Ultra™ II Q5® Master Mix, 0.5 μM TruSeq_Universal_Adapter_P5, and MPRA_v3_TruSeq_Amp2Sa_F_P7. PCR was performed under the following conditions: 95°C for 20 s, 6 cycles of (95°C for 20 s, 62°C for 15 s, 72°C for 30 s), 72°C for 2 min. The product was then purified, and indices were added through a 100 μl system containing all purified product, 50 μl NEBNext® Ultra™ II Q5® Master Mix, 0.5 μM TruSeq_Universal_Adapter_P5, and index primer. PCR was performed as above, except for only five cycles. Samples were purified, molar pooled, and sequenced using 2 × 125 bp on Illumina NextSeq 500.
Transfection. The GM12878 cell line was grown in RPMI medium supplemented with 10% FBS, 100 units/mL of penicillin, and 100 µg/mL of streptomycin. Cells were seeded at a density of 5 × 10 5 cells/mL the day before transfection. For triplicate transfections, we collected a total of 5 × 10 7 cells per replicate. Cells were then suspended with 50 μg transfection library plasmid in 400 μL Buffer R. Electroporation was performed with the Neon transfection system in 100 μl tips with 3 pulses of 1200 V, 20 ms each. After transfection, cells were recovered in 50 mL prewarmed RPMI medium supplemented only with 10% FBS for 24 h. Cells were then collected for preparation of the sequencing library for barcode counting.
The Jurkat cell line was grown in RPMI medium supplemented with 10% FBS, 100 units/mL of penicillin, and 100 µg/mL of streptomycin. Cells were seeded at a density of 5 × 10 5 cells/mL the day before transfection. For each experimental group, we collected a total of 5 × 10 7 cells per replicate for 5 replicates. Cells were then resuspended with 50 μg transfection library plasmid in 400 μL Buffer R. Electroporation was performed with the Neon transfection system in 100 μl tips with 3 pulses of 1350 V, 10 ms each. After transfection, cells were recovered in 50 mL pre-warmed RPMI medium supplemented only with 10% FBS for 24 h. After recovery, cells were supplemented with or without 100 ng/ml TNFα for 24 h. Cells were then collected for preparation of the sequencing library for barcode counting.
Sequencing library for barcode counting. For samples from GM12878 cells, total RNA of transfected cells was extracted by the RNeasy Midi Kit following the manufacturer's instruction. Extracted RNA was subjected to DNase treatment in a 375 μL system with 2.5 μL Turbo DNase and 37.5 μL Turbo DNase Buffer at 37°C for 1 h. 3.75 μL 10% SDS and 37.5 μL 0.5 M EDTA were added to stop DNase with 5 min of incubation at 75°C. The whole volume was used for eGFP probe hybridization in an 1800 μL system, with 450 μl 20X SSC buffer, 900 μL Formamide and 1 μL of each 100 μM Biotin-labeled GFP probe One to Three. The probe hybridization was performed through incubation at 65°C for 2.5 h. 200 μL Dynabeads™ MyOne™ Streptavidin C1 was prepared according to the manufacturer's instruction. The beads were suspended in 250 μL 20X SSC Buffer and incubated with the above probe hybridization reaction at room temperature for 15 min. Beads were then collected on a magnet and washed with 1X SSC Buffer once, and 0.1X SSC Buffer twice. eGFP mRNA was eluted first through adding 12.5 μL ddH 2 O, heating at 70°C for 2 min and collecting on a magnet, then adding another 12.5 μL ddH 2 O, heating at 80°C for 2 min and collecting on a magnet. All collected elution was performed with another DNase treatment in a 30 μL system containing 0.5 μL Turbo DNase and 3 μL Turbo DNase Buffer at 37°C for 1 h. 0.5 μL 10% SDS was added to halt DNase reaction. Eluted mRNA was purified through RNA Clean SPRI Beads. mRNA was reverse transcribed to cDNA using SuperScript™ IV First-Strand Synthesis System with gene specific primer MPRA_v3_Amp2Sc_R, following the manufacturer's instruction. cDNA and plasmid control were then used for building sequencing libraries following the Tag-seq Library Construction section in the paper of Tewhey et al. 19 . In brief, 1 μL of cDNA and plasmid control samples were used to estimate the relative concentration of eGFP in the 10 μL system containing 5 μL NEBNext® Ultra™ II Q5® Master Mix, 0.6 μL SYBR green I diluted 1:1000 (Life Technologies, S7563), and 0.5 μM TruSeq_Universal_A-dapter_P5 and MPRA_V3_Illumina_GFP_F. PCR was performed under the following conditions: 95°C for 20 s, 40 cycles of (95°C for 20 s, 65°C for 20 s, 72°C for 30 s), 72°C for 2 min. According to the cycle threshold, all cDNA and plasmid control samples were diluted to match the sample with the lowest concentration. A total of two PCRs were needed for building the sequencing library. The first PCR was performed with 10 μL of normalized samples in the 50 μL system containing 25 μL NEBNext® Ultra™ II Q5® Master Mix, 0.5 μM TruSeq_Universal_A-dapter_P5, and MPRA_V3_Illumina_GFP_F. PCR was performed under the following conditions: 95°C for 20 s, corresponding cycles of (95°C for 20 s, 65°C for 20 s, 72°C for 30 s), 72°C for 2 min. The product was then purified, and indices were added through a 100 μl system containing all purified product, 50 μl NEB-Next® Ultra™ II Q5® Master Mix, 0.5 μM TruSeq_Universal_Adapter_P5, and index primer. PCR was performed as above, except for only 6 cycles. Samples were purified, molar pooled, and sequenced using 1 × 75 bp on Illumina NextSeq 500.
For samples from Jurkat cells, total DNA and RNA of transfected cells were extracted by the Qiagen ALLPrep DNA/RNA Mini Kit following the manufacturer's instruction 23 . Extracted RNA was processed the same as above to obtain cDNA. cDNA, extracted DNA, and plasmid control were then used for building sequencing libraries with the same protocol described above. Samples were purified, molar pooled, and sequenced using 1 × 100 bp on Illumina NovaSeq 6000.
All primers used in this study are provided in Supplementary Table 1.
Oligo and barcode association. Paired-end, 125 bp reads were first quality filtered using Trimmomatic (v0.38) 83  Only reads with Levenshtein distance of 4 or less within the constant region and perfect matches to the two bases directly adjacent to the barcode were kept. Barcodes were then associated with the retained reads using the read ID. Only barcodes that met our quality threshold requirements described above in the Methods section "Oligo and barcode association" were used for downstream analysis.
Enhancer variant (EnVar) identification. We followed the procedures described in the "Identification of Regulatory Oligos" section of Tewhey et al. 19 with minor modifications. In brief, oligos (alleles) with 30 or more unique barcodes from the plasmid control were included for analysis. All barcodes were summarized at the oligo level. Barcode count totals for each oligo, including all SLE variants and the 20 control variants, were passed into DESeq2 (v1.28.1) 85 in R (v3.5.3) to estimate the fold change and significance between plasmid controls (Supplementary Note 1 and Supplementary Fig. 5f, g) and the experimental replicates. A Benjamini-Hochberg FDR adjusted p-value of <0.05 was required for significance. Only significant alleles with greater than or equal to a 1.5x fold change were identified as enhancer alleles (enAlleles). A variant was identified as an enhancer variant (enVar) if any allele of this variant was an enAllele. Results for the 20 control variants were compared to data from Tewhey et al. 19 to estimate accuracy, sensitivity, and specificity.
Allelic enVar identification. Only enVars were considered for allelic analysis. The barcode counts from every allele of each enVar were used for calculating p-values by comparing the log2 ratios of the non-reference allele vs the reference allele, normalized by plasmid controls, using Student's t-test 19,35 . p-values were adjusted with the Benjamini-Hochberg FDR-based procedure. A corrected p-value of <0.05 was required for significance. Only significant alleles with 25%-fold changes or greater were identified as allelic enVars (Supplementary Note 2 and Supplementary  Fig. 6). We have created an R package (mpraprofiler) for performing this analysis, which is available on the Weirauch lab GitHub page (https://github.com/ WeirauchLab/mpraprofiler).
Gene annotation. We annotated each SLE genetic variant with its nearest gene using the NCBI RefSeq table 86 downloaded from the UCSC table browser 82 . enVars were annotated using a combination of DNA looping interactions (GM12878 Capture Hi-C data 87,88 ) and eQTL data obtained from the eQTL Catalog, a resource that contains quality-controlled, uniformly re-computed eQTLs from 19 eQTL publications 61,89-108 , EBV-transformed B cell lines (GTEx Analysis V7 (dbGaP Accession phs000424.v7.p2)) 40 and other individual studies [109][110][111][112] . For all variants, the target genes were annotated (Supplementary Data 2) using the union of promoter interacting genes and eQTL genes from B cells with and without EBV transformation, when available. Otherwise, target genes were annotated as the nearest gene. Allelic enVar gene targets were classified into four tiers: a Tier (1) variant is both an eQTL and also loops to the promoter of the same gene; a Tier (2) variant has an eQTL for at least one gene; a Tier (3) variant only loops to the promoter of at least one gene; a Tier (4) variant is neither an eQTL nor loops to the promoter of any gene (Supplementary Data 12).
TF binding site motif enrichment analysis. To identify specific TFs whose binding might contribute to the enhancer activity observed in our MPRA experiments, we performed HOMER (v4.9) 33 Supplementary Fig. 3).
Identification and processing of publicly available LCL ChIP-seq data. 1058 ChIP-seq datasets were obtained from the Gene Expression Omnibus (GEO) 115 44 . ENCODE blacklist regions 120 were removed from the peak sets using the hg19-blacklist.bed.gz file available at https://github.com/Boyle-Lab/Blacklist/tree/ master/lists/Blacklist_v1. ChIP-seq datasets GSM1666207, GSM2748907, and GSM1599157 were removed due to the low number of cells used in the experiments.
Functional genomics dataset enrichment analysis with RELI. We used the RELI (v0.9) 10 algorithm to identify genomic features (TF binding events, histone marks, etc.) that coincide with enVars. As input, RELI takes the genomic coordinates of enVars. RELI then systematically intersects these coordinates with one of the GM12878 ChIP-seq datasets, and the number of input regions overlapping the peaks of this dataset (by at least one base) is counted. Next, a p-value describing the significance of this overlap is estimated using a simulation-based procedure. To this end, a 'negative set' is created for comparison to the input set, which in this study contains the set of non-enVars (i.e., variants with no allele having an adjusted pvalue of <0.05 and more than 10%-fold change in the DESeq2 result). A distribution of expected overlap values is then created from 2000 iterations of randomly sampling from the negative set, each time choosing a set of negative examples that match the input set in terms of the total number of genomic loci. The distribution of the expected overlap values from the randomized data resembles a normal distribution and can thus be used to generate a Z-score and corresponding p-value estimating the significance of the observed number of input regions that overlap each ChIP-seq dataset. We performed similar RELI analysis for allelic enVars. As input, we used the allelic enVar sites. For the 'negative set', we used the set of common SNPs taken from the dbSNP142 database downloaded from the UCSC table browser 82 .
Identification of allelic ChIP-seq reads using MARIO. To identify possible mechanisms underlying our allelic enVars, we applied our MARIO(v3.93) 10 method to the LCL ChIP-seq dataset collection described above. In brief, MARIO identifies common genetic variants that are (1) heterozygous in the assayed cell line and (2) located within a peak in a given ChIP-seq dataset. It then examines the sequencing reads that map to each heterozygote in each peak for imbalance between the two alleles. Results are combined across experimental replicates to produce a robust Allelic Reproducibility Score (ARS). Results with MARIO ARS value >0.4 that also pass the following three post-processing filters were considered allelic. (1) The variant must be significantly allelic for a given protein/histone mark (ARS > 0.4) in at least 50% of the datasets in which that variant was heterozygous; (2) The same allele must be significantly preferred (ARS > 0.4) in at least 75% of the datasets where that variant shows significant allelic behavior; and (3) The replicates of a given experiment must all prefer the same strong allele. These post-processing filters were applied to remove results with inconsistent allelic imbalance, extending the procedures of our previous study 10 .
Identification of variant overlapping and variant adjacent TFs. Variant overlapping TFs were identified using an algorithm that compares predicted TF binding motif scores between the different alleles of each allelic enVar. First, we padded each allele of a given allelic enVar with 25 bps of upstream and downstream DNA sequence (a sufficient length to account for any known human TF binding sites 121 ). The algorithm consists of two major components: (1) individually scoring the two alleles of a given variant with a given TF model; and (2) quantifying the difference in the binding intensity between these two alleles. DNA sequences are scored using the large collection of human TF position weight matrix (PWM) models contained in the Cis-BP database 34 and the log-likelihood PWM scoring system 122 . Since loglikelihood score distributions vary substantially (depending on the information content of a given motif), we employ a simple scaled scoring system that maps a given log-likelihood score to the percentage of the maximum achievable loglikelihood score of the given motif-we refer to this value as the "relative PWM score". We identify binding site altering events (i.e., "creating" or "breaking" a predicted binding site for a given TF motif) as cases where one allele has a relative PWM score of 70% or higher, and the other allele has a score of <40%. For a given variant, any TF with allelic ChIP-seq sequencing reads (see above) and a binding site altering event for any of its motifs was deemed a variant overlapping TF. Any TF with allelic ChIP-seq sequencing reads and a lack of a binding site altering event for any of its motifs was deemed a variant adjacent TF.
We next sought to identify particular TFs that tend to be variant overlapping TFs at SLE allelic enVars. To this end, we calculated the fraction of times each TF motif has a binding site altering event (as defined above) at SLE allelic enVars. As background, we calculated the fraction of times each TF motif has a binding site altering event at non-allelic enVars. The significance of the difference between these two fractions was then calculated using a proportions test. Results are provided in Supplementary Data 15.
We used a similar procedure to identify particular TFs that tend to be variant adjacent TFs at SLE allelic enVars. We performed HOMER(v4.9) motif enrichment analysis using the full 170 bp allelic enVar DNA sequences as input. The dinucleotide scrambled version of input sequences were used as background. The fractions of motif "hits" obtained in the foreground vs. background set were then compared, and significance was again calculated using a proportions test. Results are provided in Supplementary Data 15.