A phenotypic and genomics approach in a multi-ethnic cohort to subtype systemic lupus erythematosus

Systemic lupus erythematous (SLE) is a heterogeneous autoimmune disease in which outcomes vary among different racial groups. Here, we aim to identify SLE subgroups within a multiethnic cohort using an unsupervised clustering approach based on the American College of Rheumatology (ACR) classification criteria. We identify three patient clusters that vary according to disease severity. Methylation association analysis identifies a set of 256 differentially methylated CpGs across clusters, including 101 CpGs in genes in the Type I Interferon pathway, and we validate these associations in an external cohort. A cis-methylation quantitative trait loci analysis identifies 744 significant CpG-SNP pairs. The methylation signature is enriched for ethnic-associated CpGs suggesting that genetic and non-genetic factors may drive outcomes and ethnic-associated methylation differences. Our computational approach highlights molecular differences associated with clusters rather than single outcome measures. This work demonstrates the utility of applying integrative methods to address clinical heterogeneity in multifactorial multi-ethnic disease settings.

S ystemic lupus erythematosus (SLE) is a multifactorial autoimmune disease with heterogeneous manifestations that encompasses a wide range of disease severity. Complex diseases such as SLE involve a dynamic interplay between molecular processes, many of which are unknown. Long-term outcomes for individual patients are therefore difficult to predict, as is the scope of organ system involvement. While some patients have aggressive disease progression, others do not accrue significant damage within 5 years of SLE diagnosis [1][2][3][4] . We know little about why an affected individual might develop a particular SLE phenotype. Furthermore, a patient can be classified as having SLE if she or he fulfills any four of the 11 American College of Rheumatology (ACR) classification criteria 5 , with resultant extensive disease heterogeneity. In recent years, significant effort has been applied to better sub-classify SLE, not only to predict disease outcomes but also inform specific mechanistic pathways that could be strategically targeted according to subtype [6][7][8] .
SLE disease progression and outcomes vary significantly among different racial/ethnic groups [9][10][11] . Patients from non-European populations, such as Hispanics, African Americans, and Asians, develop SLE at a younger age and experience worse disease manifestations than patients of European descent. Even after decades of basic research and public health initiatives these health disparities remain relatively unchanged. Factors that underlie these disparities are elusive and likely derive in part from complex interactions between genetic and environmental factors, which may in part originate from social inequities. However, the majority of molecular studies to date have been carried out in European populations.
While numerous previous studies focused on either the genetics or epigenetics of SLE, a multi-omics approach coupled with deep clinical phenotyping may better elucidate the molecular basis of disease heterogeneity. By integrating different layers of molecular and clinical data, several studies have provided insight into mechanisms of complex disease such as Alzheimer's disease [23][24][25] , inflammatory bowel disease 26 , cancer 27,28 , and rheumatoid arthritis [29][30][31][32] . In this work, we initially apply unsupervised clustering of ACR classification criteria for SLE to define disease subtypes among a diverse multi-ethnic cohort of SLE patients. We then develop and apply an integrative approach leveraging human genetics and DNA methylation data to elucidate differences between these disease subtypes. We find 256 differentially methylated CpGs that varied significantly according to subtype, of which 61 were under proximal genetic control (Fig. 1).

Results
Clinical clustering identifies distinct subtypes of SLE. Clinical characteristics of the 333 patients examined from the UCSF California Lupus Epidemiology Study (CLUES) cohort are presented in Supplementary Table 1. We first stratified SLE patients into clusters based on ACR classification criteria and sub criteria using an unsupervised clustering approach. Briefly, we first applied multiple correspondence analysis (MCA) and then performed K-means clustering on the top two components chosen by a bootstrap resampling strategy (see Methods). Three clusters were identified. The clusters are labelled M (mild), S1 (severe 1) and S2 (severe 2; Fig. 2a, b). Cluster M was comprised of 101 patients (30.3%) and was characterized by a high prevalence of malar rash, photosensitivity, arthritis, and serositis, but lower prevalence of hematologic manifestations, lupus nephritis, and serologic manifestations (p < 0.001). Cluster S1 was comprised of 154 patients (46.2%) and was characterized by higher prevalence of lupus nephritis and anti-dsDNA autoantibody positivity (p < 0.001). Cluster S2 was comprised of 78 patients (28.8%) and was the most severe subtype, with a high prevalence of lupus nephritis, autoantibody production (anti-dsDNA, anti-Sm, anti-RNP and antiphospholipid antibodies), and internal organ manifestations, such as hematologic manifestations (Fisher exact test p < 0.001; Table 1). The Lupus Severity Index 33 , a validated scoring system based on the ACR classification criteria, was also significantly different between the three clusters (ANOVA test p = 2 × 10 −21 ), with cluster M the least severe, and cluster S2 the most severe (Fig. 2c).
With respect to ethnicity, we found a significant increase in the proportion of White patients in cluster M compared to clusters S1 and S2 (Kruskall-Wallis p = 4.76 × 10 −4 ), and a higher proportion of Asian patients in clusters S1 and S2 compared to cluster M (Kruskall-Wallis p = 1.4 × 10 −3 ; Table 1).
At the time of blood sampling, patients in the more severe clusters (S1 and S2) had lower levels of complement C3 (ANOVA p = 1.47 × 10 −3 ), were more likely to be RNP positive (Fisher exact test p = 3.66 × 10 −5 ) and were more likely to be receiving mycophenolate (Fisher exact test p = 2.48 × 10 −3 ) and prednisone (Fisher exact test p = 5.07 × 10 −2 ) than patients in cluster M (Supplementary Table 2). We also examined complete blood counts and proportions taken at time of blood draw from all patients and found a statistically significant decrease in leukocytes (ANOVA p = 3.31 × 10 −3 ), eosinophils (ANOVA p = 3.39 × 10 −2 ) and lymphocytes (ANOVA p = 2.84 × 10 −2 ) among the three clusters (Supplementary Table 2). This could represent a marker of disease severity or a consequence of higher immunosuppressant drug use at the time of blood draw for patients in the more severe disease clusters.
In a comparison of socioeconomic variables across clinical clusters, we did not observe a statistically significant difference in average education level or income between the three clusters (Supplementary Table 3).
Distinct methylation patterns distinguish clinical clusters. The clusters identified above, characterized by multiple comorbid phenotypes, represent a clinically relevant framework to stratify SLE patients. Using this stratification, we aimed to identify differentially methylated CpG sites associated with these clinical clusters. Using an ANOVA model, we identified 256 CpG sites in 124 genes that were differentially methylated according to clinical cluster (FDR < 0.1) after adjusting for sex, genetic ancestry principal components, cell composition, medications, alcohol use, and smoking status ( Fig. 3a; Supplementary Data 1). A quantilequantile plot is shown in Supplementary Figure 1. The observed versus expected test statistic demonstrates no evidence for inflation of the association tests (inflation factor λ = 0.99).
Upon mapping these 256 cluster-associated CpG sites to genes and performing pathway analysis, we found significant enrichment of genes associated with Type I interferon signaling, antiviral responses and inflammatory pathways (gene list enrichment analysis FDR < 0.01; Table 2).
In order to functionally classify the cluster-associated CpGs, we intersected these genomic regions with the Hallmark Interferon-Alpha Responsive gene set 34 since the IFN-alpha signaling pathway has been previously implicated in SLE pathology 18,[35][36][37][38] . We observed a significant enrichment of IFN-alpha responsive genes (hypergeometric p < 0.01) with 101 out of the 256 CpGs within this set. Notably, of the 101 IFN-alpha CpGs, 93 were hypermethylated in cluster M relative to both cluster S1 and S2. Of these CpGs, 57 were in the promoter region (TSS200, TSS1500, 5′ UTR), and 36 were in the gene body. Hypermethylation at the promoter sites suggests a role for epigenetic silencing in cluster M with respect to S1 and S2 while gene body hypermethylation suggests gene expression.
Cluster-associated CpGs with the greatest variance (5-11% methylation variance) across the clusters were in genes IFI44L, MX1, PARP9, EPSTI1, and PDE7A, all displaying hypermethylation in cluster M relative to S1 and S2 (Supplementary Data 2). With the exception of PDE7A, all of these genes are interferon responsive. PDE7A encodes a phosphodiesterase associated with T cell activation and IL-2 production 39 . Differentially methylated CpGs in IFI44L, MX1, and PARP9 map to the 5-UTR region, suggesting silencing of these genes. Differentially methylated CpGs in EPSTI1 and PDE7A are located in the gene body, where hypermethylation is associated with gene expression.
For each of the 256 CpGs identified above using the ANOVA test, we then sought to determine which pairwise comparison (cluster S2 vs M, S2 vs S1, or S1 vs M) contributed to the significant F-statistic. Using the nestedF method in the Limma R package 40 , 247 of the aforementioned associated 256 CpGs were differentially methylated between clusters S2 and M (FDR p < 0.1; Supplementary Fig. 2A, Table 3 and Supplementary Table 5). The most significant CpGs were in the promotor of IFI44L and gene body of RSAD2, with hypermethylation in cluster S1 versus M. Comparison of clusters S2 to S1 identified 18 differentially methylated CpGs (FDR < 0.1; Supplementary Fig 2B, Table 3 and Supplementary  Table 5). The percent variance between clinical clusters explained by CpG methylation varied from 0.9% for cg23002431 (COPA gene body) to 21% for cg00959259 (PARP 5'UTR).
Validation of cluster-specific methylation profiles. To determine whether the methylation signature associated with the three clinical clusters identified was reproducible, we applied our clustering method to a previously published independent cohort of 302 female SLE patients of European descent 16 . This cohort has a lower lupus severity index (6.15 ± 1.42) compared to the CLUES cohort (6.85 ± 1.63; Student's t test p < 0.001), however this difference is relatively small and not clinically significant. In order to identify methylation associations in this validation cohort, we first assigned a cluster label (M, S1, or S2) to each patient using the study cohort as a reference. Since the clusters in the study cohort were discovered using an unsupervised approach, we first trained a random forest model on the CLUES data with the ACR features as input. Model parameters were optimized by minimizing the out-of-bag error (Supplementar Fig. 3). The model achieved a minimum out-of-bag error of 12.8%. We then applied this model to the validation cohort of SLE patients of European descent and determined a cluster label for each sample. Clinical characteristics of subjects in each cluster in the validation cohort are presented in Supplementary Table 4. In comparison to the CLUES cohort where the majority of subjects were in cluster S1, the majority in the validation cohort were in cluster M, reflecting the racial differences between the cohorts.
To determine whether the methylation patterns associated with each cluster were robust and reproducible, we evaluated methylation differences in the independent cohort of 302 female SLE patients of European descent as described above 16 . Since the validation dataset was obtained using the Illumina 450k BeadChip, we restricted these analyses to the 158 cluster-associated CpGs in the CLUES cohort that were also on the 450k array. Of these 158 CpGs, 132 (84%) were significantly associated with cluster in the validation dataset (FDR < 0.1; Table 3). We observed a strong correlation (r > 0.9) between differences in methylation beta values for the CLUES cohort and validation set for all three pairwise comparisons (cluster S1 vs. M, S2 vs. S1, and S2 vs. M; Table 3, Fig. 4).
Active chromatin states in cluster-associated CpGs. In order to further characterize the epigenetic landscape of the clusterassociated CpGs, we examined CpG enrichment in genomic regions classified according to specific chromatin states based on the Epigenome Roadmap 15 state model 41 . Results for 13 peripheral blood cell types are summarized in Supplementary Fig 4. We found significant depletion (Fisher's exact FDR < 0.01; OR < 0.5) of cluster-associated CpGs in quiescent regions in 12 of the 13 cell types, and significant enrichment (FDR < 0.01; OR > 2) in enhancers and regions flanking active transcriptional start sites in all cell types. We also observed significant enrichment (Fisher's exact FDR < 0.01; OR > 2) of H3K4me3, H3K4me1, and H3K27ac histone marks specific for active enhancers in all peripheral blood cell types ( Supplementary Fig 5). Epigenetic annotation of differentially methylated CpGs in IFI44L land in enhancers and active transcription sites in peripheral blood primary B cells, T helper memory cells, Naïve T cells, Th17 cells, T memory cells and T regs, but not in regulatory or transcription sites in neutrophils or NK cells. Differentially methylated CpGs in MX1, PARP9, EPSTI1, and PDE7A are located in enhancers and transcription sites in most peripheral immune cell subtypes.
meQTL loci controlling cluster-associated CpGs. We sought to understand the sources of methylation differences in the clinically-defined clusters. Therefore, we used paired genotype data to investigate genetic drivers of methylation differences. Specifically, we conducted a methylation quantitative trait loci analysis (meQTL) to determine which cluster-associated CpGs were under proximal genetic control (distance between SNP and CpG < 1 Mb). Genetic data was first imputed and LD-pruned (r 2 < 0.8). After adjusting for population structure, sex, age, cell type composition, medication use, smoking status, and alcohol consumption, we found 744 significant cis meQTL associations (FDR < 0.05; Fig. 3b). These involved 61 unique CpGs in 41 genes, and 397 SNPs in 90 genes (Supplementary Data 3).
Of the 744 significant cis meQTL associations, 91 meQTLs in 19 unique CpGs were in interferon-alpha or interferon-gamma responsive genes. Of these, the greatest number of meQTL loci were in EPSTI1 (12 SNPs), PARP14 (nine SNPs), and PARP15 (8 SNPs) for CpGs in interferon-alpha responsive genes (Supplementary Data 3). We found 39 meQTL associations involving CpG sites in the promoter region. Notably, we found 21 associations in CpGs in PARP14, of which 20 were in the promoter region under the control of SNPs in PARP14, PARP15, and DIRC2. We also found CpGs in the promoter region of OAS3 (n = 2) under the control of SNPs in LHX5-AS1, and one CpG in USP18 under the genetic control of SNPs in LINC01634.
Of the non-interferon-responsive CpGs, we found 20 genetic variants that controlled methylation of cg07259759 located in the gene body of USP35, a ubiquitin specific peptidase 42 (methylation variance 21-25%). Ten of these 20 genetic variants were found in an intron of GAB2, a tyrosine kinase adaptor that is primarily upregulated in activated innate immune cells [43][44][45] . We also found 43 genetic variants in HLA-F, a MHC-Ib minor allele involved in NK cell self-recognition 46 , which controlled methylation at four CpG sites in the gene body of HLA-F. Fifteen CpGs were located in the promoter or 5'UTR region, with the largest methylation variance observed for cg04738877 in the promoter Although we considered all SNPs within a 1 Mb window around each CpG, the proportion of significant meQTL associations decreased as the distance between SNPs and CpGs increased ( Supplementary Fig. 6A and B). This suggests that genetically determined CpG methylation was typically driven by proximal genetic variation, rather than distal effects.
Epigenetic mediation of genetic association with clusters. One challenge of interpreting the methylation associations with clusters is that many methylation differences may represent a consequence of clinical differences between clusters rather than causal mediators. In order to identify which CpGs may mediate genetic associations with clusters and reveal novel biology, we employed an integrative causal inference method 47 .
Briefly, this method uses conditional probabilities to evaluate a causal relationship between a factor (genotype), a potential mediator (CpG methylation), and an outcome (clinical cluster).
First, from the list of 744 meQTLs identified, we selected the SNPs that were significantly associated with cluster (FDR < 0.05). For these meQTL associations, we identified the subset where methylation appears to mediate the genotype-cluster association using the causal inference test (CIT). This yielded 24 meQTLs with 21 SNPs (FDR < 0.05; Supplementary Table 5). Notably from these, we found 6 significant associations between SNPs in GAB2 and CpGs in USP35. We also found evidence for methylation mediation of SNPs in HLA-F. Figure 5 provides an example of one of these associations between a SNP in GAB2 and CpG in USP35.
Ethnicity-associated differentially methylated CpGs. As some of the methylation differences in the clinically-defined clusters could be explained by genetic variation in the meQTL analysis, we explored the effect of ethnicity, after adjusting for genetic factors. Previous work has identified patterns of differential methylation across ethnic groups due to both ancestral genetic variation and environmental influences 48 . As non-White ethnicity is associated with worse outcomes in SLE, we sought to determine whether the differentially methylated CpGs across clusters were enriched for ethnicity-associated CpGs, after adjusting for genetic ancestry. Of the 256 cluster-associated CpGs, we identified 237 CpGs that were associated with ethnicity (FDR < 0.05) after adjusting for sex, the top three genetic ancestry principal components, cell composition, medications, alcohol use, and smoking status. A permutation analysis was conducted by randomly permuting   Fig. 3 Cluster associated CpGs and meQTL associations. a Heatmap of CpGs significantly associated with clinical cluster (FDR < 0.1) b Manhattan plot shows −log 10 (p-value) for associations between cluster-associated CpGs and all SNPs within 1 Mb of each CpG. For each CpG with a significant meQTL (FDR < 0.05), the most significant variant is labelled with its corresponding gene ethnicity 1000 times and testing for association with ethnicity. Figure 6a displays the density of ethnicity-associated CpGs in 1000 random samples. This analysis revealed a significant (p < 0.001) enrichment of ethnicity-associated CpGs in the clusterassociated methylation signature. Results and data are available as an RShiny Application for use in future research: http:// comphealth.ucsf.edu/sle_clustering/.

Discussion
In the present study, we developed a stepwise multi-omics approach for identifying SLE patient subtypes defined by clinically-relevant phenotypes and molecular mechanisms among a multi-ethnic cohort. We report three lupus clinical subtypes defined by the ACR classification criteria that vary according to disease severity. We also show that patterns of differential methylation at specific CpGs are associated with the clinical subtypes. A subset of these CpGs are under genetic control, however the majority display a strong association with ethnicity after adjusting for genetic ancestry, suggesting possible molecular mediators of the ethnic-effect underlying lupus outcomes. Unlike previous studies that have largely studied SLE patients of European genetic ancestry, we studied a cohort or patients of White, African-American, Asian, and Hispanic ethnicity. Since SLE severity is known to vary widely between racial and ethnic groups, analysis of a large multiethnic cohort is crucial for understanding the genetic and non-genetic determinants of this ethnic-associated variability.
Unsupervised clustering approaches have been applied widely to high dimensional omics datasets with the aim of deriving meaningful clusters characterized by a small set of molecular features 6,49,50 . By translating this dimensionality reduction technique to the ACR clinical features in SLE, we identified three clinical subtypes each characterized by specific ACR features. Due to the strong association of DNA methylation with genetic variation, unsupervised clustering of the methylation data revealed population structure rather than lupus-relevant clinical differences. For these reasons, we did not report DNA methylation clustering and rather chose to define subtypes by ACR criteria. Since this is an ongoing cohort, with a larger sample size, we may be able to define methylation-based clusters in each racial group separately, minimizing the effect of genetic structure confounding.
Importantly, the clusters defined in this study are consistent with previous epidemiological studies describing the correlation of multiple sub-phenotypes of SLE, such as the correlation of SLE skin manifestations with arthritis, serositis with the lack of  other end-organ involvement, and anti-dsDNA with lupus nephritis 21,51,52 . The milder subtype in this study had a higher prevalence of participants of White race. This has also been previously described, as patients with European ancestry have a higher proportion of arthritis, skin manifestations and serositis and lower prevalence of lupus nephritis and autoantibody production [53][54][55] . By considering clusters defined by multiple phenotypes we preserve the multifactorial clinical nature of SLE. Training a random forest model using the cluster assignments as labels allowed us to apply this clustering scheme to an external dataset of patients of European descent. After validation in other larger multi-ethnic cohort, this model may form a clinically useful stratification method to further study disease heterogeneity in SLE.
After identifying clinically-relevant patient clusters, we found a set of 256 CpGs associated with the clusters with strong enrichment of methylation for genes in the Type I interferon pathway, cytoplasmic viral sensing pathways, and immune related pathways, with significant enrichment in enhancers and regions flanking active transcriptional start sites in all peripheral immune cell types. Several studies have implicated transcriptional upregulation and epigenetic regulation of the Type I interferon pathway in SLE 17,19,[35][36][37]56 . We and others have also previously described methylation changes in interferon responsive genes associated with individual lupus outcomes, including cutaneous rash 57 , renal involvement 19,20,58 and serologic manifestations 16,22 . However, given the striking heterogeneity of clinical features in SLE, epigenetic programs associated with single phenotypes may be less relevant in a clinical setting. By performing unsupervised clustering on a diverse SLE cohort, we can study the molecular heterogeneity in clinically-relevant subtypes driven by multiple SLE outcomes and disease severity. With this approach we found that severe subtypes, which also have higher proportions of patients of Asian and Hispanic ethnicity, have a higher degree of type I interferon dysregulation. Of these CpG sites, the greatest methylation variance across the clusters was in IFI44L, which encodes for Interferon Induced Protein 44 Like, with progressive hypomethylation from cluster M to cluster S2. Although the function of IFI44L is unknown, increased IFI44L expression is a component of the type-1 IFN response signature and also part of the cellular response to viral infections 59 . IFI44L promoter methylation has been proposed as a blood biomarker for SLE 58 .
Since genetic variation can have profound effects on DNA methylation 60-62 , we also performed an meQTL analysis to quantify the degree of proximal genetic control of the 256 CpG signature. Although the cluster-associated CpG set was strongly enriched for Type 1 IFN genes, only a subset of these Type 1 IFN CpGs (24%) had meQTL loci, suggesting that environmental factors may be contributing more to the epigenetic regulation of the Type 1 IFN pathway than genetic factors. We found interesting associations between genetic variation and methylation at CpGs in immune relevant genes, and in 24 meQTL associations we had evidence of mediation of the genetic association by methylation at the corresponding CpG. We would like to highlight variants in HLA-F, PAR14 and GAB2 controlled methylation sites in USP35. HLA-F is part of the nonclassical HLA-Ib genes, which are mono-or oligomorphic 46 . Surface expression of HLA-F has been demonstrated on activated T, B and NK cells, and serum IgG autoantibodies against HLA-F have been detected in SLE patients and correlated with disease activity [63][64][65] . PARP14 encodes for poly(ADP-ribose) polymerase (PARP) protein family 14 and is involved in cellular maintenance and cell fate decisions, such as cell-cycle progression, metabolic pathways and ribosome biogenesis 66 . Its role in SLE and autoimmune disease has not been defined but it has been shown to regulate glycolysis via IL-4 in B lymphocytes 67 and to promote survival of cancer cells [67][68][69] . Glycolysis in SLE has been found to directly influence the Th17 cell fate and survival, therefore implicating a potential mechanistic role for PARP14 in SLE 70 .
GAB2 is a member of the GRB2-associated binding protein (GAB) gene family. These genes act as adapters for transmitting various signals in response to stimuli through cytokine and growth factor receptors, and T-and B-cell antigen receptors 45 . Among its related pathways is Akt signaling, which is involved in cell proliferation and autophagy, a process that has been implicated in SLE pathogenesis 44,[71][72][73] . Variants of GAB2 influenced methylation marks in the gene body of USP35, which encodes for a member of the peptidase C19 family of ubiquitin-specific proteases 42 . This deubiquitinating enzyme has been shown to mediate the IFN-type I response upon viral infection and it has been associated with higher IFN-β and IFIT1 gene expression 74 . This is relevant to our findings as higher levels of IFN-β have been associated with SLE [75][76][77] . As with all epidemiologic studies, these results represent hypotheses that will require mechanistic validation and independent replication.
Variation in methylation can be attributed to genetic and nongenetic effects. The majority of differentially methylated CpGs among disease subtypes were not classified as under genetic control. Although the number of detected meQTL associations is likely to increase with a larger sample size, it also suggests a greater role for non-genetic or environmental effects. Since selfreported race and ethnicity refers to communality in cultural heritage, language, social practice, traditions, and geopolitical factors, it may be a proxy for a wide variety of environmental exposures not easily captured. Since ethnicity plays a role in differences in lupus severity, we were interested in the association of ethnicity and our methylation findings, after adjusting for genetic variation. After adjusting for genetic principal components, we found a significant enrichment of self-reported raceassociated CpGs in our 256 CpG signature (p < 0.001). This suggests that there may be a common set of CpGs that mediates both SLE clusters and non-genetic differences in race. Although this might raise concern for confounding by genetic structure, our analysis revealed a low genomic inflation factor. Furthermore, the CpGs are biologically relevant in SLE pathogenesis. In addition, previous work has identified differential methylation between ethnic groups due to environmental factors that is not fully explained by genetic ancestry 48 . It is well known that SLE outcomes vary according to race, and the causes behind these race disparities are a source of ongoing debate. We observed Association of DNA methylation of cg07259759 and cluster (a) or genotype of rs7104222 (b). c Association between genotype rs7104222 and clusters. d Beta coefficient represents the dependence of cluster on genotype with or without adjusting for methylation. Error bars represent the 95% confidence interval for beta coefficient estimate. After adjusting for methylation, the observed dependence reduces toward zero differences in race across the disease subtypes, with the milder subtype having a larger representation of White patients. Therefore, these methylation differences suggests the existence of molecular mediators of the non-genetic race-related clinical differences in SLE outcomes, and may reflect environmental exposures that affect races differentially. Figure 6b displays a model for the role of race-associated non-genetic factors that control both methylation of the 256 CpG signature and SLE disease subtypes.
We applied this approach to an external cohort of exclusively SLE patients of European descent. The clinical clusters were labelled by applying a random forest model trained on the original CLUES clusters. The greatest number of subjects were assigned to cluster M. Since cluster M was also enriched for patients that self-identified as White in the original clusters, we believe that the clustering model accurately captures the role of racial differences in lupus severity. We were only able to test 158 of the 256 cluster-associated CpGs in the validation cohort due to a limited number of overlapping probes. However, we validated 84% of the cluster-associated methylation sites. As we also found that the majority of cluster-associated CpGs were race-associated in the CLUES cohort, this suggests that while race plays a role in DNA methylation, these effects are secondary to the clinical differences between lupus clusters. As we could not determine the relative genetic and non-genetic contributions of the methylation differences, we cannot recreate the race-specific findings we have described for the CLUES cohort. In general, this suggests that the subtypes and associated methylation differences we have described can be applied to other cohorts.
Our study has likely been limited by the modest sample size, in particular the small number of African-American SLE patients. We have also profiled DNA methylation marks in whole blood, which even after rigorous adjustment methods may be subject to confounding due to cell composition. Additionally, since methylation changes may be cell-type specific, our data may be insensitive to these effects. For example, when comparing clusters S1 and S2, we were only able to detect 20 differentially methylated CpGs. However, since clusters S1 and S2 had significantly different rates of lymphopenia and leukopenia, methylation differences may represent both differences in cell proportion and celltype specific epigenetic remodeling that cannot be distinguished in whole blood data. Additionally, our validation cohort is limited not only by its ethnic and gender composition, but also since methylation was measured on a different chip than the present study. This meant that we were only able to test a subset (158) of the 256 cluster-associated CpGs in the validation cohort. Ideally, clustering would be performed separately in each racial group to account for possible race-specific effects. However, due to the limited sample size, this was not feasible in the current study. Future work with larger sample sizes in non-European populations may help to address this issue.
Strengths of this study include the rich phenotyping data and adjustments for major confounders, including medications at the time of blood draw, smoking history, and alcohol consumption, which are unaccounted for in most epigenome-wide association studies. This is also the largest cohort including African American, Caucasian, Asian, and Hispanic patients to be profiled for genome wide DNA methylation and genotyping, which allowed us to differentiate between genetic and non-genetic effects of race in SLE outcomes, shedding light on molecular mediators of race in disease heterogeneity. Future work will include testing these findings in other multi-ethnic cohorts. Furthermore, it will be of interest to determine whether these DNA methylation differences are predictive of future disease activity and severity.
In summary, we have identified three distinct clinical subtypes of SLE that have distinct patterns of methylation at specific CpG sites. While previous studies have defined subtypes based on transcriptomic data 6-8 , by integrating methylation and genetic data, the three subtypes identified here may reflect the influence of both genetic and non-genetic effects. We also identified potential mediation of genetic association by methylation changes not previously addressed in SLE. Furthermore, we have demonstrated the utility of studying a diverse SLE population to investigate the molecular underpinnings of race differences in SLE outcomes.

Methods
Subjects and samples. Subjects were participants in the California Lupus Epidemiology Study (CLUES), a multi-racial/ethnic cohort of individuals with physician-confirmed SLE. This study was approved by the Institutional Review Board of the University of California, San Francisco. All participants signed a written informed consent to participate in the study. Participants were recruited from the California Lupus Surveillance Project, a population-based cohort of individuals with SLE living in San Francisco County from 2007 to 2009 (2,47). Additional participants residing in the geographic region were recruited through local academic and community rheumatology clinics and through existing local research cohorts.
Study procedures involved an in-person research clinic visit, which included collection and review of medical records prior to the visit; a history and physical examination conducted by a physician specializing in lupus; collection of biospecimens, including peripheral blood for clinical and research purposes; and completion of a structured interview administered by an experienced research assistant. All SLE diagnoses were confirmed by study physicians based upon one of the following definitions: (a) meeting ≥ 4 of the 11 American College of Rheumatology (ACR) revised criteria for the classification of SLE as defined in 1982 and updated in 1997 5,78 , (b) meeting 3 of the 11 ACR criteria plus a documented rheumatologist's diagnosis of SLE, or (c) a confirmed diagnosis of lupus nephritis, defined as fulfilling the ACR renal classification criterion (>0.5 grams of proteinuria per day or 3 + protein on urine dipstick analysis) or having evidence of lupus nephritis on kidney biopsy. DNA methylation assessment. DNA methylation of genomic DNA from peripheral blood mononuclear cells was profiled using the Illumina MethylationEPIC BeadChip. This chip assesses the methylation level of~850,000 CpGs in enhancer regions, gene bodies, promoters, and CpG islands. All subsequent processing was done using the R minfi package. Signal intensities were background subtracted using the minfi noob function and then quantile normalized 79,80 . Sites with a poor detection rate (detection p value > 0.05) in more than 5% of the samples (1746 CpG sites) were removed. Sites predicted to hybridize to multiple loci (44,097 CpG sites) and sites on non-autosomal chromosomes (19,627 CpG sites) were removed, resulting in 802,579 probes for analyses.
DNA genotyping. Genotyping for genomic DNA from peripheral blood mononuclear cells was performed using the Affymetrix Axiom Genome-Wide LAT 1 Array. This genotyping array is composed of 817,810 SNP markers across the genome and was specifically designed to provide maximal coverage for diverse racial/ethnic populations, including West Africans, Europeans and Native Americans 81 . Samples were retained with Dish QC (DQC) ≥ 0.82. SNP genotypes were first filtered for high-quality cluster differentiation and 95% call rate within batches using SNPolisher. Additional quality control was performed using PLINK. SNPs having an overall call rate less than 95% or discordant calls in duplicate samples were dropped. Samples were dropped for unexpected duplicates in IBD analysis or mismatched sex between genetics and self-report; for first-degree relatives, one sample was retained. All samples had at least 95% genotyping and no evidence of excess heterozygosity (maximum < 2.5*SD). We tested for Hardy-Weinberg Equilibrium (HWE) and cross-batch association for batch effects using a subset of subjects that were of European ancestry and negative for double-stranded-DNA antibodies and renal disease to minimize genetic heterogeneity. SNPs were dropped if HWE p < 1e-5 or any cross-batch association p < 5e-5. Genetic data was imputed using the Michigan Imputation Server 82 using Minimac3. Imputation was performed using the 1000 Genomes Phase 3 reference panel. Following imputation, SNPs with minor allele frequency greater than 5% were retained, and SNPs with > 5% missing data or evidence of deviation from Hardy Weinberg equilibrium (p < 1 × 10 −4 ) were removed. SNPs were pruned so that no two SNPS were in linkage disequilibrium (r 2 > 0.8).
Phenotypic clustering analysis. To stratify SLE patients into clinically relevant clusters, we performed unsupervised clustering on patient phenotypic data. Cluster analysis was performed on the American College of Rheumatology (ACR) criteria and sub criteria. Data were dichotomized to represent absence or presence of each criterion. Multiple correspondence analysis was performed with the PCAmixdata R package 83 . The top two MCA dimensions were retained as selected by the k-fold cross validation scheme implemented in the missMDA R package 84,85 ( Figure  S7A). The number of clusters, k, was chosen by maximizing cluster stability measured by Jaccard similarity using a bootstrap resampling based method. Maximum cluster stability was achieved with k = 3 and each cluster had a Jaccard mean stability score greater than 0.82 86 (Fig. S7B).
Medication use adjustment. Since medication use can modify CpG methylation at certain sites, we aimed to include medications prescribed at the time of blood sampling as covariates in statistical analyses. We performed principal component analysis (PCA) on a dichotomized matrix of current medications at the time of blood sampling for each patient. The top three PCs were chosen using a three-fold cross validation scheme implemented in the missMDA R package 84,85 and included as covariates in subsequent statistical models.
Differential methylation analysis. In order to account for possible confounding due to cell type heterogeneity, we applied the ReFACTor algorithm 87 implemented in Glint 88 to infer peripheral blood cell composition. To identify CpG sites associated with clinical clusters, a linear model adjusted for sex, age, cell count estimates, alcohol use, smoking status, genetic ancestry components, and the top three medication principal components was fit using the nestedF mode in the Limma R package 40 . P values were adjusted using the Benjamini Hochberg procedure. All analyses were performed using R version 3.4.2 89 .
Chromatin state enrichment. 15-state chromatin model epigenome data for all human peripheral blood cell types was accessed via the NIH Roadmap Epigenomics Consortium 41 . All CpGs on the probe-set were assigned a chromatin state. For each of the 15 chromatin states, a fold statistic was computed using a Fisher's exact test for enrichment of the chromatin state within the set of cluster-associated CpGs relative to all the CpGs in the probe-set. This process was repeated for H3K4me3, H3K4me1, and H3K27ac ChIP seq peaks from the NIH Roadmap Epigenomics Consortium.
CpG race enrichment adjusted for genetic ancestry. Enrichment of raceassociated CpGs in the list of differentially methylated CpGs was determined via a permutation method. Briefly, the total number of cluster-associated CpGs (N cluster ) was obtained for a specified FDR as above. Then, a null set was created by randomly permuting the race labels 1000 times. For each permutation, from the set of cluster-associated CpGs, we computed the number of CpGs associated with the permutated race labels by fitting a linear model for each CpG adjusting for sex, age, cell count estimates, alcohol use, smoking status, the top three genetic principal components, and the top three medication principal components. We then found the number of race associated CpGs in the set of clusterassociated CpGs (N race ) using the same linear model as the null set. We defined an enrichment statistic as the proportion N race /N cluster relative to the mean of the null distribution.
Statistical meQTL analyses. Since the Illumina BeadChip EPIC platform is known to cross-react with several probes if the region contains a SNP 90 , we first removed all probes with SNPs. meQTL analysis was then performed by fitting a linear model adjusted for sex, age, cell count estimates, alcohol use, smoking status, the top three genetic principal components, and the top three medication principal components using the Matrix eQTL R package 91 .
Causal inference test. For each meQTL association, the genotype (G)-methylation (M)-cluster (Y) relationships were assessed using the Causal Inference Test (CIT). To establish a mediation relationship in which genotype acts on the clusters through methylation, four conditions must be satisfied: (1) G and Y are associated, (2) G is associated with M after adjusting for Y, (3) M is associated with Y after adjusting for G, and (4) G is independent of Y after adjusting for M. The CIT p-value is defined as the maximum of the four component test p-values.
External validation. A random forest model was trained on the cluster labels from the study data and cross validation was used to optimize parameters. Methylation data and ACR criteria were obtained from a previously published SLE cohort 12 . All ACR criteria were dichotomized in a manner identical to the study data. The random forest model was then applied to the external validation data to generate cluster labels. CpGs that were differentially methylated in the study data were validated using a linear model adjusted for cell composition, genetic principal components, sex, and age. Smoking history was negative for all patients in the validation cohort. Alcohol use and medication principal components were not included in the linear model since these data were not available.

Data availability
DNA methylation, genotype and phenotypic data that support the findings of this study have been deposited in DbGap with the primary accession code of phs001850.v1.p1. Data is available through an application to a data access committee