Introduction

SLE is a complex, multisystem autoimmune disease that continues to be a major diagnostic as well as therapeutic challenge. There are no definitive, specific diagnostic tools available to determine whether a patient has SLE, and diagnostic approaches in SLE have not changed in decades. Physicians still rely on clinical evaluation and a few laboratory tests, including measurement of autoantibodies and complement levels. Despite the wealth of genetic, epigenetic, and gene expression data that has emerged in the past few years at both the patient and cellular levels, none has been integrated to produce a predictive tool that can be used to evaluate an individual SLE patient.

In SLE, defects in central and peripheral tolerance allow for activation of self-reactive B cell clones and differentiation into plasmablasts/plasma cells (PCs) that secrete autoantibodies, which in turn mediate tissue damage1,2. Genome wide association studies (GWAS) have identified numerous polymorphisms in regions encoding genes or regulatory regions that could influence B cell function3, suggesting that a general state of B cell hyper-responsiveness could contribute to SLE pathogenesis. Autoantibody-containing immune complexes stimulate production of type 1 interferon, a hallmark of infection that is also observed in SLE patients, regardless of disease activity4,5. In addition to B cells and PCs6, various T cell populations also exert differential effects on SLE pathogenesis. T follicular helper cell subsets contribute to B cell activation and differentiation, and abnormal T cell receptor signaling is also thought to lead to hyper-responsive autoreactive T cell activity7,8,9. Furthermore, defects in regulatory T cells, partially secondary to deficient IL-2 production, result in faulty modulation of immune activity and inflammation8,9.

Myeloid cells (MC) also play a role in SLE pathogenesis10. Factors present in the local microenvironment can cause macrophages (Mϕ) to undergo extreme changes in transcriptional regulation in a process called Mϕ polarization11,12,13. Overabundance of proinflammatory M1 Mϕ and decreased expression of markers for anti-inflammatory M2 Mϕ are detected in both lupus-prone mice and SLE patients14,15, and therapeutic stimulation of M2 polarization significantly decreases disease severity in murine SLE16. Experimental intervention in M2 polarization as well as microRNA array profiling suggest that abnormalities in M2 Mϕ may contribute to SLE severity15,17. Low-density granulocytes (LDGs) are abnormal neutrophil-like cells that appear in the blood of lupus patients as well as in many other disease states18,19,20,21,22,23. Although their involvement in SLE has not been studied as extensively as that of other cell types, LDGs have already been linked to kidney disease, vascular disease, and other manifestations in lupus patients24,25,26,27,28,29.

To date, however, it has been difficult to relate gene expression profiles to SLE disease activity successfully. Numerous groups have attempted to characterize SLE patients by gene expression. For example, Jourde-Chiche et al. reported a discrete group of differentially expressed genes that might be found in subjects with SLE renal disease28, and Banchereau et al. extensively analyzed pediatric lupus samples and attempted to associate modules of expressed genes with disease manifestations in children30. Despite these advances, gene expression data has yet to provide an approach with sufficient predictive value to utilize in decision making about individual subjects with SLE. Furthermore, no cellular phenotype has been independently verified to be able to distinguish a patient with active SLE from one with inactive disease. This distinction is critical both for patient evaluation and for clinical trials, as most SLE trials are aimed at controlling disease activity.

Therefore, in order to advance personalized treatment of SLE patients, the use of big data analytical techniques, including machine learning, can be useful to understand the relationships between cell subsets, gene expression, and disease activity. Machine learning describes a wide range of computational methods which allow researchers to harness complex data and develop self-trained strategies to predict the characteristics of new samples, such as whether a given SLE patient has active or inactive disease. Machine learning techniques have already been leveraged in lupus to characterize disease risk and identify new biomarkers based on genotypic data or urine tests31,32. When applied to high-throughput transcriptomic data, machine learning algorithms could potentially be used to identify the gene expression features with the most utility to identify subjects with higher degrees of disease activity and may also provide insights into disease pathogenesis.

To address this possibility, we used conventional bioinformatics methods in conjunction with unsupervised and supervised machine learning techniques to: (1) test the potential of raw gene expression data and modules of genes to classify subjects with active and inactive SLE, (2) determine the optimum classifier or classifiers, and (3) understand the combinations of variables that best facilitate classification.

Results

Gene expression and SLE disease activity

Before employing machine learning techniques, it was necessary to first assess whether conventional bioinformatics approaches could accurately separate active SLE patient samples from those obtained from inactive patients. First, three whole blood (WB) data sets (Table 1) were filtered to include only those genes which passed quality control and filtering in all three studies. Differential expression (DE) analysis of active versus inactive patient samples with the remaining filtered 7,848 genes revealed major differences among data sets and considerable heterogeneity within data sets. GSE39088 had only 176 DE genes with a false discovery rate (FDR) less than 0.2 and none with FDR < 0.05; GSE45291 had 5850 DE genes with FDR < 0.2 and 4837 with FDR < 0.05; GSE49454 had 1710 DE genes with FDR < 0.2 and 72 with FDR < 0.05 (Supplementary Data S1). Hierarchical clustering was carried out on each study with all genes, DE genes with FDR < 0.2, and DE genes with FDR < 0.05 to determine whether active and inactive patients would separate into two clusters. The Adjusted Rand Index (ARI) was used to compare these clusterings to the known status of the patients. When using all genes, all three studies had ARIs near zero, indicating that clustering separated active and inactive patients no better than random chance (Table 2). GSE39088 and GSE49454 showed only mild improvement after filtering genes, whereas GSE45291 attained an ARI of 0.94 when using genes with FDR < 0.05.

Table 1 Data sources for active (SLEDAI ≥ 6) and inactive (SLEDAI < 6) SLE WB gene expression.
Table 2 Adjusted Rand Index of Unsupervised Hierarchical Clustering Compared to Known Disease Activity.

Next, the lists of genes were compared for commonalities. Out of 6,440 unique DE genes from the three studies, 5,170 genes were unique to one study, 1,234 were shared by two studies, and 36 were shared by all three studies. Of these 36 genes, only three had consistent fold changes across all studies (DNAJC13 and IRF4 upregulated; RPL22 downregulated). Rank-rank Hypergeometric Overlap (RRHO) was next applied as a threshold-free comparison of the studies33. All genes that were tested for differential expression were sorted by FDR from most significantly overexpressed to most significantly underexpressed and broken into 36 groups of 218 genes each. Among the three studies, the ranked gene lists failed to demonstrate significant overlap of the most overexpressed and underexpressed genes (Fig. 1). The three data sets were comprised of different patient populations and were collected on different microarray platforms (Table 1); still, the heterogeneity is striking. The lack of commonality among the genes most descriptive of active and inactive patients in each data set casts doubt on whether active and inactive patients from different data sets will separate cleanly.

Figure 1
figure 1

Heatmaps of −log10(overlap p values) from RRHO. Strongest overlaps near the center of each plot indicate weak agreement among the most significantly upregulated and downregulated genes from each data set. Strong agreement between data sets should form a diagonal from the bottom-left corner to the top-right corner.

Patients from each study were then joined to evaluate whether unsupervised techniques would separate active patients from inactive patients. Expression profiles from each study were first normalized to have zero mean and unit variance. Figure 2 shows that even these three genes (DNAJC13, IRF4, and RPL22) failed to separate active patients from inactive patients precisely. Hierarchical clustering on all genes had an ARI of 0.03 when compared to the known status of the patients, and clustering on the three consistent DE genes shared among the studies (DNAJC13, IRF4, and RPL22) had an ARI of 0.05 (Table 2). If gene expression has the potential to identify active SLE patients robustly, conventional bioinformatics techniques failed to harness that, highlighting the need for more advanced algorithms.

Figure 2
figure 2

Clustering all three studies on three consistent DE genes. DNAJC13, IRF4, and RPL22 were consistently differentially expressed in each study yet fail to fully separate active from inactive patients. Orange bars denote active patients; black bars denote inactive patients. Blue, yellow, and red bars denote patients from GSE39088, GSE45291, and GSE49454, respectively.

We hypothesized that patterns of enrichment of Weighted Gene Co-expression Network Analysis (WGCNA) modules derived from isolated cell populations that are correlated to the SLEDAI SLE disease activity index might be more useful than gene expression across studies to identify active versus inactive lupus patients. To characterize the relationships between SLE gene signatures from various peripheral cellular subsets and disease activity, WGCNA was used to generate co-expression gene modules from purified populations of cells from subjects with active SLE, which could subsequently be tested for enrichment in whole blood of other SLE subjects. WGCNA analysis of leukocyte subsets resulted in several gene modules with significant Pearson correlations to SLEDAI (all |r| > 0.47, p < 0.05). CD4, CD14, CD19, and CD33 cells yielded 3, 6, 8, and 4 modules significantly correlated to disease activity, respectively (Table 3). Two low-density granulocyte (LDG) modules were created by performing WGCNA analysis of LDGs along with either SLE neutrophils or HC neutrophils and merging the modules most strongly expressed by LDGs. Two plasma cell (PC) modules were created by using the most increased and decreased transcripts of isolated SLE plasma cells compared to SLE naïve and memory B cells2.

Table 3 Cell module correlations to disease activity and functional analysis.

Gene Ontology (GO) analysis of the genes within each module showed that some processes, such as those related to interferon signaling, RNA transcription, and protein translation, were shared among cell types, whereas other processes were unique to certain cell types (Table 3) and might be used to classify patients more effectively. The genes in each module are available in (Supplementary Data S2).

To characterize the relationships between SLE gene modules from cell subsets and disease activity in greater detail, Gene Set Variation Analysis (GSVA) enrichment was carried out using the 25 cell-specific gene modules (Fig. 3). Of the 25 cell-specific modules, 12 had enrichment scores with significant Spearman correlations to SLEDAI (p < 0.05), and 14 had enrichment scores with significant differences between active and inactive patients (Welch’s t-test, p < 0.05) (Table 4). Notably, each cell type produced at least one module with a significant correlation to SLEDAI in WB and at least one module with a significant difference in enrichment scores between active and inactive patients, demonstrating a relationship between disease activity in specific cellular subsets and overall disease activity in WB. However, the Spearman’s rho values ranged from −0.40 to +0.36, suggesting that no one module had substantial predictive value. Furthermore, the effect sizes as measured by Cohen’s d when testing active versus inactive enrichment scores ranged from −0.85 to +0.79. The CD4 Floralwhite and Orangered4 modules, which had the largest positive and negative effect sizes, respectively, showed a high degree of overlap in the enrichment scores of active and inactive patients (Fig. 4).

Figure 3
figure 3

Cellular gene modules provide the basis for machine learning predictions of SLE activity. GSVA was performed on three SLE WB datasets using 25 WGCNA modules made from purified SLE cells with correlation or published relationship to SLEDAI (See Table 2). Orange: active patient; black: inactive patient. LDG: low-density granulocyte; PC: plasma cell.

Table 4 Assessment of WGCNA module relationships with SLE disease activity in WB.
Figure 4
figure 4

Individual WGCNA modules are ineffective at separating active and inactive SLE subjects. GSVA enrichment scores for (a) CD4_Floralwhite and (b) CD4_Orangered4 in SLE WB are unable to fully separate active patients from inactive patients. Asterisks denote significant differences by Welch’s t-test. Error bars indicate mean ± standard deviation.

Analysis of individual disease activity-associated peripheral cellular subset gene modules was not sufficient to predict disease activity in unrelated WB data sets, since no single module from any cell type was able to separate active from inactive SLE patients (Fig. 3). The results emphasized the need for more advanced analysis to employ gene expression analysis to predict disease activity.

Machine learning and disease activity

To assess the effectiveness of either raw gene expression or module-based enrichment techniques, SLE patients were classified as active or inactive using generalized linear models (GLM), k-nearest neighbors (KNN), and random forest (RF) classifiers. Classifiers were validated using two different methodologies: (1) 10-fold cross-validation or (2) study-based cross-validation, in which classifiers were trained on each data set independently and tested in the other two data sets. When evaluating the performance of classifiers on the data set on which they were trained, GLM accuracy was defined as one minus the cross-validated classification error from the cv.glmnet() function, and RF accuracy was determined based on out-of-bag predictions. The accuracy of each classifier trained with either gene expression or module enrichment is shown in Fig. 5, and receiver operating characteristic (ROC) curves are plotted in Fig. 6. Classification metrics for each classifier are shown in Table 5.

Figure 5
figure 5

Performance of machine learning classifiers across three independent data sets. Classifiers were trained on the data sets listed across the top and evaluated in the data sets listed across the bottom. Data sets are listed by their GEO accession numbers. Expression (black): gene expression data. WGCNA (blue): module enrichment scores.

Figure 6
figure 6

Area under the ROC curve of machine learning classifiers across three independent data sets. Classifiers were trained on the data sets listed across the top and tested in the other two data sets. Data sets are listed by their GEO accession numbers. Expression (black): gene expression data. WGCNA (blue): module enrichment scores.

Table 5 Classification metrics of machine learning classifiers.

When performing 10-fold cross-validation, the use of gene expression values resulted in better performance from all three classifiers compared to module enrichment scores. The random forest classifier was the strongest performer with 83 percent accuracy, and its corresponding ROC curve demonstrated an excellent tradeoff between recall and fall-out (AUC 0.89). This high accuracy can likely be attributed to the presence of data from all three studies in both the training and test sets. In this case, the classifiers have the opportunity to learn patterns inherent to each data set, which proves useful during testing. To ensure that the classifiers were not disproportionately learning patterns from certain data sets at the expense of others, the classification results from the 10-fold cross-validation approach were subdivided by data set. All classifiers exhibited good performance with small differences between their highest and lowest accuracies in individual data sets, with the exception of the WGCNA-based KNN classifier (Supplementary Table S1).

When performing study-based cross-validation, classifiers trained on expression data performed better on their respective training sets than those trained on module enrichment scores in nearly all cases (Fig. 5). However, the accuracy of classifiers trained on expression values in the test sets was approximately 50 percent. This is in line with the findings of our initial bioinformatic analysis (Table 2), namely, that gene expression values have little to no utility when attempting to classify unfamiliar samples. When the training and test data come from different data sets, the classifiers learn patterns that are unhelpful for classifying test samples. Although classifiers trained on module enrichment scores did not achieve high accuracies in their training sets, they did not experience as sharp a drop in accuracy when tested on unfamiliar data sets. Remarkably, the use of module enrichment scores improved RF test accuracy to approximately 65 percent and improved KNN test accuracy to approximately 70 percent.

Overall, gene expression values provide high accuracy when performing 10-fold cross-validation but are rendered nearly useless when performing study-based cross-validation. These results indicate that disease activity classification based on raw gene expression, while more accurate, is sensitive to technical variability, whereas classification based on module enrichment better copes with variation among data sets.

Random forest consistently achieved high performance, and we hypothesized that its assessments of variable importance could be used to gain insight into directors of the identification of SLE activity. To this end, random forest classifiers were trained on all patients from all data sets in order to identify the most important genes and modules as determined by mean decrease in the Gini impurity, a measure of misclassification error. The classifier trained with gene expression data achieved an out-of-bag accuracy of 81 percent, with a sensitivity of 83 percent and a specificity of 78 percent. The classifier trained with module enrichment scores achieved an out-of-bag accuracy of 73 percent, with a sensitivity of 78 percent and a specificity of 68 percent.

The most important genes and modules identified a wide array of cell types and biological functions (Fig. 7). The most important genes encompass such diverse functions as interferon signaling, pattern recognition receptor signaling, and control of survival and proliferation (Fig. 7a). Notably, the most influential modules skewed away from B cell-derived modules and towards T cell- and myeloid cell-derived modules (Fig. 7b). As some of these modules had overlapping genes, the variable importance experiment was repeated with modules that were first scrubbed of any genes that appeared in more than one module before GSVA enrichment scoring (Supplementary Data S3). The relative variable importance scores of the deduplicated modules correlated strongly with those of the original modules (Spearman’s rho = 0.69, p = 1.94E-4), indicating that module behavior was partly driven by the overlapping genes but strongly driven by unique genes (Fig. 7c).

Figure 7
figure 7

Random forest classifier reveals variable importance of genes and modules. (a) Variable importance of top 25 individual genes as determined by mean decrease in Gini impurity. (b) Variable importance of cell modules. (c) As many modules shared genes, modules were deduplicated to determine the effects on the random forest classifier. The relative importance of the full modules and deduplicated modules was strongly correlated (Spearman’s rho = 0.69, p = 1.94E-4). LDG: low-density granulocyte; PC: plasma cell.

CD4_Floralwhite and CD14_Yellow, two interferon-related modules which maintained high importance after deduplication, were further analyzed to study the effect of unique genes on module importance. Gene lists were tested for statistical overrepresentation of Gene Ontology biological process terms with FDR correction on pantherdb.org. CD4_Floralwhite did not show any significant enrichment, but CD14_Yellow, which had the highest importance after deduplication, was highly enriched for genes with the “Immune Effector Process” designation (26/77 genes, FDR = 9.38E-11 by Fisher’s exact test). This suggests that CD14+ monocytes express unique genes that may play important roles in the initiation of SLE activity.

Discussion

Several important findings related to SLE gene expression heterogeneity within and across data sets have been elucidated by this study. First, we demonstrated that DE analysis of active versus inactive patients is insufficient for proper classification of SLE disease activity, as systematic differences between data sets render conventional bioinformatics techniques largely non-generalizable.

Next, we hypothesized that WGCNA modules created from the cellular components of WB and correlated to SLEDAI disease activity might improve classification of disease activity in SLE patients. The use of cell-specific gene modules based on a priori knowledge about their relevance to disease fared slightly better than raw gene expression, as it generated informative enrichment patterns, and many of the modules maintained significant correlations with SLEDAI in WB. However, these enrichment scores failed to separate active patients from inactive patients completely by hierarchical clustering.

We then compared raw expression data alongside the WGCNA generated modules of genes in machine learning applications. We used a supervised classification approach using elastic generalized linear modeling, k-nearest neighbors, and random forest classifiers. The trends in performance when cross-validating by study or cross-validating 10-fold speak to the potential advantages and disadvantages of diagnostic tests incorporating gene expression data or module enrichment. Cross-validating by study serves as a kind of “worst-case” scenario, whereas 10-fold cross-validation serves as a “best-case.” Attempting to classify active and inactive SLE patients from different data sets and different microarray platforms during cross-validation by study proved difficult, but module enrichment was able to smooth out much of the technical variation between data sets. 10-fold cross-validation simulated a more standardized diagnostic test. Although the data was sourced from three different microarray platforms, each cohort in the test set had many similar patients in the training set to facilitate classification by gene expression. If such a test could be reliably free from technical noise, it is likely that raw gene expression would perform very well. RNA-Seq platforms, which produce transcript counts rather than probe intensity values, may display less technical variation across data sets because they are not dependent on the binding characteristics of pre-defined probes that differ among arrays34. On the other hand, comparison of RNA-Seq and microarray samples has shown that the two methods can deliver highly consistent results35,36,37, so a microarray-based test could be feasible if it was only conducted on one platform. Further study to construct an optimal panel of genes similar to that identified by the random forest classifier could result in a simple, focused test to determine disease activity by gene expression data alone. Interestingly, module enrichment scores, which show little variation across platforms, could be used to develop diagnostic tests that leverage existing data sets, even if they are sourced from different platforms.

The strong performance of the random forest classifier indicates that nonlinear, decision tree-based methods of classification may be best suited to SLE diagnostics. This may be because decision trees ask questions about new samples sequentially and adaptively in contrast to other methods that approach variables from new samples all at once. Random forest is able to “understand” to an extent that different types of patients exist and that a one-size-fits-all approach will tend to misclassify those patients whose expression patterns make them a minority within their phenotype. To put it more simply, active patients that do not resemble the majority of active patients still have a strong chance of being properly classified by random forest.

We used the random forest classifier to assess the importance of each gene and module in patient classification. The most important genes were involved in a number of functions other than interferon signaling, such RNA processing, ubiquitylation, and mitochondrial processes. These pathways may play important roles in directing, or at least be indicative of, SLE disease activity. CD4 T cells originally contributed the most important modules, but when the modules were deduplicated, CD14 monocyte-derived modules gained importance. This suggests that unique genes expressed by CD14 monocytes in tandem with interferon genes may prove to be informative in the study of cell-specific methods of SLE pathogenesis. Futhermore, it is important to note that modules that were negatively associated with disease activity were just as important in classification as positively associated modules. Further study of underrepresented categories of transcripts should enhance our understanding of SLE activity.

One limitation of this study was the relatively small amount of data used to train and test the classifiers. Creating dedicated training and test sets is preferable to cross-validation, but it requires many samples. Although there are large numbers of publicly available gene expression profiles of SLE patients, many of these profiles are not annotated with SLEDAI data. Furthermore, some data sets which include SLEDAI data show heavy class imbalance, which impedes classification. Further work to integrate cross-platform expression data will be crucial to expanding our ability to classify active and inactive SLE patients.

The machine learning models tested here provide the basis of personalized medicine for SLE patients. Integration of our approaches with emerging high-throughput patient sampling technologies could unlock the potential to develop a simple blood test to predict SLE disease activity. Our approaches could also be generalized to predict other SLE manifestations, such as organ involvement. A better understanding of the cellular processes that drive SLE pathogenesis may eventually lead to customized therapeutic strategies based on patients’ unique patterns of cellular activation.

Methods

Compilation of gene expression data from SLE patients

Publicly available gene expression data and corresponding phenotypic data were mined from the Gene Expression Omnibus. Raw data sources for purified cell populations are as follows: GSE10325 (CD4: 8 SLE, 9 HC; CD19: 10 SLE, 8 HC; CD33: 9 SLE, 9 HC); GSE26975 (10 SLE LDG, 10 SLE Neutrophil, 9 HC Neutrophil); GSE38351 (CD14: 8 SLE, 12 HC). Raw data sources for SLE whole blood gene expression are as follows: GSE39088 (24 active, 13 inactive); GSE45291 (35 active, 257 inactive); GSE49454 (23 active, 26 inactive). 35 randomly sampled inactive patients were taken from GSE45291 to avoid a major imbalance between active and inactive SLE patients. Active SLE was defined as having an SLE Disease Activity Index (SLEDAI) of 6 or greater.

Quality control and normalization of raw data files

Statistical analysis was conducted using R and relevant Bioconductor packages. Non-normalized arrays were inspected for visual artifacts or poor hybridization using Affy QC plots. PCA plots were used to inspect the raw data files for outliers. Data sets culled of outliers were cleaned of background noise and normalized using RMA, GCRMA, or NEQC where appropriate. Data sets were then filtered to remove probes with low intensity values and probes without gene annotation data. WB gene expression data sets were filtered to only include genes that passed quality control in all data sets. At this juncture, differential expression (DE) analysis and Weighted Gene Co-expression Network Analysis (WGCNA) were carried out on data sets. WB gene expression data sets were then further processed before machine learning analysis. WB gene expression values were centered and scaled to have zero-mean and unit-variance within each data set, and the standardized expression values from each data set were joined for classification.

Differential expression analysis

Normalized expression values were variance corrected using local empirical Bayesian shrinkage, and DE was assessed using the LIMMA R package38. Resulting p-values were adjusted for multiple hypothesis testing using the Benjamini-Hochberg correction39, which resulted in a false discovery rate (FDR). Significant genes within each study were filtered to retain DE genes with an FDR < 0.2, which were considered statistically significant. The FDR was selected a priori to diminish the number of genes that might be excluded as false negatives. Rank-rank hypergeometric overlap between data sets was assessed using the RRHO R package40. Additional analyses examined differentially expressed genes with an FDR < 0.05.

Weighted gene co-expression network analysis (WGCNA) of purified cell populations

Log2-normalized microarray expression values from purified CD4, CD14, CD19, CD33, and low density granulocyte (LDG) populations were used as input to WGCNA to conduct an unsupervised clustering analysis, resulting in co-expression “modules,” or groups of densely interconnected genes which could correspond to comparably regulated biologic pathways41. For each experiment, an approximately scale-free topology matrix (TOM) was first calculated to encode the network strength between probes. Probes were clustered into WGCNA modules based on TOM distances. Resultant dendrograms of correlation networks were trimmed to isolate individual modular groups of probes by partitioning around medoids and labeled using color assignments based on module size. Expression profiles of genes within modules were summarized by a module eigengene (ME), which is analogous to the module’s first principal component. MEs act as characteristic expression values for their respective modules and can be correlated with sample traits such as SLEDAI or cell type. This was done by Pearson correlation for continuous or semi-continuous traits and by point-biserial correlation for dichotomous traits.

WGCNA modules from CD4, CD14, CD19, and CD33 cells were tested for correlation to SLEDAI. SLEDAI information was not available for the LDG modules, so the two modules provided are descriptive of LDGs compared to SLE neutrophils and HC neutrophils.

Plasma cell modules were generated by differential expression analysis and not WGCNA, but were included because of the established importance of plasma cells in SLE pathogenesis and their increase in active disease2.

Gene set variation analysis (GSVA)-based enrichment of expression data

The GSVA R package42 was used as a non-parametric method for estimating the variation of pre-defined gene sets in SLE WB gene expression data sets. Standardized expression values from WB data sets were used to test for enrichment of cell-specific WGCNA gene modules using the Single-sample Gene Set Enrichment Analysis (ssGSEA) method, which scores single samples in isolation and is thus shielded from technical variation within and among data sets. Statistical analysis of GSVA enrichment scores was done bv Spearman correlation or Welch’s unequal variances t-test, where appropriate. Effect sizes were assessed by Cohen’s d43.

Machine learning algorithms and parameters

We employed three distinct machine learning algorithms to test biased and unbiased approaches to microarray data analysis. The biased approach involved GSVA enrichment of disease-associated, cell-specific modules, and the unbiased approach employed all available gene expression data in the WB. An elastic generalized linear model (GLM), k-nearest neighbors classifier (KNN), and random forest (RF) classifier were deployed to classify active and inactive SLE patients and determine whether gene expression could serve as a general predictor of disease activity. GLM, KNN, and RF were deployed using the glmnet, caret, and randomForest R packages, respectively44,45,46.

GLM carries out logistic regression with a tunable elastic penalty term to find a balance between the L1 (lasso) and L2 (ridge) penalties and thereby facilitate variable selection. For our predictions, the elastic penalty was set to 0.9, specifying a penalty that is 90% lasso and 10% ridge in order to generate sparse solutions. KNN classifies unknown samples based on their proximity to a set number k of known samples. K was set to 5% of the size of the training set. If the initial value of k was even, 1 was added in order to avoid ties. RF generates 500 decision trees which vote on the class of each sample. The Gini impurity index, a measure of misclassification error, was used to evaluate the importance of variables47.

Validation approaches

The performance of each machine learning algorithm was evaluated by 2 different forms of cross-validation. First, a random 10-fold cross-validation was carried out by randomly assigning each patient to one of 10 groups. For each pass of cross-validation, one group was held out as a test set, and the classifiers were trained on the remaining data. Next, as the data came from three separate studies, study-based cross-validation was also done to determine the effects of systematic technical differences among data sets on classification performance. In this circumstance, the classifiers were trained on one data set and tested in the other two data sets. Accuracy was assessed as the proportion of patients correctly classified across all testing folds. Performance metrics such as sensitivity and specificity were assessed after cross-validation by agglomerating class probabilities and assignments from each fold or study. Receiver Operating Characteristic (ROC) curves were generated using the pROC R package48.