Paediatric systemic lupus erythematosus (pSLE) is a rare, clinically and genetically heterogeneous systemic autoimmune disease with a prevalence of between 3.3 and 8.8 per 100,000 children1. The disease course is unpredictable, with periods of remission and flares that lead to cumulative damage over time2. SLE is classified by the presence of at least 4 out of 11 of clinical criteria3, with disease activity (DA) severity calculated based on composite scores, including Systemic Lupus Erythematosus Disease Activity Index (SLEDAI)4. Genome-wide association studies have identified more than 130 SLE-associated loci5, including those driven by interferons6, or those controlling inflammation and tissue response to injury7. Together these have been used to highlight the link between SLE and viral responses8. However, the trigger that initiates the expression of these genes and the progression of SLE disease remains poorly understood9.

Efforts to unravel the SLE gene expression pathway have been initiated. A 2016 study of paediatric disease examined the personal transcriptomic profiles of 158 patients using linear mixed models built on blood expression data from 15,386 transcripts10. The transcript panel utilised for this process considered each gene locus individually, and correlated the binary up- or down-regulation patterns with patient phenotypes. The result was the stratification of patients into distinct subclasses, with an enrichment of neutrophil expressed transcripts noted as a patient passed from the low DA1 state to the high DA3 form of disease. While the molecular pathways proposed by the study have led to a better understanding of personal disease progression10, the analysis lacked the co-predictive power of rule-based machine learning (RBML) models.

Machine learning (ML) approaches are well suited to address this process, as they can model and characterise data with very high dimensionality, such as that generated through personal transcriptomics. However, the majority of methods work as black boxes. These offer little to no explanation in terms of how, and why, a specific classification decision is made. For clinical -omics, understanding how a classification decision is made, may offer insight into the underlying biological mechanisms, for example contrasting a disease state to healthy controls10. Interpretable ML methods such as RBML models, offer classification transparency11,12. We applied RBML that is based on rough set theory13. It uses Boolean reasoning to identify the minimal set of features that can discern decision classes (reducts). Reducts are subsequently overlaid onto transcriptomics data samples to create IF–THEN rules. One of the main advantages of this method is co-prediction, i.e., the identification of descriptors that collaboratively correctly classify samples from the data into the outcomes. Co-prediction can provide insight into the candidate biological processes beyond of what can be learnt through co-expression networks.

In the current study, we apply a RBML approach using rough sets to existing pSLE blood transcriptome data10. Here, the goal was to identify the genes and interactions that demarcate a low pSLE DA1 state from a high DA3 state. The disease sub-groups discovered were intersected with available clinical data, revealing gene sets key to the progression of disease and the involvement of the innate and acquired immune arms. These genes, and their protein products, have the potential to be translated to biomarkers, or could be suggested points for therapeutic intervention.


Minimum gene set model discerns DA1 from DA3

The two extremes of disease activity were defined with SLEADAI score (DA1: SLEDAI = 0–2; DA3: SLEDAI > 7). The initial rule-based model was built with R.ROSETTA14 using data from 629 unique patient clinical visits (observations) and the discretised gene expression value for each DA1and DA3 patient visit (features: 33,006 probes for 629 observations; Fig. 1). This initial model had an overall prediction accuracy of 71% using tenfold cross validation (Supplementary Fig. S1 online). The observations (visits) incorrectly classified by the model (Supplementary Fig. S2 online) were pruned to achieve a better separation between DA1 and DA3 then intersected with the patient metadata in order to understand the potential reasons behind their misclassification. Observations were more likely to be pruned or removed based on patient treatment, low SLEDAI score or the number of days since diagnosis (Logistic regression p-value for all < 0.05; Supplementary Fig. S3 online; Supplementary Table S1). No significant association was observed between removed observations and clinical symptoms, and the significant associations were a reflection of observations removed from class DA3 (38%, 125/330 removed) rather than reductions from DA1 (53%, 157/299 removed).

Figure 1
figure 1

Overview of the modelling process implemented to classify and interrogate gene expression relationships between DA1 and DA3.

Following Monte Carlo Feature selection (MCFS)15 on the pruned dataset, 4980 genes were available and subsequently used to build an enhanced rule-based model. Gene set enrichment analysis revealed terms connected to neutrophils (e.g., activation, mediation, degranulation) and the production and degradation of gene products (e.g., transcription initiation and nonsense-mediated decay; Supplementary Fig. S4 online). This suggests a difference in neutrophil mediated immune response between patients with DA1 and DA3, a known functional shift in SLE manifestation between disease states4.

Feature boosting was performed to identify the optimal number of genes for the model (Fig. 1). Empirical studies revealed that model accuracy was lost if more than 200 of the top 4980 MCFS ranked genes were used for this process (Supplementary Fig. S5 online). Iterative R. ROSETTA computational rounds added genes from the starting set of 200, with maximum model accuracy of 81% achieved with a minimum set of 34 genes (Fig. 2; Supplementary Table S2 online). These genes were used in 22 and 44 classifying rules for DA1 and DA3 respectively. The model mirrored the structure of the initial model (Supplementary Fig. S1 online). Figure 2 shows DA1 and DA3 were again split, however with a reduction of complexity, in terms of rules (edges) connecting the genes (nodes) and a refinement of the central hub genes. The 10% points gain in the model accuracy provided improvement in terms of a clearer and visible separation between the disease activities in the rule networks (RN); this gain in accuracy was too small to imply an overfitting of the model. The similarity between the network of the initial model and the enhanced model implied that removed objects were unnecessary for classification of DA1 and DA3 since their removal did not significantly impact the main network structure or the rule model.

Figure 2
figure 2

The rule networks discern the disease states. DA1 is largely defined by medium gene expression, whereas DA3 includes more genes, and those that were highly expressed. For each decision class, internal node colour indicates discretised gene expression value (high, medium, low; orange, grey, blue), node size is proportional to the number of objects supporting rules associated to a node, node border thickness is proportional to the number of rules associated to a node (low, high; circle border thin, thick) and edges connecting nodes represent normalised connection values (< 55%, ≥ 85%; grey, red with increasing line thickness per support interval). The latter is the strength of the co-appearance of connected nodes in rules supporting a decision class. The network was filtered to visualise rules with minimum support of 10% and rule p-value ≤ 0.05.

In DA3, hub gene CKAP4 was surrounded by a thick blue border, indicating the importance of this gene to predicting this disease state. In fact, CKAP4 was a member of 14/44 co-prediction rules (Supplementary Table S2 online). The protein product of this gene, CKAP4, formerly CLIMP63, can act to regulate endoplasmic reticulum (ER) nanodomain homeostasis via shaping the luminal space or through interaction with other ER-resident proteins16. CKAP4 was highly expressed (orange), whereas connected gene SEC11C showed a medium level of discretised expression (grey), and RPS14 was lowly expressed (blue). In DA1, IFI35 and KLRB1 were both hub genes with medium expression levels. However, the latter had larger number of observations supporting its membership to rules (larger node size) but contributed to slightly fewer rules than IFI35 (thinner circle border size: IFI35, 6/22 rules; KLRB1, 4/22 rules). CD161/NKR-P1A, encoded by KLRB1, is a surface receptor of natural killer (NK) cells and subtypes of T lymphocytes, whereas IFI35 encodes the Interferon-induced 35 kDa protein, a proinflammatory damage-associated molecular pattern (DAMP) molecule in the innate immune pathway17.

The membership of genes to the rule networks was not exclusive. For example, both IFITM1, which encodes interferon-induced transmembrane protein 1, and KLRB1 appeared in DA1 and DA3 although with different expression values (Supplementary Table S2 online). The sharing of genes across rules was more common in DA1 (4/12 plotted genes are unique to DA1) than DA3, where 18/26 were unique to that class. The model showed that the type 1 interferon response term was limited to DA1 (IFI35 and IFITM1) whereas B-cell activation was restricted to DA3 (CD38 and IGLL1) (Supplementary Fig. S6 online). However, while each term was enriched based on very few genes, it should be noted that these genes were present in multiple rules.

Patient subgroups reflect clinical manifestations

Hierarchical clustering of the enhanced model results (i.e., the membership of observations for each rule) revealed five subgroups largely contained within DA1 (C1 and C2) or DA3 (C3, C4 and C5) (Fig. 3A). The model and sub-groups were tested for significance to confirm that they cannot be attained by using random data. The significance was tested using permutation of DA state (p-value ≤ 0.05). These sub-clusters were subsequently projected onto the RNs (Fig. 3B). Of note, the C1 and C2 sub-clusters were not restricted to the DA1 rule set, however C4 and C5 reflected partially intersecting networks that were all included in C3 and limited to DA3 (Fig. 3B). In comparison to C3, the DA3 hub gene CKAP4 was absent from C4, whereas the two small unconnected DA3 networks were absent from C5. Due to the small number of genes available for consideration, a sub-cluster-based gene enrichment analysis was not informative for all sets. The C4 and C5 enrichments were largely based on the combination of two genes (MT1F, MT1A; 11 genes available) and suggested response to ion levels (see Supplementary Fig. S7 online), whereas C3 was again led by a small number of gene combinations (e.g., cell cycling and division: CDC20, PTTG1, PTTG3P, UBE2C; B-cell pathways: CD38, GBA, TYM) but this cluster also included the MTIF, and MT1A signals.

Figure 3
figure 3

Hierarchical clustering of the model rules showed the major subdivision between the DA clusters. (a) Supported rules (black) and unsupported rules (grey) distinguish five disease subgroups that were projected into the (b) RN where group (cluster) membership is indicated by pie colour.

The relationship between clinical phenotype (Supplementary Table S3) and sub-cluster was explored in two ways. First, to assess clinical association to a sub-cluster, the phenotype values supporting that sub-cluster were compared to those that did not. To interrogate which rule(s) were driving that pattern, a similar assessment was performed, this time for visits supporting a rule within the sub-cluster. The examination of continuous phenotypes showed that these measures were only significantly different between the three DA3 clusters and not between the two DA1 clusters (Tukey HSD adjusted p-value < 0.05; Supplementary Table S4 online). However, for DA1, the C1 and C2 clusters did contain the majority of low SLEDAI score visits (~ 1.7 in each, Supplementary Table S4, Supplementary Fig. S8 online), with C1 tending towards lower alanine aminotransferase (ALT) and serum creatinine (CR) values compared with the C2 cluster. As expected, the DA3 cluster contained the higher SLEDAI scores (C4 ~ 8.8, C5 ~ 12.1, C3 ~ 14.6). C4 was largely reflective of low measures for anti-dsDNA antibody, erythrocyte sedimentation (ESR) and white blood cell count (WBC). C5 presented lower ALT and aspartate aminotransferase levels (AST), while C3 was most representative of active disease, with low complement factor C3 and C4 values (Supplementary Table S4, Supplementary Fig. S8 online). Only two phenotypes, lymphocyte percent (LP) and neutrophil percent (NP), were significantly different in all pairwise DA3 cluster comparisons. LP was highest in C4 and NP, highest in C5. C3 was intermediate for both (Supplementary Table S4, Supplementary Fig. S8 online).

In terms of categorical phenotypes, no significant association was detected between sex or race for each of the five clusters. In C1, the alopecia category was enriched when compared with all others (Fisher exact test p-value = 0.04; Supplementary Fig. S8 online). In C2, the musculoskeletal term and both oral steroid and nephritis treatment groups were enriched (all Fisher exact test p-value < 0.05). Treatment could not be ruled out as the factor driving differences between this and other clusters (Supplementary Fig. S9 online).

Rules reveal which gene co-predictions drive phenotype correlation

To interrogate which genes and rules drove the phenotypic associations, a closer examination of the rules within the clusters was performed. To associate rules to the discovered clusters a frequency distribution was built for all rules with support set matching at least 10% of the visits assigned to each of the discovered clusters. Based on the distribution 20% match was an empirical threshold for assigning rules to each (Supplementary Fig. 10 online). Figure 4 illustrates the fraction of rules from each cluster that were significantly associated with a phenotype, either continuous or categorical. Overall, rules from C1 or C3 were significantly associated with all phenotypes displayed (Fig. 4), an enrichment not seen with the other clusters. Interestingly, whilst no individual continuous phenotype was significantly different between the two DA1 clusters, or categorical phenotype different between the DA3 clusters, the graphs clearly showed that the same was not true for the proportion of rules significantly associated with a phenotype in either class.

Figure 4
figure 4

Fraction of rules per cluster significantly associated with (a) continuous and (b) categorical phenotypes. See Supplementary Table S3 online, for a list of clinical variables and phenotypes abbreviations.

For example, in the continuous class, rules from both DA1 clusters were significantly associated with lymphocyte count (LC; Fig. 4a). There, three unique rules were contributed by C1 (rules 5, 44, 56), whereas the fourth rule was shared by both clusters (rule 41: KLRB1, SEC11C; Supplementary Table S5 online). Interestingly from the seven genes contained across the four rules, only the gene encoding the signal peptidase complex catalytic subunit, SEC11C, showed decreased expression, all others had medium values. This maintenance of gene expression likely explained the overall lack of significant difference between clusters for this trait.

For the DA3 clusters, a significant difference was recorded for the complement factor C3 phenotype between the C3 cluster (mean 62.1 mg/dL) and the C5 cluster (mean 85.9 mg/dL) (Wilcoxon test p-value = 2.4 × 10–3; Supplementary Table S4 online). An examination of the rules associated with phenotype C3 revealed that 17 rules were significantly linked to this phenotype in cluster C3, whilst only eight were found in the C5 cluster (Supplementary Table S5 online). All C5 rules were shared with C3, and no rules were contributed from C4 (Fig. 4a). As expected from the associated RN, none of the nine rules unique to cluster C3 showed discrete gene membership, rather they served to illustrate how in comparison to C5, rules represented by network edges could introduce additional unique features that may serve to explain the phenotypic difference. For example, shared rule 15 (CKAP4, MT1F) can form an extended connection with rules 4 (MT1F, KLRB1), 23 (CKAP4, SEC11C) and 51 (MT1F, PTTG1), widening this network to include genes KLRB1, SEC11C and PTTG1 (Supplementary Table S6 online). Each of these genes had previously been associated with SLE, but the link was not always clear. As noted before, KLRB1, expressed by NK cells and shown to be in the medium discretised expression level here, has been implicated in the regulation of the interferon gamma immune response18. SEC11C, encodes a subunit of microsomal signal peptidase complex and was the only DA3 gene maintained within medium levels for this phenotype. This gene was previously shown to be significantly down regulated in the T cells of adult SLE patients with low complement levels17. PTTG1 was previously linked to SLE via SNP association19, although it was later shown that the risk allele was tagging the nearby microRNA, miR-146a, and this was down-regulated in European disease20.


The use of machine learning in the current study has served to identify the key regulatory networks that underlie two disease states, DA1 and DA3, of the highly heterogeneous condition, paediatric systemic lupus erythematosus (pSLE). In doing so, the high dimensionality of data drawn from 33,006 gene expression measures across 629 paediatric patient visits has been reduced to co-predictive networks linked via genes. These genes were under-represented or down-weighted in published studies of SLE differential gene expression (DGE) profiling (Supplementary Fig. S11 online). The result here was five sub-networks; two distinguishing DA1, perhaps as a result of treatment response, and three subgroups not related to treatment, within the more severe DA3 disease state.

Two major factors underpinned the difference in the results observed here, versus those generated by others in the field. The first was the study of patient visits, rather than individuals over time via longitudinal gene expression. The second was methodological, as RNs are co-predictive and as such, are conceptually different from co-expression networks. The goal here was to delve into the co-predictive RNs based on gene expression at different stages of disease, potentially creating a set of biomarkers, which could be used to stratify patient subgroups for clinical trials or personalised medicine based on their disease state at a particular time. This contrasts to the prognostic goals of others using the same dataset10,21.

Let us set the scene. For the transcriptomic data analysed here, the nodes of an RN are genes and their discretised expression values. The edges between two nodes of an RN are formed from pairs of genes and their discretized expression values as they co-occurred in the IF-part of rules (Fig. 2). Significantly, in one outcome a gene may have one discretised value, but in the other outcome it will have a different value. It follows that each outcome has its own network. As such, co-prediction can provide insight into the candidate biological processes characteristic of the given outcome. For example, one combination of descriptors, i.e., pairs of gene and their discretised value may be associated to DA1 state, and another pair to DA3. This is in contrast to co-expression networks that identify genes that are co-expressed, not necessarily co-predictive of the outcomes.

SLE is a condition that spans the axes of both autoinflammatory and autoimmune disease. In this study, three DA3 subgroups were identified. The C3 sub-group sits on the autoimmune side, and had the clinical hallmarks of hypocomplementemia (low C3 and C4 clinical measures) in combination with high anti-dsDNA values, whilst the C4 sub-group likely represented the autoinflammatory side, with normal complement levels and low anti-dsDNA values (Supplementary Table S4 online). This was reinforced by the higher SLEDAI scores observed in C3 versus C4. Cluster C5 likely represented the intermediate stage between C3 and C4, where a significant shift between neutrophil and lymphocyte involvement is observed. This could translate to an immune complex driven disease state in C5, where the type I interferon process was active (low lymphocyte percent and increased neutrophil involvement). In studies using independent patient groups, both changes in complement ratio (C3/C4)22 and the categorisation of neutrophil to lymphocyte ratio (NLR)23, have been suggested as ways to distinguish SLE patient groups. Here, network analysis and unsupervised clustering combined both C3/C4 and NLR biomarker sets and resulted in three separate groups spanning these factors. The novelty in the current study lies in linking the clusters to co-predictive RNs, and this was the second major factor differentiating this work from others.

While the application of machine learning approaches to the big data sets generated by biology -omics is not new24, the approach used here removes the ‘black box’ interpretation of both the modelling and the results. This is required in the trade-off between predictability and interpretability25. Here we accepted the potentially reduced, but still high prediction accuracy of 81%, in favour of transparent classical models that perform well when the number of features available in the dataset (i.e. observations versus genes) outnumber the observations26. It is important to note that the rough sets approach to constructing rules, is based on finding all minimal subsets of features that preserve discernibility of the decision classes from the original set. The rules will contain conjunctions of genes that may reflect different levels of gene regulation but that do not need to be co-expressed. In RNs, the genes and their regulation levels are associated to the outcome and discern the decision classes (here DA1 or DA3) based on the training data, while in co-expression networks the genes are co-expressed with other genes and may not discern the outcomes. The R.ROSETTA method used for constructing the model has been shown to outperform other existing rule based methods14, and has the key distinction of being the only method that can compute a significance level for the rules in the model. This is useful for calculating model prediction reliability, but it is the use of a minimum set of significant rules that served to highlight the genes contributing most strongly to the separate networks.

In practice, this was illustrated by the hub genes for DA1 (e.g., IFI35, KLRB1) and DA3 (e.g., CKAP4, OTOF; Fig. 2). IFI35 expression is stimulated in response to IFN-α/γ27 and it can act intracellularly as a negative switch in the innate immune pathway via retinoic acid-inducible gene I regulation28. Extracellularly, the opposite effect has been observed, and the IFI35 molecule can act as a DAMP, and serve to activate the NF-κB pathway in macrophages via TLR4 signalling17. The end result is the release of proinflammatory cytokines, including interleukin 6 and tumour necrosis factor17. In DA1, IFI35 expression is observed within the medium range, but a change in this value could be key in driving DA1 patients back to a remissive or inactive SLE state. Likewise, the maintained medium expression of KLRB1 (encoding the surface receptor CD161) suggests a role for other cell sets, including natural killer (NK) cells and T lymphocytes in this lower disease state. The cell population expressing CD161 has been shown to be lower in SLE patients versus controls29. This is intriguing as this receptor can mark the NK cells that respond to innate cytokines and so promote innate inflammation30. Here again we see a contradiction between the promotion and reduction of the innate immune response.

While CKAP4 was shown as a highly expressed hub gene in DA3, the protein product is most often reported to have a role in cancer, for example acting with RBP1 to induce autophagy in murine models of oral squamous cell carcinoma31. Autophagy can also play into the pathogenesis of SLE in a number of ways. Dysregulated autophagy can affect the regulation of T and B cell populations32, and increased autophagy can promote the NF-κB pathway response33. Through its interaction with ER-resident proteins, CKAP4 also has the potential to regulate or reflect the current state of cellular immune signalling15. For the individuals studied here, increased levels of CKAP4 may not be driving disease, but the finding opens a potential line of anti-CKAP4 antibody drug development for SLE patients; an avenue previously only promoted for cancer treatment34. Another DA3 hub gene, OTOF, is an interferon inducible gene, and has been recognised by others as a marker for SLE disease flares35. This is in keeping with the finding of OTOF in the C3 and C5 clusters, but not in C4. Recently it was suggested that through interaction with melatonin, OTOF may have a role in proteasome inhibition36, and so could function in the downstream signal transduction pathway of NF-κB37. While that study was focused on neuronal survival driven by melatonin ubiquitin proteasome system inhibition, a protective anti-inflammatory role of melatonin in SLE pathogenesis has been reported previously38,39. Gene networks acting through the fulcrum of OTOF may help to explain this action, and suggests that further investigation of melatonin treatment in SLE flare could be warranted.

The current analysis aimed to explore the different networks that underlie pSLE disease states with the goal of developing a minimum set of rules that could discern disease states DA1 from DA3. It is worth mentioning that we did not aim to model the entire spectrum of pSLE disease activity, so we chose the objects that could optimally and clearly separate between DA1 and DA3 states and highlight their subgroups. This required the pruning the misclassified objects from the initial model. A logistic regression was used to reveal the probability of object removal (Supplementary Fig. S3 online, Supplementary Table S1). More specifically, this analysis showed that the enhanced model applies to a subset of SLE patients for whom there have been no prior treatments, do not have very low SLEDAI scores, or have not markedly long period of time between diagnosis and clinic visit. If the pathways are to be generalized, these factors must be accounted for prior to using the model in a clinical setting.

The enhanced model showed clearer sub-networks even though the gain in the accuracy was only 10%. While the networks generated here are based on a single gene expression set, multiple lines of evidence from previous SLE studies support their value; whether that be in classifying sub-cluster patient states or indicating possible treatments based on hub genes. It will be important to test the predictive, or replicative, ability of the gene networks to classify additional SLE patient sets. This includes generalisation to known patient subcategories, such as those with nephritis, and the wider adult SLE population. However, the permutation analysis conducted here suggests that this should be possible. We believe that machine learning approaches, such as the one demonstrated here, could aid disease understanding. This applies not only to SLE, but to any complex heterogeneous syndrome.


Figure 1, an overview of the analysis pipeline was generated with resources.

Data and pre-processing

Publicly available whole blood transcriptome records (Illumina HT-12 V4 bead chip) and clinical metadata from 158 pSLE patients and 48 healthy controls were downloaded (NCBI GEO: GSE65391)9. All the procedures were performed in accordance with the relevant guidelines and regulations. From that exisiting data set, the values corresponding to DA1 (SLEDAI 0-2), DA3 (SLEDAI > 7) and control visits were extracted. In this analysis, the transcriptome generated per visit to the clinic, and not per patient lifetime, was considered. As such, an individual may be represented in the analysis multiple times (between 1 and 15 times) if their disease status at the time was classified as DA1 or DA3 (Supplementary Fig. S12 online). For expression data, gene loci represented by more than one probe were combined and averaged, before each gene locus was log transformed. Following a linear mixed model used to identify biological and technical confounders (Variance Partition R package40), no biological cofounders were identified. Batch effects were identified (Variance Partition R package41) and corrected (SVA R package40). The batch effects identified here were limited to the reported batch replicates from the original metadata (batch 1 and 2) and not found for other phenotypes (Supplementary Fig. S13 online).

Machine learning rule-based modelling to obtain explainable classifiers for DA state

For methodological context, we applied an interpretable learning method based on rough sets that offers classification transparency11,12. Given data in the form of a decision table, where rows represent observations and columns are features with the last column being the outcome or decision, rough set algorithms find all minimal subsets of features that preserve discernibility between the outcomes for the observations. These subsets of features are called reducts, and are used to generate IF–THEN rules by overlaying them on the observations. An IF–THEN rule consists of the condition part, the IF-part, often called the left-hand side, and the THEN part is the decision or outcome and often called the right-hand side of the rule. The elements of the IF-part are called descriptors, and are in the form of pairs, feature and its value. To aid interpretation, the rules generated by the model were visualized as RNs, where the nodes are descriptors. For every pair of descriptors in a rule of the RBM, an edge connecting the corresponding nodes is added to the network.

First, expression values were subject to data discretisation, since R. ROSETTA14 generates rules for that data form. For each gene, the control data expression mean (μ) and standard deviation (σ) were calculated, and then all DA data for that gene projected onto this threshold frame and discretised (Low ≤ μ − 2σ; μ—2σ ≤ Medium < μ + 2σ; High ≥ μ − 2σ; and coded by numeric values 1, 2, 3, respectively).

To generate the initial model, data was first collected into a decision table where unique visit identifiers were the objects and put in rows (n = 629), while genes (n = 33,006) were variables and constituted columns. The objects were labelled with disease activity, DA1 or DA3, accordingly. Next, Monte Carlo Feature selection (MCFS) algorithm14 was applied to obtain a ranked list of informative features with respect to classifying the objects. A significance cut-off for selecting features from the ranked list was obtained by a permutation test (p-value ≤ 0.05). Feature boosting was applied to select the optimal number of features to build the model and then the rule model was visualized with the VisuNet R package42.

The initial rule-based model defined above was used as a base to further improve classification. Data (DA1 or DA3 visits) that did not match the left-hand side of any significant rules in the previous model were removed (p-value < 0.05). The MCFS15 process was then repeated after object removal. Prior to building the enhanced rule-based model, iterative computational rounds were performed (Feature boosting in Fig. 1) in order to select the optimal number of features for building the final predictive model. The significant features from MCFS output were incrementally added to build several rule-based models. The selected features that were used to build the model with the best overall accuracy where chosen for building the final enhanced model using R.ROSETTA14 and then visualized using VisuNet42.

In order to identify patient subgroups, a matrix was constructed with maintained observations (visits) as rows and rules as columns. All cells for the observations that supported a rule were assigned 1, otherwise 0. Hierarchical clustering based on binary distance as the distance function was applied on this matrix.

Correlating clusters to clinical and phenotypic data

Available metadata, including continuous and categorical clinical values (Supplementary Table S3), were accessed10. For continuous variables, a one-way ANOVA following a post-hoc Tukey HSD test was used to compute significance. A Fisher's exact test was used for the assessment of categorical variables to sub-clusters.

Correlating rules associated with clusters to clinical and phenotypic data

Empirical values were used to determine the minimal threshold for rule membership to clusters. Rules were considered associated with a cluster if they had a support set matching at least 10% of the cluster’s support set (i.e., observations associated with a cluster; Supplementary Fig. S14 online). The association between a cluster’s supported rules and clinical phenotypes was assessed by contrasting phenotype values for supported samples of each rule versus the non-supported samples (categorical variables, non-parametric Wilcoxon test; binary variables, Fisher’s exact test). Supplementary Fig. S15 online illustrates this process.

Model validation

The decision label (DA1 or DA3) was permuted 1000 times and rule-based models were created for these random sets. A normal distribution was built for the model accuracies and an alpha of 0.05 and a 95% confidence interval used to determine the significance of the p-value. The mean, standard deviation and the standard error for the normal distribution were computed. The accuracy of the original model was compared to the mean μ and standard error σ. If the accuracy of the original model was smaller than μ − σ or greater than μ + σ then the p-value in this case was < 0.05.

Gene enrichment analysis

Overrepresentation of gene sets belonging to each cluster and the gene sets belonging to rules in DA1 and DA3 was determined using the R package clusterProfiler43. The background list was set as initial set of 33,006 available loci.