Introduction

Osteoarthritis (OA) is a disease characterized by painful deterioration and destruction of articular cartilage1. It is a whole joint disease involving, in the case of knee OA, four tissues: cartilage, synovium, meniscus and subchondral bone2. OA is a highly heterogeneous condition that makes it difficult to characterize it in terms of clear disease phenotypes3 or completely understand the pathophysiological processes in terms of responsible biological functions, disease-associated genes and risk loci4. Until now there are no disease modifying drugs except for pain-relief treatments and compounds that were used to target the prototypic players involved in inflammation and extracellular matrix (ECM) physiology have not been able to provide significant improvements until now or are still in clinical trials5.

Systems oriented approaches in OA have been employed in many studies in the past using various experimental platforms and computational methods6. One application was to use whole-genome sequencing data (DNA microarray/RNA-seq) to identify overexpressed genes in diseased tissues and pinpoint molecular mechanisms and cellular functions related to OA7,8,9. The latter studies combined this information with other experimental platforms (mass spectrometry proteomics and DNA methylation) or used network based approaches to find pathways regulated during the development of OA. A limitation of differential gene expression and pathway analysis is that it relies on multiple statistical tests and arbitrary cut-off thresholds that are affecting the results10. Another approach to process gene expression data is to construct networks using the co-expression of the genes as the connectivity measure11. The most prominent method is weighted gene co-expression network analysis (WGCNA) that allows the construction of co-expression networks and the identification of preserved modules between different datasets12. Applied to OA, the study by Mueller et al.13 used WGCNA to identify preserved gene modules comparing human and rat studies.

When it comes to drug discovery, systematic approaches using network-based technologies and ‘omics platforms are getting increasing attention with many different methodologies developed and applied in the recent years14. The core idea is to unravel the molecular mechanisms of diseases and use this information for a systematic evaluation of pharmacological compounds. As an example, the study by Nacher et al.15 used information from 17 proteomic studies in healthy and OA chondrocytes to develop an OA-interactome and utilized network approaches to identify drugs.

Combining these two ideas, using co-expression networks to identify biological functions in OA and then, based on this information, suggesting possible pharmaceutical compounds affecting these functions seems like an interesting option to explore.

Thus, the aim of this paper is twofold. At first WGCNA will be used to identify common disease mechanisms in OA joints characterized by preserved gene modules in the relevant tissues (cartilage, synovium, meniscus and subchondral bone). Secondly, based on this information drug candidates will be inferred using network-based approaches.

Results

Weighted gene co-expression network analysis

Module identification

The WGCNA algorithm was run with the gene expression data of four datasets including 11461 genes in each set without distinction between healthy and OA, n = 88. At total 1933 genes in 25 different modules (31–285 genes per module) were identified as co-expressed and preserved across all tissues, as seen in Fig. 1. Grey colour describes non-preserved genes.

Figure 1
figure 1

Hierarchical cluster dendrogram and the identification of co-expressed modules. Colours represent the preserved modules. Grey colour are the non-preserved genes.

Module stability

Both approaches, re-sampling with replacement and 10% removal of the samples, deliver median values of ~72% and 78% of preserved module genes when compared to the original unmodified dataset. A boxplot of the preserved genes for each method can be found in Supplementary Fig. S7. Gene dendrograms and module colours similar to Fig. 1 for all the stability analyses are included in Supplementary Figs S6, S7.

Meta-module identification

Eigengenes for each module and each tissue were calculated and a dissimilarity consensus matrix DISCONSMEij (Eq. (6)) of the eigengene adjacency AMEij was computed. The consensus matrix is shown as a hierarchical co-clustering plot in Fig. 2a. Multi-dimensional scaling (MDS) together with k-means clustering (cluster number = 6) was applied on the DISCONSMEij in order to identify meta-modules. Figure 2b represents the MDS plot with the modules eigengenes and the meta-modules.

Figure 2
figure 2

Meta-module identification. (a) Hierarchical co-clustering and heat-map of the dissimilarity consensus matrix DISCONSMEij. Red: low dissimilarity of the MEs, Blue: High dissimilarity of the MEs. (b) Multidimensional scaling with k-means clustering. Colours correspond to the meta-modules (MMs) that will be analysed further.

Preservation of meta-modules across tissues

The MM preservation across the tissues was quantified via differential eigengene network analysis (after computing eigengenes for every meta-module) according to Eqs (810). The results are presented in Fig. 3. In the first row A.-D. hierarchical clustering dendrograms of the MM dissimilarity consensus matrix DISCONSMMij are shown. In other words, they show how the meta-modules are related to each other in terms of their respective co-expression. E.g. MMgreen is very different from MMred in the synovium dataset (Fig. 3C). The main diagonal (E., J., O., T.) shows the adjacencies of the MM eigengenes for each tissue. In the upper triangle (F., G., H., K., L., P.) the preservation statistics between two tissues are shown. The height of the bars represent the scaled connectivity C (Eq. (9)) for each meta-module. The value D represents the density of the preservation network (Eq. (10)). In both cases values close to 1 mean ideal preservation. For all tissues a median value of D = 0.72 can be observed. Pairwise comparisons show that preservation between meniscus and cartilage is almost perfect, whereas subchondral bone vs. cartilage exhibit the worst preservation of D = 0.63. In the lower triangle (I., M., N., Q., R., S.) the adjacency heatmaps for the pairwise preservation networks of the tissues (Eq. (8)) are shown with row and columns corresponding to the respective meta-modules. Saturation of red means high preservation. Once again, it can be seen that meniscus and cartilage have a high preservation measure whereas the preservation between subchondral bone and cartilage is rather low. In summary, the identified meta-modules are preserved across tissues, however big differences regarding the preservation quality is observable.

Figure 3
figure 3

Differential eigengene network analysis across four joint tissues meniscus, subchondral bone, synovium and cartilage. A.-D.: Hierarchical clustering dendrograms of dissimilarity of MM eigengene adjacencies. Main diagonal (E., J., O., T.): MM adjacencies for every tissue. With 1 meaning high similarity and 0 meaning low similarity. Upper triangle (F., G., H., K., L., P).: Preservation statistics for all pairwise comparisons between the tissues according to Eqs (9) and (10). Lower triangle (I., M., N., Q., R., S.): Adjacency heatmaps for the pairwise preservation networks of the tissues according to Eq. (8).

Module-trait relationship and identification of driver genes

Until now six meta-modules were identified without any relation to the phenotype or any biological information. Thus, the genes inside the modules were correlated to the OA phenotype via Eq. (11) (GS) and their intramodular connectivity (GC) was computed. This procedure was repeated for all tissues and a consensus measure was calculated by taking the median value of GS and GC. The results are presented in Fig. 4 with the six MMs and the grey module of not-preserved genes. It can be seen, that two MMs, the turquoise and red meta-module exhibit a correlation of 0.45 and 0.4 (p < 0.001 in both cases) between gene significance and intramodular connectivity. In other words, the hub genes inside these modules (driver genes) are correlated with the disease and therefore the turquoise and red MMs should be associated with biological functions playing a role in OA. This hypothesis was tested through GSEA in the following step.

Figure 4
figure 4

Pearson correlation plots between gene significance (GS) and gene connectivity (GC) for the consensus (median) across all tissues. Colors correspond to the identified MM in Fig. 2b.

Gene set enrichment analysis

GSEA was performed on the turquoise and the red MM to see if the preserved modules are involved in common biological functions. As an input a gene list of the according modules sorted by decreasing absolute median t-values taken from the differential expression analysis of each tissue was provided. The results presented in Table 1 show the top 10 pathways and biological processes sorted by the adjusted p values (using the Bonferroni correction with the expected experiment-wide significance level of α = 0.05) for the red and the turquoise MM. A full list of the enriched pathways is included in Supplementary Table 1.

Table 1 Results of GSEA showing the top 10 enriched gene sets for the red and the turquoise MM. Entries sorted by increasing adjusted p values (p.adj).

It can be observed that the red MM mostly represents biological functions and pathways related to the immune system as well as diseases affecting the immune system and causing immune responses. The turquoise MM includes functions related to ECM organization, skeleton and bone development as well as collagen physiology. Involvement of immune system and ECM in OA are well-known facts2,16. It was decided to focus the network based drug discovery on genes taken from the turquoise MM, as it showed the most consistent results regarding GS vs. GC correlation in all tissues (Supplementary Fig. S10).

Network based drug discovery

Genes in the 80% quantile of the gene significance (GS) and gene connectivity (GC) of the turquoise MM were chosen. To justify the choice of the threshold for the definition of the disease signature, the agglomeration measures were computed for different percentile values (0-90%) and the respective z-scores for module size S and mean shortest distance <ds> were determined. The plots of threshold vs. the agglomeration measures can be found in Supplementary Fig. S11 showing that the 80% threshold provided the best results. This choice resulted in a disease signature of 64 genes with a z-score for the module size S of 12.05 and with a z-score for the mean shortest distance <ds> of −1.75.

The results of the drug-disease proximity based screening are shown in Table 2 with the top 10 compounds identified by the algorithm. The mean shortest distances between a drug signature and the disease signature are described by <dc>, the respective z-score was computed by 1000 sampling runs with random drug and disease signatures of same size and same degree distribution as the original signatures. As another requirement only drugs with a <dc> 1 (lowest 5% after screening the full list of 1833 drugs) were considered. The type and mechanism of action were taken from Drugbank. Further on the relation to OA is shown. It can be seen that 4 out of 10 drugs (Ruxolitinib, Certolizumab, Golimumab, Vedolizumab) are anti-inflammatory compounds that, although being used as a treatment for other diseases than OA, have been studied as a treatment option for joint diseases (mostly rheumatoid arthritis). The second finding is that the thrombolytic agent Tirofiban might be an option for treatment of OA. Although there are no studies testing this agent in OA or arthritic joint diseases there exists a clinical study on the linkage of arthritis to local and systemic activation of coagulation and fibrinolysis pathways in a cohort of n = 161 patients. The most statistically significant result Florbetapir is a radiopharmaceutical agent that binds to beta amyloid plaque, a molecule playing a central role in Alzheimer’s disease (AD). A linkage between AD and OA is a hypothesis that has been posed and positively tested17. Finally, hyaluronidase and Turpentine are two compounds that will lead to cartilage destruction by degrading hyaluronan, the major constituent in the ECM (hyaluronidase) and release of inflammatory mediators (Turpentine). Interestingly both compounds are used in disease animal models with hyaluronidase used in OA18 and Turpentine used in a model of anemia of inflammation19. In summary 9 out of 10 suggested compounds exhibit a hit either as having been tested for an arthritic disease or having targets that are also relevant in OA.

Table 2 Top 10 suggested compounds after network based drug screening.

In order to validate the compound suggestions the bottom 10 and the random 10 lists of drugs were computed. The lists can be found in Supplementary Table 3. The bottom 10 list did neither include any drugs tested in OA nor any targets relevant for OA. Two random 10 lists were created. The first one was sorted by lowest mean shortest distance <dc> and provided 3 out of 10 hits, however none of them were statistically significant (lowest z-score was -1.3). The second one was sorted by the lowest z-scores and provided 2 out of 10 hits. Even relaxing the requirement of low z-scores and comparing the hits (top 10 vs. random 10) with Fisher’s exact test delivers a p-value of 0.02. Finally, the entire list of approved drugs (1833 compounds) was screened for having compounds with Drugbank curated application ‘arthritis’. In this scenario 42 out of 1833 compounds were selected. Fisher’s exact test versus 9 out of 10 hits (top 10 list) delivered a p-value of 4.5e-14.

In summary the network based drug discovery approach confirms the role of inflammation in OA and suggests anti-inflammatory agents with various mechanisms of action. Further on, coagulation and fibrinolytic pathways seem to play a role in OA, thus thrombolytic agents might be a treatment opportunity to explore.

Discussion

OA is a multi-tissue disease, including cartilage degradation, meniscus and subchondral bone alterations and synovium inflammation. The aim of the study was to apply WGCNA to identify preserved structures of co-expressed genes, connect these findings to biological functions and include a network based drug discovery approach based on the findings obtained from the WGCNA.

The results show that structural similarities in the microarray datasets in terms of co-expressed genes describe biological functions relevant for OA. More specifically two preserved meta-modules had hub genes associated with OA and described functions related to immune system (red MM) and ECM physiology (turquoise MM). It has to be noted that the preservation quality of meta-modules between two tissues was very different (see Fig. 3). Especially meniscus and cartilage show extremely good preservation statistics (D = 0.94) which may be caused by several reasons. First of all, in both datasets the healthy samples were retrieved from patients undergoing arthroscopic partial meniscectomy whereas the OA samples were retrieved from patients undergoing total knee arthroplasty. Therefore the sample retrieval itself surely poses difficulties in terms of clear separation of the tissues. More specifically, arthroscopic retrieval of healthy looking cartilage might have caused an unwanted contamination of the sample with meniscal cells as both tissues are in proximity inside the knee joint. A second reason might be the use of the exact same platform Agilent-072363 SurePrint G3 Human GE v3 8 × 60 K Microarray 039494 for both datasets. Normally one would not expect such a strong influence on the co-expression of the genes. We tested this hypothesis by performing differential eigengene network analysis after removal of a batch effect of all datasets with the limma package, however the results were not affected. Lastly, there might really be a high overlap of biological functions and a strong similarity between meniscus and cartilage.

After meta-module preservation we were interested which modules were relevant for OA for further downstream analysis (see Fig. 4). In order to allow for a tissue unspecific comparison, the median values of the absolute t-values after differential expression analysis of each tissue were used.

Clearly this approach bears the risk of ignoring important biological information that is tissue specific. In particular using the GS vs. GC correlation approach for each tissue individually shows that there are significant differences between the tissues, see Supplementary Results 2. Analysis of the cartilage dataset reveals that there are no meta-modules that exhibit positive correlation between GS and GC. Looking at the differential expression analysis and the volcano plots in Supplementary Table 2 shows that very few genes (n = 32) are differentially expressed in this dataset and that most of the genes have low logFC (low spread of the eruption in the volcano plot). Further on, differential expression analysis revealed that there are no differentially expressed genes across all tissues, however 8 genes (CSN1S1, APOD, FAP, COL5A2, MXRA5, DEFA3, DEFA4, S100A8) were differentially expressed in 3 out of 4 tissues as shown in Fig. 5, a Venn diagram on tissue specific DEGs. More details on this analysis can be found in Supplementary Results 4.

Figure 5
figure 5

Venn diagram of DEGs for all four datasets. The number of DEGs per individual dataset and all possible overlaps are given. Gene names mentioned if overlapping in three out of four tissues.

In the remaining datasets (Supplementary Fig. S10A–C) at least either the red or the turquoise MM exhibited a positive correlation between GS and GC. In the synovium dataset the yellow MM seems to be of interest as well. Performing GSEA with g:Profiler on the genes of the yellow MM reveals next to rather generic functions (gene expression, cellular and RNA metabolism) the enrichment of the HIF-1 signaling pathway. Comparing with literature reveals many studies proving the role of the hypoxia inducible factor in OA20,21.

In addition we ran GSEA for the red, turquoise and yellow MM without any information on the differential expression (just providing an unsorted list of genes). This approach provided basically the same results (in terms of the overall functions of the MM), however the statistical significance was lower in the unsorted case. Finally it has to be added, that there are more sophisticated methods of performing GSEA. Notably, using the piano22 package allows the consideration of directionality during pathway enrichment, thus identifying which pathways are distinctively up -or down-regulated and how this information relates to the t-values of the differential expression analysis. We created a code that includes the possibility of GSEA with the piano package that is stored in the repository as mentioned in the Materials and Methods section. The aim of this study was to provide a pipeline on how to identify pharmaceutical compounds using WGCNA and network-based drug discovery approaches. It has to be emphasized that the pathway analysis presented herein was used to prove a biological role of the genes inside the identified meta-modules (red, turquoise and yellow MMs) and not to focus on an in-depth analysis of molecular functions to unravel mechanisms and genes playing a role across the tissues in OA knee joints. Hence, for the sake of conciseness the results of the piano package are not included and not discussed herein.

The network based drug discovery approach suggested four compounds with anti-inflammatory potential acting along the JAK/STAT pathway, the TNF-a pathway and the integrin pathway. This is an interesting observation as the genes of the disease signature enriched pathways related to ECM physiology and not to inflammatory processes. Strikingly Vedolizumab, which is a drug for inflammatory bowel disease, ameliorated joint pain and delayed the onset of new cases of joint diseases in a post-hoc analysis of the GEMINI 2 trial23. Further on, it was suggested that anti-coagulants might have an effect on osteoarthritis, which is supported by the fact the coagulation and fibrinolysis pathways do play a role in arthritis24. The suggestion of two compounds (Hyaluronidase and Turpentine) that would worsen OA conditions shows up the first intrinsic limitation of the drug-disease proximity approach. With this consideration there is no information on positive or negative interactions between target and signature but solely a distance measure between these two groups. Alternative drug screening approaches such as using a reversal of the disease signature (in terms of measured gene expression) such as proposed by the L1000CDS2 platform might be an interesting alternative25. A drawback of such an approach (for our scenario) is that gene expression is very different across the joint tissues and it will be difficult to consider all tissues in parallel. The validation approach classified the drug suggestions as hits or misses based on literature research and compared them with a bottom 10 list (highest distance) and two random 10 lists (10 compounds with lowest <dc> and 10 compounds with lowest z-score after randomly drawing from gene list of 11461 genes). In the first case no compounds related to OA were identified. In the second scenario the random 10 lists gave 3 out of 10 hits (without statistically significant z-scores) and 2 out of 10 hits. At last the Drugbank database was screened for compounds including ‘arthritis’ or ‘osteoarthritis’ as a curated description, as just random selection from the database without any of the presented analysis steps might be an option. In this case 42 out of 1833 were selected delivering a p-value of 4.5e-14 (Fisher’s exact test, compared to 9 out of 10 hits). As the curated description might not be complete, we computed the number of potential arthritis drugs the Drugbank database has to include in order to not be outperformed by the top 10 list. As a result at least 893 out of 1833 compounds should have a relation to osteoarthritis in order to deliver a p-value > 0.01. As such scenario is highly unlikely, the following conclusions were made: The Drugbank database is not biased towards osteoarthritis drugs. Drug-disease proximity seems like an important measure to be included in drug screening. The analysis performed with WGCNA seems to be necessary in order to prioritize genes of interest and define a disease signature. In the case of OA such signature is not trivially to define. The publications of Menche et al.26 and Guney et al.27 based their work on disease signatures obtained from various databases (299 diseases), unfortunately OA is not included in their dataset to allow for a cross-check of our results. We tried to overcome the obstacle by choosing a cut-off threshold that produced the lowest z-scores for S and <ds>, thus assuming that the disease signature should be as much agglomerated as possible. Until now the screening was applied to a list of approved drugs in order to facilitate comparison with literature. It can however be easily expanded to include investigational compounds as only the target genes need to be known.

Limitations

The first limitation in using WGCNA is the requirement of having the exact same list of expressed genes for each tissue, thus it is favourable if the same experimental platform can be used. In our case, the synovium dataset was collected with the Affymetrix platform, whereas the remaining tissues were processed with the Agilent platform. Therefore, in the end, around 11000 genes were used as an input for WGCNA and some information could have gotten lost due to the differences in the experimental platforms. The methodology presented herein allows the exclusion of certain datasets and it would be an interesting option to explore functional overlaps in other combinations of joint tissues (e.g. only cartilage and meniscus). Secondly, although WGCNA tries to reduce the influence of arbitrary cut-off thresholds, the parameter β (Eq. 3) has to be chosen based on the a priori requirement of scale-free network topology. This assumption might not be correct, as a recent study showed that only a small fraction of biological networks do really exhibit scale-free network properties28. WGCNA does not rely on statistical measures, therefore it is hardly possible to perform a priori analyses for further experimental design and the performed module stability approach does not give any information on desired sample sizes for further studies.

As mentioned above, the GSEA performed in the study ignored tissue specificity and directionality measures of the enriched pathways and biological functions.

In terms of validation our approach relied on comparison with literature without in vitro testing. It has to be mentioned that in vitro models of OA are rather diverse in terms of model structure, disease induction and model outcome. It is therefore not easy to define whether a drug is really working in comparison to e.g. IC50 in cancer drug testing. Further on, the drug discovery approach was based on molecular profiles of four joint tissues and to the best our knowledge there are no in vitro models considering the influence of all these tissues. Lastly, right now the drug discovery approach does not consider toxicity or side effects in order to include other measures for compound prioritization.

Despite these limitations we believe that the methodology presented in this work is a viable way to guide in silico drug discovery in OA or other multi-tissue diseases. Having a modular structure, the identification of target genes or the network based drug discovery part can be extended and improved to tackle the abovementioned limitations.

Overall, WGCNA was used to identify target genes with preserved co-expression across tissues, association with the disease and high intramodular connectivity. The output was used to suggest drugs based on drug-disease proximity measures in a PPI network. Anti-inflammatory compounds with different mechanisms of action such as JAK/STAT inhibitors, TNF-a inhibitors and integrin pathway inhibitors were suggested. Finally compounds affecting the coagulation pathways might be interesting for OA treatment.

Materials and Methods

Datasets

Publically available genome-wide microarray datasets for each tissue involved in knee OA were acquired from the Gene Expression Omnibus (GEO)29. These included cartilage, synovium, meniscus and subchondral bone. The tissue sources with the GEO accession numbers, the platform and the sample numbers are shown in Table 3.

Table 3 Tissues, GEO accession numbers, experimental platforms and sample numbers.

The cartilage dataset (GSE117999) included 24 samples. Healthy-appearing cartilage was taken from the non-weight bearing site of the medial intercondylar notch of 12 patients undergoing arthroscopic partial meniscectomy without any evidence of OA. OA cartilage was taken from 12 patients undergoing total knee arthroplasty due to end-stage OA. The synovium dataset (GSE55235)30 included 20 samples. Healthy tissue was taken from 10 individuals from post-mortem joints or after traumatic injury and OA tissue was taken from 10 patients undergoing total knee arthroplasty. The meniscus dataset (GSE98918)31 included 12 patients undergoing arthroscopic partial meniscectomy (healthy) and 12 patients undergoing joint replacement due to end-stage OA. The subchondral bone dataset (GSE51588)32 included tissue taken from the knee lateral and medial tibial plateaus (LT and MT) of 5 non-OA (post-mortem or after amputation surgery) and 20 OA patients undergoing joint replacement surgery. Preliminary analysis of LT vs. MT from the same group showed significant differences in gene expression, thus mixing of tissue from both sites would have resulted in loss of biological information. The MT plateau group showed to be more influenced by OA, thus OA and control groups used the results taken from the MT plateau.

Data pre-processing and differential expression analysis

The R package limma33 was chosen for background correction and normalisation of the data as well as for the differential expression analysis. RMA and quantile normalisation were used for all datasets as these methods were able to produce MA plots34 (log-intensity ratio M vs. mean log-intensity A) that were scattered around the zero line, see Supplementary Fig. S1. Before performing differential expression analysis, the gene expression values of normal and OA samples were hierarchically clustered to remove outliers in the respective datasets, see Supplementary Figs S2S5 in the Supplementary Methods section. P11 and P12 were removed from the healthy meniscus group, P11 was removed from the healthy cartilage group and P18 and P19 were removed from the OA cartilage group. Once the outliers were removed, DEGs in each dataset were identified by satisfying the following conditions (Eqs 1 and 2):

$$lo{g}_{2}FC\,\ge 1.5$$
(1)
$$adj.p\,\le 0.05$$
(2)

with FC being the fold change between the average expression of the healthy and the OA samples and adj.p being the FDR adjusted p-value using Benjamini-Hochberg correction.

As the input for the WGCNA analysis the expression of exact the same genes across all datasets is necessary. Gene names for every tissue were annotated with the HUGO gene nomenclature symbols and an intersection of all datasets was performed.

Weighted gene co-expression network analysis

WGCNA is a methodology to identify clusters of genes calculated from a network described by the connectivity of the pairwise correlation between the genes. Further on, it can be used to identify if a module from one dataset is preserved in another dataset by using topological measures of the network12. Detailed information on the methodology can be found in Zhang et al.12, therefore just a brief description of the algorithm is presented herein. All computations were performed using the R package WGCNA35.

Network construction and module identification

At first, a signed weighted adjacency matrix Aij was computed according to Eq. 3:

$${A}_{ij}={(0.5+0.5cor({x}_{i},{x}_{j}))}^{\beta }$$
(3)

with cor(xi,xj) being the pairwise Pearson correlation matrix (NxN) with xi and xj (i, j = 1…N) being the vectors containing the gene expression levels across the different samples of genes i and j respectively and N being the total number of genes. The power β is used to reduce the influence of low absolute correlation values on the network topology. Further on β is chosen to lead to an approximate (R2 ≥ 0.8) scale-free topology of the network. As seen in Supplementary Fig. S6 a choice of β = 20 leads to an approximate scale-free topology and reduces the connectivity of the nodes. Further on, the connectivity ki of a node i is defined as in Eq. 4 and describes the sum of all weighted connections of a node i:

$${k}_{i}=\sum _{u}{a}_{iu}$$
(4)

In the next step Aij was transformed into a topological overlap matrix (TOMij) according to Eq. 5:

$$TO{M}_{ij}=\frac{{\sum }_{u}{a}_{iu}{a}_{ju}+{a}_{ij}}{\min \{{k}_{i},\,{k}_{j}\}-{a}_{ij}+1\,}$$
(5)

with aij = 1 if a direct link between node i and node j exists and 0 otherwise. In other words, TOMij relates the set of common neighbours to the smallest set of neighbours of i excluding j and vice versa. The dissimilarity matrix that was used for module identification with WGCNA is defined in Eq. 6:

$$DIS(TO{M}_{ij})=1-TO{M}_{ij}$$
(6)

The procedure of Eqs (36) was performed for four datasets and a consensus transformation for the dissimilarity matrices according to Eq. (7) was computed:

$$Consesu{s}_{ij}({A}^{(1)},\,{A}^{(2)},\mathrm{...})=mi{n}_{ij}({A}^{(1)},\,{A}^{(2)},\ldots )$$
(7)

Other operators instead of the min operator (10th quantile, median, mean etc.) can also be used, depending on how strict the consensus criterion is formulated.

Finally clusters of genes were identified by using a hybrid method combining hierarchical clustering and partitioning-around-medoids clustering with the consensus matrix of Eq. (7) as the distance matrix36.

Module stability

Two methods to assess the stability of the module identification through the WGCNA algorithm were implemented. The first considered a random removal of 10% of the samples of each microarray dataset with identical processing and module identification as for the original datasets. The second approach used resampling with replacement for the creation of new artificial datasets. Both approaches were performed 50 times with each time comparing the new set of modules with the original set.

Differential eigengene network analysis

For each module an eigengene (the first principal component of the gene expression data underlying this module) was computed in order to reduce the network and allow a meta-analysis of the data37. The eigengenes were represented in an eigengene co-expression network AMEij for every tissue according to Eq. (3) with β = 1. Then a consensus matrix, Eq. (7) and the dissimilarity of the consensus matrix DISCONSMEij Eq. (6) was calculated.

Multi-dimensional scaling38 with subsequent k-means clustering39 on DISCONSMEij was performed to identify clusters of module eigengenes (MEs), so called meta-modules (MMs), that were analysed further down the pipeline. It has to be noted that every MM was again expressed with a meta-module eigengene.

At first, it was of interest to what degree the meta-modules were preserved across the datasets. Thus a preservation transformation for the meta-module adjacency matrices AMMij (using Eq. (3) with β = 1) of all four tissues was performed according to Eq. (8), further referred as the preservation network:

$$Preser{v}_{ij}({A}^{(1)},\,{A}^{(2)},\ldots )=1-[Ma{x}_{ij}({A}^{(1)},\,{A}^{(2)},\ldots )-Mi{n}_{ij}({A}^{(1)},\,{A}^{(2)},\ldots )]$$
(8)

Two measures, the scaled connectivity C and the density D of the preservation network were computed according to Eqs (9) and (10) to quantify the preservation between networks A(1) and A(2) with dimension n x n.

$${C}_{i}(Preser{v}^{(1,2)})=1-\frac{{\sum }_{j\ne i}|{a}_{ij}^{(1)}-{a}_{ij}^{(2)}|}{n-1}$$
(9)
$$D(Preser{v}^{(1,2)})=1-\frac{{\sum }_{i}{\sum }_{j\ne i}|{a}_{ij}^{(1)}-{a}_{ij}^{(2)}|}{n(n-1)}$$
(10)

For more detailed information on preservation statistics and differential eigengene network analysis, the reader is referred to Langfelder et al.37.

Module-trait relationship and identification of driver genes

Until now the identified MMs represented genes that were co-expressed and preserved across all tissues not considering the phenotype (healthy vs. OA). As a next step it was necessary to point out MMs that have disease related genes. Further on, the connectivity of the genes inside the MMs was of interest, as hub genes might be influential for the according meta-module.

Thus, overall gene expression datExpr was correlated to the disease (trait) by computing the gene significance GS with Eq. (11):

$$GS=abs(cor(trait,\,datExpr))$$
(11)

Additionally gene connectivity GC was calculated as the weighted within module connectivity (edge weighted degree).

Functional enrichment and pathway analysis

The outcome of the WGCNA analysis are modules of co-expressed genes preserved across knee joint tissues that simultaneously have genes correlated with the disease state. These modules were connected to biological functions and pathways through gene set enrichment analysis (GSEA) using the g:Profiler web-service40. g:Profiler takes as an input a listed of gene names (sorted or unsorted) and provides an enrichment score to show if a set of genes is enriched in a biological function or pathway. Enrichment was performed using the Gene Ontology (GO): biological processes41,42 as well as KEGG43 and REACTOME44 pathways.

Network based drug discovery

In order to suggest compounds for treatment of OA, the network-based approach suggested by Guney et al.27 was used. This approach represents diseases with signatures (lists of proteins or protein encoding genes) that are located in a background protein-protein interaction (PPI) network, called the interactome. Drugs are represented by their respective protein targets (drug signatures) and network-based distances between the disease and drug signatures are used to suggest drugs with therapeutic potential.

The disease signature was chosen from the meta-modules of the WGCNA analysis that had genes significantly correlated with the disease state (high GS) and had a high gene connectivity GC. Therefore, following requirements for the disease signature were met: 1: Genes were co-expressed and co-expression was preserved across tissues. 2: Genes were correlated with the disease state. 3: Genes were the hub genes of the disease related meta-modules.

As the background network a PPI network as presented by Menche et al.26 consisting of 13460 proteins and 141296 interactions was selected. At first, it was determined if the disease gene list is present as a module in the background network. Two approaches were chosen that quantify the degree to which disease proteins agglomerate in the interactome neighbourhood26. The first measure was the module size S quantified by the largest number of disease proteins directly connected to each other. The second one calculated the shortest distance ds as the distance for each disease protein N to the next closest protein associated with the disease inside the interactome. Then the average value <ds> for all disease proteins N describing the diameter of the disease on the interactome was calculated. Detailed explanations can be found in the Supplementary Material of Menche et al.26.

Random controls were created for both measures S and <ds> from sets with the same number of proteins as the disease signature by sampling without replacement of the background interactome with preservation of the degree distribution. This procedure was repeated 10.000 times and z-scores and p-values for S and <ds> were calculated according to Eq. (12):

$$z=\frac{X-\mu ({X}_{rand})}{\sigma ({X}_{rand})}$$
(12)

with X being S or <ds> respectively.

To obtain drug signatures, Drugbank v. 5.1.345 was parsed and all approved drugs together with their target genes were retrieved, resulting in 1833 drugs and small-molecule compounds. Drug-disease proximity <dc> was calculated as the average of all shortest distances of the drug targets T to any of the disease proteins S27. Statistical significance of the drug-disease proximity for every drug was computed according to Eq. (12) with 1000 sampling repetitions.

Validation of the network based method

In the end a list of top 10 drugs with lowest drug-disease proximity and highest significance was derived. In order to validate the findings the function of each compound and their relationship to joint diseases/OA was characterized by literature research returning a hit: compound has relationship with OA in terms of existing studies or pathways/targets relevant for OA or a miss: no interaction between compound and OA/joint diseases. The number of hits were compared to a bottom 10 list of drugs, this means drugs with highest drug-disease proximity and highest statistical significance. Additionally a random 10 list was developed by creating a disease signature through sampling without replacement from the genes of the microarray datasets (11641 overlapping genes) with the same size and degree distribution as S and subsequent drug-disease proximity computation as shown in Eq. (12). These two lists have the following reason: The bottom 10 list shows the influence of drug-disease proximity on the chosen compounds, whereas the random 10 list shows the influence of WGCNA in order to select an appropriate disease signature. At last the Drugbank dataset was screened for drugs with curated association to ‘arthritis’ or ‘osteoarthritis’ in order to check how a random drug selection from such a list would perform.