Finding disease modules for cancer and COVID-19 in gene co-expression networks with the Core&Peel method

Genes are organized in functional modules (or pathways), thus their action and their dysregulation in diseases may be better understood by the identification of the modules most affected by the disease (aka disease modules, or active subnetworks). We describe how an algorithm based on the Core&Peel method is used to detect disease modules in co-expression networks of genes. We first validate Core&Peel for the general task of functional module detection by comparison with 42 methods participating in the Disease Module Identification DREAM challenge. Next, we use four specific disease test cases (colorectal cancer, prostate cancer, asthma, and rheumatoid arthritis), four state-of-the-art algorithms (ModuleDiscoverer, Degas, KeyPathwayMiner, and ClustEx), and several pathway databases to validate the proposed algorithm. Core&Peel is the only method able to find significant associations of the predicted disease module with known validated relevant pathways for all four diseases. Moreover, for the two cancer datasets, Core&Peel detects further eight relevant pathways not discovered by the other methods used in the comparative analysis. Finally, we apply Core&Peel and other methods to explore the transcriptional response of human cells to SARS-CoV-2 infection, finding supporting evidence for drug repositioning efforts at a pre-clinical level.


Supplementary Figures
study. The pathways enrichment analysis was performed for each subnetwork detected by Core&Peel and by the four competitive methods (ClustEx, Degas, KPM and ModuleDiscoverer). Both Core&Peel configurations with density equals to 0.5 and 0.7 were tested. Consequently, two versions of ClustEx (ClustEx 1940 and ClustEx 3800) have been generated in order to get modules with size comparable with Core&Peel. The pathway analysis using the differentially expressed genes (DEGs) was conducted as baseline. Only pathways with an adjusted p-value < 0.05 were selected. The bars on the bottom-left represent the number of enriched pathways in each method. The bars on the main plot represent the number of enriched pathways in common between methods marked with black points on the panel below. The plot was generated by the R package UpSetR. case study. The pathways enrichment analysis was performed for each subnetwork detected by Core&Peel and by the four competitive methods (ClustEx, Degas, KPM and ModuleDiscoverer). Both Core&Peel configurations with density equals to 0.5 and 0.7 were tested. Consequently, two versions of ClustEx (ClustEx 400 and ClustEx 900) have been generated in order to get modules with size comparable with Core&Peel. The pathway analysis using the differentially expressed genes (DEGs) was conducted as baseline. Only pathways with an adjusted p-value < 0.05 were selected. The bars on the bottom-left represent the number of enriched pathways in each method. The bars on the main plot represent the number of enriched pathways in common between methods marked with black points on the panel below. The plot was generated by the R package UpSetR. M6  132  103  64  56  47  43  M6  132  132  107  72  61 54 Core&Peel-r1-nl20-d0.7-f1-j0. 8   We tested four existing active subnetwork identification algorithms. We selected the methods that take gene expression matrix and/or list of differentially expressed genes (DEGs) as their inputs [1], a part from the network (generally it is a protein-protein interaction network). In this case, we used the co-expression network by the DREAM challenge as input. The algorithms we used in the comparison are summarized as follows.

Supplementary Tables
ModuleDiscoverer [2]: Starting from a network, the algorithm first approximates the underlying community structure by iterative enumeration of cliques from random seed nodes (one or more) in the network. Secondly, all the cliques are tested for their enrichment with the DEGs obtained from the gene expression data. Finally, significantly enriched cliques are assembled in a unique regulatory module taking each gene once. We used the single-seed approach, which identifies cliques using only one seed node and the p-value cutoff was 0.01 (both default parameters). The p-value for each clique is calculated using a permutationbased test. To calculate this p-value we used the parameter of 20000 gene-sets.
KeyPathwayMiner (KPM) [3]: it detects highly connected subnetworks in which most genes show similar patterns of expression. Specifically, given network and gene expression data, KeyPathwayMiner identifies those maximal subgraphs where all but k nodes of the subnetwork are expressed similarly in all but l cases in the gene expression data. We provided an indicator flag to mark differentially expressed genes (0/1). We set k= 2 and l = 0 and INES strategy with GREEDY algorithm was performed. More modules were detected but most of them differs from one or two genes, so the best-scoring module was selected for further analysis.
DEGAS [4]: it identifies connected gene subnetworks significantly enriched for genes that are dysregulated in samples with disease. It takes the network and the gene expression matrix as inputs and each sample is labeled as case or control. We used default parameters a part from l (the number of outlier cases) which was set as 20% of the case number (as suggested by the authors). The algorithm identified one regulatory module for case study, which were used for further analysis. Only in BALF and PBMC cases, it detected more than one module so we merged them together. Beside it was not possible to run Degas in COVID-cells case since the expression matrix was not available.
ClustEx [5]: the algorithm takes DEGs and network as input. The DEGs are clustered and partitioned into different groups by average linkage hierarchical clustering according to their distances in the network.
In the extending step, the final modules are formed by adding intermediate genes on the k-shortest paths between the DEGs. The size of the largest module has to be chosen. The output consists of some small modules (about 2 or 3 genes for each) and one big module. We took the biggest module for the downstream analyses. The size of the largest module was chosen in order to be comparable with the module size of Core&Peel.

Core&Peel: detection of active sub-network when nl = 20
Here we show the Core&Peel performance in detecting active subnetwork when nl = 20. We chose the Core&Peel-r1-nl20-d0.8-f1-j0.8 for the prostate and colorectal cancer cases, instead Core&Peel-r1-nl20-d0.5-f1-j0.8 in asthma and rheumatoid arthritis case. Core&Peel detected 743, 120, 33 and 837 modules in the prostate cancer, asthma, rheumatoid arthritis and colorectal cancer datasets, respectively. Since the significant modules were merged in one regulatory module, we checked if these modules were still significantly enriched with the DEGs; all the four modules got a p-value < 10 −12 .

Comparison with other active sub-network detection methods
We compared the Core&Peel modules with those detected by the four existing active subnetwork identification methods, namely ModuleDiscoverer (MD), KeyPathwayMiner (KPM), ClustEx and Degas. We also tested MD using the hypergeometric exact test (MD hyperg) instead of the permutation-based test.
The number of genes in each module is reported in Table 9. Also in this case, regulatory modules of asthma and rheumatoid arthritis include a smaller number of genes than the prostate and colorectal cancer modules. Besides when we apply the hypergeometric test to MD, smaller modules have been identified than classic MD. Next, we compared the enriched pathways (Reactome-based) and plotted the overlap among all the method combinations ( Figure 10). In this comparison we took into account the MD hyperg. Globally, MD hyperg, ClustEx and Core&Peel detected higher number of pathways than Degas, KPM and DEGs. In both cancer cases, Core&Peel and MD hyperg share almost half of their pathways (around 60) and a large number of pathways was identified only by ClustEx ( Figures 10A, 10B). In rheumatoid arthritis case, more than half of the pathways were shared by Core&Peel, ClustEx and MD hyperg ( Figure 10D) and there is a small overlap for the other combinations. Besides, a smaller number of enriched pathways were detected in asthma than the other three case studies and we can also notice there is a few overlap. In fact almost the entire set of enriched pathways of ClustEx is not in common with any other methods. Similarly, more than half of the pathways identified by Core&Peel, were not detected by the others (Figure 10C).

Detection of disease-associated pathways
We calculated the number of enriched pathways with at least two genes associated to the specific disease (annotated on DisGeNET database). Also in this case we took into account MD hyperg. The results are showed in Figure 11. Generally Core&Peel is able to identify a substantial number of pathways. In particular, it is able to detect the major number of pathways with disease-associated genes in colorectal cancer and asthma. Generally applying the hypergeometric test reduces the number of pathways with disease-associated-genes, more evident in the rheumatoid arthritis case ( Figure 11B).
Finally, we conducted further enrichment analyses using different databases in order to investigate the enrichment with specific-disease pathways. The results are showed in Figure 12. The results for carcinoma and asthma pathways are similar to those obtained for nl = 10 (compare Figure 12 A and B with the corresponding Figure in the main text). The Core&Peel modules are the unique ones to be significantly enriched for colon and prostate cancer specific pathways such as neoplasm of the colon ( Figure 12C) and abnormal prostate morphology, prostate cancer and prostate neoplasm ( Figure 12D). This highlights how Core&Peel (with nl = 20) was able to find more cancer type-specific modules than the others algorithms. Again no significantly enriched pathways were found for rheumatoid arthritis case.

ClueGO analysis
In order to analyze the pathways detected only by Core&Peel in the active subnetwork identification problem, we used the ClueGO [6], a Cytoscape plug-in which creates a functionally organized pathway term network. This enables us to study the most representative pathways identified by only Core&Peel. For each case study, we selected the Reactome enriched pathways detected only by Core&Peel (no overlap with any other methods) as ClueGO input. Basically a Reactome term-term similarity is calculated (for each pair of terms) using the kappa score, which takes into account the shared genes between the terms. Finally, the created network represents the terms as nodes which are linked based on a predefined kappa score level (default: 0.1). All the terms are iteratively compared and functional groups are defined. Each functional group is represented by its most significant term and the size of the nodes reflects the enrichment significance of the terms. The following networks show the Reactome-pathways identified only by Core&Peel across the four case studies and the tables summarize the representative terms of each group.

Prostate cancer
The majority of the pathways detected in prostate cancer are related to the regulation of the cell cycle and DNA repair, which are the main deregulated processes in cancer. In particular, one of the pathways that plays a role in the cell cycle is associated to the kinetochore, a protein that links the chromosome to microtubule and it is essential for proper chromosome segregation during mitosis. A study [7] showed that the expression of one kinetochore-associated protein was remarkably upregulated in prostate cancer. Knockdown of this protein repressed the ability of cell proliferation, migration, and invasion of prostate cancer cells. Another pathway with a role in the mitosis control is called AURKA Activation by TPX2. The protein TPX2 activates Aurora A kinase (AURKA) which contributes to the regulation of cell cycle progression. It has been demonstrated [8] that overexpression of TPX2 improved proliferative, invasive and migratory abilities, and repressed apoptosis of the prostate cancer cells. Pathways involved in the DNA repair processes are for example Resolution of D-loop Structures through Synthesis-Dependent Strand Annealing (SDSA) and HDR through Homologous Recombination (HRR). The SDSA and the Homology directed repair (HDR) are mechanisms in cells to repair double-strand DNA lesions. The SDSA is promoted by the DNA helicase and inactivating mutations in DNA helicase genes are frequently associated with various cancers. However, the overexpression of many DNA helicases is required for cancer cell proliferation or resistance to DNA damage [9]. In addition, mutations in genes that promote HDR are frequently observed in several cancers, including the prostate cancer [10]. Cell-cell junction organization and Elastic fibre formation pathways are also enriched. Elastic fiber are bundles of proteins found in the extracellular matrix. The extracellular matrix is commonly deregulated in cancer and affects the cancer progression, promoting metastasis [11]. Finally, Fatty acids pathway has been found to be associated to prostate and in fact several studies demonstrated the association of some fatty acids with prostate cancer risk [12,13]. Moreover all the pathways enriched only in Core&Peel module include at least one gene with a disease-association in the DisGeNET database and/or DEGs. This highlights how these pathways can have a relevance in the prostate cancer.

Colorectal cancer
Similarly, all the pathways in colorectal cancer module include genes with disease-association and/or DEGs. Like prostate cancer case, pathways associated to the cell cycle have been found (G2/M DNA replication checkpoint and Telomere C-strand synthesis initiation). They are mainly involved in DNA replication which represents a crucial point for the genome integrity. A deregulation of this process can lead an accumulation of genetic aberrations that promote diseases such as cancer [14]. Telomeres maintain genome integrity by protecting the end of the chromosome from deterioration. Telomere crisis can cause a wide array of genomic aberrations that can promote cancer progression [15]. Another characteristic of cancer is an altered signal transduction which can lead to uninhibited growth. The purinergic receptors pathway is enriched in colorectal module, in fact it has been demonstrated that these receptors, after the binding with the ATP, affect tumor cell growth. In particular, it has been noticed an overexpression of some purienergic receptors in colon cancer [16], along with many malignant tumors. Another enriched pathway involved in the signal transduction is G alpha signalling events. Recent findings suggest that the prostaglandin E2, a proinflammatory product, stimulates colon cancer cell growth through a G protein-dependent signaling pathway [17]. Similarly to prostate cancer, fatty acids metabolism is also involved in colorectal cancer. More precisely a rapid metabolism of arachidonic acid was reported in various stages of the malignancy, suggesting a possible link between dietary lipids and the incidence of colorectal cancer [18].

Rheumatoid arthritis (RA)
Rheumatoid arthritis is a chronic autoimmune disease that primarily affects the lining of the synovial joints. Among all the enriched pathways, two of them (Neutrophil degranulation and Signaling by Interleukins) are involved in the immune system and include both genes with disease-association ( Figure 11b). Neutrophils are the most abundant leukocytes (white blood cells) in mammals and are one of the first-responders of inflammatory cells to migrate towards the site of inflammation. They contribute to RA pathology through the release of cytotoxic and immunoregulatory molecules, promoting the autoimmune processes that underly this disease [19]. The interleukins are proteins expressed by the immune system cells during an immune response. Different studies have demonstrated the involvement of interleukins in RA [20,21]. In addition, several splice variants of interleukins have been discovered [22], highlighting the involvement of mRNA splicing pathway in RA. Other pathways detected only by Core&Peel have a role in immunity and inflammation such as prefoldin and SUMO proteins. Prefoldin is a family of proteins used in protein folding complexes and seems to function as proinflammatory signals [23]. SUMO is a small ubiquitin-like modifier involved in protein sumoylation, a post-translational-modification event. Sumoylation has been suggested to regulate multiple cellular processes, including inflammation [24].

Asthma
Asthma is a common long-term inflammatory disease of the airways of the lungs. In fact most of the enriched pathways identified only by Core&Peel are involved in the immune system. For example, the interaction between the Toll-like receptors (TLRs) and environmental allergens leads to release of various pro-inflammatory mediators from innate cells supporting asthma development [25]. Like in RA, different interleukins have a role in asthma pathogenesis [26,27]. In addition, butyrophilins [28], leukotrienes and eoxins are molecules formed in response to inflammatory stimuli. In particular leukotrienes have a role in the pathophysiology of asthma including increased airway smooth muscle activity, microvascular permeability, and airway mucus secretion [29]. Another pathway enriched in asthma module and involved in the immune system is called TNFs bind their physiological receptors. TNFs function as cytokine and the binding with their specific receptors has crucial roles in both innate and adaptive immunity. In particular, TNF-α is a proinflammatory cytokine that has been implicated in many aspects of the airway pathology in asthma.
Besides, preliminary studies have demonstrated an improvement in asthma quality of life and lung function in patients treated with anti-TNF-α therapy [30]. To conclude, a pathway identified only by Core&Peel which has four genes with disease-association is G alpha signalling events. G protein-coupled receptors (GPCRs) regulate numerous airway cell functions, and signaling events transduced by GPCRs are important in both asthma pathogenesis and therapy [31].