Gene expression analysis method integration and co-expression module detection applied to rare glucide metabolism disorders using ExpHunterSuite

High-throughput gene expression analysis is widely used. However, analysis is not straightforward. Multiple approaches should be applied and methods to combine their results implemented and investigated. We present methodology for the comprehensive analysis of expression data, including co-expression module detection and result integration via data-fusion, threshold based methods, and a Naïve Bayes classifier trained on simulated data. Application to rare-disease model datasets confirms existing knowledge related to immune cell infiltration and suggest novel hypotheses including the role of calcium channels. Application to simulated and spike-in experiments shows that combining multiple methods using consensus and classifiers leads to optimal results. ExpHunter Suite is implemented as an R/Bioconductor package available from https://bioconductor.org/packages/ExpHunterSuite. It can be applied to model and non-model organisms and can be run modularly in R; it can also be run from the command line, allowing scalability with large datasets. Code and reports for the studies are available from https://github.com/fmjabato/ExpHunterSuiteExamples.

: Results for the analysis of the PMM2-CDG dataset. A) PCA showing the separation of samples along the first two principal components. The greatest amount of variance was found between control and patient samples, although the Ctrl 2 and pat 1 samples were found close together. B) Venn diagram showing the results of the three different DEG detection methods employed. C) Representative functional enrichment plot for GO Biological Processes. All figures taken directly from the ExpHunter Suite output report. to identify such individual-specific genes is important as they allow the user to find groups of genes that can be potentially excluded from downstream analysis. Figure 3: Co-expression results for the analysis of the PMM2-CDG dataset. A) Dendrogram based on correlation between the co-expression modules and the categorical vectors. B) Function-gene network representation of the GO Biological Process enrichment for the co-expression module 60, the closest module to the treat-control vectors C) Barplot representing enrichment for the co-expression module 6, the closest to the chaperone vectors. The data is shown in this format as the network representation was too dense to interpret due to the large numbers of categories and genes.
In terms of correlation with the treated samples, the vectors encoding this information are close to module 60 (correlation=0.97, p-value=4 × 10 −7 ) This module contains 58 genes of which 12 are prevalent DEGs. The genes in this module show higher expression in the patient samples ( Figure 2B). Enrichment analysis shows reduced expression for genes related to cell secretion, mitochondrial membrane permeability, and apoptotic processes ( Figure  3B).
CACNA1H is underexpressed by 4.6 logFC. This gene encodes the voltage-gated calcium channel Cav3.2. Notably, another calcium channel, Cav2.1, Figure 4: GO Molecular Function network for co-expression module 6. Green points represent functional categories, blue/red points represent genes belonging to these categories, with the colour indicating fold-change between patients and controls. LogFC values range from -0.5 to 0.5, the most underexpressed genes are TOP2A, AURKA and CCNB1 whereas NEIL3, REXO5 and DNA2 are the most overexpressed. There are also several Kinesins genes with low expression. encoded by CACNA1A, has been reported to be involved in PMM2-CDG cerebellar syndrome and has been proposed as a therapeutic target [6]. CACNA1H loss of function mutations and the consequent decreased activity of the Cav3.2 channel have been related to development of autism spectrum disorder [11], and its decreased expression has been linked to skeletal muscle atrophy involving endoplasmic reticulum stress.
Regarding the addition of a chaperone to the samples, module 6 was closest to the vector encoding this information ( Figure 3A; correlation=0.79, p-value=0.002). The genes in this module tend to show higher expression in the chaperone receiving samples than controls ( Figure 2C), moreover the pattern is much clearer for patient samples, as would be expected given that the chaperone is intended to rescue the function of the mutated PMM2 gene. The genes in this module are enriched for various GO terms related to the cell cycle and DNA processing ( Figures 3C and 4). Taken together, these results suggest that the chaperone may be helping to restore these processes.

Lafora disease
For the Lafora disease dataset, PCA analysis revealed separation between control and knock-out samples (Epm2a and Epm2b) along the first principal component, and between Epm2a and Epm2b samples along the second, suggesting that the different mutants lead to a similar expression profile, with some differences ( Figure 5A). Of the genes detected by at least one of the DEG detection methods, 179 were detected by all three packages ( Figure 5B). In contrast to the results obtained for the PMM2-CDG dataset, DESeq2 did not detect any genes not detected by at least one other method (Supplementary Report 5).
Functional analysis identified genes related to the innate and adaptive immune responses and inflammation ( Figure 5C), supporting the idea of microglia-astrocyte cross talk in neurodegeneration [5]. Many genes underlying these functions are highly overexpressed ( Figure 5C), including Lcn2, that can protect the nervous system in response to inflammatory processes [3]. The up-regulated genes are largely expressed by astrocytes and microglia, two cell types that accumulate polyglucosan inclusions [8]. As such, one can speculate that this accumulation triggers pro-inflammatory mediator production in these cells [5].
Correlation analysis was first performed to find modules correlated with all mutant mice, and then focussing on specific knock-out groups. The full dendrogram can be visualised in Figure 6.
Module 1 is correlated with the mutant vs. control vectors (correla-tion=0.91, p-value=6.25 × 10 −5 , Figure 8A). The expression values of the genes in this module can be seen in Figure 7A, tending towards higher expression in the mutant samples. These genes show enrichment for the regulation of immune responses, including negative regulation of immune effector process, positive regulation of cytokine production and T cell activation ( Figure  8B). Full details are included in Supplementary Report 6.
The gene-function network for this module can not be easily visualised due to its large size. Module 13, which is a similar distance from the mutant/control vector (correlation=0.95, p-value=5.2 × 10 −6 ), is shown instead (Figure 9). It shows immune system and inflammatory processes, including response to virus, cellular response to lipopolysaccharide and biotic stimulus, and I-κB kinase/NF-κB signaling. All genes are overexpressed in this group, interestingly three of the most highly expressed are Cxcl10, Ccl5 and Ccl12, also identified in the initial functional analysis.
Regarding correlation with the different knock-out mice, the Epm2a vector was close to module 23 (correlation=0.93, p-value=3.2 × 10 −5 ). This module shows higher gene expression in Epm2a knock-out samples ( Figure  7B) and enrichment for the stress-activated MAPK cascade, JNK cascade and immune system processes ( Figure 8B), consistent with glial signaling in this disease [9]. Module 23 contains Tlr4 and Traf6, two components of the key inflammation pathway that triggers NF-κB and MAPK processes, important in Lafora disease [9]. Epm2b was highly correlated with module 11 (correlation value=0.92; p-value=5.3 × 10 −5 ), which contains 138 genes, none of which are DE, despite higher expression in the Epm2a samples (Fig-Figure 6: Full dendogram for the Lafora disease co-expression analysis. 8  ure 7C). We also investigated the correlation between gene modules and gene and protein expression values, chosen based on the results of the initial DE analysis and measured using western blot (WB) and real time quantitative PCR (qPCR) ( Table 1). Full details in [5]. All expressed genes in this table were detected as differentially expressed between knock-out and control animals according to all DE gene detection and combination methods, including the Naïve Bayes approach. Figure 9: Function-gene network that represents the GO Biological Process enrichment analysis for the co-expression module 13, the next closest to the categorical vectors, for Lafora disease.  Figure 8A. Figure 10, shows how these external measures tend to increase for the mutant samples but stay low for the controls. Wisp2 and Mmp3 gene expression values are most highly correlated with module 192, which shows enrichment for functional categories such as the Toll-like receptor signaling pathway, also involved in NF-κB and MAPK activation. LCN2 protein levels correlate with module 42, which is enriched for the inner mitochondrial membrane protein complex. Mitochondrial activity has been shown to be altered in Lafora disease, as such the genes in this module are of interest for future study [4,12,7]. It should be noted that the protein levels for Ccl5, Lcn2, and Cxcl10 correlate with different modules to the mRNA levels of their genes.