Identification of diagnostic biomarkers via weighted correlation network analysis in colorectal cancer using a system biology approach

Colorectal cancer (CRC) is the third most frequent cancer to be diagnosed in both females and males necessitating identification of effective biomarkers. An in-silico system biology approach called weighted gene co-expression network analysis (WGCNA) can be used to examine gene expression in a complicated network of regulatory genes. In the current study, the co-expression network of DEGs connected to CRC and their target genes was built using the WGCNA algorithm. GO and KEGG pathway analysis were carried out to learn more about the biological role of the DEmRNAs. These findings revealed that the genes were mostly enriched in the biological processes that were involved in the regulation of hormone levels, extracellular matrix organization, and extracellular structure organization. The intersection of genes between hub genes and DEmRNAs showed that DKC1, PA2G4, LYAR and NOLC1 were the clinically final hub genes of CRC.


Construction of module-trait relationships and identification of modules hub genes. Module
Eigengene (ME) was used to describe expression profiles of each module as the eigenvector related to the first principal component of the expression matrix in order to discover modules that were strongly linked to the clinical variables that were being examined.Also, the correlations between specific genes and the CRC trait were assessed using the estimates of gene significance (GS).Additionally, the ME correlation and the gene expression profile for each module were defined as module membership (MM).The most significant (central) components of the modules can be said to be closely connected to the trait if the GS and MM have a substantial correlation.Lastly, we chose the GS and MM genes that showed a correlation of 0.7 or higher, and we considered the identical genes between the two groups as the most important genes.

Protein-protein interaction network construction and hub clusters identification. The Search
Tool for the Retrieval of Interacting Genes (STRING, https:// string-db.org/) database was used to build the protein-protein interaction (PPI) network of the genes with highest GS and MM.Confidence score > 0.7 was set as significant.The hub clusters were chosen, and functional annotation was carried out, using the Cytoscape software plug-in molecular complex detection (MCODE) 13 .
Gene ontology (GO) and KEGG pathway enrichment analysis of hub clusters.The biological process (BP), molecular function (MF), cellular component (CC), and KEGG pathway [14][15][16] enrichment analysis of the mRNAs in the hub clusters were obtained utilizing ClusterProfiler R package (version 4.4.4) 17 .A criterion of an adjusted p value of 0.05 or less was established for the functional category.
Identification of hub genes in PPI network using multiple centralities.By using maximal clique centrality (MCC) algorithm via cytohubba 18 plugin in cytoscape, we discovered hub genes in the PPI network.Finally, the intersection of genes between hub genes and DEGs were identified as final hub genes.

MiRNA-and TF-final hub genes regulatory networks.
Using the NetworkAnalyst database 19 , connections between the final hub genes, transcription factors (TFs), and microRNAs (miRNAs) were established.The greatest degree in the networks was then found for each TF and miRNA.The final hub genes' genetic alterations in CRC patients.Using the cBioPortal for Cancer Genomics 20 , we examined the genetic changes of the final hub genes in CRC patients.We selected CRC datasets from the TCGA, which comprise 1510 CRC samples.

Verification of final hub genes via expression values.
The UALCAN 21 and GSCALite databases 22 , a Web server for Gene Set Cancer Analysis, were used to confirm the mRNA expression patterns of the final hub genes in colorectal adenocarcinoma (COAD) and normal samples.Immunohistochemistry (IHC) from the Human Protein Atlas (HPA) 23 was used to compare the protein expression of the final hub genes between COAD and normal tissues.

Results
Microarray data processing, integrative meta-analysis and assessment of data quality .The role of mRNAs in CRC was examined utilizing three expression datasets that included samples from CRC and healthy control individuals.Using the normalizeQuantiles function in the preprocessCore package, quantile normalization was used to independently normalize each dataset.Then, gene symbols were used to combine three datasets.Next, the ComBat function from the R Package Surrogate Variable Analysis (SVA) was employed to rule out batch effects (non-biological differences).The raw data and normalized data following batch effect removal boxplots are displayed in Figure S1.These boxplots show a consistent level of quality in the expression data.Also, the boxplot of the preprocessed data showed satisfactory normalization.The PCA plot of all samples in the 2D plane included by their first two main components (PC1 and PC2) is shown in Figure S2 (PC1 and PC2).After batch effect reduction, the samples exhibited a fair dispersion, as seen by this plot.
Identification of DEmRNAs for CRC .We analyzed the expression matrix between CRC and normal samples using the Limma package (version 3.52.3) 24and discovered 792 DEmRNAs, comprising 466 downregulated and 326 upregulated DEmRNAs (Table S1).Table 2 displays the top 10 DEmRNAs.The volcano plot was made using the EnhancedVolcano package (https:// github.com/ kevin blighe/ Enhan cedVo lcano) in R, allowing us to examine the changes in mRNA expression between CRC and normal samples (Fig. 1).Also, we utilized the R pheatmap package (version 1.0.12)(https:// cran.r-project.org/ packa ge= pheat map) to show the expression of the top up-and down-regulated DEmRNAs (Fig. 2A,B).

Gene Ontology (GO) and KEGG pathway enrichment analysis of DEmRNAs. GO and KEGG
pathway analysis were carried out to learn more about the biological role of the DEmRNAs.These findings revealed that the genes were mostly enriched in the BPs that were involved in the regulation of hormone levels, extracellular matrix organization, and extracellular structure organization (Fig. 3A).According the KEGG pathway, the AGE-RAGE signaling pathway in diabetic complications, protein digestion and absorption, and PPAR signaling pathway were the three main categories where the genes were enriched (Fig. 3B).

Weighted correlation network analysis (WGCNA) and functional annotation.
A total of 39 samples from merged dataset were used for WGCNA using WGCNA package in R.Then, we used principal component analysis and samples hierarchical clustering to eliminate outlier samples (Fig. 4A,B).We eliminated 4 samples because of clustering and carried on with the analysis using the remaining data.Soft threshold value for the dataset was chosen with a cutoff R2 value of 0.9.As the network follows the power law distribution at this value (Fig. 4C), it is more similar to the condition of a true biological network.In order to merge the similar modules, the minimum module size was set to 30 with a 025 height cut (Fig. 4D). Figure 4E depicts every gene co-expression module.Twenty co-expressed gene modules were identified, with gray modules denoting genes that were not co-expressed.Each module was given a color name.With 24 genes, the royalblue module is the smallest.The turquoise module, which has 2434 genes, is also the biggest module at the moment.In addition, the 7659 genes that are not associated with any module are represented by the background's grey color (Table 3).For PPI network hub genes detection.We identified hub genes of PPI network using cytohubba plugin.In this case, we selected 25 hub genes with highest maximal clique centrality (MCC) (Fig. 6A).Then, the intersection of genes between hub genes and DEmRNAs (Fig. 6B) showed that DKC1, PA2G4, LYAR and NOLC1 were the clinically final hub genes of CRC.
Analyzing the regulatory networks of the TF-final hub genes and miRNAs.There are several ways that miRNAs can control how genes are expressed.Networkanalyst online database was utilized to collect miRNAs that target hub genes (Fig. 7A), and in this case, we used the miRTarBase v8.0 database to discover hub miRNA interactions.Hsa-miR-16-5p was regarded as a key miRNA for the development of CRC because it interacted with the three final hub genes (degree 3).Also, we got TFs that target final hub genes from the NetworkAnalyst database (Fig. 7B), and we selected the ChEA database to find the TFs that do so.According to the TF-hub gene network, the E2F1 controls each of the four hub genes and may be important for the development of CRC.
The genetic alterations of final hub genes in CRC patients.In colorectal adenocarcinoma TCGA datasets, we looked at the final hub genes mutations using the cBioPortal database.These datasets contained 1510 samples from 3 studies.As a result, out of a total of 981 samples, gene DKC1 was altered in 9 samples,  www.nature.com/scientificreports/PA2G4 in 4, NOLC1 in 21, and LYAR in 17 samples (Fig. 8).Also, it was discovered that the LYAR and DKC1 co-occur with mutations, with a q value of less than 0.001 (Table 4).

Validation of the expression of the final hub genes.
We looked up the expression patterns of the final five hub genes (DKC1, PA2G4, NOLC1, LYAR, and E2F1) in several databases to check their reliability.
According to the UALCAN database, the expression levels of every gene were noticeably greater in COAD than in normal samples (Fig. 9A; Table 5).According to the Gene Set Cancer Analysis Lite (GSCALite) database, all of these hub genes were also considerably enhanced in COAD (Fig. 9B).The methylation of these genes was also examined using the GSCALite database, and we discovered that NOLC1 was hypomethylated based on methylation difference between tumor and normal samples in COAD (Fig. 9C), which may be connected to the high level of this gene expression in COAD.The Human Protein Atlas (HPA) database also revealed that protein levels of these five genes considerably greater in tumor samples compared to normal samples (Fig. 9D).

Discussion
CRC is regarded as a cancer with high burden in the world needing urgent identification of appropriate diagnostic biomarkers.High throughput techniques are valuable tools for comparison of expression profiles of normal and cancerous cells to find biomarkers.WGCNA is a method for identification of interconnections between genes and recognition of differentially expressed genes between two sets of samples 25 .In the current study, we used this method and find final five hub genes, namely DKC1, PA2G4, NOLC1, LYAR, and E2F1.Expression levels of these gene were noticeably greater in COAD than in normal samples.Notably, NOLC1 was hypomethylated in COAD.Therefore, overexpression of this gene can be explained by the phenomenon.Functionally, DKC1 has been shown to enhance angiogenesis through increasing HIF-1α transcription.Moreover, DKC1 can facilitate metastasis in CRC 26 .PA2G4 has also been shown to be highly expressed in a variety of cancers, including cervical cancer, CRC, nasopharyngeal carcinoma and salivary carcinoma [27][28][29][30] .In hepatocellular carcinoma, PA2G4 can promote the metastasis through increasing stability of FYN transcript in a YTHDF2-dependent manner 31 .NOLC1 is involved in determination of multidrug resistance phenotype in non-small cell lung cancer 32 .LYAR has been found to promote CRC cell mobility through activating galectin-1 expression 33 .Finally, E2F1 is a transcription factor that binds to DNA with dimerization partner proteins.This transcription factor has fundamental roles in the development of CRC 34 .
Notably, these genes have been found to be mutated in CRC samples.Thus, several lines of evidence show fundamental roles of DKC1, PA2G4, NOLC1, LYAR, and E2F1 in CRC.
To sum up, the bioinformatics strategy used in the current study revealed important roles of DKC1, PA2G4, NOLC1, LYAR, and E2F1 in the CRC carcinogenesis and potentiates these genes as biomarkers for detection of CRC and therapeutic targets for this cancer.

Figure 1 .
Figure 1.Volcano plot of DEGs; the horizontal axis represents the value of LogFC, while the vertical axis represents the mean value of -log 10 (false discovery rate).The significant dysregulated genes that fit the criteria are shown as red dots.

Figure 2 .
Figure 2. Differentially expressed mRNAs heatmaps.(A) Heatmap of upregulated DEmRNAs in CRC samples related to normal samples, (B) heatmap of downregulated DEmRNAs in CRC samples related to normal samples.

Figure 3 .
Figure 3. Results of DEmRNA analysis in Kyoto Encyclopedia of Genes and Genomes (KEGG) and the Gene ontology (GO).(A) Results of the investigation of DEmRNAs GO enrichment.The size of the bars corresponds to the gene number, while the color denotes the p value.(B) Results of DEmRNA KEGG pathway analysis.The size of the bars corresponds to the number of genes, while the color denotes the p value.

Figure 4 .Figure 4 .
Figure 4. Weighted correlation network analysis.(A) Hierarchical clustering of samples to detect the outlier samples.(B) Principal component analysis to detect the outlier samples.(C) Scale independence (left) and mean connectivity (right).(D) Using a hierarchical clustering of genes based on the 1-TOM matrix, the co-expression network modules Cluster dendrogram is arranged in a certain order.Various modules are represented by various colors.(E) Hierarchical clustering of modules (above) and heatmap of trait and modules (below).(F) Module-trait relationships.Each column is a clinical feature (CRC and normal), and each row denotes a color module.The correlation coefficient is displayed in each cell.(G) intersection of GS and MM.

Figure 5 .
Figure 5. PPI network of most significant genes, two cluster modules extracted by MCODE and GO and KEGG pathway enrichment.(A) There were 294 nodes and 1416 edges in the protein-protein interaction network created by the most significant genes.A protein is represented by each node, and a protein-protein interaction is represented by each edge.Also, the size of the nodes corresponds to the degree centrality, while the color denotes the neighborhood connectivity centrality.(B) Display of clusters in the PPI network.(C) Two clusters extracted by MCODE.(D) GO enrichment of two clusters.(E) KEGG pathway enrichment of two clusters.

Figure 7 .
Figure 7. Construction of miRNA-and TF-final hub genes regulatory networks.(A) In the miRNA-final hub genes regulatory network, it is shown that has-miR-16-5p is related to the DKC1, LYAR and PA2G4.(B) In the TF-final hub genes regulatory network, it is shown that E2F1 is related to the DKC1, LYAR, PA2G4 and NOLC1.

Figure 8 .
Figure 8.The genetic changes of the last hub genes in individuals with COAD.An asterisk (*) indicates that the gene has undergone mutations.

Figure 9 .
Figure 9. Validation of the final hub expression pattern in COAD.(A) DKC1, PA2G4, NOLC1, LYAR, and E2F1 expression patterns in COAD and normal samples were taken from the UALCAN database.(B) DKC1, PA2G4, NOLC1, LYAR, and E2F1 expression in COAD and normal samples from the GSCALite database.(C) Based on methylation difference between COAD and normal samples from the GSCALite database, there is a difference in the methylation of DKC1 and NOLC1; as the circle size increases, the level of significance becomes greater.Moreover, as the circle color shifts towards dark blue, it indicates a greater reduction in methylation in tumor samples.(D) Immunohistochemistry (IHC) of the DKC1 (DKC1 normal sample from patient 634; DKC1 COAD sample from patient 192), PA2G4 (PA2G4 normal sample from patient 1423; PA2G4 COAD sample from patient 2106), NOLC1 (NOLC1 normal sample from patient 1958; NOLC1 COAD sample from patient 4721), LYAR (LYAR normal sample from patient 2040; LYAR COAD sample from patient 4724), and E2F1 (E2F1 normal sample from patient 1423; E2F1 COAD sample from patient 2948) in COAD and normal samples from the HPA database.

Table 1 .
Details of included datasets.

Table 2 .
The top 10 up-and downregulated DEmRNAs between CRC and healthy control samples.

Table 3 .
Identified modules and their gene numbers.

Table 4 .
Mutually exclusive mutation pattern between the final hub gene pairs.