Introduction

Colorectal cancer (CRC) is one of the most prevalent digestive system cancers worldwide. After breast, lung, and prostate cancers, it is the fourth prevalent cancer in the United States1. Nearly two million new CRC cases and 700,000 deaths from this cancer are reported every year2. Its incidence and mortality rate is 25% higher in men than women, and most of the reported CRC cases are in the colon, while fewer cases are in the rectum3. Thanks to novel diagnostic tools and therapeutic interventions, the mortality rate has decreased significantly from 1999 to 20154. Although most reported cases of CRC are sporadic (70%), a significant part of the cases occur in patients with familial history of CRC (25%) or hereditary colorectal cancer syndromes (~ 10%)5. Therefore, further studies are still required to clarify the involved molecular mechanisms in CRC.

Non-coding RNAs (ncRNAs) are small molecules that are not translated to proteins and participate in various regulatory functions of the cell. They consist of different families including, microRNAs (miRNAs), small nuclear RNAs (snRNAs), PIWI-interacting RNAs (piRNAs), and long non-coding RNAs (lncRNAs)6. LncRNAs were recently identified as regulatory molecules with a length of more than 200 nucleotides. They bear some resemblances with mRNAs, including a cap at the 5′ end, having more than one exon, being transcribed by RNA polymerase II (RNA pol II), and being located in the cytoplasm or the nucleus. However, this class has some dissimilarities with mRNAs, including lower expression level, poorer conservation among other species, inability to be transcribed to a protein, and tissue/stage-specific expression7. It has been reported that lncRNAs may be up/down-regulated in cancerous cells compared to healthy ones, indicating the possible role of these molecules as an oncogene or a tumor suppressor8.

LncRNAs are involved in carcinogenesis and progression of CRC9,10,11. It has been demonstrated that lncRNAs regulate various cellular functions related to CRC pathogenesis, including cell proliferation, apoptosis, migration, invasion, metastasis, differentiation, DNA damage, drug resistance, epithelial-mesenchymal transition (EMT), development, controlling cancer stem cells, and cell cycle12. Thanks to high-throughput methods and novel bioinformatics approaches, such as microarray and RNA-seq, many lncRNAs with altered expression in CRC cells have been identified13. Forrest et al. identified more than 200 differentially expressed lncRNAs by analyzing RNA sequencing data from The Cancer Genome Atlas (TCGA) dataset. Moreover, they concluded these lncRNAs regulate cell cycle genes and increase resistance to apoptosis14. Studies have reported that some lncRNAs are significantly overexpressed in the CRC cells and tissues, correlating with metastasis and weak patient prognosis15. In addition to CRC cells, lncRNAs with altered expression level has been reported in peripheral blood components such as serum or plasma16.

Weighted gene co-expression network analysis (WGCNA) is an in-silico system biology tool to analyze gene expression in a complex network of regulatory genes. This tool based on R programming can identify clusters of highly correlated genes (modules) based on genetic correlations17. Therefore, it is helpful for identifing novel diagnostic and prognostic biomarkers for cancer. Zhou et al. reported a number of hub genes and miRNAs which was associated with stages of CRC18. In the current study, the WGCNA algorithm was employed to construct a co-expression network of lncRNAs associated with CRC and their target genes. This study would help identify possible new biomarkers for CRC and reach a better understanding of the molecular pathways contributing to this disease.

Methods

Data acquisition and processing of lncRNA expression profiles

Microarray gene expression data of colorectal cancer with the series number GSE106582 was obtained from the publicly available Gene Expression Omnibus (GEO) database to identify lncRNA candidates (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse106582) .GSE106582 was provided by the University clinics Freiburg (Freiburg, Germany), and CRC patients were recruited at the University Hospital of Heidelberg. Total RNA from 77 tumors and 117 mucosa samples were analyzed, including 68 tumor-mucosa pairs using Illumina HumanHT12v4 gene chips. Next, downloading and reading expression profile data was conducted in R environment using GEOquery package19. Differential expression genes (DEGs) analysis was conveyed via Limma (linear models for microarray data) package20,21.

Weighted gene correlation network analysis and the identification of modules

Considering the fact that the functions of most lncRNAs are unknown, the prediction of their functions principally depends on the examination of their co-expressed genes. Network analysis was conducted using the WGCNA package in R22 to evaluate the relative significance of lncRNAs and their module membership. Briefly, WGCNA was performed on the GSE106582 dataset obtained from 77 CRC and 117 mucosa tissues. To distinguish modules with different expression patterns, a soft threshold power was selected to create co-expression networks. Next, the pearson correlation coefficient was used to evaluate the weighted co-expression relationships in the adjacency matrix. Then, a topological overlap matrix (TOM) similarity function was applied to transform the matrix into a TOM, which was used to estimate the co-expression relationships between genes. The networks were established by merging genes with extremely similar co-expression patterns into modules. Consequently, the module with the key lncRNAs and their co-expressed genes was achieved. The reconstructed co-expression network was visualized using the Cytoscape software (version 3.7.0) and Cytohubba plugin (version 0.1)23.

Functional annotation of the co-expressed genes in the module

Gene Ontology (GO) is a simple technique applied for annotating large numbers of genes to define attributes of gene products in three non-overlapping domains of molecular biology, including Molecular Function (MF), Biological Process (BP), and Cellular Component (CC)24. To identify genes and their corresponding functionalities, the Kyoto Encyclopedia of Genes and Genomes (KEGG) was employed to systematically analyze gene functions (www.kegg.jp/kegg/kegg1.html)25. To determine the potential functions of novel lncRNAs and their associated biological pathways, a functional enrichment analysis of their co-expressed genes was performed using the Fun Rich software (version 3.1.3).

Statistical analysis

GSE106582 downloading and reading expression profile data was conducted in an R environment using the GEOquery package18. Differential expression genes (DEGs) were assessed between CRC and mucosal samples by empirical Bayesian method via t-test. DEGs analysis was conveyed via Limma (linear models for microarray data) package19,20. The cutoff criteria of the adjusted p-value (FDR) < 0.05 and | logFC | ≥ 0.5 were considered the threshold for significance to extract DEGs and DELs among 48.107 Probe sets. The top 2450 genes were evaluated with the critical value of the adjusted P-value < 0.05, and logFC ≥ |0.5|, were selected for further analysis. Subsequently, the DEGs were filtered by 5023 lncRNAs were retrieved from HGNC BioMart (https://biomart.genenames.org/) to detect differentially expressed lncRNAs (DELs).

Results

Identification of LncRNA candidates associated with colorectal cancer

GSE106582 gene expression profiles were selected in this study. This dataset contained 117 normal samples and 77 CRC samples. Following analysis of the dataset with the Limma package, the difference between CRC and normal tissues was presented in volcano plots (Fig. 1). Based on the criteria of adjusted P-value < 0.05 and logFC ≥ |0.5|, a total of 2449 DEGs were screened from GSE106582, including 1170 upregulated genes and 1279 downregulated genes. LncRNA expression data analysis of the GSE106582 dataset resulted in the identification of 32 DELs (Supplementary File 1), among which nine lncRNAs are detected as novel lncRNAs with no previous association with CRC development, including LINC01018, SNHG32, ITCH-IT1, ITPK1-AS1, FOXP1-IT1, FAM238B, PAXIP1-DT, ATP2B1-AS1, and MIR29B2CHG (Table 1). Of these, 8 lncRNAs were down-regulated (p-value < 0.05) and one lncRNA (SNHG32) was up-regulated (p-value < 0.05) in CRC tissues compared to normal tissues.

Figure 1
figure 1

Identifying differently expressed genes between colorectal cancer and normal tissues. Heatmap of the difference between tumoral and normal samples of the GSE106582 dataset with R. Box-Scatter plot of the expression data of the lncRNAs in tumor tissues vs normal of the GSE106582 dataset for LINC01018.

Table 1 Differentially expressed lncRNAs in colorectal cancer based on analyses of GSE106582 Dataset. Log2FC < 0: down-regulated, *From NCBI RefSeqGene.

Construction of weighted gene co-expression network analysis

In this study, a co-expression network was constructed using GSE106582, the expression amounts of 2449 DEGs were analyzed for the co-expression network constructing the “WGCNA” package. Primarily, the outlier cases were displaced, and the hierarchical clustering analysis was accomplished with the “hclust” R function (Supplementary File 2). Meanwhile, the pickSoftThreshold function was used to determine scale independence and mean connectivity analysis of modules with several power values. Afterward, to guarantee a scale-free network, we picked β = 14 as the soft-thresholding power (Fig. 2A), to double-check the scale-free topology R2 with a linear regression plot (scale-free R2 = 0.93) (Fig. 2B). Therefore, β = 14 was selected to produce a hierarchical clustering tree with different colors signifying diverse modules. As demonstrated in Fig. 2C, six co-expressed gene modules were identified with gray modules representing non-co-expressed genes, and each module was marked by a color. The green module is the smaller module with 128 genes. At the same time, the blue module is the largest module with 780 genes. Additionally, the background color is grey and represents the 289 genes not attributed to any module (Table 2). Ultimately, we examined the interactive connections amongst the six modules, plotted the heatmap of the network, and showed the relative independence of each module in Fig. 2D and the multi-dimensional scaling (MDS) plot presented in Fig. 2E.

Figure 2
figure 2

Network visualization plots. (A) Scale independence and mean connectivity analysis. The proper soft threshold power = 14 was selected. (B) The histogram of connectivity distribution and the scale-free topology panels. (C) Clustering dendrogram of genes, with dissimilarity based on the topological overlap. (D) Heatmap plot to represent the TOM among the genes in different modules. (E) Multidimensional scaling (MDS) plots to describe the entire gene expression network.

Table 2 Identified gene modules and their gene numbers.

Identification of novel lncRNA modules

We investigated the modules of novel lncRNAs to predict their functions through their co-expressed genes and the construction of a regulatory network between lncRNAs and protein-coding genes. We found LINC01018, ITCH-IT, ITPK1-AS1, FOXP1-IT1, FAM238B, and PAXIP1-AS1 in the black module, ATP2B1-AS1, and MIR29B2CHG in the turquoise module, and SNHG32 in the blue module. The list of genes for each module is detailed in Supplementary File 3.

Gene co-expression modules correspond to CRC

In addition, we examined the associations of gene modules and cancer phenotype, which was based on the correlation between module eigengenes and clinical traits. The results revealed two of the total six gene modules were strongly correlated with tumoral status, including turquoise (R = 0.91, P = 3E−69), and grey (R = 0.96, P = 1E−99), while the grey module is non-co-expressed genes and not considered for further studies (Fig. 3A). In addition, the blue gene modules (R = 0.91, P = 3E−69) negatively correlate with tumoral status in a significant manner. However, other clinical traits, including age and gender, are not correlated with gene modules. Furthermore, the eigengene dendrogram and heatmap were designed to distinguish groups of correlated eigengenes associated with tumoral status. The results re-validated the correlations of the turquoise and blue gene modules with tumoral status (Fig. 3B). Finally, the plots of module membership in different gene modules vs. gene significance determined that the turquoise and blue gene modules have significant correlations with CRC and demonstrate these gene modules are associated with CRC (Fig. 3C, D).

Figure 3
figure 3

Gene co-expression modules correlated with colorectal cancer. (A) Module–trait relationships. Each cell includes the corresponding correlation and P-value. (B) The eigengene dendrogram and heatmap classify groups of correlated eigengenes. A scatter plot of the gene significance for Tumoral versus the module membership in the Turquoise (C) and Blue (D) modules.

Construction of PPI network

CytoHubba plugin was used to construct interaction networks and identify hub genes in each module. Networks for black, blue, and turquoise modules are depicted in Figs. 4, 5 and 6, respectively. Moreover, the plugin has identified thirty hub genes based on their degrees, ranks, and scores that are summarized in Table 3.

Figure 4
figure 4

Black module network. The network of the top 30 genes or the most influential black module genes that are most closely related show the highest to lowest scores among these 30 genes in red, orange, yellow, and blue, respectively.

Figure 5
figure 5

Blue module network. This module network analysis indicates that the highest score is related to the CLCA1gene.

Figure 6
figure 6

Turquoise module network. The top 10 genes in this module all have the same score and are equally involved in the respective pathways listed in Table 6.

Table 3 Identified hub genes for each module along with their ranks, and scores.

Functional enrichment analysis

The GO enrichment and KEGG pathway analyses were carried out to understand the biological characteristics of all modules. The involved cellular components, molecular functions, biological processes, and biological pathways for each module are summarized in Tables 4, 5 and 6.

Table 4 Black module functional enrichment results. www.kegg.jp/kegg/kegg1.html.
Table 5 Blue module functional enrichment results. www.kegg.jp/kegg/kegg1.html.
Table 6 Turquoise module functional enrichment results. www.kegg.jp/kegg/kegg1.html.

Discussion

CRC is a global concern due to its high mortality and morbidity rates. Medical systems worldwide have been endeavored to reduce CRC rate using novel diagnostic and prognostic methods26. However, it is still one of the leading medical burdens over the globe27. High throughput technologies like microarray have been a valuable tool to compare the expression profile of normal and tumor cells. The omitted data has been precious to better understand expression alterations in cancers28. Using tools like WGCNA, we can study the interconnections between genes and obtain differentially expressed genes29.

In recent years, WGCNA has been used to comprehend lncRNAs role in cancers. Giulietti et al. identified eleven lncRNAs using this method as key regulators in pancreatic cancer, which could be used as novel diagnostic/prognostic markers30. Jiang et al. used the WGCNA method and found four lncRNAs associated with the carcinogenesis and progression of colon adenocarcinoma. In the current study, nine differentially expressed lncRNAs (LINC01018, ITCH-IT, ITPK1-AS1, FOXP1-IT1, FAM238B, PAXIP1-AS1, ATP2B1-AS1, MIR29B2CHG, and SNHG32) were identified using microarray data analysis (GSE106582). Afterward, WGCNA was performed on the lncRNAs and their target genes, which resulted in three significant modules. Further bioinformatics studies on hub genes of every module showed that they are involved in concrete pathways and biological processes.

The role of some of identified DELs in cancer pathogenesis has been previously reported in the literature. Miao et al. identified LINC01018 as a prognostic marker for gastric cancer31. It also has a tumor suppressor role in hepatocellular carcinoma that upregulates FOXO1 by sponging miR-182-5p32. In a recent study, Liting Wang et al. (2021) found that LINC01018 / hsa-miR-182-5p / ADH4 were strongly correlated. Moreover, the regulatory axis of ceRNA in the human body, by regulating the expression of key proteins in important signaling pathways can become a checkpoint inhibitor and regulate the incidence of liver cancer33.

In the study of Hu et al., a combination of five lncRNAs, including ITPK1-AS1, was introduced as a useful prognostic marker for gastric cancer. A 2020 study also found that an ITPK1-AS1 anti-sensory ncRNA with 0.56-fold induction was the highest gene regulated by e-cigarettes compared to traditional cigarettes in active bronchial epithelial cells in smokers34,35. It is also a potential prognostic biomarker of colon adenocarcinoma36. Using bioinformatics tools, FOXP1-IT1 and other lncRNAs have been recognized as a useful prognostic marker for colon adenocarcinoma37. A study examining the expression pattern of different lncRNAs induced by TGF β1 predicts that FOXP1-IT1 is highly regulated by RAD21, possibly involved in oncogenic conversion and tumorigenesis in response to DNA repair and induction of genomic instability38. PAXIP1-AS1 is located in the glioma cell nucleus, and its overexpression increases migration, invasion, and angiogenesis of human umbilical vein endothelial cells in glioma39. Zhou et al., in a bioinformatics study, identified MIR29B2CHG as a useful prognostic marker for adrenocortical carcinoma40. A significant decrease of MIR29B2CHG was observed in the triple negative types of breast cancer. It is a host gene for producing of miR-29b2 and miR-29c, which plays a suppressive role in the progression of breast cancer41.

A co-expression network analysis using WGCNA between lncRNAs and their target genes resulted in black, blue, and turquoise modules. Enrichment and functional analysis using Cytoscape plugin Cytohubba indicated that the modules involved fundamental cellular pathways. Results showed black module was involved in the metabolism of RNA, FGF signaling pathway, and regulation of gene expression in beta cells; the blue module was mainly engaged in mesenchymal to epithelial transition, fatty acid, triglyceride, and fatty acid beta-oxidation, and the turquoise model played a critical role in mitotic cell cycle, DNA replication, and mitotic M-M G1 phase. Gene ontology study revealed modules participate in essential biological processes and molecular functions. For example, the genes of black and turquoise modules participate in regulating nucleic acid metabolisms. Moreover, they act in transcription regulation activity and DNA/RNA binding molecular functions. In The blue module, genes are involved in metabolism, energy pathway, and transportation biological processes. In addition, they function in catalytic, transporter, and hormone activities. The genes of black and blue modules are operating in the lysosomes and exosomes. Also, the turquoise module genes are located in the nucleus and mitochondrion.

Based on identified biological pathways, modules and hub genes might have an essential role in developing and malignancy of CRC. In the black module, Otte et al. detected the elevated expression level of several self-renewal and stemness-associated genes in cultures with active FGF2 signaling42. The p38 MAPKs are a family of serine/threonine kinases that mainly respond to external stresses43. They participate in significant cancer progression-related mechanisms, including cell metabolism, invasion, inflammation, and angiogenesis44. Glucagon-like peptide 1 (GLP-1) is secreted from intestinal L-cells and participates in insulin secretion and β-cell growth. It has been suggested that sustained activation of the GLP-1 receptor may indirectly result in colon cancer by hyperinsulinemia45. As other recognized pathways in the black module regulate gene expression in pancreatic beta cells and the synthesis/secretion of Incretin, there may be an association between insulin secretion, colon cancer, and genes in this module which needs to be further studied.

The genes in the blue module participate in the EMT process. The cells of the colon lose their epithelial trait and gain some mesenchymal characteristics that help them migrate to other parts of the body. This process is the main reason for liver metastasis that occurs in CRC patients46. In addition, it has been reported that lipid is required for cancer cells to proliferate. Due to this fact, pathways that participate in lipid synthesis would be proper targets to design therapeutic agents47. As summarized in Table 4, genes in the blue module participate in fatty acid, triacylglycerol, and ketone body metabolism. Fatty acid beta-oxidation is another identified pathway that cancer cells rely on for survival, stemness, metastasis, immune suppression, and drug resistance48. Considering all the above-mentioned discoveries, further study would elucidate the exact role of this module and lipid involvement in cancer.

The cytoHubba plugin has pointed out several hub genes for black (LMOD3, CDKN2AIPNL, EXO5, ZNF69, BMS1P5, METTL21A, IL17RD, MIGA1, CEP19, FKBP14), blue (CLCA1, GUCA2A, UGT2B17, DSC2, CA1, AQP8, ITLN1, BEST4, KLF4, IQCF6) and turquoise (PAFAH1B1, LMNB1, CACYBP, GLO1, PUM3, POC1A, ASF1B, SDCCAG3, ASNS, PDCD2L) modules. The number of identified hub genes has been previously reported to participate in colorectal cancer pathogenesis, downregulation of miR-193a-3p results in upregulation of IL17RD, which promotes colon cancer through inflammation49. Also, this protein promotes cancer by concealing cancer cells from immune surveillance50. As a pro-proliferation factor, Yang et al. indicated FKBP14 was upregulated in CRC tissues, which were associated with the poor prognosis of CRC patients51. CLCA1 has a tumor suppressor role by inhibiting the Wnt/beta-catenin signaling pathway and the EMT process. The study on human CRC samples indicates its expression has been significantly decreased52. Polymorphisms in the UGT2B17 gene have been associated with CRC risk in the Caucasian population53. Dsc2 is the only expressed member of the desmocollins family in the normal colorectal cell. A study on CRC cells shows Dsc2 is switched to Dsc1 and Dsc3 during cancer development54. Overexpression of AQP8 has significantly decreased CRC cell growth and metastasis55. Aleksandrova et al. indicated circulating ITLN1 concentration has been correlated with CRC risk56. KLF4 as a tumor suppressor inhibits colorectal cancer cell growth and is associated with poor overall survival57. A component of the ubiquitin pathway, CACYBP, is overexpressed in CRC patients and has increased cancer proliferation58.

Conclusion

In the current study, with the help of bioinformatics tools, black, blue, and turquoise modules were regarded as the most critical modules in the progression and development of CRC. Moreover, thirty genes were recognized as hub genes that could be possible biomarkers for the diagnosis and prognosis of CRC. In addition, nine lncRNAs including, LINC01018, SNHG32, ITCH-IT1, ITPK1-AS1, FOXP1-IT1, FAM238B, PAXIP1-DT, ATP2B1-AS1, and MIR29B2CHG were identified with no previous association with CRC development which may serve important roles in the pathogenesis of CRC.