Gene expression network analysis of lymph node involvement in colon cancer identifies AHSA2, CDK10, and CWC22 as possible prognostic markers

Colon cancer has been well studied using a variety of molecular techniques, including whole genome sequencing. However, genetic markers that could be used to predict lymph node (LN) involvement, which is the most important prognostic factor for colon cancer, have not been identified. In the present study, we compared LN(+) and LN(−) colon cancer patients using differential gene expression and network analysis. Colon cancer gene expression data were obtained from the Cancer Genome Atlas and divided into two groups, LN(+) and LN(−). Gene expression networks were constructed using LASSO (Least Absolute Shrinkage and Selection Operator) regression. We identified hub genes, such as APBB1, AHSA2, ZNF767, and JAK2, that were highly differentially expressed. Survival analysis using selected hub genes, such as AHSA2, CDK10, and CWC22, showed that their expression levels were significantly associated with the survival rate of colon cancer patients, which indicates their possible use as prognostic markers. In addition, protein-protein interaction network, GO enrichment, and KEGG pathway analysis were performed with selected hub genes from each group to investigate the regulatory relationships between hub genes and LN involvement in colon cancer; these analyses revealed differences between the LN(−) and LN(+) groups. Our network analysis may help narrow down the search for novel candidate genes for the treatment of colon cancer, in addition to improving our understanding of the biological processes underlying LN involvement. All R implementation codes are available at journal website as Supplementary Materials.

II and III cancers are mainly differentiated based on nodal (N) stage, indicating the importance of lymph node (LN) involvement in prognosis. Currently, the N stage is decided by the pathologist after examination of LNs removed during surgery. However, sometimes patients are under-staged because of an inadequate number of LNs retrieved during surgery; these under-staged patients lose their opportunity for adjuvant chemotherapy resulting in a higher risk of tumor recurrence 7 . This makes the prediction or diagnosis of lymph node involvement extremely important for patient care.
To assess the diagnostic and/or prognostic possibilities regarding LN involvement in colon cancer, we analyzed and compared gene network of the gene expression in LN (+) colon cancer and LN (−) colon cancer, and identify significantly differing gene(s) from the gene networks using the Cancer Genome Atlas (TCGA) database (https://cancergenome.nih.gov/). A conventional and widely used method of gene expression profiling is the differential expression of genes (DEG). However, a DEG analysis has the evident limitation of being unable to identify interactions between multiple genes, and also the inability to ensure the involvement of the most significantly differentially expressed genes with the disease 8,9 . To overcome these limitations, we combined a network analysis referred to as the degree of centrality method with the DEG analysis 10 . The degree of centrality method is one of the simplest methods to measure the degree of the edge between a hub gene constituting a network and other genes directly connected to the hub using the number of adjacent hub genes. It is possible to identify very important hub genes or connector genes in terms of degree on the network by a degree centrality analysis, which detects how far genes are located from the center or genes acting as connectors or hubs in a network.
In addition, the protein-protein interaction network, GO enrichment, and KEGG pathways were searched using the selected hub genes from each group to better understand the regulatory relationship between hub genes and the biological events driving LN involvement in colon cancer.

Methods
Data collection and characterization. The RNA sequencing data set from colon cancer patients was obtained from Fire Browse (Version 1.1.35), which provides the TCGA data sets (accessed in Feb 2017) 11 . These data provide RNA sequencing V2 expression levels values for each gene. We reviewed and characterized the clinical information from the collected data set and divided them into two groups designated as LN (−) or (+) ( Table 1). The clinical data include age, gender, TNM stage, and radiation therapy status etc. Out of a total of 395 colorectal cancer patients, there were 179 LN(+) samples, and 216 LN(−) samples. The average age at diagnosis was 64.54 years-old in the LN(+) colorectal cancer group, whereas it was 68.29 years-old in the LN(−) colorectal cancer group.
DEG and network analysis. The RNA sequencing data from the LN (+) and (−) groups were pre-processed as follows: A total of 17,009 genes was selected after removing genes where the expression value were assigned as "0" in more than half of the samples. The expression value of each gene was converted to log2 scale and standardized for DEG analysis. Statistically significant differences in the gene expression levels of the LN(+) and LN(−) colorectal cancer samples were analyzed using a t-test. A total of 17,009 selected genes, which were used for the DEG analysis, were used to evaluate the gene networks in the LN (−) and LN (+) colorectal cancer groups based on a network estimation method. The network estimation method finds probabilistic neighbors (the edge gene in a network) for each gene (the node within the network) using a LASSO regression. The penalty parameter value in LASSO was obtained using the formula proposed by Meinshausen and Buhlmann, and it satisfied the asymptotic property 12 . The LASSO regression was performed using the R package glmnet. The hub genes in the network of LN(+) and LN(−) colorectal cancer groups were analyzed by the degree of centrality using R programming. For further network analysis, hub genes with a less than 20% coefficient of variation (CV) were selected from both groups. A CV cutoff of 20% was chosen because in general a CV of less than 20%, and not more than 30%, is considered to be an indicator of the reliability or measurement error in any analysis 13 . Survival analysis. Kaplan-Meier plots were used to estimate survival rates 14 . A multivariate analysis was used to evaluate whether the groups clustered by the expression levels of selected genes were an independent prognostic factor for overall survival. A P value less than 0.05 was considered statistically significant.

Protein-protein interaction network, Gene ontology (GO), and KEGG pathway enrichment analysis.
The protein-protein interaction network, gene ontology (GO), and KEGG pathway enrichment were searched using 353 hub genes from the LN(−) group and 240 hub genes from the LN(−) group. The protein-protein interaction network was analyzed using the Tool for the Retrieval of Interacting Genes/Proteins (STRING), and interactions with a confidence score of more than 0.95 were selected (https://string-db.org/). GO enrichment analysis was performed using DAVID Bioinformatics Resources (version 7.0). KEGG pathway enrichment analysis was performed using KEGG Mapper (https://www.genome.jp/kegg/tool/map_pathway2.html).

DEG analysis.
To analyze the DEG levels between the LN(+) and LN(−) groups, we extracted the 1918 genes with p-value < 0.005 and calculated the median gene expression levels using a Wilcoxon-test. The relative gene expression levels between the LN(+) to LN(−) groups were subdivided into upregulated and downregulated (Supplementary Table 1), which were plotted into heat map ( Supplementary Fig. 1). The genes INTS10, AGPAT5, NAT1, MINPPP1, EFR1, and PBK etc. were downregulated in the LN(+) group, which reflects an upregulation in LN(−) group. The genes TEAD3, RGL2, ITFG3, BAT3, ATF6B, and RARA etc. were upregulated in the LN(+) group, which reflects a down-regulation in the LN(−) group. www.nature.com/scientificreports www.nature.com/scientificreports/ Degree of centrality analysis. If the relationship between log degree and log number of a gene is linear, the topology suggests there is a scale-free network, which refers to a network that appears in many natural phenomena in network analyses. In a scale-free network, the degree of each gene is uneven and is concentrated at a specific hub gene. Therefore, the number of hub gene degrees in a scale-free network follows a power-law distribution. Both networks in the LN(+) and LN(−) colorectal cancer groups showed a scale-free topology ( Supplementary Fig. 2).
In the degree of centrality analysis, we calculated the number of hub genes from the pre-processed set of 17,009 genes, and as a result, a total of 16,579 hub genes with at least one edge (neighboring) gene were identified, with the degree of centrality of the edge genes sorted by degree (i.e. the number of edge genes) in each group (Supplementary Tables 2A and 3A). The mean degree per hub gene was ~7.5 with a range of 0-72 in the LN(+) group and ~7.7 with a range of 0-70 in LN(−) group, which was similar in both groups. Hub genes over 26 degrees (i.e. CV ≤ 20%) were selected, with 240 being identified in the LN(+) group and 353 in the LN(−) group, and analyzed further. As a result, 127 genes were identified as the hub (a common hub) in both groups (Supplementary Tables 2A, 3A,   www.nature.com/scientificreports www.nature.com/scientificreports/ and 12.5 (35.3%) common edge genes with the LN(−) group, with a range of 4-26, but did not share 20.5 (62%) common edge genes, with a range of 11-46, for the LN(+) group, and 23.3 (64.7%) common edge genes, with a range of 10-54, for the LN(−) group as different edges genes from each group. The representative network of hub and edge genes with high degrees in each group are shown and compared in Fig. 1 and Supplementary Fig. 3. This result indicates that there are gene network differences between the LN(−) and LN(+) groups.
A degree of centrality analysis was performed using the 240 selected hub genes in the LN(+) group with the LN(−) group, to investigate and compare how edge genes and degree are changed/altered with the same hub genes in each group (Supplementary Table 2B Table 5B). A mean degree of 4.2 (52.3%) was found in the LN(−) group, and a mean degree of 4.2 (52.3%) was found in the LN(+) group, with a range of 0-12 per hub of hub gene, with common edge genes in both group, when hub genes in the LN(−) group were applied to the LN(+) group. This result indicates that approximately 38-48% of the edge genes for the same hub genes are different in each group confirming that high degree hub genes, which may play an important role in the gene network, still exist as a hub gene network in both groups.
Furthermore, the network between hub genes was determined and the hub genes from one group were applied to the other group. Data showed that the network of hub genes from the LN(+) group changed when the same hub genes were applied to the LN(−) group (Fig. 3). The opposite analysis showed a consistent result, indicating once again that the network between the LN(−) and LN(+) groups had changed ( Supplementary Fig. 5). In addition, the networks of common hubs from both groups were compared and found to be very different, confirming that the relationship between hub genes differed between the LN(−) group and the LN(+) group ( Supplementary  Fig. 6).

Comparison of DEG sets and hub genes from each group. Hub genes from the LN(+) and LN(−)
groups were compared with the DEG set (1918 genes) to select hub genes which were highly differentially expressed between the two groups, and which may be serve as a marker to distinguish LN(−) for LN(+), and which could potentially be used as a prognostic markers (Supplementary Table 7). The analysis revealed that as the hub genes in both the LN(+) and LN(−) group, five genes were downregulated and 21 genes were upregulated in LN(+) group compare to LN(−) groups. As hub genes in the LN(+) group, three genes were downregulated and eight genes were upregulated and in the LN(−) group, thirteen genes were downregulated and 14 genes were upregulated in LN(+) group compare to LN(−) group ( Fig. 4 and Table 2). Furthermore, the expression level differences between hub genes in the LN(+) and LN(−) groups were examined, and it was shown that 155 hub genes showed significantly (p ≤ 0.05) different expression levels between the LN(+) group and the LN(−) group, even if these genes were not included in 1918 DEG set (Supplementary Table 7).

Survival analysis with selected hub genes.
To understand the potential use of these selected genes for prognosis we compared the DEG set, and the hub genes as well as the hub of hub genes with high degree, a survival analysis using a Kaplan-Meier estimation was performed (Fig. 5, Supplementary Fig. 6, and Supplementary Table 8). When the survival rate was compared with the expression levels of hub genes, the results were consistent with our expectation. Hub genes that were upregulated by the DEG analysis in the LN(+) group, such as AHSA2, ZNF767, and CDK10 showed a significantly (p ≤ 0.05) reduced survival rate in the up-regulated group compared to the down-regulated group. However, the hub genes selected as downregulated by the DEG analysis in the   LN(+) group showed a tendency, but not significantly, toward a reduced survival rate in the down-regulated group compared with the up-regulated group. CWC22, a hub gene which at the same time functions as the hub of hub genes with a high degree, and which was not a significant DEG, also showed significant survival rate differences. This result indicates the possibility of using these selected hub genes identified from a network analysis as prognostic markers. Protein-protein interaction network, GO and KEGG pathway enrichment analysis with selected hub genes. The results of STRING analysis showed a protein-protein interaction network of 41 hub genes (17.08%) in the LN (+) group and 66 hub genes (18.70%) in the LN(−) group, with >0.95 confidence score. Of these hub genes, 16 were shared among both groups, while 50 hub genes from LN(−) replaced 25 different hub genes from LN(+), resulting in protein-protein network differences between the LN(−) and LN(+) groups ( Fig. 6 and Supplementary Table 9). However, interactions between the shared hub genes did not differ between groups, which retained from LN(+) to LN(−) group.
To examine the characteristics of the hub genes from each group, the functional classifications of the hub genes were searched using the GO tool. The top 100 most significantly enriched GO terms for biological process from each group were determined (Supplementary Table 10). Of these 100 enriched GO terms, 54 were common among both groups and 46 differed between groups. The top ten most highly enriched GO terms from each group were selected and compared (Fig. 7A). Results showed that hub genes from the LN(+) group were enriched for cell motility and cell locomotion relative to that in the LN(−) group.
KEGG pathway analysis was also performed to further understand the biological functions of the genes. A total of 150 pathways from the LN(+) and 213 pathways from the LN(−) groups were enriched (Supplementary  Table 11). Of these pathways, 142 pathways were common between both groups, eight pathways (5.3%) were enriched only in the LN(+) group, and 71 pathways (33.3%) were enriched only in the LN(−) group. Pathways where more than 0.025% of hub genes were involved were compared between the LN(−) and LN(+) groups (Fig. 7B); this analysis highlighted the differences at the level of the metabolic and cell adhesion molecule pathways.

Discussion
The genetic basis of the development of colon cancer is well understood, however prognostic factors related to the LN involvement are still under investigation. Here, we have attempted to understand the pathophysiology of colon cancer and how the gene changes from LN(−) to LN(+) using a network analysis and by comparing DEGs in LN(+) and LN(−) groups of colon cancer patients using the TCGA data set. Our data showed that the gene network differs from LN(−) to LN(+), since 62-64.7% of edge genes for the same hub genes from both group differed between the two groups. However, main hub genes, such as PNCP, HEG1, SECISBP2L, and CWC22 etc. were still present, even though the edge genes changed from the LN(−) to the LN(+) group. Furthermore, the hub of hub genes with a high degree, such as HEG1, SECISBP2L, TCF4, CLIP3, MSRB3, PCNP, VIM, and ATP8B2, were hub genes with a high degree in the LN(+) group, and the hub of hub genes with a high degree, such as PCNP, XPO1, TNS1, PIKFYVE, VPS26A, CWC22, CCDC80, MATR3, ZEB1, C1S, AFF4, ZEB2, LY6G6D, and SSB, were hub genes with a high degree in the LN(−) group. These results imply that the important hub genes are not altered even if their edge genes are changed from the LN(−) to LN(+) groups. Of these selected hub genes, or hub of hub genes, PCNP, which selected as both a hub gene, as well as a hub of hub gene with a high degree in both groups. It is known that PCNP is related to control of the cell cycle and may be involved in tumorigenesis 15,16 , however, to date there have been no reports suggesting a role for PCNP in colon cancer. Another interesting hub gene observed was HEG1, a heart development protein with EGF-like domains 1, which is known Scientific RepoRtS | (2020) 10:7170 | https://doi.org/10.1038/s41598-020-63806-x www.nature.com/scientificreports www.nature.com/scientificreports/ to be associated with the stabilization of cell-cell junctions 17 and has been suggested as a tumor marker and a therapeutic target in malignant mesothelioma 18 .
To investigate possible markers to distinguish LN(−) to LN(+), hub genes from the LN(+) and LN(−) groups were compared with the DEG set to select hub genes that were highly differentially expressed between the two groups. Hub genes which were highly differentially expressed, such as APBB1, AHSA2, ZNF767, and JAK2 etc., were included within the 1918 DEGs set. A survival analysis using selected hub genes, such as AHSA2, ZNF767, SECISBP2L, CDK10 and CWC22, showed that their expression levels were significantly associated with survival rate, indicating the possibility that they could be useful as prognostic markers; these genes could not have been identified by a DEG analysis alone. AHSA2, as a hub gene, was found to be upregulated in the LN(+) group compared to the LN(−) group and was significantly associated with survival. AHSA2(AHA1) is an activator of the heat shock 90 kDa protein ATPase homolog 2, and belongs to the AHA family, which encodes proteins that can activate the ATPase activity of Hsp90 as co-chaperones 19 . The basal level of expression of AHA1 is different across a panel of different human cancer cell lines, however HCT116 cells, which is known to be a highly aggressive colon cell line, showed increased expression levels of AHA1 compared to HT29 cells, which is a less aggressive colon cancer cell line 20 . Thus, modulation of AHA1 has been suggested as a potential therapeutic strategy to increase the sensitivity to HSP90 inhibitors, since treatment with 17-AAG results in the sustained up-regulation of AHA1, and in addition the silencing of AHA1 expression increases cellular sensitivity to an HSP90 inhibitor 21 . Function of ZNF767, which is also edge gene of AHSA2 in our data, and SECISBP2L has not been studied yet. CDK10, cyclin dependent kinase 10, has been reported high expression in colon cancer and inactivation of its kinase domain showed prevention of tumor growth lately 22 . CWC22, the other upregulated hub genes in the LN(+) group, is a CWC22 spliceosome associated protein and has been suggested to be an unfavorable prognostic marker in renal and liver cancer (https://www.proteinatlas.org/ENSG00000163510-CWC22/pathology), although its function still needs to be investigated. However, hub genes, such as PCNP and HEG1, were not identified as DEGs between the LN(+) vs, LN(−) groups, even if their edge genes were changed. It is possible that there are other mechanisms, not expression differences, which need to be further explored.
In addition, the protein-protein interaction network, GO enrichment, and KEGG pathway were searched using the selected hub genes from each group. A STRING analysis was performed to further explore the physical and functional protein interaction networks among the hub genes from each group, and the results showed changes in the protein-protein interactions among the hub genes, as 50 hub genes from the LN(−) group were replaced by 25 different hub genes in the LN(+) group. Four hub genes (MYH11, MRVI1, LMOD1, and JAM3) from the LN(+) group, seven hub genes (UBA3, SETD1A, NUMA1, MRPL50, JAK2, COPS4, and BOC) from the LN (−) group, and three hub genes (PKD1, CDK1, and ABCE1) from both groups were included in the 1918 DEG (p < 0.005) set, indicating differential expression between the LN(−) and LN(+) groups (Table 2). However, survival analysis using a Kaplan-Meier estimation of these genes was not significant between LN(+) and LN(−) (Supplementary Fig. 7). In the GO enrichment analysis, cell motility enrichment was only shown in the LN(+) group, and cell locomotion enrichment was higher in the LN(+) group than that in the LN(−) group. The results of the KEGG pathway analysis showed differences at the level of the metabolic and cell adhesion molecule pathways. The cell adhesion molecule pathway is known to be associated with cell motility. Taken together, the GO Figure 6. Protein-protein interaction network among the hub genes from LN(−) and LN(+) with more than a 0.95 confidence score as analyzed by STRING. Balls represent proteins, and lines represent interactions between proteins. A red circle around a ball indicates genes shared among both groups. Red arrow indicates upregulation. Green arrow indicates downregulation. (2020) 10:7170 | https://doi.org/10.1038/s41598-020-63806-x www.nature.com/scientificreports www.nature.com/scientificreports/ and KEGG results implied that the hub genes from the LN(+) group are more related with cell movement and metastatic ability.
In conclusion, using a gene expression network analysis, we have identified hub genes, such as AHSA2, CDK10, and CWC22, as being possible prognostic markers, that were not previously known to be associated with colon cancer. Additionally, the regulatory relationships among the hub genes with respect to biological processes, and the LN involvement in colon cancer were different.Since we only used gene expression data for network construction, further research is needed to confirm the role of these genes in colon cancer. The results of this network analysis may help narrow down the search for novel candidate genes for the treatment of colon cancer, in addition to improving our understanding of the biological events underlying LN involvement in colon cancer.