Construction and validation of a metabolic risk model predicting prognosis of colon cancer

Metabolic genes have played a significant role in tumor development and prognosis. In this study, we constructed a metabolic risk model to predict the prognosis of colon cancer based on The Cancer Genome Atlas (TCGA) and validated the model by Gene Expression Omnibus (GEO). We extracted 753 metabolic genes and identified 139 differentially expressed genes (DEGs) from TCGA database. Then we conducted univariate cox regression analysis and Least Absolute Shrinkage and Selection Operator Cox regression analysis to identify prognosis-related genes and construct the metabolic risk model. An eleven-gene prognostic model was constructed after 1000 resamples. The gene signature has been proved to have an excellent ability to predict prognosis by Kaplan–Meier analysis, time-dependent receiver operating characteristic, risk score, univariate and multivariate cox regression analysis based on TCGA. Then we validated the model by Kaplan–Meier analysis and risk score based on GEO database. Finally, we performed a weighted gene co-expression network analysis and protein–protein interaction network on DEGs, and Kyoto Encyclopedia of Genes and Genomes pathways and Gene Ontology enrichment analyses were conducted. The results of functional analyses showed that most significantly enriched pathways focused on metabolism, especially glucose and lipid metabolism pathways.

The performance of gene signature. We applied the metabolic risk model to TCGA datasets. Patients were grouped into high-and low-risk groups according to the median risk score. Kaplan-Meier curves were generated by survival analysis through the 'survival' package. Risk score curves were drawn using the 'pheatmap' package. Next, univariate and multivariate cox proportional hazards analysis were performed. Time-dependent receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) were used to evaluate the sensitivity and specificity of the model using the 'survival ROC' package. For evaluation of the performance of the model, we compared the prediction model with the existing model (TNM stage) by the time-dependent ROC curve 16 . The 1-, 3-, and 5-year ROC curves of the model were plotted and the values of AUC were calculated with packages 'survival' , 'survminer' , and 'survival ROC' in R. A nomogram was built to visualize the risk model and classic independent risk factors, including age, gender, tumor grade, and TMN stage, to calculate survival rate of cancer patients.
Validation of metabolic risk model. We investigated extensively in GEO datasets and collected all appropriate datasets to validate the predictive ability of the model. Patients were grouped into high-and lowrisk groups and Kaplan-Meier curves were generated by survival analysis through the "survival" package. Risk score curves were drawn using the 'pheatmap' package.
Gene set enrichment analyses. GSEA software (version 4.0.3) for Windows was downloaded from website for functional analyses. KEGG and GO (Gene Ontology) enrichment analyses of the DEGs were performed to identify potential pathways and functions by GSEA.

WGCNA of DEGs.
To assess the inter-correlation of the intensities of the DEGs and the relationship between them and feature data, a weighted gene co-expression network analysis (WGCNA) was performed using the WGCNA R package. Firstly, we conducted quality testing and preprocessing of the data 17,18 . Normalization was performed using 'preprocessCore' package, while outliers were removed and missing values were treated by using the 'impute' package. After the procedure of preprocessing, we got a reasonable and useful sample set. Secondly, a matrix of adjacencies was built according to the calculated absolute value of the Pearson's correlation coefficients between each of the gene pairs. Thirdly, the matrix was applied to a network construction and module detection function by building a weighted matrix with a scaling factor-β. Then, the adjacencies were transformed into topological overlap matrix (TOM). Similar modules were merged to trim genes whose correlation with module eigengene was less than the defined threshold (min Module Size of 30 and merge Cut Height of 0.25). Fourthly, we calculated the correlation between clinical information and module eigengenes (MEs) to identify the clinically significant modules. Finally, we calculated the correlation between genes and clinical traits, as well as the correlation between genes and MEs. WGCNA determines highly inter-connected nodes as modules designated with different colors.
PPI network construction and functional analyses. Protein-protein interaction (PPI) network was constructed by Search Tool for the Retrieval of Interacting Genes (STRING, https:// string-db. org/) and the CytoHubba plug-in in the Cytoscape software (version 3.7.1). Then we analyzed the relationships between each core genes and clinical characteristics. The 'clusterProfiler' , 'enrichplot' , and 'ggplot2' software packages in R were used to perform Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses. GO has three independent branches: molecular function (MF), biological process (BP), and cellular component (CC). An FDR of < 0.05 was defined as statistically significant.

Results
Data extraction. We extracted 452 cases and 753 metabolic genes from TCGA database, including 361 up-regulated and 392 down-regulated genes (Supplementary Table 1). Then we downloaded and combined gene expression matrix of the GSE17536, GSE17537, and GSE29621 series and obtained 177 cases. After screening, we identified 139 differentially expressed metabolic genes, including 62 up-regulated and 77 down-regulated genes, which were shown in volcano plot (Fig. 1A) and heatmap (Fig. 1B).
Construction of metabolic signature. A total of 15 prognostic genes were dug out by univariate cox regression analysis. Among them, NAT2, ENOPH1, ACAA2, UGT2A3, PAFAH1B3, SUCLG2, CPT2, and ACOX1 were associated with better overall survival outcomes, while PKM, GPX3, LPCAT1, ADCY5, ADH1B, SPHK1, and PTGDS were associated with worse overall survival outcomes (Fig. 2). Then an eleven-gene prognostic model was constructed after 1000 resamples by LASSO penalized Cox regression analysis. Risk score = 0.0011 × expression of PKM + 0.0076 × expression of GPX3 + 0.0052 × expression of LPCAT1 − 0.0060 × expression of ENOPH1 − 0.0044 × expression of ACAA2 − 0.0041 × expression  The performance of gene signature. Patients were grouped into high-and low-risk groups according to the median risk score. In the high, patients had a lower OS according to Kaplan-Meier curves (P < 0.05, Fig. 3A) and risk score distribution than in the low (Fig. 4A). The differential expression of metabolic genes in high-and low-risk groups were shown in heatmaps (Fig. 4C). Univariate Cox analysis showed that the risk of poor progno-  www.nature.com/scientificreports/ sis elevated with the risk score increasing (Fig. 5A). Multivariate Cox regression analysis demonstrated that the risk score could be an independent risk factor for OS (Fig. 5B). The AUC was 0.697 according to the ROC, which was higher than age (AUC = 0.561), gender (AUC = 0.439), and T stage (AUC = 0.672) (Fig. 6). We compared the Validation of metabolic signature. We adjusted and combined GSE17536, GSE17537, and GSE29621 as validation dataset to reanalyze. Patients were grouped into high-and low-risk groups according to the median risk score. In the high, patients had a lower OS according to Kaplan-Meier curves (P < 0.05, Fig. 3B) and risk score distribution than in the low (Fig. 4B). The differential expression of metabolic genes in high-and low-risk groups were shown in heatmaps (Fig. 4D).
GSEA. The results of GSEA showed that significantly metabolism-related pathways were enriched in the low-risk group. KEGG analysis demonstrated that most of the enriched pathways focused on fatty acids (FAs) metabolism, pyruvate metabolism, propanoate metabolism, and butanoate metabolism pathways (Fig. 9A). GO analysis showed that genes were mainly enriched in nucleoside bisphosphate metabolic process, and thioester metabolic process (Fig. 9B). In addition, there were several famous cancer-related pathways, such as cell adhesion molecules (CAMs), cytokine-cytokine receptor interaction, peroxisome, peroxisomal transport, peroxisome organization.     Figure 10A shows the determination of β parameter based on the description in the WGCNA manual. We established two modules based on genes expression pattern via average linkage clustering (Fig. 10B). Modules in gene dendrogram are shown in different colors and based on dynamic branch cutting algorithm underneath row color assigns the modules membership. Figure 10C shows a highly significant correlation between gene significant (GS) versus module membership (MM) in the turquoise module with tumor, and Fig. 10D shows the most significant and correlated module with tumor was blue module.

PPI network construction and functional analyses.
We selected the genes with high GS and MM from blue module based on the co-expression network to constructed PPI networks and highlighted the high connectivity genes. The PPI network showed the top 10 central metabolic genes in colon cancer (Fig. 11). Of all the central genes, PYGM was correlated with gender, T stage, and N stage, while ACOX1, ADH1B, and MAOB were significantly correlated with N stage (Supplementary Fig. 1). GO enrichment results show the top 10 BP terms, 10 CC terms and 10 MF terms, The KEGG enrichment results show the 30 paths ( Supplementary Fig. 2).

Discussion
In this study, we established a metabolic risk model to predict the prognosis of colon cancer based on TCGA and verified by GEO. Firstly, we identified differentially expressed metabolic genes from TCGA and constructed a prognostic model by LASSO. The gene signature has been proved to have an excellent ability to predict prognosis by Kaplan-Meier analysis, time-dependent receiver operating characteristic (ROC), risk score, univariate and multivariate cox regression analysis based on TCGA. Then, we verified the model by Kaplan-Meier curves and risk score based on GEO. Finally, we performed WGCNA, PPI network and functional analyses of DEGs.  23 , lung adenocarcinoma 24 , hepatocellular carcinoma 25 , and acute myelogenous leukemia 26 . But it has not been used in colon cancer. In this study, we identified eleven prognostic genes involved in metabolism, which revealed the metabolism alterations in colon cancer. Theses metabolic genes can be used as potential biomarkers and therapeutic target for colon cancer patients.
In this model, the expression of ENOPH1, ACAA2, UGT2A3, PAFAH1B3, CPT2, and ACOX1 predicted better prognosis, whereas the expression of PKM, GPX3, LPCAT1, ADCY5 and PTGDS predicted worse prognosis. Most of the genes have been proved to take part in pathogenesis, progression and prognosis of cancers. We compared this metabolic risk model with other metabolic signatures in previous studies and found that PAFAH1B3 is a good prognostic gene in myelogenous leukemia 26 while LPCAT1 is a risky prognostic gene in hepatocellular carcinoma 27 , which has the same function in colon cancer. Platelet activating factor acetyl hydrolase 1B3 (PAFAH1B3) is involved in ether lipid metabolism, and was reported to play an important role in tumorigenesis and aggressiveness in many different cancers 28 . Blockers of PAFAH1B3 could heightened levels of tumor-suppressing lipids, then impairs pathogenicity of different cancers, including breast, ovarian, melanoma, and prostate cancer 29 . LPCAT1, an important subtype belongs to the LPCAT subtypes, is a cytosolic enzyme which converts lysophosphatidylcholine (LPC) to phosphatidylcholine (PC). To date, the over-expression of LPCAT1 has been reported in the contribution of the progression, metastasis, and recurrence of multiple types of cancers, including cell renal cell carcinoma 30 , gastric cancer 31 , breast cancer 32 , oral squamous cell carcinoma 33 , and hepatocellular carcinoma 34 .
Enolase-phosphatase 1 (ENOPH1), a bifunctional enolase-dephosphorylase enzyme 35 , is required for polyamine biosynthesis 36 . Previous studies showed that overexpression of ENOPH1 remarkably promoted cell migration and invasiveness, whereas the downregulation of ENOPH1 significantly impaired cell migration and invasiveness 37 . Acetyl-Coenzyme A acyltransferase 2 (ACAA2) exerts an effect on β-oxidation of fatty acid, which then provide energy 38 . Previous showed that ACAA2 could abolished the apoptosis in human hepatocellular carcinoma 39 , and played a vital role in the metabolism processes in gliomas. The higher expression of ACAA2 was associated with better prognosis of colorectal cancer 40 . Carnitine palmitoyl transferase 2 (CPT2), a ratelimiting enzyme for mitochondrial fatty acid transportation, had been proved to be a protective prognostic gene for colorectal cancer 41 , which is in accordance with our results. Peroxisomal Acyl-Coa Oxidase 1 (ACOX1) is the rate-limiting enzyme in fatty acid β-oxidation. The inhibitor of ACOX1 is SIRT1, which has been proved to prevent oxidative damage and is downregulated in liver cancerr 42,43 , suppresses colorectal cancer metastasis by transcriptional repression of miR-15b-5p 44 .
Pyruvate kinase M (PKM), a metabolic regulator, participated in both glycolytic and non-glycolytic pathways. PKM also acts as protein kinase, which shifts the glucose metabolism from the respiratory chain to aerobic glycolysis in tumor cells. PKM2 is upregulated in most cancer types, and contributes to tumorigenesis 45 , which suggested that it could act as a remarkable therapeutic target 46 . The glutathione peroxidases-3 (GPX3), a selenocysteine-containing redox enzyme, took part in reactive oxygen species signaling and immunomodulatory 47 . Previous studies suggested that GPX3 prevented the colitis-associated carcinoma by immunomodulation 48 . But the correlation between GPX and prognosis of colon cancer is not clear. The tumor suppressive effect of www.nature.com/scientificreports/ prostaglandin D2 (PGD2) on testicular cancer and gastric cancer has been confirmed 49 . However, the correlation between PGD2 and colon cancer is unclear.
The gene signature has been proved to have an excellent ability to predict prognosis by Kaplan-Meier analysis, ROC and AUC, univariate and multivariate cox regression analysis based on TCGA. In addition, we compared the model with TNM stage by the ROC curve and the values of AUC. The results showed that our predict model was more accurate in the prediction of the 5-year survival rate (AUC = 0.775 vs 0.758). Besides, genetic testing is more accurate and convenient, without the need for surgery, so the prediction model is much more advanced for clinicians and researchers.
It is acknowledged that diverse factors affect tumor development and prognosis, including clinic-pathological diagnosis, immunology regulations, gut microbiome, and metabolism of glucose and lipids 50 . The TNM classification system is the major pathological staging method, which is the most widely used staging system for clinical decision making 51,52 . Increasing evidence indicates the key role of systemic immunity in tumor progression and outcome [53][54][55] . Recent studies highlight the role of the microbiota composition in intestinal inflammation in mice and humans, which has been linked to colorectal cancer 56 . There have been many publications and studies on the risk prediction model for the prognosis of colon cancer based on immune genes [57][58][59][60][61][62] . The abnormal metabolism of tumor glucose and lipids is an important part of tumor metabolic reprogramming, which is closely related to tumor occurrence, development, and prognois 50 . In this study, the functional analyses revealed many metabolic pathways related to prognosis of colon cancer. In one hand, the results validated the close association between metabolic systems and colon cancer. GSEA analyses showed most significantly enriched pathways focused on fatty acid metabolism, butanoate metabolism, propanoate metabolism and pyruvate metabolism pathways. Most of the metabolic genes in the model are fatty metabolism related genes, including PAFAH1B3, LPCAT1, ACAA2, ACOX1, and CPT2.The top four significant genes in PPI network ACOX1, CPT2, ACAA2, and PKM were involved in lipid metabolism or glycolysis. Changes of glucose and lipid metabolism is one of the key features of cancer cells because cell proliferation requires increased lipid biosynthesis, and bioactive molecules produced by lipid catabolism will act as signal molecules to regulate cancer metastasis 63,64 . Increased glycolysis is the main source of energy supply in cancer cells that use this metabolic pathway for ATP generation. Fatty acids are the preferred substrates for energy storage, mainly in the form of triglyceride. Availability of fatty acids in the blood and peripheral tissues depends on lifestyle and diet and is likely to be altered in patients who are obese or have metabolic syndrome. It is important to keep a healthy diet, including less high saturated fatty acids and lose weight.
In the other hand, most of the metabolic pathways were enriched in the low-risk patients, while the nonmetabolic pathways were enriched in the high-risk patients. As a result, metabolic therapy may be more suitable for low-risk patients, while high-risk patients might benefit more from nonmetabolic therapy. Previous clinical studies have shown the effectiveness of glycolytic therapy to suppress cancer progression [65][66][67] . So far, the application of metabolism in malignant tumors includes diagnosis, identification of biomarkers and tumor metabolic processes. In the future, colon cancer patients might benefit from metabolic-related therapy and management.
However, there were some limitations in this study. Firstly, there was no functional experiment in the real world. Secondly, deep learning is widely used for intelligence medicine to assistant disease diagnosis and risk prediction, like imaging data, genomic or transcriptomic data 68 , but we adopted Lasso and multivariate Cox regression to build a risk model which may result in some limitations. Thirdly, we didn't test disease-associated modules for preservation by use of an independent dataset, which may lead to reliability decreases 17,18 . Moreover, it is relatively weak to take only metabolic genes into prognostic model because much more complicated mechanisms contributed to colon cancer.

Conclusions
We constructed and validated a metabolic risk model to predict the prognosis of colon cancer based on TCGA and GEO database. Metabolic-related therapied and managements could be considered for colon cancer patients. However, validation and functional experiments of the metabolic risk model are indispensable.

Data availability
The datasets and analyses during the current study are available in TCGA and GEO database.