Prognostic model development for classification of colorectal adenocarcinoma by using machine learning model based on feature selection technique boruta

Colorectal cancer (CRC) is the third most prevalent cancer type and accounts for nearly one million deaths worldwide. The CRC mRNA gene expression datasets from TCGA and GEO (GSE144259, GSE50760, and GSE87096) were analyzed to find the significant differentially expressed genes (DEGs). These significant genes were further processed for feature selection through boruta and the confirmed features of importance (genes) were subsequently used for ML-based prognostic classification model development. These genes were analyzed for survival and correlation analysis between final genes and infiltrated immunocytes. A total of 770 CRC samples were included having 78 normal and 692 tumor tissue samples. 170 significant DEGs were identified after DESeq2 analysis along with the topconfects R package. The 33 confirmed features of importance-based RF prognostic classification model have given accuracy, precision, recall, and f1-score of 100% with 0% standard deviation. The overall survival analysis had finalized GLP2R and VSTM2A genes that were significantly downregulated in tumor samples and had a strong correlation with immunocyte infiltration. The involvement of these genes in CRC prognosis was further confirmed on the basis of their biological function and literature analysis. The current findings indicate that GLP2R and VSTM2A may play a significant role in CRC progression and immune response suppression.

Feature selection and prognostic model development. The boruta provided 33 confirmed features for the classification of the TCGA-CRC data between normal and tumor class based on the gene expression data after 10 iterations. A total of 43 (ranked as 2) tentative and 94 rejected and 33 (ranked as 1) features were confirmed to be used for further analysis as shown in Table 3.
The Random Forest (RF) classifier was implemented on the gene expression data of these 33 confirmed features. An accuracy score of 100% was obtained for this RF-based prognostic model. The performance metrics for the RF-based prognostic model is provided in Table 4 which shows the 100% score for both the sensitivity and positive predictive value. The confusion matrix of the training and testing dataset is provided in Fig. 4a,b, respectively and the ROC analysis for the model with an AUC curve is shown in Fig. 4c which shows the predicted and truth class labels with their classification values for tumor and normal sample classes.
Prognostic model cross-validation and survival analysis. The stratified K-Fold method has provided the list of possible accuracy, maximum accuracy, minimum accuracy, and standard deviation of the developed prognostic model for 33 confirmed features. The model has achieved 100% accuracy (maximum and minimum) with 0% standard deviation. The overall survival analysis was performed for the 33 confirmed features and only 2 genes namely GLP2R, and V-set and transmembrane domain containing 2A (VSTM2A) were identified which had log-rank p < 0.05. The GLP2R had log-rank p = 0.02 and VSTM2A had log-rank p = 0.014, respectively as shown in Fig. 5a. The gene expression analysis based on count data information shows significant downregulation of tumor class for GLP2R and VTSM2A genes when compared with normal class analyzing TCGA-CRC dataset as shown in Fig. 5b.
The gene expression profile of GLP2R and VSTMA genes across TCGA repository for different cancer types is shown in Fig. 6. It is visible that the expression of GLP2R and VSTM2A is downregulated in CRC datasets while the expression in normal samples is upregulated.
Correlation analysis between the final set of genes and immunocytes. The correlation analysis between the final set of genes GLP2R and VSTM2A has shown a positive correlation for TCGA-CRC datasets through TIMER. The correlation value for COAD and READ datasets are 0.435 and 0.411, respectively as shown in Fig. 7.
The correlation analysis for the GLP2R and VSTM2A genes with immunocytes was also analyzed and was found that there is a strong correlation between TIIC with CRC as shown in Fig. 8a,b. The maximum correlation for GLP2R (TCGA-COAD cor = 0.423, TCGA-READ cor = 0.355) and VSTM2A (TCGA-COAD cor = 0.26, TCGA-READ cor = 0.14) gene expression with immunocytes infiltration level was found with CD4 + T-cells and www.nature.com/scientificreports/      www.nature.com/scientificreports/ model as shown in Fig. 9. Since the GLP2R and VSTM2A had shown significant downregulation in tumor cells their biological functions which would get affected are mentioned in Table 5. The brainstem, hypothalamus, stomach, colon, ileum, jejunum, lung, and duodenum widely express GLP2R and its ligand GLP2 helps in stimulating the GLP2R activity in the intestinal subepithelial fibroblasts (ISEMFs).   Table 3. List of rejected, tentative and confirmed features selected through boruta algorithm. www.nature.com/scientificreports/ The release of growth factors after GLP2R activation promotes colonic epithelial proliferation 24 . The downregulation of VSTM2A is linked with poor survival of CRC patients because it acts as a novel antagonist of the Wnt signaling pathway. The direct binding of VSTM2A with LDL receptor related protein 6 (LRP6) induces lysosomemediated degradation and endocytosis of LRP6 25 .

Discussion
Colorectal cancer is one of the leading causes of morbidity worldwide. The early diagnosis of CRC can be helpful in improving the survival rate of patients rather than diagnosing at a later stage 26 . Therefore, studies for identification of diagnostic and prognostic biomarkers with their predictive efficacy can be of great significance [27][28][29] . A study conducted by sun et.al. identified a set of immune-related genes (IRGs) which were used to develop the prognostic model for the classification of colon adenocarcinoma 30 . The CTHRC1 was identified as a prognostic predictor for CRC which is a peritoneal metastasis-related gene 31 .
The current study used TCGA-CRC and 3 GEO CRC datasets for the identification of significant DEGs which will further help in the identification of immune-related gene signatures for CRC prognosis. A list of 170 significant common DEGs were identified through topconfects analysis along with the DESeq2 R package.
The 33 features of importance were obtained after the implementation of the feature selection technique boruta algorithm which is based on the working principle of RF although it can be used with other ML algorithms also. These 33 confirmed features of importance were utilized to develop a prognostic classification model through RF algorithm on the TCGA-CRC dataset for classifying the tumor and normal data. These 33 features gene expression count data for 695 CRC samples were used to divide for training and testing dataset in a 7:3 ratio. The gene expression dataset of CRC with 33 confirmed features have scored 100% accuracy with the test dataset containing 209 samples. The accuracy obtained through RF prognostic classification model was cross-validated through the stratified k-fold technique and the observed standard deviation was 0 which shows the robustness of the prognostic ML model. The overall survival analysis has provided 2 genes (out of 33 genes) GLP2R (log-rank p-value = 0.02) and VSTM2A (log-rank p-value = 0.014) which had passed the threshold of log-rank p-value < 0.05. The gene expression analysis has shown significant downregulation of both the genes and both GLP2R and VSTM2A genes have shown a positive correlation average value of 0.42 (TCGA-COAD and TCGA-READ). The ROC curve analysis with the area under the curve (AUC) for GLP2R and VSTM2A gene has shown the value of 99.99% and 99.68%, respectively.
Immunocyte infiltration analysis for GLP2R and VSTM2A has shown a positive correlation with TIIC. GLP2R is predominantly found in the gastrointestinal tract and is important for maintaining the integrity of the colonic www.nature.com/scientificreports/ epithelial cells. Studies have shown that GLP2R expression stimulates colonic epithelial cell proliferation and inhibits apoptotic cell death in the crypt compartment [32][33][34][35] . Gene expression analysis through TCGA-CRC data, GEPIA, and TIMER has shown the downregulation of GLP2R in the tumor samples as compared to the normal samples.
There are some known possible molecular mechanisms that promotes CRC tumorigenesis by binding GLP-2 to GLP2R. One of the mechanisms involves GLP-2 induced colon tumor promotion by stimulating insulin-like growth factor 1 (IGF-1) synthesis in intestinal subepithelial fibroblasts and its (GLP-2) own synthesis [36][37][38] . The GLP2 (secreted through colon tumor cells) binds to GLP2R expressed by carcinoma-related fibroblasts. CAFs are the cell phenotypes that composes the tumor microenvironment and play a significant role in the communication between the compartments of the epithelium and stroma. This leads to angiogenic enhancement, cytokine and growth factor secretion, and immune response suppression which enables tumor cells to expand, differentiate and invade 39,40 .
Another mechanism that stimulates CRC proliferation involves the phosphatidylinositol 3 kinase/protein kinase B (PI3-K/Akt) activation through Fibroblast-expressed IGF-1. The expression of IGF-1 is stimulated through Akt and subsequently, IGF-1 is released from the fibroblasts 41,42 . Following this, IGF-1 binds to colonocyte receptors and transmits signals via the PI3-K/Akt/GSK-3β pathway 38,43 . A study conducted on HT29 colon cancer cells has shown that IGF-1 stimulates the PI3-K activity which further induces Akt phosphorylation 44 . To increase the proliferation of tumor cells, GLP2 might boost the production of IGF-1, which, upon binding with IGF-1r in colon tumor cells, would activate Akt signalling.
VSTM2A regulates preadipocyte cell differentiation. The gene expression analysis has shown a significant reduction in tumor samples as compared to the normal CRC samples. According to studies, VSTM2A downregulation and VSTM2A DNA promoter hypermethylation are linked to poor prognosis of CRC patients and colorectal tumorigenesis is affected by the hyperactivation of the Wnt/β-catenin signalling pathway. The interaction between VSTM2A and LRP6 inhibits the Wnt signalling intracellularly. It occurs through suppressing LRP6 protein expression and inhibiting LRP6 phosphorylation, both of which are dose-dependently increased by the presence of VSTM2A protein 45 . According to studies, the binding of ligands to their receptor often www.nature.com/scientificreports/ induces receptor endocytosis 46 . Studies have shown that the interaction between VSTM2A and LRP6 induces the lysosome-mediated degradation and endocytosis of VSTM2A. This study has certain limitations as it has been performed on the publicly available dataset and external realtime dataset validation will be required for the developed prognostic classification model.

Conclusion
In recent years several studies have been designed for the early diagnosis of CRC with the assistance of technological advancements. This study utilizes the significant DEGs to further develop the ML-based prognostic classification model. The developed ML model has shown consistent performance with the cross-validation algorithm and has a 0% standard deviation. The finalized immune-related genes GLP2R and VSTM2A had shown a positive correlation with the immunocyte infiltration and have a role in the suppression of immune response. The biological, functional, and gene expression analysis has also proved the role of these genes in CRC progression.  www.nature.com/scientificreports/ Identification of significant DEGs. TCGAbiolinks package was used to pre-process the TCGA-CRC mRNA gene expression data. The TCGA-CRC dataset genes which had correlation cutoff value less than 0.6 were removed for the DEG analysis by utilizing the command "dataPrep <-TCGAanalyze_Preprocessing(object = data-Prep, cor.cut = 0.6, datatype = "HTSeq -Counts")" and the normalization was performed through data-Norm <-TCGAanalyze_Normalization(tabDF = dataPrep, geneInfo = geneInfoHT, method = "gcContent"). Thereafter, the threshold implemented to identify the DEGs between normal and CRC tumor samples using the edgeR by glmRT method was FDR cut-off of 0.01 and |log 2-FC|> 1.5. The gene expression count data for GEO datasets were processed through DESeq2 17 R Bioconductor package. The initial pre-filtering of the low count reads were performed through the rowsums count parameter function of the DESeq2 package by using the command : dds <-rowSums(counts(dds)) > = 100. It filters the genes which had row sum value less than 100 in terms of gene counts. The GEO datasets normalization was performed through counts function by applying the R command: normalized_counts <-counts(dds, normalized = TRUE). Further, the DEGs for GEO datasets were identified by applying a threshold of adj. p-value < 0.05 and |log 2-FC|> 1.5.

Methods
To obtain a significant list of DEGs from GEO datasets "topconfects" 18 Bioconductor R package was employed which ranks the genes based on the confidence score bound to log fold change value. The deseq2_confects function was used on all GEO datasets with a step value of 0.5. The common shared significant genes list between TCGA-CRC and all GEO datasets after topconfects implementation were used for further analysis. The common genes list was analyzed by the venny tool 19 .
Feature selection and prognostic model development. The gene expression data of 170 common features (genes) was used for developing a prognostic model by utilizing TCGA-CRC dataset. The jupyter-notebook platform was used for model implementation. The features of importance were identified through boruta feature selection 20 algorithm, which was implemented through python library "BorutaPy" based on Random Forest 21 which selects features by generating shadow attributes for the original set of features. The parameters which were used by boruta were number of iterations = 10 and number of estimators = 'auto' . The confirmed identified features were further used for developing a prognostic classification model by using RF algorithm utilizing the python library "RandomForestClassifier". The training and testing dataset of TCGA-CRC was divided into a ratio of 7:3, respectively. To evaluate the performance of the prognostic model accuracy score, precision, recall, and f1-score were calculated by using confusion matrix information by utilizing "sklearn.metrics" python library.
Prognostic model cross-validation and survival analysis. The prognostic ML classification model was cross-validated through the stratified K-Fold method utilizing the "StratifiedKFold" python library (number of splits = 4) to observe the performance consistency in terms of accuracy and statistical significance for TCGA-CRC dataset classification. The confirmed features were further analyzed for survival analysis through GEPIA2 online server (http:// gepia2. cancer-pku. cn/) 22 based on the threshold of log-rank p value < 0.05. The features that had passed the before-mentioned threshold were analyzed for their gene expression count data.
Correlation analysis between the final set of genes and immunocytes. Tumor Immune Estimation Resource (TIMER) is a publicly available web resource for analyzing tumor-infiltrating immune cells (TIIC)  ROC Curve and literature analysis. The identified significant features for CRC classification were further analyzed for their receiver-operating characteristic (ROC) curve analysis by using the TCGA-CRC dataset gene   www.nature.com/scientificreports/ Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.