Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Identification of gene signatures for COAD using feature selection and Bayesian network approaches


The combination of TCGA and GTEx databases will provide more comprehensive information for characterizing the human genome in health and disease, especially for underlying the cancer genetic alterations. Here we analyzed the gene expression profile of COAD in both tumor samples from TCGA and normal colon tissues from GTEx. Using the SNR-PPFS feature selection algorithms, we discovered a 38 gene signatures that performed well in distinguishing COAD tumors from normal samples. Bayesian network of the 38 genes revealed that DEGs with similar expression patterns or functions interacted more closely. We identified 14 up-DEGs that were significantly correlated with tumor stages. Cox regression analysis demonstrated that tumor stage, STMN4 and FAM135B dysregulation were independent prognostic factors for COAD survival outcomes. Overall, this study indicates that using feature selection approaches to select key gene signatures from high-dimensional datasets can be an effective way for studying cancer genomic characteristics.


Cancer is a major public health burden around the world, and it is the second leading cause of death in the United States1. According to the most recent American Cancer Society statistics for 2021, colon and rectum cancer (CRC) ranks the third in incidence and the third leading cause of cancer-related death worldwide. CRC remains one of the most common malignant tumors in the digestive system, and the type of colon adenocarcinomas (COAD) accounting for 95% of all cases of colon cancer2.

Cancers are well understood to be caused by genetic abnormalities in the target cells. In general, acquired mutations and epigenetic changes can influence tumor cell chromatin architecture and gene expression levels. As a result, identifying specific genetic markers that will promote molecular diagnosis and precision medicine in cancer is one of the most important aspects of cancer research. The Cancer Genome Atlas (TCGA, program, an invaluable resource of cancer genomics, provides publicly available datasets for the development of improved methods for cancer diagnosis, treatment, and prevention3,4. The TCGA program molecularly characterizes over 20,000 primary cancer and matched normal samples spanning 33 cancer types, including COAD. Another human genomics project, the Genotype-Tissue Expression (GTEx,, establishes a reference resource of gene expression from ‘normal’, disease-free tissues5,6. The GTEx project was established to characterize human transcriptomes within and across individuals for a wide range of primary tissues and cell types, including colon tissue6. Thus, combing the datasets from TCGA as tumor resources and GTEx as normal sample resources expands opportunities for data mining and deeper understanding of gene signatures in cancer research7,8,9.

Clinical diagnosis or prognosis prediction of cancer patients based on the high-throughput gene expression data depends greatly on the accuracy of disease classification. This necessitates the development of best classification models for cancer samples with high accuracy and low risk of misclassification. Gene expression data, such as RNA-sequencing or microarrays, usually suffer from the dimensionality problem: too many gene features and relative few samples. It is usually impractical to go through all of the features during the gene expression analysis. As a result, feature selection tends to be a prominent approach for disease classification, especially in datasets with a large number of features. It can eliminate relatively unimportant variables and improve classification accuracy and performance10. Wu et al.11 selected 300 biomarkers from 13,990 features with the combination of seven algorithms, including logistic regression and feature selection methods. A hybrid feature selection algorithm also has been used for searching optimal tumor biomarkers with significant performance for distinguishing tumor and normal samples12. The wavelet kernel ridge and radial basis kernel ridge regression were proposed to select the most relevant features which can be used for classification of microarray medical datasets13. Using a random forests model for feature selection, researchers identified a six-gene signature for predicting survival status in patients with head and neck squamous cell carcinoma (HNSCC) from the TCGA-HNSCC dataset14. Another five-gene signature (including RGS11, RGS10, RGS13, RGS4, and RGS3) has been identified as independent prognostic factors for ovarian cancer patients by using Lasso cox analysis15. In a study of melanoma, the feature selection approach was applied to discover and validate metastasis-related biomarkers based on single cell gene expression datasets16.

The current study aimed to identify gene signatures that could be used to classify COAD samples and normal colon tissues. Specifically, we established a feature selection model, SNR-PPFS, by combining the signal-to-ratio (SNR) ranking algorithm17,18 with the predictive permutation feature selection (PPFS) algorithm, a Markov blanket (MB) based feature subset selection method. The PPFS algorithm considers features both individually and collectively in order to provide the best set of features. Bioinformatic and biological analysis were also carried out to investigate the potential biological significance of the candidate genes identified through feature selection approaches. We anticipate that our research will provide a novel methodological foundation for the identification of COAD biomarkers as well as other cancer types.

Methods and materials

Data acquisition

Figure 1 depicted an overview of the study design. The datasets for a combined cohort of TCGA, TARGET, and GTEx samples were obtained from the UCSC xena website19. Firstly, the total RSEM expected_count (DESeq2 standardized) dataset was downloaded as the total gene expression profiling, which containing 19,039 bio-samples from both tumors and normal tissues ( We then chose samples of COAD tumor and normal colon tissue (selection criteria: for tumor tissue, primary_disease_or_tissue = “Colon Adenocarcinoma”; for normal tissue, primary_site = “Colon”) from the total gene expression dataset for the current study. Finally, 637 samples were recruited for research, including 289 COAD tumor samples (resourced from TCGA) and 348 normal samples. The 348 normal samples further contained 41 normal samples from the TCGA-COAD cohort and 307 normal colon tissues from GTEx. We also downloaded TCGA-COAD cohort’s phenotype and survival data for bioinformatic and biological analysis (phenotype data:; survival data: The clinicopathological characteristics of the 289 COAD tumor samples were summarized in Table 1.

Figure 1
figure 1

Overview of the study design.

Table 1 Clinical characteristics of COAD cases (data from the TCGA database), as well as Cox regression analysis of the clinical parameters.

Gene feature selection using SNR-PPFS algorithms

After obtaining the gene expression dataset of 637 samples, we subsequently performed feature selection to identify gene signatures as classifier between tumor and normal groups. As shown in Fig. 1, the gene feature selection process consisted primarily of two steps, gene screening using the SNR algorithm and related gene selection using the PPFS method. All the steps were performed based on Python 3.8.

Screening genes using the SNR algorithm

SNR is an effective screening method that can quickly filter out genes that are unrelated to classification attributes. The expression is as follows. The numerator of the formula contains the average values of gene expression of the gene gi in the tumor and normal groups, and the denominator contains the standard deviations of the gene gi in the two groups. The higher the signal-to-noise ratio, the more important the gene is for classification.

$$SNR(g_{i} ) = \frac{{|u_{ + } (g_{i} ) - u_{ - } (g_{i} )|}}{{\delta_{ + } (g_{i} ) + \delta_{ - } (g_{i} )}}$$

Obtaining the Markov blanket genes using PPFS

The definition of Markov blanket

Markov blanket is a widely used feature selection approach, which can be described as the following definitions and Fig. 2. It has already contained all the information related to the target node, and the non-Markov blanket nodes can be discarded safely to achieve the purpose of feature selection.

Figure 2
figure 2

The diagram of an example of Markov blanket in a casual network. The T node with yellow color in the red rectangle is a target node, the other nodes form a Markov blanket of T node, and T node is independent of any node outside the rectangle.

Definition 1

(Markov condition) Any variable (node) in a bayesian network is independent of its non-descendants given its parents.

Definition 2

(Faithfulness) Let G denote a Bayesian network. Let P denote a joint probability. G and P are said to be faithful to one another if all the conditional independencies entailed by G and the Markov condition is present in P.

Definition 3

(Markov blanket) Under the faithful condition, MB (Y) is the minimal set conditioned on which all other variables are independent of Y, i.e., (X\MB(Y) Y|MB(Y)).

Predictive permutation feature selection

The PPFS20 is a Markov blanket theory-based feature selection algorithm that selects a subset of features based on their performance both individually and as a group. It can automatically decide how many features to take and try to find the optimal combination of features, especially it performs well on high-dimensional data. In this case, we combined the SNR and PPFS to obtain the final gene signatures for classifying tumor and normal samples; the procedures were detailed in Algorithm 1.

figure a

Bioinformatic and biological analysis

Bayesian network and gene functional annotation

Following the feature selection, we will identify candidate genes in tumors. We then used the Bayesian structure learning algorithm of PCStable21 to construct a gene regulatory network, based on the expression profiles of the candidate genes. Furthermore, the protein–protein interaction (PPI) network and functional annotation were carried out using the online platform STRING: functional protein association networks (

PCA, PLS-DA and heatmap analysis

We conducted principal component analysis (PCA), partial least squares discriminant analysis (PLD-DA) and heatmap analysis to illustrate the performance of classification between tumor and normal groups. Specifically, the PCA, PLS-DA and heatmap analysis were carried out in R using the pca function in “FactoMineR” package, the plsda function in “mixOmics” package, and the pheatmap function in “pheatmap” package, respectively, based on the candidate gene expression profiling of 637 samples.

Differential expression and ROC analysis of candidate genes

The R package “limma” was used to compare the expression of candidate genes in tumor and normal samples. To evaluate the performance of candidate genes in the diagnosis of COAD, the specificity, sensitivity, and area under the curve (AUC) values were obtained using receiver operator characteristic (ROC) analysis in MedCalc software.

Correlation analysis of candidate genes with the clinicopathological characteristics of COAD patients

We used Pearson correlation in R to examine the relationship between gene expression and clinicopathological characteristics of COAD patients, particularly tumor stage status. For survival analysis, R packages “survival” and “survminer” were applied. Both univariate and multivariate Cox regression analysis were performed to estimate the simultaneous effects based on the clinical parameters and candidate gene expression signature in COAD patients, with P < 0.05 as the statistically significant level. Kaplan–Meier survival curves of candidate genes were also visualized.


Feature selection identified a 38 gene signatures for classifying COAD tumor and normal samples

We found some genes with an expression value of “0” during the pre-processing, and we filtered out those genes with the expression of “0” in more than two-thirds of the 637 samples to reduce the noise. Following data pre-processing, we obtained expression profiling of over 50,000 gene symbols for each of the 637 samples. We then conducted feature selection to determine the most valuable gene features in classifying tumor and normal groups. The SNR approach identifies expression patterns with the greatest difference in average expression between two groups and the least variation in expression within each group; genes can be ranked according to their expression levels using the SNR test statistic. In this study, we first screened a total of 430 gene signatures by SNR method. Further, the 430 genes were matched by PPFS algorithm. Finally, the best set of gene features containing 38 genes was identified for classification.

Expression profiling analysis of the candidate 38 genes

Previously, 38 genes were identified as classifiers between tumor and normal samples through feature selection approach. To investigate the expression patterns of these 38 genes in COAD tumors and normal samples, differential expression analysis was performed using Limma method. Table 2 displayed the fold change and statistical level of the candidate genes in tumor versus normal groups, as well as the specificity, sensitivity, and AUC values in ROC analysis. The majority (30 out of 38 genes) of the differentially expressed genes (DEGs) were up-regulated in tumors, as shown in Table 2 and the heatmap in Fig. 3. In particular, all these 38 genes demonstrated promising discrimination power in distinguishing tumors from normal samples (specificity range: 90.5–99.7, sensitivity range: 90.0–99.7, AUC range: 0.954–0.998).

Table 2 Differential expression and ROC analysis of the 38 candidate DEGs.
Figure 3
figure 3

Bi-clustering analysis of the 38 genes that were screened using feature selection. The analysis was carried out in R using the “pheatmap” package. All of the samples were mainly divided into two groups: tumor and normal, with the latter including normal_TCGA and normal_GTEx subgroups. The samples and genes were represented by the horizontal and vertical axis, respectively.

The heatmap, PCA and PLS-DA model of samples based on the 38 gene signatures were performed to visualize the clustering performance. As expected, fully separated models between tumor and normal samples were observed when performing PCA and PLS-DA (Fig. 4). In the current study, the normal group was further subdivided into two subgroups according to the sample source databases: normal-TCGA and normal-GTEx. Thus, we also took into account the information of subgroups when performing the clustering analysis. As shown in Figs. 3 and 4, the two subgroups of normal samples overlapped to a small extent, and both sets of normal samples could be completely separated with tumor samples.

Figure 4
figure 4

PCA and PLS-DA plot based on the expression pattern of the 38 genes. The analysis was performed using the “FactoMineR” package for PCA and the “mixOmics” package for PLS-DA in R. Each dot, triangle, and square represent a sample.

Using Bayesian network constructing gene regulatory network

In this study, we proposed using Bayesian network to construct gene regulatory networks for the 38 candidate genes based on their expression profiles. The 38 DEGs interacted with each other to some extent (Fig. 5A). Specifically, in this connected network, the eight down-DEGs interacted with the up-DEGs in relatively separate ways. Furthermore, we discovered that the Bayesian network aids in the discovery of biological gene-regulatory interactions. For instance, we identified seven up-DEGs interacting with each other in the Bayesian network (as shown in circle in Fig. 5A). Further, a complete protein–protein interaction (PPI) network was obtained from the STRING online platform based on the seven up-DEGs (Fig. 5B). Functional annotation of the PPI network was primarily involved in biological process related to cell cycle and nuclear division, as well as gastric cancer disease (Fig. 5C). These findings indicated that in a Bayesian network, genes with similar expression patterns and functions are tend to be closer in the connections, which will help bridge the gap between an individual gene and a system biological interpretation in the high throughput bioinformatics research.

Figure 5
figure 5

Bayesian network of the 38 candidate genes as well as the PPI network analysis. (A) Bayesian network of the 38 candidate genes. The red and green eclipses represent the up-regulated and down-regulated DEGs in COAD tumors, respectively. (B) PPI network of the seven up-DEGs [the seven genes in circle of (A] from STRING functional database. (C) Functional annotation of the genes in the PPI network.

Correlation analysis of candidate genes and clinicopathological characteristics of COAD patients

The TCGA database contains relatively comprehensive clinicopathological information on tumor samples. We then investigated whether the candidate genes were related to the clinicopathological characteristics of COAD patients. As summarized in Table 1, the tumor samples could be divided into different subgroups based on basic clinical information such as age, gender, race, and body mass index (BMI). According to the PLS-DA model (Fig. S1), the 38-gene expression signature could not well distinguish different subgroups of tumor samples based on the above basic clinical information. While, from the 38 gene signatures, we identified 14 candidate genes that were positively related to tumor stage status (P < 0.05 in Pearson correlation). Figure 6 illustrated the relative expression of the 14 stage-positive related genes in tumor samples of different stages, and Table S1 and Fig. S2 summarized the correlation scatter plots, coefficient values and statistical levels of the Pearson correlation. What’s more, we found that the 14 DEGs were up-regulated in tumors compared to normal samples (Tables 1 and S1), implying that the stage related genes may help reflecting the tumor progression of COAD.

Figure 6
figure 6

Relative gene expression plot of the 14 stage-positive related DEGs. GraphPad Prism was used to create the scatter plot, and each dot represents a sample.

To investigate the prognostic factors for COAD, the Cox regression model for survival analysis was conducted. The risk score (HR > 1) was significantly positively correlated with tumor stage in both univariate and multivariate Cox regression analysis, indicating that it could be recognized as an independent risk factor for patients’ prognosis (Table 1 and Fig. S3). We also evaluated the effects of the 38 DEGs on survival outcomes. Overall, the expression pattern of 38 DEGs was not significantly correlated with the survival outcomes (P > 0.05) in univariate Cox regression analysis, as shown in Table S2. When we set the screening criteria to 0.05 < P < 0.1 as having an influential trend, then TRIB3, STMN4 and FAM135B were found to have survival correlations in univariate Cox regression analysis. The risk score was significantly correlated with the differential expression of STMN4 (HR > 1, P < 0.05) and FAM135B (HR < 1, P < 0.05) in multivariate Cox regression analysis of the three candidate genes, as summarized in Table S2 and Fig. 7A. The Kaplan–Meier survival curves also revealed that high TRIB3 and STMN4 expression was associated with a lower overall survival probability, whereas high FAM135B expression was a better survival outcome (Fig. 7B–D). Taken together, our suggested that STMN4 and FAM135B dysregulation are independent prognostic factors for COAD patients.

Figure 7
figure 7

Multivariate Cox regression and Kaplan–Meier survival curves of three candidate DEGs. The analysis was carried out in R using the “survival” and “survminer” packages. (A) Multivariate Cox regression forest plot of the three candidate genes. HR: hazard ratio; CI: confidence interval. (BD) Kaplan–Meier survival curves for TRIB3, STMN4, and FAM135B, respectively. The cut-off points divided gene expression values into high (high) and low (low) groups.


With the development of high-throughput techniques in biology and life sciences, more and more omics datasets are being generated, particular in the field of cancer research. In recent years, the application of GTEx project has greatly improved the ability to study the genomics of normal tissues or cell lines22,23, providing invaluable reference data for cancer studies of the corresponding tissues/organs. The feature selection approach helps to locate important and representative indicators from high-dimensional datasets, which is important for the advancement of precision medicine, such as cancer diagnosis and treatment. In our study, we utilized both SNR and PPFS methods before and after, and finally discovered a set of 38 genes with promising performance in distinguishing COAD tumors from normal colon tissues, based on the combining dataset from both the TCGA-COAD cohort and GTEx normal colon samples.

The Bayesian network (also known as causal network) is a directed acyclic graphical model developed in the late 1970’s. The nodes represent the variables and the linkages represent informational or causal dependencies among the variables in a Bayesian network. Bayesian networks are widely used for modeling and inferring gene regulatory networks in biological applications, which provides an efficient way to study functional genomes. Here we constructed a Bayesian network based on the 38-gene expression profiles and classification labels (tumor or normal). The differential analysis revealed that the majority of the 38 DEGs were up-regulated, with only eight DEGs being down-regulated in COAD tumors compared to normal colon tissues. Interestingly, the gene nodes in the Bayesian network tended to be initially clustered according to the expression pattern. Based on this hypothesis, we may be able to predict the expression changes of novel genes since DEGs with similar expression patterns are tend to cluster together in a Bayesian network. It’s also worth mentioning that Bayesian networks have been applied for inferring the structure of biological modules that reflect causal molecular mechanisms or statistical associations of the underlying system24. In this study, for example, a biologically meaningful STRING PPI network involving seven up-DEGs was identified in the 38-gene Bayesian network. The seven DEGs in the PPI-network were all up-regulated in COAD tumor samples and were mainly enriched in cell cycle and division-related functions. Cell cycle deregulation is well known to be one of the most frequent alterations during tumorigenesis and development25,26. Thus, the findings above support the theory that using Bayesian networks not only provides useful information for disease classification, diagnosis and prediction, but also guides in inferring the structure of biological meaningful modules. However, Bayesian network model is not that perfect when imitating gene regulatory network. Gene regulatory networks are bipartite, since two genes can regulate each other in a network. In response to causality, the Bayesian network only forms a unidirectional mode rather than a bidirectional mode, which does not accurately reflect the actual gene regulation situation. What’s more, when the number of features (for example, genes) is relatively large, it is difficult to construct a Bayesian network, which further supports the significance of gene feature selection when studying the high-throughput dataset.

ROC analysis of the 38 DEGs showed ideal diagnostic accuracy, specificity, and sensitivity for COAD tumor samples, supporting our hypothesis that feature selection aids in obtaining effective gene features in cancer research. More importantly, parts of the candidate genes were found to be significantly correlated with tumor stage and survival outcomes in COAD patients. Studies have shown that TOP2A played important roles in the tumorigenesis of many types of cancer, including colon cancer, and knockdown of TOP2A suppressed the proliferation and invasion of colon cancer cells27. Previously, DNA microarray and two-color FISH detection revealed that the ubiquitin-conjugating enzyme E2C gene (UBE2C) was significantly overexpressed in both primary tumors and liver metastases of colon cancer28. TOP2A and UBE2C were also found to be up-regulated in COAD tumors when compared to normal tissues in this study. Meanwhile, the two genes were found to be positively correlated with tumor stage and to be functionally enriched in the gastric cancer network, implying that they may function as oncogenes in gastrointestinal tumors. Similarly, other stage-related up-DEGs discovered in our study have also been reported in colon cancer researches. A recent bioinformatic analysis, for example, revealed that key genes such as GRIND, KRT80, and SPTBN2 have high diagnosis values in CRC patients29. Furthermore, high levels of KRT80 mRNA were also observed in CRC cell lines30. INHBA promoted the proliferation, migration, and invasion of colon cancer cells31, and has been shown to be a prognostic predictor for COAD patients32. SALL4 mRNA has been identified as a marker for the diagnosis of several cancers33,34. The anti-cancer effects of chrysin on tumor cells in colon cancer included induction of apoptosis and attenuation of the SALL4 expression35. It has also been proposed that SERPINB5 in CRC is associated with tumor location, poor histological differentiation, microsatellite instability, and poor prognosis36. TMEM206 was demonstrated to promote CRC malignancy by interacting with AKT and extracellular signal-regulated kinase signaling pathways37. A study showed that TOMM34 expression was elevated in the majority of human colon cancer samples, and the siRNA-TOMM34 approach effectively suppressed gene expression and significantly inhibited cell growth in colon cancer HCT116 cells38. Researchers identified several candidate cancer driver genes, including TOMM34, in both mRNA and protein levels in a proteogenomic study of human CRC samples39. NOTUM, one of the Wnt target genes, was found to be up-regulated in clinical specimens of colon cancer40. Similarly, immunohistochemistry detection confirmed WDR43 overexpression in CRC patient specimens41. What’s more, several studies have reported the oncogenic role of TRIB3 in CRC42. In intestine cells, TRIB3 interacts with β-catenin and TCF4 to increase the expression of genes associated with cancer stem cells and promote CRC tumorigenesis43. Approaches to inhibiting TRIB3 activity may be developed for cancer therapy43. In this research, we discovered a positive relationship between TRIB3 expression and tumor stage, and high levels of TRIB3 indicating a poorer survival. Furthermore, we discovered that the gene FAM135B, which had not previously been described in colon cancer, was down-regulated and served as a prognostic factor for COAD. Overexpression of FAM135B has been reported in esophageal squamous cell cancer (ESCC)44. The FAM135B/AKT/mTOR feedforward loop promoted ESCC progression45, and silencing FAM135B improved the radiosensitivity of esophageal carcinoma cell46. This phenomenon contradicts our findings that FAM135B was significantly down-expressed in COAD samples, which needs to be confirmed further. Despite this, we may conclude that feature selection can greatly help to identify key candidate genes in cancer research. The majority of the candidate genes have previously been reported, with the same alteration trend as our findings. While another relatively novel gene features can be obtained for specific cancer types, this will broaden the field of biomarker discovery service for tumor diagnosis and treatment, both technically and theoretically.


In summary, we identified a 38 gene signatures with ideal performance when classifying COAD tumor from normal samples by using feature selection methods in this study. The majority of the 38 DEGs were significantly up-regulated in tumor samples compared to normal samples. In the Bayesian network, we found that genes with similar expression patterns or functions interacted more closely. Moreover, some of the candidate genes, such as TRIB3, KRT80, and FAM135B, were found to be correlated with tumor stage or survival outcomes, implying that these candidate genes could serve as promising prognostic biomarkers for COAD patients. Taken together, our study highlights the necessity and importance of feature selection approaches in cancer research, especially for high-dimensional datasets, which will significantly advance the development of precision medicine.

Data availability

The raw data of this study have been deposited in FigShare ( with the link:


  1. Siegel, R. L., Miller, K. D., Fuchs, H. E. & Jemal, A. Cancer statistics, 2021. CA Cancer J. Clin. 71, 7–33. (2021).

    Article  PubMed  Google Scholar 

  2. Bajramagic, S. et al. Usage of probiotics and its clinical significance at surgically treated patients sufferig from colorectal carcinoma. Med. Arch. 73, 316–320. (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Linehan, W. M. & Ricketts, C. J. The cancer genome atlas of renal cell carcinoma: Findings and clinical implications. Nat. Rev. Urol. 16, 539–552. (2019).

    CAS  Article  PubMed  Google Scholar 

  4. Tomczak, K., Czerwinska, P. & Wiznerowicz, M. The cancer genome atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol. 19, A68-77. (2015).

    Article  Google Scholar 

  5. Consortium, G. T. Human genomics: The genotype-tissue expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science 348(648–660), 2015. (2015).

    CAS  Article  Google Scholar 

  6. Consortium, G. T. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213. (2017).

    Article  Google Scholar 

  7. Barger, C. J., Branick, C., Chee, L. & Karpf, A. R. Pan-cancer analyses reveal genomic features of FOXM1 overexpression in cancer. Cancers (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Tang, Z., Kang, B., Li, C., Chen, T. & Zhang, Z. GEPIA2: an enhanced web server for large-scale expression profiling and interactive analysis. Nucleic Acids Res 47, W556–W560. (2019).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  9. Wu, M., Shang, X., Sun, Y., Wu, J. & Liu, G. Integrated analysis of lymphocyte infiltration-associated lncRNA for ovarian cancer via TCGA GTEx and GEO datasets. PeerJ 8, e8961. (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Chen, R.-C., Dewi, C., Huang, S.-W. & Caraka, R. E. Selecting critical features for data classification based on machine learning methods. J. Big Data 7, 1–26 (2020).

    Article  Google Scholar 

  11. Marcos-Zambrano, L. J. et al. Applications of Machine learning in human microbiome studies: A review on feature selection, biomarker identification disease prediction and treatment. Front. Microbiol. 12, 634511. (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Wang, Y., Gao, X., Ru, X., Sun, P. & Wang, J. A hybrid feature selection algorithm and its application in bioinformatics. PeerJ Comput. Sci. 8, e933. (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Wang, L., Wang, Y. & Chang, Q. Feature selection methods for big data bioinformatics: A survey from the search perspective. Methods 111, 21–31. (2016).

    CAS  Article  PubMed  Google Scholar 

  14. Wang, J. et al. Six-gene signature for predicting survival in patients with head and neck squamous cell carcinoma. Aging 12, 767–783. (2020).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  15. Hu, Y. et al. Identification of a five-gene signature of the RGS gene family with prognostic value in ovarian cancer. Genomics 113, 2134–2144. (2021).

    CAS  Article  PubMed  Google Scholar 

  16. Wan, Q. et al. Discovery and validation of a metastasis-related prognostic and diagnostic biomarker for melanoma based on single cell and gene expression datasets. Front. Oncol. 10, 585980. (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Mathias, C. et al. Unraveling immune-related lncRNAs in breast cancer molecular subtypes. Front. Oncol. 11, 692170. (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Mishra, D. & Sahu, B. Feature selection for cancer classification: A signal-to-noise ratio approach. Int. J. Sci. Eng. Res. 2, 1–7 (2011).

    Google Scholar 

  19. Goldman, M. J. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat. Biotechnol. 38, 675–678 (2020).

    CAS  Article  Google Scholar 

  20. Hassan, A., Paik, J. H., Khare, S. & Hassan, S. A. PPFS: Predictive permutation feature selection. arXiv preprint (2021).

  21. Colombo, D. & Maathuis, M. H. Order-independent constraint-based causal structure learning. J. Mach. Learn. Res. 15, 3741–3782 (2014).

    MathSciNet  MATH  Google Scholar 

  22. Kahles, A. et al. Comprehensive analysis of alternative splicing across tumors from 8705 patients. Cancer Cell 34, 211–224 (2018).

    CAS  Article  Google Scholar 

  23. Weng, W. et al. Identification of a competing endogenous RNA network associated with prognosis of pancreatic adenocarcinoma. Cancer Cell Int. 20, 1–14 (2020).

    Article  Google Scholar 

  24. Larjo, A., Shmulevich, I. & Lahdesmaki, H. Structure learning for Bayesian networks as models of biological networks. Methods Mol. Biol. 939, 35–45. (2013).

    CAS  Article  PubMed  Google Scholar 

  25. Engeland, K. Cell cycle arrest through indirect transcriptional repression by p53: I have a DREAM. Cell Death Differ. 25, 114–132 (2018).

    CAS  Article  Google Scholar 

  26. Phan, T. G. & Croucher, P. I. The dormant cancer cell life cycle. Nat. Rev. Cancer 20, 398–411. (2020).

    CAS  Article  PubMed  Google Scholar 

  27. Zhang, R., Xu, J., Zhao, J. & Bai, J. H. Proliferation and invasion of colon cancer cells are suppressed by knockdown of TOP2A. J. Cell Biochem. 119, 7256–7263. (2018).

    CAS  Article  PubMed  Google Scholar 

  28. Takahashi, Y. et al. Detection of aberrations of ubiquitin-conjugating enzyme E2C gene (UBE2C) in advanced colon cancer with liver metastases by DNA microarray and two-color FISH. Cancer Genet. Cytogenet. 168, 30–35. (2006).

    CAS  Article  PubMed  Google Scholar 

  29. Ma, J., Wang, P., Huang, L., Qiao, J. & Li, J. Bioinformatic analysis reveals an exosomal miRNA-mRNA network in colorectal cancer. BMC Med. Genom. 14, 60. (2021).

    CAS  Article  Google Scholar 

  30. Wu, Y. F. et al. Expression profile and prognostic value of Wnt signaling pathway molecules in colorectal cancer. Biomedicines (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Guo, J. & Liu, Y. INHBA promotes the proliferation, migration and invasion of colon cancer cells through the upregulation of VCAN. J. Int. Med. Res. 49, 3000605211014998. (2021).

    CAS  Article  PubMed  Google Scholar 

  32. Li, X. et al. INHBA is a prognostic predictor for patients with colon adenocarcinoma. BMC Cancer 20, 305. (2020).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  33. Ardalan Khales, S. et al. SALL4 as a new biomarker for early colorectal cancers. J Cancer Res Clin Oncol 141, 229–235. (2015).

    CAS  Article  PubMed  Google Scholar 

  34. Kobayashi, D., Kuribayshi, K., Tanaka, M. & Watanabe, N. SALL4 is essential for cancer cell proliferation and is overexpressed at early clinical stages in breast cancer. Int. J. Oncol. 38, 933–939. (2011).

    CAS  Article  PubMed  Google Scholar 

  35. Bahadori, M., Baharara, J. & Amini, E. Anticancer properties of chrysin on colon cancer cells, in vitro and in vivo with modulation of caspase-3, -9, bax and sall4. Iran. J. Biotechnol. 14, 177–184. (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Kim, J. H. et al. Nuclear maspin expression correlates with the CpG island methylator phenotype and tumor aggressiveness in colorectal cancer. Int. J. Clin. Exp. Pathol. 8, 1920–1928 (2015).

    PubMed  PubMed Central  Google Scholar 

  37. Zhao, J. et al. TMEM206 promotes the malignancy of colorectal cancer cells by interacting with AKT and extracellular signal-regulated kinase signaling pathways. J. Cell Physiol. 234, 10888–10898. (2019).

    CAS  Article  PubMed  Google Scholar 

  38. Shimokawa, T. et al. Identification of TOMM34, which shows elevated expression in the majority of human colon cancers, as a novel drug target. Int. J. Oncol. 29, 381–386 (2006).

    CAS  PubMed  Google Scholar 

  39. Zhang, B. et al. Proteogenomic characterization of human colon and rectal cancer. Nature 513, 382–387. (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  40. Shiokawa, D. et al. Slow-cycling cancer stem cells regulate progression and chemoresistance in colon cancer. Cancer Res. 80, 4451–4464. (2020).

    CAS  Article  PubMed  Google Scholar 

  41. Li, Z. et al. WD40 repeat 43 mediates cell survival, proliferation, migration and invasion via vimentin in colorectal cancer. Cancer Cell Int. 21, 418. (2021).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  42. Makino, S. et al. DCLK1 integrates induction of TRIB3, EMT, drug resistance and poor prognosis in colorectal cancer. Carcinogenesis 41, 303–312. (2020).

    CAS  Article  PubMed  Google Scholar 

  43. Hua, F. et al. TRIB3 interacts with beta-catenin and TCF4 to increase stem cell features of colorectal cancer stem cells and tumorigenesis. Gastroenterology 156, 708-721.e15. (2019).

    CAS  Article  PubMed  Google Scholar 

  44. Song, Y. et al. Identification of genomic alterations in oesophageal squamous cell cancer. Nature 509, 91–95. (2014).

    ADS  CAS  Article  PubMed  Google Scholar 

  45. Dong, D. et al. A GRN autocrine-dependent FAM135B/AKT/mTOR feedforward loop promotes esophageal squamous cell carcinoma progression. Cancer Res 81, 910–922. (2021).

    CAS  Article  PubMed  Google Scholar 

  46. Bi, L., Wang, H. & Tian, Y. Silencing FAM135B enhances radiosensitivity of esophageal carcinoma cell. Gene 772, 145358. (2021).

    CAS  Article  PubMed  Google Scholar 

Download references


This work was supported by the National Natural Science Foundation of China (No. 61573285).

Author information

Authors and Affiliations



Y.W wrote the main manuscript text. X.R. and P.S. prepared datasets and constructed models. X.G. and J.W revised paper. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Xiaoguang Gao or Jihan Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Gao, X., Ru, X. et al. Identification of gene signatures for COAD using feature selection and Bayesian network approaches. Sci Rep 12, 8761 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing