Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models


Colorectal cancer (CRC) is one of the most common cancers worldwide, and a leading cause of cancer deaths. Better classifying multicategory outcomes of CRC with clinical and omic data may help adjust treatment regimens based on individual’s risk. Here, we selected the features that were useful for classifying four-category survival outcome of CRC using the clinical and transcriptomic data, or clinical, transcriptomic, microsatellite instability and selected oncogenic-driver data (all data) of TCGA. We also optimized multimetric feature selection to develop the best multinomial logistic regression (MLR) and random forest (RF) models that had the highest accuracy, precision, recall and F1 score, respectively. We identified 2073 differentially expressed genes of the TCGA RNASeq dataset. MLR overall outperformed RF in the multimetric feature selection. In both RF and MLR models, precision, recall and F1 score increased as the feature number increased and peaked at the feature number of 600–1000, while the models’ accuracy remained stable. The best model was the MLR one with 825 features based on sum of squared coefficients using all data, and attained the best accuracy of 0.855, F1 of 0.738 and precision of 0.832, which were higher than those using clinical and transcriptomic data. The top-ranked features in the MLR model of the best performance using clinical and transcriptomic data were different from those using all data. However, pathologic staging, HBS1L, TSPYL4, and TP53TG3B were the overlapping top-20 ranked features in the best models using clinical and transcriptomic, or all data. Thus, we developed a multimetric feature-selection based MLR model that outperformed RF models in classifying four-category outcome of CRC patients. Interestingly, adding microsatellite instability and oncogenic-driver data to clinical and transcriptomic data improved models’ performances. Precision and recall of tuned algorithms may change significantly as the feature number changes, but accuracy appears not sensitive to these changes.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Study flow.
Fig. 2: Tuning the accuracy of the RF model.
Fig. 3: Tuning the RF model’s temporal efficiency.
Fig. 4: The accuracy remained relatively stable in some models using clinical and transcriptomic data, but showd peaks in others.
Fig. 5: The accuracy remained relatively stable in some models using all data, but showd peaks in others.


  1. 1.

    Siegel, R. L., Miller, K. D., Fuchs, H. E. & Jemal, A. Cancer Statistics, 2021. Cancer J. Clin. 71, 7–33 (2021).

    Article  Google Scholar 

  2. 2.

    Zhang, M. et al. Association of KRAS mutation with tumor deposit status and overall survival of colorectal cancer. Cancer Causes Control 31, 683–689 (2020).

    PubMed  PubMed Central  Article  Google Scholar 

  3. 3.

    Chavali, L. B. et al. Radiotherapy for patients with resected tumor deposit-positive colorectal cancer: a surveillance, epidemiology, and end results-based population study. Arch. Pathol. Lab. Med. 142, 721–729 (2018).

    CAS  PubMed  Article  Google Scholar 

  4. 4.

    Mayo, E., Llanos, A. A., Yi, X., Duan, S. Z. & Zhang, L. Prognostic value of tumour deposit and perineural invasion status in colorectal cancer patients: a SEER-based population study. Histopathology 69, 230–238 (2016).

    PubMed  Article  Google Scholar 

  5. 5.

    Siegel, R. L. et al. Colorectal cancer statistics, 2020. Cancer J. Clin. 70, 145–164 (2020).

    Article  Google Scholar 

  6. 6.

    Liu, D. D. & Zhang, L. Trends in the characteristics of human functional genomic data on the gene expression omnibus, 2001–2017. Lab. Invest. 99, 118–127 (2019).

    PubMed  Article  Google Scholar 

  7. 7.

    Deng, F., Shen, L., Wang, H. & Zhang, L. Classify multicategory outcome in patients with lung adenocarcinoma using clinical, transcriptomic and clinico-transcriptomic data: machine learning versus multinomial models. Am. J. Cancer Res. 10, 4624–4639 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Sousa-Squiavinato, A. C. M. et al. Cofilin-1, LIMK1 and SSH1 are differentially expressed in locally advanced colorectal cancer and according to consensus molecular subtypes. Cancer Cell Int. 21, 69 (2021).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  9. 9.

    Zhang, Z. et al. Genomics and prognosis analysis of epithelial-mesenchymal transition in colorectal cancer patients. BMC Cancer 20, 1135 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. 10.

    Zhang, Z. et al. Comprehensive analysis of the transcriptome-wide m6A methylome in colorectal cancer by MeRIP sequencing. Epigenetics 16, 1–11 (2020)

  11. 11.

    Zhang, X. et al. Promoter hypermethylation of CHODL contributes to carcinogenesis and indicates poor survival in patients with early-stage colorectal cancer. J. Cancer 11, 2874–2886 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Tokunaga, R. et al. 12-Chemokine signature, a predictor of tumor recurrence in colorectal cancer. Int. J. Cancer 147, 532–41 (2020).

    CAS  PubMed  Article  Google Scholar 

  13. 13.

    Saleh, R. et al. RNA-Seq analysis of colorectal tumor-infiltrating myeloid-derived suppressor cell subsets revealed gene signatures of poor prognosis. Front. Oncol. 10, 604906 (2020).

    PubMed  PubMed Central  Article  Google Scholar 

  14. 14.

    Ren, Y., Lv, Y., Li, T. & Jiang, Q. High expression of PLAC1 in colon cancer as a predictor of poor prognosis: a study based on TCGA data. Gene 763, 145072 (2020).

    CAS  PubMed  Article  Google Scholar 

  15. 15.

    Poursheikhani, A., Abbaszadegan, M. R., Nokhandani, N. & Kerachian, M. A. Integration analysis of long non-coding RNA (lncRNA) role in tumorigenesis of colon adenocarcinoma. BMC Med. Genomics 13, 108 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. 16.

    Bala, P. et al. Exome sequencing identifies ARID2 as a novel tumor suppressor in early-onset sporadic rectal cancer. Oncogene 40, 863–872 (2020).

  17. 17.

    Moody, L., Chen, H. & Pan, Y. X. Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening. BMC Med. Genomics 13, 148 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  18. 18.

    Park, S. et al. Wx: a neural network-based feature selection algorithm for transcriptomic data. Sci. Rep. 9, 10500 (2019).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  19. 19.

    Momenzadeh, M., Sehhati, M. & Rabbani, H. A novel feature selection method for microarray data classification based on hidden Markov model. J. Biomed. Inform. 95, 103213 (2019).

    PubMed  Article  Google Scholar 

  20. 20.

    Chiesa, M., Colombo, G. I. & Piacentini, L. DaMiRseq-an R/Bioconductor package for data mining of RNA-Seq data: normalization, feature selection and classification. Bioinformatics 34, 1416–1418 (2018).

    CAS  PubMed  Article  Google Scholar 

  21. 21.

    Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V. & Fotiadis, D. I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13, 8–17 (2015).

    CAS  PubMed  Article  Google Scholar 

  22. 22.

    Wang, J. et al. Predicting long-term multicategory cause of death in patients with prostate cancer: random forest versus multinomial model. Am. J. Cancer Res. 10, 1344–1355 (2020).

    PubMed  PubMed Central  Google Scholar 

  23. 23.

    Deng, F. et al. Predict multicategory causes of death in lung cancer patients using clinicopathologic factors. Comput. Biol. Med. 129, 104161 (2020).

    PubMed  Article  Google Scholar 

  24. 24.

    Deng, F., Huang, J., Yuan, X., Cheng, C. & Zhang, L. Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data. Lab. Invest. 101, 430–441 (2021).

    PubMed  Article  CAS  Google Scholar 

  25. 25.

    Naseriparsa, M., Al-Shammari, A., Sheng, M., Zhang, Y. & Zhou, R. RSMOTE: improving classification performance over imbalanced medical datasets. Health Inf. Sci. Syst. 8, 22 (2020).

    PubMed  Article  Google Scholar 

  26. 26.

    Jeni, L. A., Cohn, J. F. & De La Torre, F. Facing imbalanced data recommendations for the use of performance metrics. 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (Acii). 245–251 (IEEE Xplore, 2013).

  27. 27.

    Gao, J. et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6, pl1 (2013).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  28. 28.

    Hu, W. et al. Subtyping of microsatellite instability-high colorectal cancer. Cell Commun. Signal. 17, 79 (2019).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  29. 29.

    Benson, A. B. et al. Colon cancer, version 2.2021, NCCN clinical practice guidelines in oncology. J. Natl Compr. Cancer Netw. 19, 329–359 (2021).

    Article  Google Scholar 

  30. 30.

    Benson, A. B. et al. NCCN guidelines insights: rectal cancer, version 6.2020. J. Natl Compr. Cancer Netw. 18, 806–815 (2020).

    Article  Google Scholar 

  31. 31.

    Cancer Genome Atlas N. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).

    Article  CAS  Google Scholar 

  32. 32.

    Cocco, E., Scaltriti, M. & Drilon, A. NTRK fusion-positive cancers and TRK inhibitor therapy. Nat. Rev. Clin. Oncol. 15, 731–747 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  33. 33.

    Mermel, C. H. et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 12, R41 (2011).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  34. 34.

    Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  35. 35.

    Phipps, A. I. et al. Colon and rectal cancer survival by tumor location and microsatellite instability: the colon cancer family registry. Dis. Colon Rectum 56, 937–944 (2013).

    PubMed  PubMed Central  Article  Google Scholar 

  36. 36.

    Samowitz, W. S. et al. Microsatellite instability in sporadic colon cancer is associated with an improved prognosis at the population level. Cancer Epidemiol. Biomark. Prev. 10, 917–923 (2001).

    CAS  Google Scholar 

  37. 37.

    Zhuang, Y. et al. Multi gene mutation signatures in colorectal cancer patients: predict for the diagnosis, pathological classification, staging and prognosis. BMC Cancer 21, 380 (2021).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  38. 38.

    Zhang, C. et al. microRNA-1827 represses MDM2 to positively regulate tumor suppressor p53 and suppress tumorigenesis. Oncotarget 7, 8783–8796 (2016).

    PubMed  PubMed Central  Article  Google Scholar 

  39. 39.

    Yan, P. et al. Reduced expression of SMAD4 is associated with poor survival in colon cancer. Clin. Cancer Res. 22, 3037–3047 (2016).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  40. 40.

    Voorneveld, P. W. et al. Loss of SMAD4 alters BMP signaling to promote colorectal cancer cell metastasis via activation of Rho and ROCK. Gastroenterology 147, 196–208 e113 (2014).

    CAS  PubMed  Article  Google Scholar 

  41. 41.

    Ogino, S. et al. CpG island methylator phenotype, microsatellite instability, BRAF mutation and clinical outcome in colon cancer. Gut 58, 90–96 (2009).

    PubMed  Article  Google Scholar 

  42. 42.

    Samowitz, W. S. et al. Poor survival associated with the BRAF V600E mutation in microsatellite-stable colon cancers. Cancer Res. 65, 6063–6069 (2005).

    CAS  PubMed  Article  Google Scholar 

  43. 43.

    Washington, M. K. Colorectal carcinoma: selected issues in pathologic examination and staging and determination of prognostic factors. Arch. Pathol. Lab. Med. 132, 1600–1607 (2008).

    PubMed  Article  Google Scholar 

  44. 44.

    Compton, C. C. & Greene, F. L. The staging of colorectal cancer: 2004 and beyond. Cancer J. Clin. 54, 295–308 (2004).

    Article  Google Scholar 

  45. 45.

    Xu, D. et al. Development and clinical validation of a novel 9-gene prognostic model based on multi-omics in pancreatic adenocarcinoma. Pharmacol. Res. 164, 105370 (2021).

    CAS  PubMed  Article  Google Scholar 

  46. 46.

    Pan, Y., Song, Y., Cheng, L., Xu, H. & Liu, J. Analysis of methylation-driven genes for predicting the prognosis of patients with head and neck squamous cell carcinoma. J. Cell Biochem. 120, 19482–19495 (2019).

    CAS  PubMed  Article  Google Scholar 

  47. 47.

    Kodama, T. et al. Two-step forward genetic screen in mice identifies Ral GTPase-activating proteins as suppressors of hepatocellular carcinoma. Gastroenterology 151, 324–337 e312 (2016).

    CAS  PubMed  Article  Google Scholar 

  48. 48.

    Tapper, W. et al. Genetic variation at MECOM, TERT, JAK2 and HBS1L-MYB predisposes to myeloproliferative neoplasms. Nat. Commun. 6, 6691 (2015).

    CAS  PubMed  Article  Google Scholar 

  49. 49.

    Liu, H., Li, H., Luo, K., Sharma, A. & Sun, X. Prognostic gene expression signature revealed the involvement of mutational pathways in cancer genome. J. Cancer 11, 4510–4520 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  50. 50.

    Saelee, P. et al. Novel PNLIPRP3 and DOCK8 gene expression and prognostic implications of DNA loss on chromosome 10q25.3 in hepatocellular carcinoma. Asian Pac. J. Cancer Prev. 10, 501–506 (2009).

    PubMed  Google Scholar 

  51. 51.

    Deshpande, S., Shuttleworth, J., Yang, J., Taramonli, S. & England, M. PLIT: An alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasets. Comput. Biol. Med. 105, 169–181 (2019).

    CAS  PubMed  Article  Google Scholar 

  52. 52.

    Jylhävä, J. et al. Identification of a prognostic signature for old-age mortality by integrating genome-wide transcriptomic data with the conventional predictors: the Vitality 90+ Study. BMC Med. Genomics 7, 54 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  53. 53.

    Tolosi, L. & Lengauer, T. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27, 1986–1994 (2011).

    CAS  PubMed  Article  Google Scholar 

  54. 54.

    Wang, J. & Wang, L. Prediction and prioritization of autism-associated long non-coding RNAs using gene expression and sequence features. BMC Bioinform. 21, 505 (2020).

    CAS  Article  Google Scholar 

  55. 55.

    Ma, H., Tong, L., Zhang, Q., Chang, W. & Li, F. Identification of 5 gene signatures in survival prediction for patients with lung squamous cell carcinoma based on integrated multiomics data analysis. Biomed. Res. Int. 2020, 6427483 (2020).

    PubMed  PubMed Central  Google Scholar 

  56. 56.

    Lu, Z. et al. A 13-immune gene set signature for prediction of colon cancer prognosis. Comb. Chem. High Throughput Screen. (2020)

  57. 57.

    Cheng, N., Schulte, A. J., Santosa, F. & Kim, J. H. Machine learning application identifies novel gene signatures from transcriptomic data of spontaneous canine hemangiosarcoma. Brief Bioinform. 22, bbaa252 (2020).

  58. 58.

    Long, N. P. et al. High-throughput omics and statistical learning integration for the discovery and validation of novel diagnostic signatures in colorectal cancer. Int. J. Mol. Sci. 20, 296 (2019).

  59. 59.

    Zhang, Z. Y. et al. Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Brief. Bioinform. 22, 526–535 (2021).

    CAS  PubMed  Article  Google Scholar 

  60. 60.

    Yuan, F., Lu, L. & Zou, Q. Analysis of gene expression profiles of lung cancer subtypes with machine learning algorithms. Biochim. Biophys. Acta 1866, 165822 (2020).

    CAS  Article  Google Scholar 

  61. 61.

    Li, J. et al. Identification of leukemia stem cell expression signatures through Monte Carlo feature selection strategy and support vector machine. Cancer Gene Ther. 27, 56–69 (2020).

    CAS  PubMed  Article  Google Scholar 

  62. 62.

    Fernández, E. A. et al. Unveiling the immune infiltrate modulation in cancer and response to immunotherapy by MIXTURE-an enhanced deconvolution method. Brief. Bioinform. 22, bbaa317 (2020).

  63. 63.

    Chen, Z. et al. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 36, 1542–1552 (2020).

    CAS  PubMed  Google Scholar 

  64. 64.

    Mangiola, S. et al. Periprostatic fat tissue transcriptome reveals a signature diagnostic for high-risk prostate cancer. Endocr. Relat. Cancer 25, 569–581 (2018).

    CAS  PubMed  Article  Google Scholar 

  65. 65.

    Fatai, A. A. & Gamieldien, J. A 35-gene signature discriminates between rapidly- and slowly-progressing glioblastoma multiforme and predicts survival in known subtypes of the cancer. BMC Cancer 18, 377 (2018).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  66. 66.

    Hu, Y. et al. A machine learning approach for the identification of key markers involved in brain development from single-cell transcriptomic data. BMC Genomics 17, 1025 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  67. 67.

    Wei, X. et al. Identification of biomarkers that distinguish chemical contaminants based on gene expression profiles. BMC Genomics 15, 248 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  68. 68.

    Murphy, C. C., Harlan, L. C., Lund, J. L., Lynch, C. F. & Geiger, A. M. Patterns of colorectal cancer care in the United States: 1990–2010. J. Natl Cancer Inst. 107, djv198 (2015).

    PubMed  PubMed Central  Article  Google Scholar 

Download references


The work was in part supported by the Ramzi S. Cotran Young Investigator Award (to LZ) from the U.S. and Canadian Academy of Pathology. The funder plays no roles in the study design, data analysis or manuscript preparation.

Author information




C.H.F., C.C. and L.Z. designed the study, C.H.F. and L.Z. conducted the study and drafted the manuscript, all authors discussed, revised and edited the manuscript and L.Z. supervised the work.

Corresponding author

Correspondence to Lanjing Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Feng, C.H., Disis, M.L., Cheng, C. et al. Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models. Lab Invest (2021).

Download citation


Quick links