Introduction

Colorectal cancer (CRC) is the third most common cancer in both U.S. men and women when excluding melanomas1. Genomic medicine and predictive markers have greatly helped treat CRC2,3. We have shown that pathologic staging is critical for prognostication of CRC4. However, the 5-year overall survival of CRC was still only 64%5, and warrants identification of additional markers to better classify treatment responses and guide CRC treatments.

Transcriptomic and genomic data have been increasingly produced and used in recent decades6. However, there are several major challenges in integrating clinicopathologic and ‘omics data for predicting clinical outcomes. Adding clinical data to transcriptomic data seems not to improve accuracy in classifying multicategory cancer outcomes7, but it is unknown whether adding genomic data of oncogenic drivers would help improve classification performance of statistical models. Moreover, limited by the outcome data in datasets, a vast majority of previous transcriptomic studies on CRC used binary or time-event type outcomes8,9,10,11,12,13,14,15,16, while clinicians need to predict clinical outcomes in much more details to better treat cancer patients and inform them of the disease prognosis. Furthermore, feature selection has greatly improved classification performances of various models on transcriptomic and microarray data17,18,19,20. However, it is largely unclear whether addition of other ‘omic data change the number of selected features. To explore these questions, we examined the performance metrics and optimal feature numbers of the best models in classifying four-category clinical outcomes of CRC before and after adding selected-genomic and microsatellite instability (MSI) data to clinical and transcriptomic data.

Machine learning (ML) algorithms produce a model that can perform classification, regression, and other similar tasks based on a given dataset, which can be used to predict the output of another system or dataset21. We have shown that transcriptomic data and clinical features can be used to accurately classify outcomes of lung, prostate, and breast cancers7,22,23,24. However, most of the model tuning was focused on accuracy, which is nonetheless very important. Additional metrics may be important to better understand the performance of an algorithm, particularly in imbalanced datasets25,26. Hence, we aimed to understand the multimetric performance of random forests (RF) and multinomial logistic regression (MLR) in classifying multicategory outcomes of CRC patients, and how to select features for the best multimetric performances such as accuracy, precision, recall, and F1. We also compared the performance metrics of MLR and RF models and obtained the range of clinical and molecular features that are needed for best performance of each model.

Methods

Data extraction

We obtained individual-level data of colorectal adenocarcinomas (pancancer atlas) from The Cancer Genome Atlas Program (TCGA) through cBioPortal repository (https://www.cbioportal.org/study/summary?id=coadread_tcga_pan_can_atlas_2018, Fig. 1)27. The TCGA data were de-identified and publicly available. Therefore, this study was categorized as an exempt study (category 4) and did not require an Institutional Review Board (IRB) review. The transcriptomic data (RNAseq V2) were processed and normalized using RSEM (rsem.genes.normalized_results of the TCGA dataset), calculated into z-score and dichotomized using the median z-score of all samples27.

Fig. 1: Study flow.
figure 1

We extracted the colorectal adeno-carcinoma cases in the cancer genome atlas (TCGA), and classified the patient survival into four categories, including alive with no progression, alive with progression, dead with no known progression, and dead with progression. We used random forests (RF) and multinomial logistic regression (MLR) to classify the four-category outcomes. The 10-fold cross-validation approach was used during the modeling of the MLR model.

We extracted the MSI statuses from a previous study on TCGA CRC that used a nonnegative matrix factorization-based consensus clustering model28. We also extracted the genomic data of 16 clinically-significant oncogenic-driver or MSI-related genes, including KRAS, NRAS, BRAF, NTRK1, NTRK2, NTRK3, ERBB2, POLE, MSH2, MSH6, PMS2, MLH1, APC, TP53, PIK3CA, and SMAD429,30,31,32. The genomic data were standardized and annotated using Genome Nexus (which utilizes VEP with the canonical UniProt transcript, https://github.com/mskcc/vcf2maf/blob/master/data/isoform_overrides_uniprot). Any driver alternations (amplifications, mutations, copy-number variation, and deletions) of these oncogenic driver genes were determined using GISTIC 2.033.

We performed the differentially expressed genes (DEG) analysis to remove the less relevant transcriptomes using Chi-square test7. The DEG and clinico-pathological features were subject to the modeling alone or with addition of the genomic and MSI data. The outcomes of the classification models were the patients’ four-category survival, including alive with no progression, alive with progression, dead with no known progression, and dead with progression. All processes were conducted using Python 3.6.9.

RF modeling

Tuning

We used the RandomForestClassifier from the Python Scikit-learn package for RF modeling34. We used the Gini index as the split criterion and had 20 iterations for each run.

During the tuning of the RF model, parameters including n_estimator (the number of trees in the forest), min_sample_splits (the minimum number of samples required to split an internal node), min_samples_leaf (the minimum number of samples required to be at a leaf node), and n_jobs (the number of jobs to run in parallel) were tuned.

To measure the accuracy of the model, we used cross_val_score imported from the Scikit-learn package, which evaluated the accuracy of a model using cross validation. We used a 5-fold cross validation, which produced five separate accuracy values for each iteration of our tuning, from which we took the average value.

We automated the tuning process. The pipeline first tuned n_estimator (range of 5–195, with increments of 5) and min_sample_splits (range of 2–14). This was repeated twice. We then tuned min_sample_leaf (range of 1–25) and n_jobs (range of 1–14) three times. The pipeline automatically selected the parameters which produced the highest average accuracy. To obtain the feature importance from our model, we used the feature_importances_ property, which measured the Gini importance, the impurity-based feature importance. Feature_importances was part of RandomForestClassifier from the Scikit-learn Package. We ran the tuned RF model 20 times to obtain the mean of the feature importance. We obtained performance metrics, including accuracy, recall, precision, and F1 score, using the cross_validate function imported from the sklearn.model_selection package from Scikit-learn34.

Feature selection

Ranked approach, feature importance approach, and preset feature selection were used for feature selection, with each approach described in detail as following. The whole process of each approach was repeated 20 times. We then compared the performance metrics of the optimized models based on all data versus clinical and transcriptomic data that reached the best accuracy and precision, using Student t-test.

Ranked approach

We performed feature selection on the RF model using our pipeline based on ranking. After sorting the features based on their average feature importance, we created a number of reduced datasets. The number of features in these datasets increased by an increment of 185 features. Thus, we created 11 reduced datasets (Supplementary Table 1). During the process of feature selection, we tuned each reduced dataset individually using the aforementioned tuning process. After tuning, we ran the tuned model 20 times for each reduced dataset and collected the performance metrics of the model.

Feature importance approach

We performed feature selection using our pipeline based on the sum of the squared coefficients. Using the average feature importance which we previously obtained for each individual feature, we selected a number of sets of features using their feature importance values as a cut-off. We created 11 selected datasets based on the cut-off values (Supplementary Table 1). During the process of feature selection, we tuned each reduced dataset individually using the tuning process as described above. After tuning, we ran the tuned model 20 times for each reduced dataset and collected the performance metrics of the model.

Preset feature selection

We also performed feature selection using preset parameters for the RF model. Using the reduced datasets created for the feature importance approach, we set the values of n_estimator (range of 20–100, with increments of 20) and min_sample_split (range of 2–6, with increments of 2) of the RF model uniformly for all datasets to investigate the nature and trends of dimension reduction in relation to tuning. The model was run with each combination of the n_estimator and min_sample_split values 20 times for all reduced datasets, and collected the performance metrics of the model.

MLR modeling

Model setup

We performed MLR to compare the results from our RF model with those from a more conventional model. We used LogisticRegression from the sklearn.linear_model package of scikit-learn to create the MLR model34. MLR is a conventional statistical model, which doesn’t need to be tuned. We used train_test_split, cross_validate, and KFold from the sklearn.model_selection package to split the dataset into the test and train sets and perform a 10-fold cross validation. To extract the coefficient values for the features that built the MLR model, we used.coef_ from the LogisticRegression package.

Feature selection

Ranked approach

The pipeline using the ranked method was employed to perform the feature selection for the MLR model. Using the coefficients we obtained for each feature in a MLR, the pipeline sorted the features by their coefficients for each of the four classes of clinical outcomes in descending order. The pipeline then extracted the top and bottom features from every class to create reduced datasets and perform feature selection. We removed the duplicated features in each reduced dataset. To this end, we created 14 reduced datasets (Supplementary Table 1). During the process of performing Dimension Reduction, we ran the model 20 times for each reduced dataset and collected the performance metrics of the model.

Squared coefficient approach

Using the coefficient values, we obtained for each feature, the pipeline calculated the squared-coefficient of each feature by adding together the squared values of the feature’s coefficient for each class. Then, these features were sorted in descending order based on their square-coefficient values. When creating the datasets, we selected square-coefficient values to create reduced datasets containing the features, whose square-coefficients were greater than the selected threshold. Using these thresholds, we created 10 reduced datasets (Supplementary Table 1). During the process of performing dimension reduction, we ran the model 20 times for each reduced dataset and collected the performance metrics of the model

Results

Baseline characteristics

There were 2034 DEGs among the 17,501 genes that were subject to RNAseq as shown by Chi-square test. The dataset had 589 cases, including 406 (68.7%) alive no disease, 65 (11.0%) alive with disease, 35 (5.9%) dead no disease, and 85 (14.4%) dead with disease (Table 1). There were 58 (9.9%) MSI-high CRC. The basic characteristics of these cases are summarized in Table 1.

Table 1 Baseline characteristics of the included colorectal cancer cases by outcome.

Genomic alterations of CRC driver genes

Among the 16 most frequently mutated or clinically significant CRC driver genes, APC, TP53 and KRAS were the most frequently seen, namely in 67% (393/589), 53% (312/589), and 38% (221/589) of the tumors, respectively (Supplementary Fig. 1 and Supplementary Table 2). The MSI related genes such as MSH2, MSH6, PMS2, MLH1, and POLE each had a mutation frequency of 3–6%, together accounting for about 8% of all tumors. The frequencies of driver-mutations in KRAS G13D, KRAS Non-G13D, NRAS, BRAF V600E, PKI3A, APC, and TP53 were 9.3% (55/589), 28.2 (166/589), 5.1% (30/589), 8.2% (48/589), 22.9% (155/589), 64.9% (382/589), and 52.6% (310/589), respectively. Interestingly, NTRK1, NTRK2, and NTRK3 driver mutations were seen in only 0.5% (3/589), 0% and 0% of the tumors, despite their 2–5% of overall mutation-frequencies. The most frequent genomic alteration of ERBB2, which is also clinically known as HER2/Neu, was amplification (3.4%, 20/589), including 2.9% (17/589) with amplification only and 0.5% (3/589) with amplification and mutation(s).

Tuning RF models

The RF model was tuned to find the best parameters to reach the highest possible accuracy. Through tuning, we found that the accuracy of the model can reach 69 or 70% when the parameters n_estimator, min_sample_splits, and n_jobs were not too few (i.e., n_estimator > 15, min_sample_splits > 2, n_jobs > 1), and the effect of min_samples_leaf to be less prominent when tuning (Fig. 2A, B).

Fig. 2: Tuning the accuracy of the RF model.
figure 2

A Tuning the parameters n_estimators (from 5 to 195, in increments of 5) and min_sample_splits (from 2 to 14, in increments of 1). B Tuning the min_samples_leaf (from 1 to 25, in increments of 1) and n_jobs (from 2 to 14, in increments of 1).

The model was also tuned for its temporal efficiency. When the value of n_estimator was not too great (fewer than 130), the temporal efficiency of the model can be within 4 s (Fig. 3A). The effect of min_sample_splits was less prominent when tuning. Results also showed that when the value of n_jobs was sufficiently small (fewer than 10) and that of n_samples_leaf was sufficiently large (greater than 5), the efficiency of the model can range from 1.8 to 2.6 s (Fig. 3B).

Fig. 3: Tuning the RF model’s temporal efficiency.
figure 3

A Tuning the number of estimators and sample split. B Tuning the number of sample leaf and jobs without using graphic processing unit. The time of each individual run was measured in seconds.

Based on tuning results, the RF model was tuned with parameters n_estimator at 45, min_sample_splits at 4, min_samples_leaf at 1, and n_jobs at 4.

We performed 20 iterations of the tuned RF model, which appeared to have similar performances according to the confusion matrices and classification reports. Among these 20 iterations, the best classification accuracy of the model was 76% and it had a temporal efficiency of 2.57 s.

To evaluate the performance of a model, just looking at the accuracy and temporal efficiency is not enough. The values of precision, recall, and F1 all reflect the performance of the model in classifying each category. For our RF model, the precision, recall, and F1 in classifying alive with no progression were 76%, 99%, and 86%, respectively. However, the classification report also indicated that our model cannot effectively classify the other three categories (alive with progression, dead with no progression, and dead with progression).

The datasets used in this study had 2034 and 2051 features (also including MSI and 16 oncogenic-driver statuses), respectively. The individual importance values of these features were extracted and measured. The Gini importance of each individual feature was measured and averaged after 20 consecutive repeats using the tuned model

Using the average feature importance values, we obtained the following important features from the clinical and transcriptomic data that had the top feauture-importance values: path_stage 4, path_n_stage 3, path_n_stage 1, COL11A2, SSX2IP, GSR, CYP46A1, CCDC114, PSRC1, GABRD, GLT25D2, ASPDH, path_t_stage 4, ARHGAP4, CLK2, FLJ43663, TLE6, age_3grp3, HIST1H2BE, C20orf108, C15orf58, TP53TG3B, RBKS, and CD101; and obtained the following from all data: path_m_stage2, path_stage4, path_n_stage3, COL11A2, path_m_stage1, age_3grp3, C12orf69, path_n_stage1, CCDC114, CLK2P, FLJ12825, ASPDH, EFNA2, HIST1H2BE, path_t_stage4, ARHGAP4, CYP46A1, EVI5, CDHR2, and MMAA (Supplementary Table 3).

During our manual tuning of the RF feature selection datasets based upon feature importance, we found that the number of features affected the parameters that led to the highest performance (Supplementary Fig. 2). For each of the manually tuned models, the datasets that underwent feature-selection performed better than the datasets with the highest number of features, which can be most clearly seen in the values of precision (Supplementary Fig. 2). These observations underscore the importance of feature selection before analysis to ensure its best performance.

The MLR using clinical and transcriptomic data (Supplementary Table 4) and all data (Supplementary Table 5) both showed that the features with higher association with class 1 (alive with no disease) usually had higher inversion association with classes 4 (dead with disease). Interestingly, some of the features were important in both datasets including path_n_stage3, path_m_stage1, DCDC1, BRE, and TP53TG3B.

Using the sum of squared coefficients, we obtained the following important features from the clinical and transcriptomic data: path_stage 4, FLJ12825, KLF10, SLC5A5, path_n_stage1, GSR, SLC38A7, TUBA1C, DPCR1, TSPYL4, ERLIN2, HBS1L, path_stage2, LOC100130331, ZNF500, STOML1, TP53TG3B, TEX2, PPCDC, path_n_stage3, FAM124A, SLC20A2, EIF4E3, and SLC22A4; and obtained the following from all data: path_n_stage3, path_m_stage1, RRN3P2, GPR88, path_stage4, COMT, TP53TG3B, WBSCR17, BRE, GAS6, path_m_stage2, MSMP, FAM195B, GPRIN3, HBS1L, ZNF101, MOSC1, TSPYL4, TAS1R1, and KCNJ13 (Supplementary Table 6).

During our multimetric feature selection process on clinical and transcriptomic data (Fig. 4) and all data (Fig. 5), we found that accuracy remained relatively stable in RF models, while there were clear accuracy peaks in the MLR models. Interestingly, the number of features, that produced the best accuracy based on sum of squared coefficients using clinical and transcriptomic data, also produced the best precision and recall (Fig. 4D). Strikingly, the highest precision and recalls of MLR models using clinical and transcriptomic data after feature selection was 0.82 and 0.71, respectively, which were much higher than those in the RF model, while its highest accuracy was also higher than that of the RF model (0.85 versus 0.72) (Fig. 4). Similar findings were also noted in feature selection using all data (Fig. 5). These results indicate that the MLR model has better performance than the RF model in analyzing the datasets used in this study.

Fig. 4: The accuracy remained relatively stable in some models using clinical and transcriptomic data, but showd peaks in others.
figure 4

Multi-metric feature selection for classifying four category outcomes of the colorectal cancers based on the random forest (RF) model (A based on ranking of feature importance; B based on feature importance) and multinomial logistic regression (MLR) model (C based on ranking of features’ coefficients; D based on features’ sum of squared coefficients) using clinical and transcriptomic data.

Fig. 5: The accuracy remained relatively stable in some models using all data, but showd peaks in others.
figure 5

Multi-metric feature selection for classifying four category outcomes of the colorectal cancers based on the random forest (RF) model (A based on ranking of feature importance; B based on feature importance) and multinomial logistic regression (MLR) model (C based on ranking of features’ coefficients; D based on features’ sum of squared coefficients), using clinical, transcriptomic, microsatellite instability and selected genomic/oncogenic-driver data.

The best accuracy of all models and all datasets was achieved using MLR based on ranking of coefficients and 825 features of all data (0.855, Table 2), that also produced the highest precision (0.832), F1 (0.738), and recall (0.698, tied with that of clinical and transcriptomic data). Adding selected genomic and MSI data slightly increased the best accuracy of each type of model and feature selection approach except MLR models based on sum of squared coefficients, as well as the best accuracy among all models (0.855 for all data versus 0.844 for clinical and transcriptomic data, P < 0.001, Table 2). Interestingly, the optimal feature numbers before and after adding selected genomic and MSI data seemed to be similar in all models except the RF model based on importance values.

Table 2 Performance metrics of the random forest and multinomial logistic regression models with the highest accuracy and precision that were identified using the multimetric feature selection approach.

Discussion

In this study, we examined the clinical and molecular features that may be useful for classifying four-category survival outcome of CRC using the TCGA data. We created four pipelines which automated and optimized multimetric feature selection to identify the best model with the highest accuracy and precision. Our data show that the MLR models overall outperformed RF models in the multimetric feature selection. The best model was the MLR one with approximately 825 features based on sum of squared coefficients using all data, and attained the best accuracy of 0.855, F1 of 0.738 and precision of 0.832, which were better than those using clinical and transcriptomic data. A host of top features have been identified (Supplementary Tables 2 and 3). The top-ranked features in the MLR model of the best performance using clinical and transcriptomic data were different from those using all data. The pathologic stages, HBS1L, TSPYL4, and TP53TG3B were the overlapping top-20 ranked features in the best performing models using clinical and transcriptomic or all data.

We previously showed that, compared with transcriptomic data alone, combination of clinical and transcriptomic data did not significantly improve the accuracy of RF or MLR models for classifying multi-category lung adenocarcinomas7. However, adding MSI and selected genomic/oncogenic-driver data in this study significantly increased accuracy, precision, and F1 of the best feature-selection based MLR models, but not recall. The importance of alternations in MSI, TP53, PIK3CA, BRAF, and SMAD4 has been shown before35,36,37,38,39,40,41,42, and may explain the increase in model performance metrics. Indeed, the coefficients of PMS2L4 (inverse correlation), TP53TG3B (inverse correlation), PIK3CA_DRIVER, BRAF_DRIVER, and SMAD4_DRIVER were among the important features as assessed by the sum of squared coefficients. On the other hand, driver-alternations of KRAS, NRAS, NTRK1, NTRK2, NTRK3, ERBB2, and APC were not among the top- or bottom-500 features, despite their clinical values29,30,31,32. One possible explanation is that no targeted therapies of these alterations were approved or widely used at the time when these TCGA CRC were treated. Such a time-difference was also noted for the utilization of radiotherapy. Interestingly, addition of MSI and selected oncogenic-driver data significantly increased accuracy, recall, and F1 of the best feature-selection based RF models, but not precision. This difference may be attributable to the different sensitivity of RF and MLR models to including more features in imbalanced datasets. However, future studies are warranted to examine the causes of these differences.

Despite the correlations in some top-ranked and bottom-ranked features, the top-ranked features in an optimized, tuned model may vary considerably by the databases and models’ parameters7,22,23,24. The same phenomenon was also observed in this study. Interestingly, pathologic staging, HBS1L, TSPYL4, and TP53TG3B have been ranked top 20 in the best models using clinical + transcriptomic and all datasets, respectively. They thus may be the truly important features for classifying clinical outcomes in CRC. This observation is partially supported by the known prognostic values of pathologic staging for CRC survivals4,43,44. However, the literature review shows that TSPYL4 is only reportedly important for prognosis of pancreatic and head and neck cancers45,46, HBS1L(-MYB) is only found important for risks of hepatocellular carcinoma and development of myeloproliferative neoplasms47,48, and TP53TG3B seems important only for the prognosis of uterine corpus endometrial carcinoma and hepatocellular carcinoma49,50. Therefore, the prognostic values of TP53TG3B, TSPYL4 and HBS1L for CRC appear novel. Strikingly, TSPYL4 and HBS1L are both located on chromosome 6 (6q22.1 and 6q23.3, respectively, www.genecards.org) and may be closely related, while TP53TG3B is separately on 16p11.2. Future works may be focused on TSPYL4 and HBS1L related drug development and prognostic markers.

Most of the published works on feature selection were based on LASSO51,52,53, RF17,54,55,56,57,58, support vector machine59,60,61,62,63, and incremental recursive feature elimination64,65,66,67. In this study, we for the first time compared the multimetric performances of RF and MLR models. Despite the possible overfitting, MLR clearly outperformed RF after proper feature selection in all four examined performance metrics. This will greatly enable us to rigorously classify patients with CRC or other cancer for proper treatments. It is particularly useful in the imbalanced datasets25,26.

Another strength of this study is to analyze outcomes in four categories. Most, if not all, of the previous transcriptomic studies on CRC focused on binary outcomes or time-events8,9,10,11,12,13,14,15,16. Thus, the multicategory survival outcomes are poorly understood in CRC patients, particularly regarding their associated clinical and transcriptomic features. Our findings will shed light on how to better manage and prognosticate CRC patients based on these features. Indeed, it is possible that CRC patients who died with disease may be under-treated or non-responsive to current treatment regimens, while the ones who died without disease may be over-treated or suffer other non-cancer causes. Therefore, our optimized models for four-category survival outcome will help stratify these patients and proper adjust their treatment regimens according to the predicted outcome. The specific applications of ML models, that are based on clinical and ‘omic data, may include: 1. To choose the treatment option(s) so that the patient would stay alive without disease, or less preferably alive with disease; 2. To deescalate current treatment-regimen and increase dosages or treatment modalities for the patients who died without disease and those who died with disease, respectively, in future studies or practice; 3. To prepare the patients and their families with the predicted four-category outcomes, such as death with disease or alive with disease.

It should be mentioned that there are several limitations of this study. First, we do not have an external test set, although we rigorously examined the models using 5-fold cross-validation and repeated the run for 20 times. Such a limitation is inevitable due to the lack of high-quality large-scale ‘omic studies with detailed patient outcomes. Second, the four-category outcome in this cohort is not very balanced since nearly two thirds of the patients were alive no disease at end of the follow-up period. Thus, the model may be biased to better predict these patients, and less powerful and less useful for the other categories of outcome (i.e., dead with or without disease, or alive with disease). Future large-scale studies with balanced-outcomes are warranted to examine our models and findings. Third, only one type of cancer is analyzed here. It will be interesting to test whether our findings hold true in other types of cancers. Fourth, adjuvant therapy data were not available in the TCGA, and thus not included in the analyses despite their clinical values and our desire of including them. It is also noteworthy that radiotherapy in this study (TCGA cohort) was not used as often as nowadays. The main reason is that the included CRC were treated in the 2000s31, when radiotherapy was not used as widely as nowadays68, possibly due to the lack of sufficient awareness, expertize, guidelines, and resources. Fifth, the outcomes of cancer recurrence, metastasis and specific causes of death will be more helpful than the four-category outcomes of this dataset, but not available in any large-scale ‘omic datasets. Future works should focus on generating this kind of ‘omic datasets. The TCGA dataset also has the limitation that scholars cannot assess its accuracy or explore further to fit special needs. Finally, sample size of the TCGA cohort may be too small despite its great values and wide use.

In summary, we here report a multimetric feature selection-based MLR model using all data that outperformed RF models in classifying four-category outcome of CRC patients. Adding MSI and selected genomic/oncogenic-driver data increased the performance metrics of RF and MLR models. We also showed the association of feature number with models’ performance. Approximately 600–1000 clinical and features are needed to reach an optimal model with the highest accuracy, precision, and recall, while more or fewer features would lead to lower accuracy, precision, and recall. The model developed in this study has the potential to be used to analyze other types of cancers. The top important feature identified by the models we developed could help understand colorectal tumorigenesis and guide CRC treatments.