Machine Learning Algorithms for Predicting the Recurrence of Stage IV Colorectal Cancer After Tumor Resection.

The aim of this study is to explore the feasibility of using machine learning (ML) technology to predict postoperative recurrence risk among stage IV colorectal cancer patients. Four basic ML algorithms were used for prediction-logistic regression, decision tree, GradientBoosting and lightGBM. The research samples were randomly divided into a training group and a testing group at a ratio of 8:2. 999 patients with stage 4 colorectal cancer were included in this study. In the training group, the GradientBoosting model's AUC value was the highest, at 0.881. The Logistic model's AUC value was the lowest, at 0.734. The GradientBoosting model had the highest F1_score (0.912). In the test group, the AUC Logistic model had the lowest AUC value (0.692). The GradientBoosting model's AUC value was 0.734, which can still predict cancer progress. However, the gbm model had the highest AUC value (0.761), and the gbm model had the highest F1_score (0.974). The GradientBoosting model and the gbm model performed better than the other two algorithms. The weight matrix diagram of the GradientBoosting algorithm shows that chemotherapy, age, LogCEA, CEA and anesthesia time were the five most influential risk factors for tumor recurrence. The four machine learning algorithms can each predict the risk of tumor recurrence in patients with stage IV colorectal cancer after surgery. Among them, GradientBoosting and gbm performed best. Moreover, the GradientBoosting weight matrix shows that the five most influential variables accounting for postoperative tumor recurrence are chemotherapy, age, LogCEA, CEA and anesthesia time.

www.nature.com/scientificreports www.nature.com/scientificreports/ detect early colorectal cancer. They are also more suitable than non-metastatic models for predicting the survival of non-metastatic colorectal cancer.
Therefore, this study was conducted to explore whether ML algorithms can predict postoperative cancer progression in patients with stage IV colorectal cancer.

Materials and Methods patients and features.
This research is a secondary analysis of data from the BioStudies (public) database (https://www.ebi.ac.uk/biostudies/studies/S-EPMC6054421). According to BioStudies's instructions, these data have been approved by the author and can be provided to interested researchers around the world. Therefore, using the database in research does not need the approval of a secondary ethics committee. Thus, our institutional review committee also waived the requirement of written informed consent.
Patients with stage IV colorectal adenocarcinoma who had undergone primary and metastatic tumor resection surgery between January 1, 2005 and December 31, 2014 were selected from the hospital's electronic medical database. Patients lacking demographic and pathological details or postoperative analgesia data were excluded. A total of 999 patients with stage IV colorectal adenocarcinoma were included in the training data and test data. Information such as demographic characteristics, pre-treatment CEA levels, pathologic features, and whether preoperative or postoperative adjuvant chemotherapy or radiation therapy had been used was collected. The current status of each patient was determined by follow-up recordings in outpatient clinic or subsequent admission information. The radiologists and colorectal surgeons in the hospital determined whether cancer was progressing. This was primarily based on imaging studies (e.g., CT, magnetic resonance imaging, bone scans), and defined by the Response Evaluation Criteria in Solid Tumors (RECIST) guidelines. The date of death was determined by medical records or death certificates.
Data were extracted by professional anesthesiologists who did not participate in the data analysis. The quality of the extracted data was verified by random sampling, and the data were collected up through August 2016. The primary endpoint was progression.
Patient demographic and baseline characteristics were presented with descriptive statistical methods. Continuous variables were described with mean and standard deviation (SD), or median and quartile ranges, while categorical variables were described with counts and percentages. For continuous variables with normal or asymmetric distributions, a Student's t-test or Mann-Whitney U-test was used, respectively, to test for differences in tumor recurrence between the groups. The research samples were randomly divided into training group and testing group at a ratio of 8:2. The multiple interpolation method was adopted to supplement missing variable values.

ML algorithms.
In the present study, four basic ML algorithms-logical regression 9 , decision tree 10 , GradientBoosting 11 and lightGBM 4,12 -are implemented 13,14 . Logistic regression is a classical classification method in statistical learning. It can be divided into binomial logistic regression and multinomial logistic regression. Decision tree is an ML method for solving classification problems. It consists of a root node, several internal nodes and several leaf nodes. The leaf nodes correspond to decision results, and each of the other nodes corresponds to a feature test. The sample set contained in each node can be divided into child nodes according to the feature values, and the root node contains the full set of samples. The path from the root node to each leaf node corresponds to a decision test sequence.
Boosting is an ML technique that can be used for regression and classification problems. It produces a weak prediction model (such as decision tree) at each step, and weights it into a total model. If weak model prediction at each step generates unanimous gradient direction of loss function, then it is called gradient Boosting.
LightGBM is a distributed gradient elevation framework based on decision tree algorithms. LightGBM applies the histogram algorithm, which has low internal storage and low data separation complexity. LightGBM uses a leaf-wise growth strategy to identify the leaf with the largest split gain (generally the largest amount of data) from all current leaves, and then splits the cycle. However, it grows a deeper decision tree, resulting in overfitting. Therefore, LightGBM adds a maximum depth limit above leaf-wise to prevent over-fitting while ensuring high efficiency.
Hyperparameter initialization and optimization. ML algorithms involve many hyperparameters that need to be prepared before running them. In contrast to the parameters learned through training, the hyperparameters determine the structure of the ML algorithm and how to train it. The initial value of the hyperparameters for each ML algorithm used in this study was the default value specified in the package based on recommendations or experience 15 . For detailed parameterization of the algorithms, please refer to the scikit-learn user manual at http://scikit-learn.org/stable/supervised_learning.html 16 .
Performance index accuracy, sensitivity, specificity and area under receiver operating characteristic (ROC) curve are used to evaluate machine learning algorithm performance. The ROC curve shows the algorithm tradeoff setting for different thresholds for the predicted posterior probability. Precision: The proportion of positive data predicted correctly over total positive data predicted. Recall rate: the proportion of data predicted as positive cases over actual positive cases. The accuracy formula is defined as the ratio of the number of samples correctly classified by the classifier over the total number of samples for a given test data set: www.nature.com/scientificreports www.nature.com/scientificreports/ Software. Descriptive and inferential statistical analysis was conducted with R. The machine learning algorithm was applied with Python 3.6 using the SCIKIT-LEARN 0.19.1 software package (SCIKIT-LEARN, http:// scikit-learn.org/) (Python Software Foundation, HTTPS://www.python.org/).

Results
999 patients who met the inclusion criteria were included in this study, of which there were 778 patients in the relapse group and 221 patients who did not relapse. The CEA value for the advanced cancer group was 269.8 ± 1053.6; the CEA value for the non-advanced cancer group was 219.6 ± 719.3; and the P-value for both groups was 0.434. Anesthesia time was 341.9 ± 120.6 in the advanced cancer group, 326.2 ± 122.1 in the non-advanced cancer group and 0.050 in the two groups. ASA scores for the advanced cancer group and the non-advanced cancer group were different, and this result was statistically significant (P < 0.001). Similarly, there were significant differences in chemoradiotherapy between the advanced cancer group and the non-advanced cancer group, with p < 0.001 (See Table 1). www.nature.com/scientificreports www.nature.com/scientificreports/ Figure 1 shows the correlation between the variables. It shows that age is negatively correlated with ASA and cancer progression. Chemotherapy and CEA are both positively correlated with cancer progression. Anesthetic time is also weakly positively correlated with cancer progression. Additionally, there is a weak negative correlation between anesthesia time and age and CEA. Figure 2 shows the importance of each covariate in GradientBoosting's final model. The five most influential covariates are observable: chemotherapy, age, LogCEA, CEA and anesthesia time.
The four machine learning algorithms are compared in Table 2 Table 2 and Fig. 4).

Discussion
Colorectal cancer has been on the rise in China in recent years. Due to neglect of early symptoms, some patients have already entered the advanced stages by the time they are admitted to the hospital. This increases the risk of death. According to the TNM staging criteria for colorectal cancer, tumors invading the serosal layer of the intestinal wall are considered stage T4. According to previous studies, although the short-term effect of radical surgery for T4 patients is ideal, the long-term effect is poor, and the recurrence and metastasis rates are high 17 . www.nature.com/scientificreports www.nature.com/scientificreports/ In this study, the progression rate for postoperative cancer in patients with stage IV colorectal cancer was as high as 77.9%. Our study compared four ML algorithms using real-world data and found that DecisionTree, GradientBoosting and gbm algorithms can better predict the postoperative cancer progression of patients with stage 4 colorectal cancer, in both training and testing groups. Furthermore, it was found that the five most influential covariates were chemotherapy, age, LogCEA, CEA and anesthesia time. These variables are correlated. For example, there is a significant positive correlation between chemotherapy and cancer progression. Anesthetic time is also weakly positively correlated with cancer progression.
Postoperative chemoradiotherapy is a standard adjuvant therapy for patients with T3-4 and/or lymph node-positive rectal cancer. Long-term postoperative radiotherapy can reduce local recurrence by 50% to 60%, compared to surgery alone 18 . Simultaneous addition of fluoropyrimidine chemotherapy and radiotherapy can further reduce systemic metastasis and local recurrence 19 . At present, the most controversial issue is whether postoperative radiotherapy is necessary for those with low risk of local recurrence, as indicated by the postsurgical pathology. An example of this would be patients with upper rectal cancer or who are staged as T1-2N1 or T3N0.   www.nature.com/scientificreports www.nature.com/scientificreports/ Retrospective studies in a single institution have shown that some T3N0 patients may not require postoperative radiotherapy 20 . Furthermore, among patients with advanced cancer, early palliative care may optimize patient selection for chemotherapy reducing the use of high-intensity therapy by focusing on quality of life in accordance with patients' performance, preferences and care goals 21 . Additionally, no clear linear pattern between adjuvant chemotherapy and better adjusted relative survival in colon cancer was observed 22 . These results did not indicate that radiotherapy and chemotherapy will benefit patients with stage IV colorectal cancer after surgery. This may only reflect that surgery can be applied to patients at later stages.
In recent years, the incidence and mortality of colorectal cancer have risen, and the age of onset has become younger 23 . Our study also showed that the age of the cancer progression group was younger than that of the non-progression group, but that age still accounted for a large weight of cancer progression.
Serum CEA is an acidic glycoprotein with human embryo antigen specificity. It is an important marker of digestive tract tumors. Serum tumor markers are common in tumor diagnosis. Many studies [24][25][26] have evaluated the role of CEA, CA19-9 and CA50 in the diagnosis, prognosis and recurrence monitoring of colorectal cancer. Similarly, this study also showed that LogCEA is an important factor in the progression of stage IV colorectal cancer patients after surgery.
Surgical injury and anesthesia can cause a bodily stress response, affecting immune response and causing reversible immune function changes in the body. This study found that anesthesia time is an important weight for cancer progression. This may be related to changes in immune function among patients with perioperative cancer caused by anesthesia.
A follow-up study conducted by Bonjer et al. 27 showed that the 3-year disease-free survival rates for patients after LS and OS surgery were 74.8% and 70.8%, respectively. The results obtained by COREAN 28 were 79.2% and 72.5%, respectively. However, there was no significant difference between LS and OS in local recurrence, disease-free survival, or overall survival after RC. However, in this study, laparoscopic surgery was found to promote tumor progression in patients with stage IV colorectal cancer. This may be related to the application of www.nature.com/scientificreports www.nature.com/scientificreports/ CO 2 pneumoperitoneum in laparoscopic surgery. This affects patients' immune function, thereby increasing the risk of tumor metastasis and recurrence, thus influencing prognosis.
The incidence of colorectal cancer ranks third among the most common malignancies among men and second among women. It is the fourth leading cause of cancer-related mortality worldwide 23 . In this study, sex was also found to be a factor in the progression of postoperative patients with stage IV colorectal cancer.
The anatomical features of portal vein blood backflow determine whether the liver is the most common distant metastatic site of colorectal cancer. Hepatic metastases were found in 20% of patients when they were diagnosed with colorectal cancer. This makes it difficult to treat, and the prognosis is usually bleak. This is similar to the findings of the present study.
This retrospective and observational study has several limitations. Firstly, patients were not randomized, the comparisons between ML prediction and statistical prediction groups were not conducted, and clinical care was not standardized. Therefore, the effects of selection bias and unmeasured confounding variables could not be excluded. Secondly, due to data requisition limitations, data on total anesthesia requirements, perioperative analgesia and intraoperative chemotherapy for each patient (such as high-temperature intraperitoneal chemotherapy) were unavailable. Thirdly, different parameters for each ML algorithm may have resulted in different results.

conclusion
GradientBoosting and gbm are more likely to improve the accuracy of predicting the postoperative cancer progression of patients with stage IV colorectal cancer than are the other two ML algorithms. Furthermore, set algorithms are more effective than basic algorithms. The five most influential covariates in cancer progression after surgery for stage 4 colorectal cancer patients are chemotherapy, age, LogCEA, CEA and anesthesia time. Anesthetic time has a weak positive correlation with cancer progression. Additional multicenter clinical studies are needed in the future.