Introduction

Gynecologic malignancies account for approximately 12% of all new cancer cases and 15% of all female cancer survivors1. Gynecologic malignancies consist primarily of five different anatomic locations: cervical, ovarian, uterine, vaginal, and vulvar cancer2. Cervical, uterine, and ovarian cancers accounted for 5.0%, 5.9%, and 2.8% of all worldwide malignancies among women in 2012 respectively3. In the United States, approximately 84,000 new cases of gynecologic malignancies are diagnosed resulting in about 2800 deaths annually4. Standard management often consists of surgery (i.e., debulking surgery, hysterectomy, and bilateral salpingo-oophorectomy) with neoadjuvant chemotherapy4,5. Hysterectomy is the most common surgical procedure in gynecology worldwide6. Despite surgery, gynecologic patients often have residual disease, defined as remaining cancer cells after treatment7,8,9.

The presence of this residual disease can often be due to the local inflammation and trauma from the surgical procedure, which causes residual cancer cells to shed into circulation and lead to accelerated micrometastatic growth10,11. This was observed as early as the end of the twentieth century, when studies found that cancer patients treated with resection had lower survival rates than cancer patients managed expectantly12,13. Given how commonly surgeries are performed for gynecologic cancer patients, patients with gynecologic cancer are at risk of adverse outcomes from surgery, especially with regards to residual disease14,15,16,17. Residual disease in gynecologic cancer survivors is common, requiring timely intervention to improve survival outcomes18,19,20,21.

Healthcare providers currently predict risk of residual disease for patients using clinicopathologic and molecular prognostic factors22,23,24,25,26,27. However, identifying individuals at risk for residual disease in the status quo is difficult and the prognosis for recurrent gynecologic malignancy is poor28,29,30,31. Developing better, more clinically applicable predictive models for risk for residual disease could improve patient outcomes, mainly through identifying patients who could benefit from early intervention and potentially adjuvant therapy32,33,34,35. Existing prognostic aids are specimen and procedure based and often are specific to a particular type of malignancy22,36,37,38,39. Furthermore, existing prognostic aids, such as diagnostic radiology may be less accessible in low resource settings40,41. As such, there is a need for an automated, machine learning approach to be used alongside conventional clinical data following surgery.

Machine learning (ML) is a field of artificial intelligence in which algorithms develop associations based on existing data to develop statistical models with predictive power over a given dependent variable. Machine learning model development begins with preprocessing data to handle blanks (or NULL values) and organize data numerically in a way that models can accept. This is followed by splitting a given dataset into a “training” set, to which statistical equations are fit in order to develop the predictive model, and “testing” sets, where the developed model’s predictions of the outcome variable are compared against the true values in the “testing” dataset. Machine learning models have begun to show considerable promise in healthcare42,43,44, including models on the American College of Surgeons National Surgical Quality Improvement Program (ACS NSQIP), machine learning models to predict mortality among other end-points, and models aimed at predicting residual malignancy following cytoreduction38,45,46,47. However, there is a lack of studies that have developed machine learning models to predict the presence of residual cancer using health data for post operative hysterectomy patients.

We, therefore, aimed to develop and validate a multivariate machine learning model to predict a given patient’s risk of having postoperative residual malignancy following hysterectomy using easily accessible clinical and laboratory parameters.

Results

Patient characteristics

A total of 3656 patients who underwent a hysterectomy for malignancy were extracted from the ACS NSQIP procedure-targeted database over the 14-year period of 2005–2019. For the purposes of this study, the training cohort consisted of 2925 patients (constituting 80% of the dataset) and the testing cohort consisted of 731 patients (20% of the dataset). A flowchart of the patient selection process based on our inclusion criteria is included in Fig. 1.

Figure 1
figure 1

Flow chart showing the hysterectomy patient cohort selection, model training, and performance evaluation processes. A total of 3656 patients were used for model development and were randomly divided into an 80% training set (2925 patients) and a 20% testing set (731 patients). k-fold cross-validation and grid searching for hyperparameters was conducted in the training set, and model performance was evaluated based on area under receiver operating characteristic curves and accuracy rates.

Study population characteristics

Of the 3656 patients analyzed, 684 (19%) of these patients were identified to definitively have residual cancer. Only definite “yes” and “no” classifications were used to use the most accurate and applicable data to develop the model. A summary table with descriptive statistics for the residual disease status for each feature in the cohort was developed (Tables 1, 2, 3).

Table 1 Clinical history/surgical information summary table.
Table 2 Cancer related variables summary table.
Table 3 Clinical complication outcome variables summary table.

Model variable importance

5 machine learning models based on Random Forest, eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Support Vector Machine (SVM), and K-Nearest Neighbor (KNN) algorithms were created. The logistic regression, random forest, and XGBoost models were the 3 highest performing models. 35 statistically significant clinical parameters were included within these models. The algorithm and methodology we used to obtain our model variable importance plots have been previously cited in the literature48,49,50. The top postoperative predictors of residual disease factored across the top three models were the presence of malignancy located on the diaphragm, disease located on the bowel mesentery, disease located on the bowel serosa, and disease located within the adjacent pelvis prior to surgical debridement. Specifically, within the XGBoost model, the top post-operative predictors of residual disease were the presence of malignancy located on the diaphragm, disease located on the bowel mesentery, and disease located on the bowel serosa (Fig. 2). A more comprehensive chart of ranked variable importances can be found in Supplemental Figs. S1S3 for the full XGBoost, logistic regression, and random forest models. Supplemental Figure S4 has the variables ranked as having little or no importance to the XGBoost model. The variables with the largest odds ratios were presence of a bladder fistula (OR 3.05), presence of a urethral fistula (OR 3.04), low cervical cancer staging, and presence of gross abdominal disease in the Diaphragm (OR 3.37).

Figure 2
figure 2

Analysis of the importance of each variable in the XGBoost machine learning model. The histogram describes the relative importance of all 35 clinical features in the logistic regression model. The relative importance is quantified by assigning a weight between 0 and 100 for each variable.

Model performance

The Extreme Gradient Boosting model had an AUC of 0.90 (95% CI 0.87–0.93), with an accuracy of 87.3%. The Random Forest model had an AUC of 0.90 (95% CI 0.87–0.93), with an accuracy rate of 87.3%. The Logistic Regression model had an AUC of 0.90 (95% CI 0.87–0.93) with an accuracy rate of 87.0%. The K-nearest-neighbors model had an AUC of 0.70 (95% CI 0.65–0.76), with an accuracy of 80.8%. The support vector machine model had an AUC of 0.59 (95% CI 0.53–0.65), with an accuracy of 80.4% (Table 4).

Table 4 Comparative chart displaying the accuracy score, area under the receiver operating characteristic curve, F1 score, and Matthews Correlation Coefficient (MCC) for each individual machine learning model.

The XGBoost, Random Forest, and Logistic Regression models all had comparable AUC and accuracy metrics, outperforming the SVM and KNN models (Fig. 3). The accuracy rates of these top 3 models outperform the current rate of residual disease diagnosis by healthcare providers.

Figure 3
figure 3

Evaluation of the machine learning models’ predictive abilities: receiver operating characteristic curves of all models plotted, along with area under the curve for each model listed in the legend.

Methods

Data was extracted from the ACS NSQIP procedure-targeted database from the time period of January 2005 to December 2019. Patients who underwent a hysterectomy for a known malignancy were included within the extracted dataset. The ACS NSQIP database is a national surgical registry used to track risk-adjusted outcomes after surgical procedures from any medical specialty. Prospective variables are obtained and audited by trained clinical reviewers. The American College of Surgeons National Surgical Quality Improvement Program and the hospitals participating in the ACS NSQIP are the source of the data used herein; they have not verified and are not responsible for the statistical validity of the data analysis or the conclusions derived by the authors.

The inclusion criteria for this experiment were: all hysterectomy patients in the ACS NSQIP procedure targeted database, collected between the 2005 and 2019 calendar years, without missing values for any clinical features, and with over 75% of their clinical data present. The exclusion criteria were: any non-hysterectomy patient and any hysterectomy patient with over 25% missing values; hysterectomy patients who had any missing values for their clinical features were also excluded.

Outcome

The primary outcome was the presence of gross residual disease following a hysterectomy procedure for malignancy. Within the ACS NSQIP dataset, this variable is either coded as a “No”, “Yes”, or coded as “NULL” in cases where it was either not recorded, or not possible to identify. The presence of gross residual disease was defined as any portion of the metastatic tumor which remains after surgical procedure, by the ACS NSQIP clinical support team.

Patients carrying blank/NULL values for the primary outcome variable column (Gross residual disease) were removed, during preprocessing to eliminate any uncertainty/inaccuracy from the training.

The study’s primary aim was to construct comparable models with improved parameters, which would yield a risk predictor for residual cancer after a hysterectomy procedure. Each predictor (clinical and laboratory variables) was studied for their odds ratios within a 95% confidence interval (CI).

Machine learning models

The entire cohort of hysterectomy patients were converted into numeric variables. Continuous numeric variables were left as is, and binary “yes” or “no” responses were changed to 1’s and 0’s respectively.

The initial development cohort was 190,488 patients with 44 clinical variables. Any patient with missing data pertaining to the presence or absence of gross residual disease were excluded. Columns with over 25% of the data missing were also dropped to reduce inaccuracy. No imputations were used in the development or validation cohort for the dataset to reduce erroneous bias. This left 4682 patients for further analysis.

Multicollinearity was assessed by creating a heatmap correlation matrix to omit variables with high variance inflation factors (VIFs) to preserve the integrity of the statistical significance of the input variables for the model. Any variables with a VIF greater than 3 were dropped from the model. Variables were dropped one by one, starting with the variables with the highest VIF, after which the list of VIFs for all features included in the model were reevaluated to see if they were all under or over 3. If all the variables included in the model did not have a VIF under 3, then the next variable with the highest VIF was dropped from the model.

This left 3656 patients for the final analysis. Grid search was performed on the dataset, which was split into an 80–20% train-test split, where 80% of the data was used to train a logistic regression model, random forest, and extreme gradient boosting model. Within the 80% training set, the outcome variable of gross residual disease was dropped from the data frame prior to the training split process to avoid skewing predictive potential.

The other 20% was used to test the individual accuracy of each model, respectively. This was done to search for the estimator with the optimal hyperparameter values. During the search, fivefold cross validation was performed on estimators of different hyperparameter values, and the estimator with the largest mean cross-validated score was selected as the optimal model from the grid search. Hyperparameters are properties of a model that can be tweaked to control its learning process, at the cost of lengthening execution time, should too many be added. In the past decade, studies have shown methods that were developed to rank hyperparameters by importance, typically by how much is gained for a metric such as accuracy or AUC, based on multiple datasets. For each classification model, hyperparameters were chosen which were highly ranked for that model across multiple datasets as indicated in the literature51,52.

The cohort was split at the patient level such that no training data could appear in the testing set. All variables were included in the model to optimize the predictive potential without introducing background noise.

To mitigate bias, the data was checked for any high multicollinearity (intercorrelation between any two variables) to see if there are any features to consider removing that could negatively impact the model’s prediction accuracy. Features with high multicollinearity were omitted to minimize bias within the model’s predictions. We generated a correlation heatmap and also computed the variance inflation factor for each variable to screen for any features which may have introduced bias, and nothing indicated high multicollinearity. We also performed cross validation on our models, which reduced bias and variance to prevent the models from overfitting onto the data.

Statistical analysis

Descriptive statistical analysis was conducted on the data based on the patient’s presence of gross residual disease, as recorded in the NSQIP. Initial analysis was done by conducting an independent, one-way analysis of variance (ANOVA) test of every continuous, numeric variable included in the model, a chi-squared test for every categorical variable in the model, and Fisher exact test for binary variables, partitioned between patients who did and did not have a diagnosis of gross residual disease.

The ML models were constructed from the training cohort and assessed on the validation cohort, independent from model development, by calculating the area under the curve (AUC) of the model’s receiver operating characteristic (ROC). The AUC, plotting the odds of a false positive against the odds of a true positive, was used due to its threshold independent nature to describe the model’s classification ability. A 95% CI for the models’ AUC was obtained through bootstrapping.

All analyses were conducted using the Sklearn version 0.24.1 (https://scikit-learn.org/stable/about.html) package53 in Python (Python Software Foundation)54 and R 4.1.0 (https://www.R-project.org/)55. Python version 3.8.8 (https://www.python.org/downloads/release/python-388/) was used for analysis.

Discussion

This machine learning cohort study demonstrated the feasibility of applying machine learning models on a large, heterogeneous population of hysterectomy patients in order to forecast the presence of gross residual disease postoperatively.

In the setting of tumor excision surgeries, there exist 2 possibilities: cases where surgeons definitively have been able to identify or rule out the postoperative presence of gross residual disease via visual inspection and/or pathology scans and cases where there exists a medical uncertainty as to whether there is any remaining disease left in the patient. The latter possibility constitutes a serious clinical problem for physicians, who then must decide how to proceed with postoperative patient management and weigh the risk between preventing possible residual disease with adjuvant chemotherapy, at the cost of harming the patient’s health.

Our dataset was obtained from the ACS NSQIP. Previous studies have described the significance of clinical features in the ACS NSQIP to predict surgical outcomes for gynecologic procedures56. Here, we use machine learning models to automate this process. Machine learning models were prioritized over deep learning models in our study due to their faster run time, lesser need for computational power, and easy interpretability (like through the generation of model variable importance plots which can highlight key clinical features pertinent to a given outcome variable) which are all vital for implementation in low resource settings. Deep learning models need more computational power, take longer to run, have lower interpretability, and are better suited for more complex problems/prediction tasks where an organized data frame is not present.

Our machine learning models were trained on definitively diagnosed cases of residual disease versus no residual disease, but can be generalized for cancer patients, particularly those in low resource settings, whose residual disease status is unclear, to give direction to the surgeon for the patient’s postoperative clinical management. This study can serve as the basis for prospective trials with prophylactic chemotherapy for non-clinically evident residual disease.

Predicting residual disease after hysterectomy would improve treatment planning. Given the poor prognosis of recurrent gynecological cancers, there is a strong need for tools to identify gynecologic cancer patients at risk for residual disease following surgical procedures. Patients at high risk could be monitored more closely or moved directly to additional chemotherapy and radiation therapy. Machine learning can be used successfully for disease diagnosis and prediction57.

Previous studies have made an attempt to develop models predicting risk of residual disease following surgery. In 2018, Horowitz et al. published a predictive model for microscopic residual disease following complete cytoreduction in patients with advanced epithelial ovarian cancer. While this study identified many variables predictive of residual disease at cytoreduction, the area under the curve of the receiver operating characteristic was 0.73, putting into question the predictive ability of the model38. In a more recent study, Kumar et al. reported computed tomography prediction models for residual disease at primary debulking surgery for advanced ovarian cancer. The model predicting gross residual disease had the highest predictive value, with its c-index reaching 0.76250.

In our work, we present a machine learning model to establish risk models (LR, KNN, SVM, RF, and XGboost models) that combine clinical and operative parameters to identify patients with increased risk of residual disease following hysterectomy. The top three performing models: XGBoost, Logistic Regression, and Random Forest models all had statistically similar ROC curves and accuracy rates. Our model was trained and validated on 3656 patients and showed consistent calibration across the database. The cohort was representative of hysterectomy patients across the United States58. Though we had a different end goal, our model was competitive with results published in literature for other machine learning-based studies59,60,61.

Our study approach had several strengths. Due to the nature of the data collected, such an approach could be applied to other cancers following a surgical procedure as well. In 2012, an estimated 8.2 million cancer deaths and 14.1 million new cancer cases occurred worldwide62. Accurately predicting residual disease in different cancers could lead to considerable reductions in healthcare costs while also improving long-term survival for cancer patients. Additionally, a prognostic approach based on clinical and operative parameters would be accessible to low resource settings as well. This analysis could be implemented in other countries that have large healthcare databases, such as Japan, without requiring additional data collection63. Furthermore, we included a detailed calibration assessment, which suggests our model would be well calibrated in other databases.

Our proposed approach had important limitations. First, while our model does not impute any values, only definitive positives for residual cancer were counted. Patients for whom the residual cancer status was uncertain could not be used for the development of the model, as surgeons were not able to definitively stage these patients. This means that the model may be biased towards more clearly defined cases where there is gross residual cancer and may not perform as well for patients for whom it is hard to discern the gross residual cancer status. However, with clinical validation, model training on increased sample sizes can hopefully lead to application on clinically ambiguous patients as well. Furthermore, greater consistency and fewer missing input values would improve the model’s discrimination. Second, these machine learning models were trained on the ACS NSQIP database and, despite thorough feature selection and hyperparameter optimization, may be fit for the nuances of the NSQIP data specifically. To overcome this limitation and to increase generalizability, these models should be tested in other oncology settings, with a mixture of diversified data sources to best assess generalizability. Doing so may help capture other significant parameters and, using a richer data source, achieve more competitive performance. Finally, though we can interpret the model’s decisions and variable splitting to identify patients at higher risk, the model only captures correlations in data and not causal pathways.

The only potential safety issue in utilizing AI systems to analyze patient data would theoretically be a breach of patient privacy. To avoid this, all features used to develop any models should be fully deidentified. In our research, we were able to mitigate this by using solely deidentified data to train our models, so no model can attribute given clinical features to the original patient, as that data was never shown to the model. Furthermore, because machine learning models are governed by statistical equations, it is impossible to “reverse engineer” machine learning models to uncover the original patient data, as the models were built on the entire aggregated patient data. To mitigate anyone from potentially trying to use the statistical equations of machine learning models to infer aggregate attributes about the original data, firewalls and secure deployment services can be used to ensure that it is impossible for anyone to be able to view/analyze the models.

Our machine learning models were trained on definitively diagnosed cases where the presence or absence of gross residual disease was known; our models can be extrapolated for the vast majority of non-clinically evident cases of gross residual disease, where there is clinical uncertainty to guide adjuvant therapy and/or postoperative follow-up. This will be most clinically useful at the end of index operations where surgical teams believe they have removed all cancer but have missed residual disease. In these settings, our machine learning models can predict the possibility of residual disease and risk stratify patients to alter their postoperative management. Our research serves as the basis for prospective studies on patients with non-clinically evident remaining cancer who are believed to not have residual disease but have a high risk score on our model.

Our findings suggest that machine learning methods, specifically Logistic Regression, Random Forest, and Extreme Gradient Boosting models, have strong classification ability and hold potential for clinical application to guide patient management, improve patient outcomes, and modulate treatment regimen, particularly for low resource settings with primarily clinical and operative variables available for analysis.

This model can have a dual integration modality depending on the clinical care setting. In developed settings, this model can be deployed publicly as a software as a service cloud platform, which healthcare facilities can directly integrate into their EHRs for dynamic prediction based on the available EHR data. The model would then generate a personalized risk score for the patient’s likelihood of residual disease, prompting healthcare providers to initiate sooner follow-up care and initiate adjuvant therapy. In low-resource settings that lack EHRs but have a prevalence of mobile devices, this model would be a mobile app, where healthcare providers can manually enter in necessary clinical features to receive the risk score output for each patient, indicating further therapy/closer follow-up.

Conclusion

Existing residual disease prognostic methods are time intensive, require pathology specimens, and often are restricted to modelling only one particular type of cancer. Current prognostic aids require expensive tools and are largely inaccessible in low resource settings. Our findings can streamline clinical postoperative diagnosis and serve as a novel lens to utilize commonly collected operative parameters for the prediction of residual disease using machine learning.