Studying the impact of marital status on diagnosis and survival prediction in pancreatic ductal carcinoma using machine learning methods

Pancreatic cancer is a commonly occurring malignant tumor, with pancreatic ductal carcinoma (PDAC) accounting for approximately 95% of cases. According of its poor prognosis, identifying prognostic factors of pancreatic ductal carcinoma can provide physicians with a reliable theoretical foundation when predicting patient survival. This study aimed to analyze the impact of marital status on survival outcomes of PDAC patients using propensity score matching and machine learning. The goal was to develop a prognosis prediction model specific to married patients with PDAC. We extracted a total of 206,968 patient records of pancreatic cancer from the SEER database. To ensure the baseline characteristics of married and unmarried individuals were balanced, we used a 1:1 propensity matching score. We then conducted Kaplan–Meier analysis and Cox proportional-hazards regression to examine the impact of marital status on PDAC survival before and after matching. Additionally, we developed machine learning models to predict 5-year CSS and OS for married patients with PDAC specifically. In total, 24,044 PDAC patients were included in this study. After 1:1 propensity matching, 8043 married patients and 8,043 unmarried patients were successfully enrolled. Multivariate analysis and the Kaplan–Meier curves demonstrated that unmarried individuals had a poorer survival rate than their married counterparts. Among the algorithms tested, the random forest performed the best, with 0.734 5-year CSS and 0.795 5-year OS AUC. This study found a significant association between marital status and survival in PDAC patients. Married patients had the best prognosis, while widowed patients had the worst. The random forest is a reliable model for predicting survival in married patients with PDAC.


Data source and patient selection
We utilized the SEER*Stat software (Version 8.4.0.1) to gather comprehensive data submitted to SEER until November 2021.In order to obtain patients with primary pancreatic site, we implemented the International Classification of Diseases for Oncology (ICD-O-3) topographical codes (C25.0-C25.3,C25.7-C25.9).We included patients diagnosed with ICD-O-3 histology/behavior codes of 8140/3 (adenocarcinoma) or 8500/3 (infiltrating duct adenocarcinoma) as part of our inclusion criteria.On the other hand, patients with pancreatic Islets of Langerhans (C25.4) tumor origin were excluded.Also, patients with missing/unknown/undifferentiated data on marital status, 6th AJCC stage and T/N/M stage, race, tumor differentiation, treatment information and the cause of death, as well as those with unknown or less than 1 month survival time were eliminated.
Following the application of our selection criteria, we identified a sum of 24,044 pancreatic ductal adenocarcinoma (PDAC) patients to serve as pertinent subjects for this investigation.Two patient clusters were then defined according to the marital status, namely, the married or unmarried group.A detailed flow-process diagram representation of our rigorous screening procedure is provided by Fig. 1.

Variable classification
Our analysis incorporated a range of factors from the database such as sex, age at diagnosis, marital status, race, grade, TNM stage (6th), and primary site surgery.Age was dichotomized into two groups: those aged < 50 years and those aged ≥ 50 years.With regard to marital status, we distinguished participants as either married or unmarried groups based on their recorded statuses at the time of diagnosis.The unmarried group was composed of those who were divorced/separated, single, or widowed.

Outcome measurement
In our study, we operationalized overall survival (OS) as the interval from the date of diagnosis to either the date of patient's decease or the last recorded follow-up instance if still alive.Similarly, cancer-specific survival (CSS) was gauged by determining the duration from the date of diagnosis to the date of death attributable solely to PDAC.

Statistical analysis
To minimize potential confounding variables between married and unmarried patients, we gathered data on potential covariates such as sex, age, race, grade, TNM stage (6th), and primary site surgery for 1-to-1 propensity score matching (the nearest-neighbor method with a stringent caliper of 0.001), utilizing the R package of MatchIt.We utilized the chi-square test to assess differences in categorical variables and estimated OS and CSS by generating survival curves using the Kaplan-Meier method.Through the implementation of log-rank tests, we evaluated survival comparisons between distinct groups.To investigate possible prognostic factors and examine the hazard ratios, we employed both univariate and multivariable Cox proportional-hazards regression models.
With the aim of establishing a machine learning model, patients within the married group were partitioned into a training set and a test set at random, at an 8:2 ratio.Within the training set, we developed the K-nearest neighbor, artificial neural network, Naïve Bayes, and random forest models aimed at predicting the 5-year CSS and OS of married patients with PDAC.K-nearest neighbor (KNN) is a non-parametric algorithm that classifies or predicts outcomes based on the majority class or average of 'k' closest data points in the feature space.Artificial Neural Network (ANN) is composed of interconnected nodes (neurons) organized in layers, designed to learn and make predictions by adjusting the weights of connections during the training process.Naïve Bayes is a probabilistic algorithm that leverages Bayes' theorem, assuming independence between features, to calculate the likelihood of a particular class based on the observed data.Random Forest (RF) in machine learning prediction models is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees for robust and accurate predictions.
All statistical analyses were carried out using R software (version 4.1.3)and SPSS software (version 25) with statistical significance set at two-sided P < 0.05.

Pathological features and baseline characteristics
The SEER database provided data on a total of 206,968 PDAC patients for potential inclusion in our study.Ultimately, 24,044 individuals were deemed suitable after a series of screening procedures, as delineated in Fig. 1.Notably, of the eligible patients, 15,024 (62.49%) were classified as married and 9,020 (37.51%) as unmarried.Additional details regarding pathological features are elaborated in Table 1.Moreover, after executing the primary comparisons, significant differences were noted between the married and unmarried cohorts with regard to sex, race, TNM stage, and surgery status, with all values recorded as P ≤ 0.001 (Table 1).www.nature.com/scientificreports/

The primary comparison assessed the impact of marital status on OS and CSS
In univariate Cox regression analysis, mortality rates associated with PDAC were demonstrated to be significantly linked with seven variables, including sex, age, race, grade, TNM stage, primary site surgery, and marital status for both OS and CSS (P < 0.05; Table 2).Upon conducting multivariate Cox regression analysis to further investigate survival factors, we found that marital status, as well as sex, race, grade, TNM stage, and surgery status, emerged as independent prognostic factors that significantly influenced OS and CSS outcomes in patients with PDAC (P < 0.001; Table 3).

The secondary comparison assessed the impact of marital status on both OS and CSS
To eliminate for potential confounding variables such as age, sex, and race between the married and unmarried groups, we employed the 1:1 propensity score matching method.After matching, 8043 married patients and an equal number of unmarried patients (for a total of 8043 individuals) were successfully enrolled.Notably, the baseline characteristics were found to be well-balanced between the two groups (Table 4; Fig. 2), and no significant differences were observed (P > 0.05).
The findings indicate that, with the exception of race, all baseline characteristics were significant predictors of both OS and CSS (Table 5).In the univariate analysis after propensity-score matching, being unmarried (with reference to married) remained a statistically significant predictive risk factor of death (OS: HR = 0.870, 95% CI = 0.842-0.898,P < 0.001; CSS: HR = 0.882, 95% CI = 0.853-0.912,P < 0.001).Upon subjecting relevant variables to further multivariate analysis, all components maintained independent significance in predicting OS/ CSS with the exception of sex.Moreover, unmarried status (with reference to married) exhibited a noteworthy negative influence on survival outcomes (OS: HR = 0.834, 95% CI 0.808-0.861,P < 0.001; CSS: HR = 0.845, 95% CI 0.817-0.873,P < 0.001; Table 5).It is worth noting that patients diagnosed prior to age 50, those with stage I cancer, well-differentiated tumors, and those who had undergone surgery were observed to be more likely to experience an improvement in both OS and CSS compared to their respective reference groups (Table 5).
The Kaplan-Meier curves presented in Fig. 3 indicate that unmarried individuals have a significantly lower survival rate than married individuals (P < 0.001).To further investigate the prognosis of different unmarried statuses, we grouped unmarried patients into separated/divorced, single, and widowed subgroups.As shown in Fig. 4, we found that there was a significant difference between their OS/CSS and different marital statuses (P < 0.001).

Machine-learning based outcome prediction in patients who married
To explore the factors that influence the survival of married patients with PDAC, we utilized age, sex, race, tumor differentiation, TNM stage, and surgery status as input parameters for developing machine learning prediction models of the 5-year CSS and 5-year OS.The performance metrics of the algorithms for the four models are presented in Table 6.Among the machine learning models, the random forest model exhibits superior discrimination performance.For predicting the 5-year CSS, the random forest model achieves an AUROC of 0.734, accuracy of 0.592, recall of 0.552, specificity of 0.806, precision of 0.939, and F1 score of 0.695.The 5-year OS results are 0.795, 0.572, 0.536, 0.940, 0.989, and 0.695 for AUROC, accuracy, recall, specificity, precision, and F1 score, respectively.Artificial neural network, naïve bayes, and k-nearest neighbor follow with AUROCs of 0.788, 0.771, and 0.708, respectively.Receiver operating characteristics (ROC) curves and AUROCs of the four models are displayed in Fig. 6.By using GridSearch, the hyperparameters of the optimal random forest model were: N_estimators = 100, Max_depth = 10, Min_samples_leaf = 2, Min_samples_split = 4, Max_features = auto (Table S1).
The calibration curves demonstrated an excellent agreement between predictions and observations (Fig. 7).For predicting the 5-year CSS, the k-nearest neighbor, artificial neural network, naïve bayes and random forest models gave brier scores of 0.125, 0.118, 0.134, and 0.118, respectively.Similarly, while the 5-year OS, brier scores of 0.080, 0.073, 0.106, and 0.072 were obtained using the same models, as outlined in Table 7.
In this study, the clinical effectiveness of four predictive models was assessed using decision curves and clinical impact curves.The DCA curve (Fig. 8) indicated that the random forest model had a greater net benefit compared to the "treat none" or "treat all" schemes across a threshold probability range of 0.6 to 1.0.Further, the random forest model exhibited superior clinical impact when compared to the other models.Notably, when the threshold probability was set above 75% (Fig. 9), the number of positive cases predicted by the models (i.e., those at high risk) was closely matched the number of true-positive cases (i.e., those who actually had high-risk outcomes).Considering all four evaluation metrics, it can be concluded that the random forest algorithm performed the best for prediction purposes and could offer more precise and systematic treatment guidance and support to married patients with PDAC.

Discussion
Marital status has been shown to be associated with survival in chronic diseases such as cancer, with married individuals having a longer life expectancy and better quality of life in various diseases.For instance, Cheng Xu et al. used a matching method to discover that married patients had better 5-year CSS/OS than unmarried patients with NPC from 1973 to 2012 16 .Gino Inverso et al. also observed a significant protective impact of marital status on metastatic oral cancer and laryngeal cancer 17 .However, while studies have confirmed that marital status is a prognostic factor for pancreatic cancer using SEER database, the available studies have failed to exclude confounding factors 9,18,19 .Previous studies have found that sex, age, stage, race 20 and surgery 21 were associated with the survival of PDAC patients.Therefore, to improve comparability between married and unmarried patients, we conducted a 1:1 propensity matching using the SEER database to screen eligible patients with PDAC, resulting in relatively reliable results based on well-matched datasets.As well as a larger sample size, our study could provide more robust results compared with previous studies.Male sex, age over 50 years, higher TNM stage, worse tumor differentiation, and no surgical treatment were determined to be risk factors for prognosis, and married patients had better survival outcomes; however, their unmarried counterparts had significantly poor OS/CSS.The current study found that marital status plays a significant role in PDAC patients, and we suspect that several possible reasons exist.First, marriage provides positive social support.It has been found that widowed, divorced, and separated individuals lack legal relationships, partner support and help during diagnosis and treatment, and are hence at a higher risk of psychological distress 22 .Similarly, relative to married patients, unmarried patients are more likely to experience negative emotional states for a prolonged period due to the absence of social support and partner companionship, which could lead to physiological dysfunctions resulting from longterm exposure to glucocorticoids and catecholamines, negatively affecting the tumor microenvironment and tumor growth, migration and stimulating angiogenesis, thus affecting the prognosis 23 .Healthy marital status plays an essential role in establishing a good psychological state, reducing negative emotions, such as anxiety and depression, and improving survival rate 10,24 .Secondly, stable marital relationships are typically associated with higher economic status, and family members such as spouses and children may provide financial and spiritual support for long-term treatment 25 .In other words, stable marital status can improve patient compliance with the treatment regimen comparatively 26 .In addition, married people with a good economic base are more likely to purchase health insurance and can receive some Medicaid at the time of diagnosis 27 .Furthermore, studies have found that patients with private health insurance are found in a greater proportion of early stages of cancer, have longer survival time and better prognosis.Patients without private health insurance, on the other hand,  are usually detected at an advanced stage of cancer and have a poor prognosis 28 .Thirdly, married individuals typically adopt healthier lifestyles, with better diets, more exercise, and less substance abuse, contributing to better healthy outcomes.It has been shown that bad habits such as smoking and alcoholism are risk factors for the development of pancreatic ductal carcinoma, while unmarried people are more likely to be infected 26 .Lastly,  Recently, machine learning has been widely used in the medical field 29 .Some researchers have developed prognostic prediction models for pancreatic cancer using machine learning methods, because machine learning algorithms are more accurate than traditional statistical methods in predicting survival outcome in the fifth year 30 .Specifically, in this study, the k-nearest neighbor, artificial neural network, naïve bayes, and random forest algorithms were used to predict the 5-year CSS and OS for married patients with PDAC.The results indicate that the random forest algorithm outperformed the other models in predicting 5-year CSS/OS, especially in its good discriminative performance and its AUROC value was high, indicating that the model could better distinguish between lives and deaths.Furthermore, since random forest has good generalization capability, it can avoid the overfitting issue.Moreover, the random forest model stands out in prognostic prediction tasks due to its superior predictive accuracy, robustness to noisy data, interpretability through feature importance analysis, capacity to handle non-linearity, generalization to unseen data, stability, and ease of implementation.Our study presents the first predictive model based on machine learning algorithms that predicts the survival impact of married patients with PDAC, which demonstrates excellent performance and provides doctors with an easily accessible and more accurate survival prediction tool for married patients with PDAC, which may guide clinical practice better.It must be admitted that with the rapid evolution of machine learning, particularly in deep learning, ensemble methods, and reinforcement learning, have led to models with increased predictive power.Therefore, we believe that our prediction model would be improved with the development of machine learning and provide more accurate prediction.  of pancreatic cancer 31 .The lack of accurate screening before data matching may result in biased conclusions.Particularly, diabetes confers a 3.05-fold increased risk of PDAC onset in diabetic individuals compared to nondiabetic individuals 32 .Secondly, the absence of data on the quality of life of patients in the SEER database, such as socioeconomic level and living environment, the quality of life of patients were not available for inclusion in our analysis.Thirdly, the marital status extracted in this study was recorded only at diagnosis, and dynamic follow-up   surveys assessing changes in marital status during PDAC treatment were not taken.This may pose information bias on clinical outcomes.We cannot understand marital status during the later treatment of patients, which may have some information bias.Fourthly, our classification of PDAC patients living together with partners but not legally married as single patients may underestimate survivorship outcomes among this group, which may be better than that of unmarried or single patients.Despite the limited proportion of such patients, they may impact the conclusions of this study.Finally, this study's generalizability may be limited to the population under investigation, as cultural variations, disparities in living standards, and economic differences between countries may influence the applicability of these findings to patients in other regions.

Conclusion
Our study provides evidence that marital status is an independent prognostic factor for PDAC.Future studies should investigate the mechanisms behind this association and the impact of marital quality and therapy on cancer outcomes.We established machine learning predictions about the survival of married patients with PDAC, with the RF model performing best.

Figure 1 .
Figure 1.The flow-process diagram for selecting patients based on inclusion and exclusion criteria.

Figure 2 .
Figure 2. Propensity score matching for married and unmarried groups.

Figure 3 .
Figure 3. Kaplan-Meier survival curves of PDAC patients between married and unmarried groups.(A) Overall survival.(B) Cancer-specific survival.

Figure 5 .
Figure 5.The impact of marital status on CSS and OS in the secondary comparison.Circles represent the aHRs with the 95% CIs indicated by horizontal bars.

Figure 7 .
Figure 7. Calibration curves for testing the stability of four prediction models.The logical calibration curve is shown in solid blue, and the statistics are displayed in the top left corner of each graph.

Figure 8 .
Figure 8. Decision curve analysis of eight prediction models.

Table 1 .
Baseline characteristics of patients patients with PDAC based on marital status.

Table 2 .
Univariate analysis to assess the impact of marital status on OS/CSS in PDAC.

Table 3 .
Multivariate analysis to assess the impact of marital status on OS/CSS in PDAC.

Table 4 .
Baseline characteristics of patients patients with PDAC based on marital status after propensity-score matching.

Table 5 .
Univariate and multivariate analysis of the impact of marital status on survival outcomes in PDAC.

Table 7 .
Calibration tests of four machine learning models for predicting 5-year CSS and 5-year OS.