Introduction

Esophageal cancer (EC) ranks among the primary causes of cancer-related mortality worldwide, placing seventh in incidence and sixth in mortality rates, estimates suggest that in 2020, approximately 1 in every 18 cancer-related deaths was attributed to EC1. The two predominant histological subtypes of EC are esophageal squamous cell carcinoma (ESCC) and esophageal adenocarcinoma (EAC). ESCC prevails predominantly among Asian populations, with alcohol consumption and smoking identified as significant risk factors, while EAC is more prevalent among Caucasians and closely associated with obesity and smoking2,3. Substantial disparities in the incidence and mortality rates of EC are observed across different geographic regions. According to GLOBOCAN 2020 estimates, the incidence and mortality rates among Asian populations are the highest globally, accounting for 79.7% and 79.8%, respectively, additionally, the mortality-to-incidence ratio (MIR) for EC in Asia is 0.89, surpassing that of Europe (0.82) and North America (0.83)4. A spectrum of lifestyle and environmental factors contributes to EC risk or protection. Studies have identified factors such as rubber production, pickled vegetable consumption, hot beverage consumption, and exposure to tetrachloroethylene as etiological factors for EC5,6, with certain Asian populations facing heightened exposure risks due to dietary habits and occupational attributes. Despite notable advancements in modern radiotherapy techniques and updated chemotherapy agents, the efficacy of EC treatment remains modest, with low survival rates and high rates of local recurrence persisting among EC patients7, particularly in Asian populations where the predominant subtype is ESCC, with a 5-year overall survival (OS) rate falling below 30%.8 Consequently, the urgent need to identify prognostic factors conducive to risk stratification and clinical management for Asian EC populations, characterized by elevated incidence and mortality rates, is underscored.

The tumor, lymph node, metastasis (TNM) staging system has garnered widespread recognition and utilization in prognostic analysis of cancer. However, with advancements in treatment modalities and multimodal approaches, prognostic disparities persist even among patients with identical TNM staging9. In recent years, nomograms have emerged as a more efficient predictive tool for forecasting the prognosis of various cancer types, serving as a complementary approach to the TNM staging system. Additionally, some researchers argue for the consideration of other significant risk factors such as age, tumor location, tumor size, and treatment modalities in predicting cancer survival rates10,11,12. Various prediction models related to cancer have been established utilizing the Surveillance, Epidemiology, and End Results (SEER) database. For instance, Qian et al. developed a nomogram to predict the prognosis of EAC13, while Fan et al. compared the prognostic disparities between endoscopic and surgical treatments in T1b-stage EC and formulated a nomogram for survival prediction14. Similarly, Tang et al. devised a nomogram to predict cancer-specific survival in patients with initial distant metastases in EC15. However, to our knowledge, there has been no relevant study focusing on the Asian EC population. Hence, we intend to develop a nomogram for predicting the prognosis of Asian EC patients based on the SEER database.

Methods

Patient selection

The SEER database (http://seer.cancer.gov/seerstat/) contains extensive cancer case data spanning from 1973 to the present day, encompassing populations from various regions of the United States. These data comprise individual information about cancer patients, including demographics (such as age, gender, and race), pathological details (such as cancer type, differentiation, and tumor size), treatment information (such as radiation therapy, surgery, and chemotherapy), and patient survival status. Necessary personal and organizational information is provided on the SEER database’s official website, and upon completion of an online registration form, approval for access is granted along with corresponding access credentials. Data retrieval is performed online using SEER*Stat software version 8.4.0.1. In the SEER database, the smallest unit for the “survival months” variable is one month, meaning that all patients with a survival time of less than one month are recorded as having a survival time of “0 month.” Some of these patients include cases of intraoperative or in-hospital death. To ensure the accuracy and reliability of the data analysis and to avoid the influence of extreme data on the overall statistical results, these cases were excluded.

Inclusion criteria encompass (1) EC with clear pathological diagnosis; (2) primary tumor; (3) initial diagnosis occurring between 2004 and 2015. Exclusion criteria include (1) demographic characteristics, including unknown age, race, and marital status; (2) clinical pathological features, including unknown tumor size, histological type, tumor grade, and tumor staging; (3) treatment information, including unknown surgery, radiation therapy, and chemotherapy; (4) survival time less than 1 month. The specific process can be seen in Supplementary Fig. 1.

Variable selection and transformation

This study collected demographic characteristics of patients, including age, sex, race, and marital status; clinical pathological features such as histological type, primary site of tumor, tumor size, grade, and TNM staging; treatment information including surgery, radiation therapy, and chemotherapy; and follow-up data including survival status and survival time. The primary observation endpoint of the study was OS, defined as the time from the beginning of the study until death for any reason. To ensure reliable follow-up data, patients included in the study were diagnosed between 2004 and 2015, with the last follow-up conducted in November 2020, ensuring a minimum follow-up period of five years for patients still alive at inclusion. Regarding tumor site, “Cervical esophagus” and “Upper third of esophagus” were categorized as upper segment, “Middle third of esophagus” as middle segment, “Lower third of esophagus” and “Abdominal esophagus” as lower segment, “Overlapping lesion of esophagus” and patients with unspecified tumor site as “Esophagus, NOS” and “Thoracic esophagus” were defined as others. For histological types, International Classification of Diseases, the third edition (ICD-O-3) codes 8140–8145, 8210, 8211, 8255, 8260, 8261, 8263, 8310, 8480, 8481, 8490, and 8574 were defined as EAC; codes 8050–8052, 8070–8076, and 8083–8084 were defined as ESCC, and the remaining codes were defined as other.

Additionally, as age and tumor size were discrete variables in this study, for further analysis, X-tile (Yale University, version 3.6.1) was used to determine the optimal cutoff values for age and tumor size in prognostic analysis. Using these optimal cutoff values as thresholds, patients were divided into groups with the most significant differences in survival.

Statistical analysis

To analyze the differences in variable distribution between Asian EC patients and those of other ethnicities, the Pearson chi-square test was employed. Asian EC patients were randomly divided into a training cohort and a validation cohort at a ratio of 7:3. The Pearson chi-square test was also used to examine the differences in variable distribution between these two cohorts. In the training cohort, the Least Absolute Shrinkage and Selection Operator (Lasso) regression was utilized for the preliminary selection of variables related to the prognosis of EC patients. A 10-fold cross-validation was conducted to avoid overfitting, and the selected variables were included in a multivariable Cox regression analysis. Variables with a p-value of less than 0.05 in the multivariable analysis were considered independent prognostic factors, which were then used to construct a nomogram. The accuracy of the nomogram was assessed using calibration curves, which compared the predicted OS rates to the actual OS rates. To reduce overfitting bias, Bootstrap resampling was performed 1,000 times. The discriminative ability of the nomogram, which refers to its capability to distinguish between surviving and deceased patients, was evaluated using the Concordance Index (C-index) and the area under the curve (AUC) of the receiver operating characteristic (ROC). The clinical utility of the nomogram was assessed through decision curve analysis (DCA), determining whether the nomogram-assisted decisions improved patient prognosis evaluation.

All statistical analyses were performed using R statistical software (version 4.2.1, http://www.R-project.org). Specifically, the “rio” package was utilized for data output, the “corrplot” and “glmnet” packages for lasso regression, the “survival,” “forestplot,” and “survminer” packages for univariate and multivariable Cox regression analysis, nomogram construction, and calculation of C-index with corresponding 95% confidence intervals (CI). The “rms” package was utilized for calibration curve construction, while the “riskRegression” package and “survival” package were employed for ROC curve construction. Furthermore, the “ggDCA,” “ggprism,” and “survival” packages were used for DCA. A significance level of p < 0.05 was considered statistically significant.

Results

Patient characteristics

A total of 13,789 EC patients were selected from the SEER database, comprising 678 Asian individuals and 13,111 individuals from other racial groups. Demographic, clinicopathological, and treatment information for all patients were summarized (Table 1). Compared to other racial groups, Asian individuals exhibited a higher proportion of being married (70.06% vs. 59.81%, p < 0.001), a higher percentage of tumors located in the upper segment (9.00% vs. 4.76%, p < 0.001), and middle segment (29.06% vs. 15.32%, p < 0.001), a higher prevalence of tumors larger than 27 mm (80.38% vs. 76.52%, p = 0.023), a higher frequency of histological type ESCC (70.94% vs. 30.04%, p < 0.001), a higher incidence of lymph node metastasis (61.95% vs. 55.36%, p < 0.001), a higher proportion of patients not undergoing surgical treatment (69.32% vs. 57.68%, p < 0.001), and a higher rate of receiving radiotherapy (71.83% vs. 65.91%, p < 0.001). Among Asian EC patients, 51 did not receive any treatment, accounting for approximately 7.52% (51/678). There were no significant differences between Asian EC patients and those of other ethnicities regarding age, gender, differentiation grade, T stage, M stage, and chemotherapy (p > 0.05). During the follow-up period, there were 11,674 deaths in the total cohort. The 1-year, 3-year, and 4-year OS rates were 56.11%, 30.28%, and 22.42%, respectively. Among Asian EC patients, there were 572 deaths, with 1-year, 3-year, and 5-year OS rates of 55.90%, 30.09%, and 20.65%, respectively. In contrast, 11,102 deaths occurred among patients of other races, with 1-year, 3-year, and 5-year OS rates of 56.12%, 29.96%, and 22.51%, respectively. Subsequently, the Asian patient cohort was randomly divided into a training cohort and a validation cohort. There were no statistically significant differences in the distribution of variables between these two cohorts (p > 0.05), as shown in Table 2.

Table 1 Comparison of baseline characteristics between Asian and other ethnic populations.(chi square test).
Table 2 Comparison of baseline characteristics between training cohort and validation cohort.(chi square test) EAC: esophageal adenocarcinoma, ESCC: esophageal squamous cell carcinoma.

Selection of independent prognostic factors

Based on the training cohort, dimensionality reduction was initially performed using Lasso-Cox regression, selecting variables with non-zero coefficients (Fig. 1). These variables included age, gender, marital status, tumor location, tumor size, degree of differentiation, T stage, N stage, M stage, surgery, radiotherapy, and chemotherapy. Subsequently, a multivariate Cox regression model was constructed using these variables (Table 3). The results indicated that age (p = 0.005), sex (p = 0.010), marital status (p = 0.003), tumor size (p = 0.002), M stage (p < 0.001), surgery (p < 0.001), and chemotherapy (p < 0.001) are independent prognostic factors for Asian patients with EC.

Fig. 1
figure 1

Correlation between coefficient and log lambda (A) and sifting of variables via cross-validation (B) in the LASSO model. LASSO, least absolute shrinkage and selection operator.

Table 3 Multivariate Cox analysis of the variables filtered by Lasso regression.(based on the training cohort).

Construction and validation of nomogram

A nomogram was developed based on the independent prognostic factors identified through multivariate Cox regression (Fig. 2). Subsequently, cross-validation of the nomogram was performed using training and validation cohorts derived from random splits. The C-index of the nomogram was 0.700 (95% CI: 0.673–0.727) in the training cohort and 0.709 (95% CI: 0.668–0.750) in the validation cohort. The calibration curves demonstrated a high concordance between the predicted and actual OS rates in both the training and validation cohorts (Fig. 3). The discriminative ability of the nomogram was evaluated using the AUC (Fig. 4). In the training cohort, the AUC for predicting 1-year, 3-year, and 5-year OS rates were 0.770 (95% CI: 0.728–0.881), 0.756 (95% CI: 0.709–0.803), and 0.783 (95% CI: 0.732–0.834), respectively. In the validation cohort, the AUC for predicting 1-year, 3-year, and 5-year OS rates were 0.814 (95% CI: 0.754–0.875), 0.763 (95% CI: 0.697–0.830), and 0.771 (95% CI: 0.705–0.837), respectively. DCA further indicated that the nomogram exhibited good clinical utility in predicting the 1-year, 3-year, and 5-year OS rates (Fig. 5).

Fig. 2
figure 2

Nomogram for predicting 1-, 3- and 5-yr overall survival of Asian patients with esophageal cancer.

Fig. 3
figure 3

Calibration curves of the nomogram. (A) Calibration curves of 1-, 3- and 5-yr overall survival in the training cohort; (B) Calibration curves of 1-, 3- and 5-yr overall survival in the validation cohort.

Fig. 4
figure 4

Receiver operating characteristic curves of the nomogram (AC) For 1-, 3- and 5-yr overall survival in the training cohort; (DF) for 1-, 3- and 5-yr overall survival in the validation cohort.

Fig. 5
figure 5

Decision curve analysis for survival prediction. (A) For 1-, 3- and 5-yr overall survival in the training cohort; (B) for 1-, 3- and 5-yr overall survival in the validation cohort.

Risk stratification

The scores assigned to each variable in the nomogram (Fig. 1) are derived from the regression coefficients corresponding to each categorical variable in the multivariate Cox regression risk model. Higher scores indicate greater risk. Based on the risk scores obtained from the nomogram, patients are assigned points. Specifically, for age, patients aged over 75 years are assigned 31 points, while those aged 75 years or younger are assigned 0 points. Regarding gender, females are assigned 0 points, whereas males are assigned 58 points. Marital status is scored as 0 for married individuals and 44 for unmarried individuals. Tumor length scores 0 if less than 27 mm and 76 if greater than 27 mm. For the M stage, patients classified as M0 receive 0 points, while those classified as M1 receive 49 points. Patients who underwent surgery and chemotherapy received 0 points, while those who did not undergo either surgery or chemotherapy received 100 and 64 points, respectively. Scores based on independent prognostic factors from the nomogram are assigned to all patients in the Asian patient cohort, and the total score is calculated. The optimal cutoff value for risk scores is determined using X-tile software, dividing patients into low-risk (total score ≤ 291) and high-risk (total score > 291) groups. Kaplan-Meier survival curves are plotted (Fig. 6), and a log-rank test reveals a significant difference in OS between the two groups (p < 0.001).

Fig. 6
figure 6

Kaplan–Meier method estimate of overall survival in the whole cohort. Low risk refers to a total score of ≤ 291, high risk refers to a total score of > 291.

Discussion

This study established a nomogram to predict the 1-year, 3-year, and 5-year OS rates of Asian EC patients using data from the SEER database, and it underwent rigorous validation. In the modeling process, Lasso regression was initially employed to select variables with non-zero coefficients corresponding to the lambda value that minimized the average error during cross-validation. While this approach often yields the highest predictive accuracy, it may also result in the selection of numerous variables, leading to increased model complexity. Subsequently, a multivariate Cox regression model was utilized to further refine the variables and identify independent prognostic factors, including age, gender, marital status, tumor size, M stage, surgery, and chemotherapy. Although the multivariate Cox regression model revealed that tumor grade as unknown and T stage as T4 had p-values less than 0.05, they were not considered independent prognostic factors to enhance the statistical reliability of the nomogram. The established nomogram demonstrated robust predictive performance and clinical utility in both the training and validation cohorts, with C-index and AUC values exceeding 0.7. Subsequently, risk stratification was performed based on the risk scores, enabling clinicians to assess the prognosis of Asian EC patients and implement appropriate interventions.

The treatment modalities for EC are continuously advancing. Rooted in clinical trials such as RTOG85-01 and RTOG94-0516,17, definitive chemoradiotherapy emerged as a non-surgical treatment method initially recognized and applied in locally advanced EC cases. The CROSS study further solidified the treatment standards for EC over the past decade18. Its follow-up data, released in 2021, demonstrated sustained OS benefits with neoadjuvant chemoradiotherapy (NCRT). With a median follow-up time of 147 months, the 10-year OS rate was higher in the NCRT group compared to the surgery-only group (38% vs. 25%, p = 0.004). While NCRT was associated with a reduced local recurrence rate compared to surgery alone, the rates of distant recurrence were similar between the two groups (27% vs. 28%). In our study, chemotherapy emerged as a protective factor postoperatively for Asian EC patients (HR = 0.589, 95% CI: 0.438–0.792, p < 0.001), while radiotherapy was not a protective factor (HR = 0.801, 95% CI: 0.596–1.077, p = 0.141). Some studies suggest that NCRT, compared to neoadjuvant chemotherapy alone, may improve local-regional control rates and rates of curative surgical resection19,20 However, no significant differences in long-term survival have been observed between these approaches. Additionally, research by Justin Rucker and colleagues found that for patients with postoperative pathological staging of T2-4aN0M0 esophageal adenocarcinoma who have undergone perioperative adjuvant chemotherapy, the addition of concurrent radiotherapy does not significantly impact OS21. The value of radiotherapy remains to be further discussed.

In the nomogram developed within this study, individuals who did not undergo surgery exhibited the highest risk scores. Despite the rapid advancements in therapies such as chemotherapy, radiotherapy, immunotherapy, and cellular treatments in recent years, comprehensive treatment modalities centered around surgery remain the preferred approach. Conventional surgical techniques involving direct excision or ablation of tumors, along with minimally invasive procedures such as laparoscopic surgery, endoscopic resection, and non-resectional methods, continue to serve as fundamental strategies in the management of EC22,23,24. It is noteworthy that the proportion of Asian EC patients undergoing surgery in this study was significantly lower compared to other ethnic groups (30.68% vs. 42.32%, p < 0.001). Paulson et al. found a close association between racial disparities and surgical utilization, after adjusting for all other independent variables, multivariable logistic regression revealed that Caucasian patients were over twice as likely to undergo surgical treatment compared to non-Caucasian patients (p < 0.001).25 Similar findings were corroborated by Steyerberg et al., indicating significant disparities in surgical acceptance among different racial groups26. The underlying reasons for these substantial discrepancies may be attributed to inequalities in socioeconomic status and access to healthcare27,28,29. Future research endeavors should further investigate the extent to which these factors influence surgical decision-making among EC patients.

Advanced age is considered one of the independent prognostic factors for Asian EC patients. Findings from other studies on independent prognostic factors among various Asian subgroups are largely consistent with the results obtained in this study13,15,30,31,32. Generally, elderly patients often present with comorbidities such as hypertension and diabetes, leading to poorer baseline health conditions. Consequently, they exhibit increased rates of in-hospital mortality and postoperative complications following surgery. This inclination towards conservative management is further influenced by traditional cultural and ideological perceptions among elderly Asian populations, resulting in lower acceptance rates of surgical interventions. Markar et al.‘s review study revealed that elderly EC patients undergoing esophagectomy face higher risks, with increased in-hospital mortality (7.83% vs. 4.21%) and higher rates of respiratory (21.77% vs. 19.49%) and cardiovascular (18.7% vs. 13.17%) complications compared to younger patients33. Furthermore, the 5-year OS rate (21.23% vs. 29.01%, p < 0.05) and 5-year disease-free survival rate (34.4% vs. 41.8%, p < 0.05) were lower in the elderly group than in the younger group. Mantziari et al. conducted a more refined analysis, demonstrating that compared to younger patients undergoing esophagectomy, elderly patients exhibited higher rates of respiratory complications (20% vs. 16%), cardiovascular complications (15.6% vs. 7.0%), and postoperative mortality (7.9% vs. 3.4%).34

Marital status is one of the socio-economic factors, where being married can serve as a surrogate indicator of social support, contributing to improving the survival status of patients with various diseases, including malignant tumors. The study by Krajc et al. demonstrated that married cancer patients, compared to unmarried ones, had a lower likelihood of tumor metastasis, a greater chance of receiving definitive treatment, and a reduced probability of dying from cancer35. Similarly, Du et al. conducted a study based on the SEER database among EC patients, revealing that the risk of death from various causes for singles (HR = 1.14, 95% CI 1.11–1.17, p < 0.001), divorced or separated individuals (HR = 1.16, 95% CI 1.13–1.19, p < 0.001), and widowed individuals (HR = 1.22, 95% CI 1.19–1.26, p < 0.001) was higher than that for married patients36. This is generally consistent with the viewpoint of our study. However, since the focus of this study is on comprehensive prognostic factors among Asian populations, the unmarried population was not specifically categorized. Aizer et al. found that married patients were less likely to experience distant tumor metastasis, more likely to undergo curative treatment, and had a lower risk of cancer-related mortality compared to unmarried patients after adjusting for demographic characteristics, staging, and treatment modalities (HR = 0.80, 95% CI: 0.79–0.81, p<0.001)37. In our study, unmarried patients had a higher risk compared to married patients (HR = 1.409, 95% CI: 1.127, 1.761, p<0.001). Gender differences are one of the factors affecting the incidence and mortality rates of EC. In Asia, the incidence rate of EC in males and females differs by 2.37 times4. According to the SEER data report, female patients with ESCC exhibit a higher relative survival rate38. The mechanisms underlying these gender differences are not fully understood and seem to be multifactorial, primarily involving hormonal and genomic factors39,40. A meta-analysis showed that larger tumors are independent prognostic factors leading to poorer OS and disease-free survival (DFS) in EC patients41. However, some studies have overlooked the impact of tumor size on prognosis24,42,43,44,45,46,47. We believe that the reason for this discrepancy is that we established the optimal cutoff values using X-tile software, greatly increasing the accuracy of grouping.

To date, several studies have constructed nomograms to predict long-term survival in EC patients32,48. However, to the best of our knowledge, close to 80% of EC patients are of Asian descent, with a high incidence rate and short survival time, yet there has been no prior research specifically targeting this population. This study, for the first time, analyzed prognostic factors in Asian EC patients based on big data and established a nomogram. For such patients, relying solely on TNM staging provides extremely limited predictive efficacy for individualized survival, highlighting the urgent need for additional methods for auxiliary assessment. This study extensively incorporated demographic, clinicopathological, and treatment data of patients and used X-tile to determine the optimal cutoff values for age and tumor size. It established highly accurate visual predictive models, with C-indices all exceeding 0.70 and AUCs all surpassing 0.75. Calibration curves demonstrated high consistency between predicted OS rates by the nomogram and actual OS rates, while DCA indicated excellent clinical value of the model. Furthermore, the established risk stratification can provide recommendations for clinicians to comprehensively assess the prognosis and individualized treatment of Asian EC patients.

Although this study has established a reliable nomogram, there are still certain limitations. Firstly, the SEER database does not contain specific information on chemotherapy regimens and radiotherapy doses, nor does it record detailed information on other prognostic factors, such as immunotherapy, Performance Status (PS) scores, postoperative complications, nutritional status, targeted drug therapy, genetic molecular markers, etc. Secondly, although this study was based on analysis of a large sample population and obtained good cross-validation results, external data were not used to validate the accuracy of this study. Finally, this study is a retrospective analysis based on a database, with OS as the primary endpoint. Causes of death unrelated to EC (e.g., accidental death or death from external causes) may be partially included, and the data lack randomization. Prospective data are needed to further validate the conclusions.