Exploration and machine learning model development for T2 NSCLC with bronchus infiltration and obstructive pneumonia/atelectasis

In the 8th edition of the American Joint Committee on Cancer (AJCC) staging system for Non-Small Cell Lung Cancer (NSCLC), tumors exhibiting main bronchial infiltration (MBI) near the carina and those presenting with complete lung obstructive pneumonia/atelectasis (P/ATL) have been reclassified from T3 to T2. Our investigation into the Surveillance, Epidemiology, and End Results (SEER) database, spanning from 2007 to 2015 and adjusted via Propensity Score Matching (PSM) for additional variables, disclosed a notably inferior overall survival (OS) for patients afflicted with these conditions. Specifically, individuals with P/ATL experienced a median OS of 12 months compared to 15 months (p < 0.001). In contrast, MBI patients demonstrated a slightly worse prognosis with a median OS of 22 months versus 23 months (p = 0.037), with both conditions significantly correlated with lymph node metastasis (All p < 0.001). Upon evaluating different treatment approaches for these particular T2 NSCLC variants, while adjusting for other factors, surgery emerged as the optimal therapeutic strategy. We counted those who underwent surgery and found that compared to surgery alone, the MBI/(P/ATL) group experienced a much higher proportion of preoperative induction therapy or postoperative adjuvant therapy than the non-MBI/(P/ATL) group (41.3%/54.7% vs. 36.6%). However, for MBI patients, initial surgery followed by adjuvant treatment or induction therapy succeeded in significantly enhancing prognosis, a benefit that was not replicated for P/ATL patients. Leveraging the XGBoost model for a 5-year survival forecast and treatment determination for P/ATL and MBI patients yielded Area Under the Curve (AUC) scores of 0.853 for P/ATL and 0.814 for MBI, affirming the model's efficacy in prognostication and treatment allocation for these distinct T2 NSCLC categories.

of the most important prognostic factors for NSCLC patients 12,13 .A study found that P/ATL is a risk factor for lymph node metastasis 14 .However, the relationship between MBI and P/ATL with lymph node metastasis still needs further exploration.
Furthermore, these two distinct subtypes of stage T2 NSCLC may pose more significant treatment challenges compared to standard T2 tumors.Despite this, there has been a lack of thorough research to determine the optimal treatment approach for these specific T2 NSCLC subtypes.The effectiveness and role of treatment modalities, in the context of the two specific NSCLC subtypes, remain to be clarified.
Employing machine learning, a branch of artificial intelligence, for model development enhances the accuracy of predictions with the addition of new data, frequently outperforming logistic regression methods.Its application in predicting survival across various cancers has been noted, and specifically, utilizing machine learning to estimate 5-year OS for T2-stage NSCLC patients under certain conditions significantly improves the precision of prognosis forecasts 15,16 .This approach not only refines survival predictions but also facilitates the formulation of recommendations for optimal treatment strategies.

Clinical characteristics of T2 stage NSCLC patients in different groups
Variations in clinical characteristics between the MBI/(P/ATL) and non-MBI/(P/ATL) groups were prominently attributed to the diameter linked to the T2 stage (Table 1).Notable disparities existed in gender distribution, with the MBI/(P/ATL) group demonstrating a higher proportion of males (58.4%/55.3% vs. 53.4%)and a heightened occurrence of Squamous Cell Carcinoma (46.0%/40.8% vs. 32.7%).Significantly, a larger proportion of primary sites in the main bronchus were identified in the MBI/(P/ATL) group (14.1%/7.8% vs. 1.7%), accompanied by a more advanced histologic grading (p < 0.001).
In relation to tumor diameter, the non-MBI/(P/ATL) group had a larger diameter due to the incorporation of cases surpassing 3 cm.In general, profound differences in clinical characteristics were observed between the groups, with the MBI/(P/ATL) group manifesting extensive disparities, especially within the P/ATL subgroup, compared to the non-MBI/(P/ATL) group.

Survival analysis before and after PSM
Through Kaplan-Meier survival analysis, it was discerned that the OS for the MBI (Diameter > 3) group was adversely impacted in comparison to the non-MBI/(P/ATL) group (p = 0.012) (Fig. 1A).Notably, regardless of the diameter size, the OS for the non-MBI/(P/ATL) group was significantly superior to that of the P/ATL group (p < 0.0001) (Fig. 1B).
Given the pronounced heterogeneity in clinical characteristics among the three groups, we adopted the Propensity Score Matching (PSM) method to mitigate the impact of diverse background variables, thereby harmonizing potential prognostic factors between the P/ATL and MBI groups compared to the non-MBI/(P/ ATL) group.This approach ensured that the p-values from t-tests or chi-square tests for all clinical characteristics between the respective groups exceeded 0.1, indicating a balanced comparison (Supplementary data 1).Following this adjustment, we analyzed OS and cancer-specific survival (CSS) using the KM method for the P/ ATL vs. None groups and the MBI vs. None groups, respectively.Our findings revealed that the P/ATL group exhibited a significantly poorer prognosis than the None group, with p of 0.00015 for OS and 0.00021 for CSS (Fig. 1C,E).Conversely, the MBI group's prognosis was marginally inferior compared to the None group, with p of 0.037 for OS and 0.016 for CSS (Fig. 1D,F).

Multivariate logistic regression analysis for lymph node metastasis
Our findings indicate that at the T2 stage, both the MBI and P/ATL groups demonstrate an elevated risk for lymph node metastasis.To ascertain whether MBI and P/ATL act as independent risk factors for these lymph node metastase, we employed a multifactorial logistic regression analysis.The results illuminated those individuals in the MBI/(P/ATL) group had a notably higher risk of lymph node metastasis compared to those in the non-MBI/(P/ATL) group.In detail, MBI was found to be an independent risk factor for lymph node metastasis (OR = 1.69, 95% CI 1.55-1.85,p < 0.001), as was P/ATL (OR = 2.10, 95% CI 1.93-2.28,p < 0.001) (Table 2).

Evaluation of different treatments in patients with MBI and P/ATL
To evaluate the optimal treatment for NSCLC patients with two specific types of T2 tumors, we integrated seven treatment modalities: None, Radiation Therapy Alone, Chemotherapy Alone, Radiation + Chemotherapy, Surgery Alone, Initial Surgery Followed by Adjuvant Treatment, and Induction Therapy Followed by Surgery.We conducted a multifactorial Cox regression analysis of OS to assess the prognostic impact of these treatments in patients with P/ATL and MBI, respectively, using Surgery Alone as the reference group (Table 3).The results indicated that surgical treatments significantly outperformed both Radiotherapy Alone and Chemotherapy Alone, as well as the combination of Radiotherapy and Chemotherapy, in both subgroups.Specifically, in patients with MBI, Initial Surgery Followed by Adjuvant Treatment (HR = 0.77, 95% CI 0.67-0.90,p = 0.001) and Induction Therapy Followed by Surgery (HR = 0.65, 95% CI 0.48-0.87,p = 0.003) were significantly more effective than Surgery Alone.Conversely, for patients with P/ATL, neither Initial Surgery Followed by Adjuvant Treatment (HR = 1.17, 95% CI 0.99-1.37,p = 0.067) nor Induction Therapy Followed by Surgery (HR = 1.05, 95% CI 0.78-1.40,p = 0.758) showed any advantage over Surgery Alone.Given the limited therapeutic options for patients with distant metastases, we analyzed the KM survival with different therapeutic strategies for patients with P/ATL and MBI at stages N0-1M0 and N2-3M0, respectively.In patients with MBI at the N2-3M0 stage, preoperative Induction Therapy significantly improved prognosis, illustrating a marked enhancement in outcomes.For the N0-1M0 stage in MBI patients, while there was a clear improvement in median survival with preoperative Induction Therapy, this improvement did not reach statistical significance.Additionally, postoperative Adjuvant Therapy substantially improved outcomes over Surgery Alone for MBI patients across both N0-1M0 and N2-3M0 stages (Fig. 2A,B).Conversely, these treatments did  www.nature.com/scientificreports/not yield significant benefits for patients with P/ATL (Fig. 2C,D).Moreover, in both subgroups for the N0-1M0 stage, prognosis following Surgery Alone was significantly better than with Chemoradiotherapy, whereas at the N2-3M0 stage, Surgery Alone did not show superiority over Chemoradiotherapy in terms of prognosis (Fig. 2).

Development of predictive models for 5-year OS in P/ATL and MBI patients
Given the potential notable disparities in clinicopathologic variables and prognoses across the MBI and P/ATL subgroups, we aimed to delve deeper into the varying impacts that different factors might exhibit on mortality within these subgroups.Accordingly, multifactorial logistic regression was applied to analyze the 5-year OS rate within the MBI and P/ATL subgroups.In the MBI group, sex, histologic type, grade, age, N stage, M stage, site, marital status and treatment type were identified as independent factors associated with 5-year OS.In the P/ATL group, histologic type, grade, age, race, N stage, M stage and treatment type were recognized as independent factors associated with 5-year OS (Supplementary data 2).We incorporated the factors independently correlated with 5-year OS from the MBI and P/ATL groups for prognostic modeling.The patients were randomized into training and test data groups at a 7:3 ratio.Subsequently, the best parameters for each model were adjusted and training was conducted within the training set to optimize performance.In the validation set, we performed ROC and DCA analyses of MBI and P/ATL groups for all models (Fig. 3A,B).The XGBoost model also demonstrated optimal AUC with 0.814 and 0.853 respectively in both MBI and P/ATL groups, and the DCA curves further affirmed that the XGBoost model secures a higher net benefit compared to other models across varying threshold ranges (Fig. 3C,D).The specific performance of each model in the test set is shown in Supplementary Data 3.In addition, we performed the Delong test and found that the XGBoost model significantly outperforms the rest of the models in both MBI and P/ATL (Supplementary Data 4).
Consequently, the calibration curves for the XGBoost model in both the MBI and P/ATL groups within the test set were also plotted, revealing commendable predictive performance of the model (Fig. 4A,B).Additionally, we scrutinized the importance scores of the variables in both models (Fig. 4C,D).

Creating web-based predictive models
To assist researchers and clinicians in utilizing our prognostic model, we developed user-friendly web applications for stage T2 NSCLC MBI and P/ATL groups (Fig. 5A,B), respectively.The web interface allows users to input clinical features of new samples, and the application can then help predict survival probabilities and survival status based on the patient's information.And the model can help clinicians to develop appropriate treatment strategies for this subgroup of patients by first selecting other parameters of a particular patient and focusing on the change of their 5-year survival by adjusting different treatments.For example, a 65-74 year old male with T2N3M0 stage lung adenocarcinoma, graded as grade III located in the upper lobe of a married MBI patient, his 5-year OS was 19.07%if he received Chemoradiotherapy, 23.83% if he received only surgery, and 5-year OS if he received Induction therapy followed by surgery was 35.51%, and 31.28% for those who received Initial surgery followed by adjuvant treatment.

Discussion
Although there are many studies examining NSCLC, studies specifically examining specific types of T2-staged NSCLC are still very limited currently.We performed the first comprehensive analysis of T2 stage NSCLC in MBI as well as P/ATL subgroups.Previous research, relying solely on the Cox proportional hazards model, has indicated that P/ATL may have an independent prognostic impact on stage T2 NSCLC 9,17 .However, when considering the inclusion of whole-lung pneumonia or atelectasis in T2 stage analysis, this effect might become more pronounced.After adjusting for all other factors through PSM, we observed that patients with P/ATL and MBI, having similar tumor diameters, faced a significantly worse prognosis in T2 stage lung cancer compared to those without these specific conditions.This adverse impact was especially marked in patients with P/ATL.Leveraging the largest sample size to date, our study is the first to confirm the independent effect of P/ATL on prognosis using a PSM approach.In addition, through multivariate logistic regression, we found a significant increase in lymph node metastasis in the P/ATL subgroup compared to the other T2 groups, and found more lymph node metastasis in the MBI subgroup as well, which may be clinically helpful in predicting lymph node metastasis in NSCLC patients.
In addition to this, we compared the treatment options in patients with MBI and P/ATL.We found that surgery remains the treatment of choice for patients with MBI and P/ATL, and that, in patients with MBI, the prognostic impact of preoperative induction therapy and postoperative adjuvant therapy is significant.In P/ ATL patients, the proportion of surgical patients was significantly lower, and the proportion of patients receiving simultaneous preoperative induction chemotherapy and postoperative adjuvant therapy was significantly higher than in MBI patients, but the effects of preoperative induction therapy and postoperative adjuvant therapy were poorer in P/ATL patients, and no significant prognostic improvement was found to exist.Earlier research suggested that the P/ATL group might derive greater benefits from radiotherapy 18 , Our study did find that radiotherapy alone had a significantly better prognosis than chemotherapy alone in P/ATL patients with T2N0-1M0 (18 months vs. 9 months), but the role of radiotherapy in higher staged P/ATL patient populations needs to be further elucidated.Moreover, due to the limitations of the SEER database, the impact of further therapies such as targeted and immunotherapies on P/ATL, a group of patients with poorer prognosis, deserves to be further investigated.
In order to accurately predict the prognosis and treatment options for these two subgroups of patients, we embarked on the separate development of machine learning models tailored to each subtype.XGBoost has consistently demonstrated superior predictive performance in various studies [19][20][21] , and it remained the top performer in our modeling as well.The outcomes indicated that our models achieved superior AUC values relative to preceding prognostic models for NSCLC 22 .This underscores the enhanced predictive accuracy our models offer, particularly for these specialized T2 stage NSCLC categories.
Several limitations merit attention when interpreting the results of this model.Firstly, our study had certain limitations in its scope of variables analyzed, mainly due to the constraints of data availability in the SEER database.As a result, some tumor markers and hematological indicators were omitted.Secondly, detailed information pertaining to the treatment regimen, including specifics on immunotherapy and targeted therapies, was absent.Lastly, it's crucial to note that our model was conceived, ratified, and examined utilizing retrospective data.It's essential that prospective validation studies be conducted to validate our findings before considering its routine application in clinical settings.

Variable selection
Given that the M and N classifications in the SEER database are established at the time of initial diagnosis, our exploration of the association between P/ATL and MBI in lymph node metastasis required a focus on clinical and pathological variables only, such as Size, Marital Status, Primary Site, Sex, Histologic type, Race, Grade, Laterality, and Age, omitting therapeutic variables.However, during the modeling process, all clinical, pathological, and the therapeutic variables were included.In this study, the model is constructed using 5-year OS specifically attributed to cancer.We also collected two ending variables, cancer-specific survival (CSS) and OS.In this study, the OS is based on a 5-year post-diagnosis timeframe.If a patient dies within these 5 years, their OS indicates 'mortality' .However, if a patient survives beyond the 5 years, or has a survival time less than 5 years solely due to the follow-up period, their OS is considered as 'survival' .

Machine learning model formulation
We utilized multifactorial logistic regression analysis to assess variables and identify independent predictors associated with 5-year OS in MBI or P/ATL in NSCLC.The dataset was randomly split into a 70% training group and a 30% testing group in both MBI and P/ATL groups.Five renowned machine learning models-random forest (RF), K-Nearest Neighbor (KNN), XGBoost, logistic regression (LR), decision tree (ID3), and support

Figure 1 .
Figure 1.Kaplan-Meier analysis of patients with different T2 types of NSCLC.(A,B) Kaplan-Meier analysis of overall survival (OS) in the Pneumonia or Atelectasis (P/ATL) and Main Bronchus Infiltration (MBI) groups versus the groups without P/ATL and MBI, prior to propensity score matching (PSM).(C,D) Kaplan-Meier analysis of OS in the P/ATL and MBI groups versus the non-MBI and P/ATL groups following PSM.(E,F) Kaplan-Meier analysis of cancer-specific survival (CSS) in the P/ATL and MBI groups versus the non-MBI and P/ATL groups after PSM.

Figure 2 .
Figure 2. Kaplan-Meier analysis comparing the effectiveness of various treatment modalities in patients with Main Bronchus Infiltration (MBI) or Pneumonia/Atelectasis (P/ATL) based on nodal involvement.(A) Overall Survival (OS) associated with different treatment approaches in MBI patients classified as N0-1M0.(B) OS associated with different treatment approaches in MBI patients classified as N2-3M0.(C) OS associated with different treatment approaches in P/ATL patients classified as N0-1M0.(D) OS associated with different treatment approaches in P/ATL patients classified as N2-3M0.

Figure 3 .
Figure 3. Receiver Operating Characteristic Curve (ROC) and Decision Curve Analysis (DCA) analyses of Main Bronchus Infiltration (MBI) and Pneumonia/Atelectasis (P/ATL) groups.(A) ROC curves for each model in the MBI group.(B) ROC curves for each model in the P/ATL group.(C) DCA curves for each model in the MBI group.(D) DCA curves for each model in the P/ATL group.

Figure 4 .
Figure 4. Calibration curves and feature significance plots of the XGBoost model for Main Bronchus Infiltration (MBI) and Pneumonia/Atelectasis (P/ATL) groups.(A) Calibration curve of the XGBoost model for the MBI group.(B) Calibration curve of the XGBoost model for the P/ATL group.(C) Feature significance plot of the XGBoost model for the MBI group.(D) Feature significance plot of the XGBoost model for the P/ATL group.

Methods Information source and study framework
The lung being the primary site as established by international norms.The exclusion criteria were as follows: (1) Patients demonstrating visceral pleural infiltration; (2) Patients with undefined clinical features.Figure6delineates the flowchart of the study..