Introduction

Utilizing machine learning presents a promising approach to enhance the diagnosis and treatment of various ailments, such as cardiovascular diseases. Specifically, appropriate stratification and patient selection are crucial steps towards administering effective treatment. Uplift modeling1, a common machine learning methodology used in commercial industries to discern individuals with a greater or lesser propensity to purchase a product in response to an intervention, can also facilitate the identification of patients who would benefit most from treatment. This technique is unlike traditional statistical analysis, which usually aims to determine whether a treatment is effective overall or whether the effectiveness of the treatment differs among a small number of pre-defined subgroups. Uplift models are frequently trained using outcome data in the form of a customer’s response to an intervention. Randomized clinical trial data can also be utilized to train uplift models to identify patients who would benefit from a particular treatment2.

Catheter ablation presents itself as a secure and efficacious treatment for atrial fibrillation (AF). While pulmonary vein isolation (PVI) is typically performed on AF patients3,4 it exhibits inadequate efficacy in maintaining sinus rhythm among individuals with persistent AF as opposed to those with paroxysmal AF. Despite the integration of extensive catheter ablation, such as linear ablation and/or complex fractional atrial electrogram (CFAE) ablation, with PVI for individuals with persistent AF5, randomized clinical trials evaluating the efficacy of this combination treatment have failed to produce definitive results, and the efficacy of extensive ablation remains a topic of controversy6,7. This may be attributed to the heterogeneity of patients with persistent AF, thus highlighting the importance of suitable stratification and identification of individuals who do or do not require extensive substrate ablation in conjunction with PVI. Though numerous previous studies have delved into this issue, including our own8,9,10,11, the optimal method for stratification has yet to be elucidated. In this regard, uplift modeling would be an appropriate approach.

Here, we report the usefulness of uplift modeling in identifying patients who necessitate extensive catheter ablation along with PVI among those afflicted with persistent AF. Moreover, we detail the features of patients identified by uplift modeling who stand to benefit from extensive ablation.

Methods

Study design

This study was a post-hoc sub-analysis of the EARNEST-PVI trial, registered at ClinicalTrials.gov (https://clinicaltrials.gov/ct2/show/NCT03514693, ClinicalTrials.gov Identifier: NCT03514693)7,9,10,12,13, which focused on stratification using uplift modeling. The EARNEST-PVI trial is a prospective, multicenter, randomized, and open-label non-inferiority trial of patients with persistent AF undergoing an initial catheter ablation procedure conducted by the Osaka Cardiovascular Conference Arrhythmia Investigators. Details of the EARNEST-PVI trial are described elsewhere7,12. Briefly, patients with persistent AF undergoing a first-time ablation procedure were enrolled in eight medical centers with extensive experience with catheter ablation for AF. Patients were randomized to receive either PVI only (PVI-alone) or extensive ablation comprising linear and/or CFAE ablation in addition to PVI (PVI-plus). Before catheter ablation, we collected clinical data, including patient history, laboratory data, and transthoracic echocardiography results. Details of the ablation procedures are also described elsewhere7,12. Patients were followed up for 12 months after the ablation procedure. A 12-lead electrocardiogram (ECG) was performed before catheter ablation, at discharge, and at 1, 3, 6, 9, and 12 months, and a 24-h Holter ECG was conducted at 6 and 12 months to detect recurrence of AF. The primary endpoint of the study was the recurrence of AF documented by scheduled or symptom-driven ECG tests during the 12-month follow-up period. All patients provided written informed consent to participate and the study was approved by the ethics committee of each hospital. This study conformed to the ethical guidelines outlined in the Declaration of Helsinki, and was approved by the Institutional Review Boards of all hospitals. The following institutes approved this study: Cardiovascular Center, Sakurabashi-Watanabe Hospital (study number: 17-6); Osaka University Graduate School of Medicine (14377); Kansai Rosai Hospital (15D059g); Osaka General Medical Center (27-2035); Osaka Police Hospital (548); Osaka Rosai Hospital (28-78); Yao Municipal Hospital: (八病H29-5); and Osaka Hospital, Japan Community Healthcare Organization (2016-25).

Uplift modeling

Uplift modeling has been used to predict the difference between class probabilities in a treatment and a control group. This approach enables the discovery of a group of patients for which a treatment is more beneficial2.

A total of 53 variables before catheter ablation (Supplementary information 2, Supplementary Table S1) were considered as primary candidates for uplift modeling after excluding factors with missing data of more than 15%. Missing values in categorical features were imputed with a constant ‘not available’ value and those in continuous features were imputed with the mean value of the feature. This imputation method is the default preprocessing method in ‘PyCaret’ 2.3.10, an open-source, low-code machine learning library in Python (https://pycaret.readthedocs.io/en/stable/index.html). Variables with a correlation coefficient greater than 0.7 or considered clinically highly relevant to each other were removed and replaced with the variable that was considered the most informative. Continuous variables were scaled and translated according to the interquartile range in a normal distribution. Finally, a total of 26 variables were included in the present analysis (Supplementary information 2, Supplementary Table S2).

Uplift modeling is commonly conducted using a two-model approach or a one-model approach2. Here, we used the one-model approach. The advantage of the one-model approach is that models are easier to interpret. The predictive power of the influence of each variable on the uplift model can easily be evaluated. Xi is defined as a predictor variable and Y \(\in \left\{0, 1\right\}\) as a class variable whose behavior is to be modeled. The uplift score is calculated by subtracting the probability of being assigned to the control group (\({P}_{C}\)) from the probability of being assigned to the treatment group (\({P}_{T}\)). For the class variable, a 1 value indicates a positive outcome (success) while a 0 value indicates a negative outcome (failure). In the present study, success was defined as no recurrence of AF during the 1-year follow-up period, and failure was defined as recurrence of AF during the 1-year follow-up period.

$$Uplift\; score = P_{T} (Y = 1|X_{i} ) - P_{C} (Y = 1|X_{i} )$$

The one-model approach uses class variable transformation. The model defines a target variable Z as follows:

$$Z = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {if\;treatment \;group \;and \;success\; in\; procedure} \hfill \\ {1,} \hfill & {if\; control \;group \;and\;failure \;in\; procedure} \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right.$$

Using a probability of event Z = 1 (\(P\left(Z=1|{X}_{i}\right)\)), the uplift score can be calculated as follows2:

$$P_{T} \left( {Y = 1{|}X_{i} } \right) - P_{C} \left( {Y = 1{|}X_{i} } \right) = 2P\left( {Z = 1{|}X_{i} } \right) - 1$$

The transformation method above makes two assumptions. First, procedure allocation is independent of Xi. Second, the probability of being assigned to the treatment group is equal to the probability of being assigned to the control group. These assumptions hold in the present study because patients in the EARNEST-PVI trial were randomly allocated to the treatment group (PVI-plus) or control group (PVI-alone) in a 1:1 ratio.

We tested the following probabilistic classification models for uplift modeling: logistic regression, random forest, k-nearest neighbor algorithm, quadratic discriminant analysis, naive Bayes, adaptive boosting, decision tree, gradient boosting, linear discriminant analysis, radial basis kernel function of support vector machine, extra trees classifier, extreme gradient boosting, Gaussian process, light gradient boosting, multi-level perceptron, and category boosting.

To evaluate each model’s diagnostic performance, we used a Qini curve because we focused on the efficacy of treatment (PVI-plus). The model returns an uplift score for each individual, and the data are sorted in descending order. The Qini curve plots the incremental success, which is calculated by

$$Qini\;curve\left( \varphi \right) = S_{T} \left( \varphi \right) - \frac{{S_{C} \left( \varphi \right)N_{T} \left( \varphi \right)}}{{N_{C} \left( \varphi \right)}}$$

where \({S}_{T}(\varphi )\) and \({S}_{C}(\varphi )\) are the number of cumulative successes in individuals with an uplift score ≥ \(\varphi\) in the treatment group and control group, respectively. \({N}_{T}(\varphi )\) and \({N}_{C}(\varphi )\) are the number of individuals with an uplift score ≥ \(\varphi\) in the treatment group and control group, respectively. In a graph on Qini curve, the horizontal axis shows ranking of uplift score sorted in descending order, not uplift score in itself, and the vertical axis shows cumulative uplift. Finally, the Qini coefficient is calculated by measuring the area between the Qini curve and the diagonal line. The diagonal line represents incremental success if the treatment is randomly allocated. A model with a higher Qini coefficient has higher diagnostic performance. The optimal cut-off value is the uplift score of an individual with a maximum score of \(\varphi\), calculated by subtracting incremental success on \(Qini curve\left(\varphi \right)\) from that on the diagonal line. After applying each model to the validation cohort of training dataset, we selected the one with the highest Qini coefficient as the best performing model. Gini importance was used to rank the importance of the features in the selected model where possible in the cohort of the training dataset for model selection14, which is the default setting in the ‘PyCaret’ package. In addition, we used SHapley Additive exPlanations (SHAP) methodology to speculate impact on model output according to each of the variables in training, model selection, and test15. The SHAP approach enables the identification and prioritization of features that determine compound classification and activity prediction using any machine learning model15.

We have included files to perform uplift score calculations in Supplementary information 3 (Supplementary materials).

Dataset

We divided the EARNEST-PVI trial (N = 497) dataset in a 1:1 ratio according to patients’ order of registration. The number of patients in the training and test datasets was 249 and 248, respectively. In addition, we further divided the training dataset in a 1:1 ratio according to patients’ order of registration: one dataset (N = 124) was used to train the models and the other (N = 125) was used to calculate Qini coefficients and determine the optimal uplift score cut-off predicted by the trained models. We subsequently selected the model with the highest Qini coefficient and used the optimal uplift score cut-off to divide the test dataset into two groups. A study flowchart is shown in Fig. 1.

Figure 1
figure 1

Study flow chart. PVI: pulmonary vein isolation.

Statistical analysis

Statistical analysis was conducted using Python 3.9.12 and R 4.0.5. We performed intention-to-treat analysis in this study. Continuous variables are presented as median with interquartile range (median [25th percentile, 75th percentile]) and categorical data as counts and percentages. Demographic and procedural differences were analyzed using the Mann–Whitney U test for continuous variables, and Fisher’s exact test for categorical variables. In Tables 1, 2, 3 and 4, continuous values are shown as median with interquartile range (median [25th percentile, 75th percentile]), and categorical values are shown as number with percentage of positive findings per number of patients studied (N (%)). P-values in Tables 1, and 2 were calculated by comparison between PVI-alone and PVI-plus by uplift score ≥ 0.0124 and uplift score < 0.0124. P-value in Table 3 was calculated by comparison between uplift score ≥ 0.0124 and uplift score < 0.0124. P-value in Table 4 was calculated by comparison between PVI-alone and PVI-plus by uplift score ≥ 0.0124 and uplift score < 0.0124. The cumulative event rate was calculated using the Kaplan–Meier method. The hazard ratio (HR), 95% confidence interval (CI), P-value, and P-value for interaction between the uplift score cut-off and treatment were calculated using the Cox proportional hazards model. The Kaplan–Meier method and the Cox proportional hazards model were applied to the test dataset, which was not included in process for training and model selection. The proportional hazards assumption of the treatment strategy for the primary endpoint was confirmed using Schoenfeld residuals (P > 0.05). P-values < 0.05 were used to indicate statistical significance.

Table 1 Patient characteristics and outcomes in the training dataset used to train models.
Table 2 Patient characteristics and outcomes in the training dataset used to plot Qini curves.
Table 3 Patient characteristics in three datasets.
Table 4 Patient characteristics in three datasets for each treatment branch.

Results

Study subjects and feature importance

A total of 512 patients were enrolled between March 2016 and September 2017. After excluding nine patients for protocol violation, five for errors in the electronic data collection system, and one for withdrawal of consent, 497 patients were analyzed in the EARNEST-PVI trial. Patient characteristic and outcome data in the training dataset used to train the models are shown in Table 1, and those used to plot the Qini curve are shown in Table 2. Qini curves of all models are shown in Fig. 2. We selected adaptive boosting to conduct predictions in the test dataset because this model showed the highest Qini coefficient among the 16 algorithms evaluated. Figure 3 shows the Gini importance of the top 13 variables. SHAP values are summarized in Supplementary information 1 (Supplementary Figure S1). Creatinine had the highest impact on prediction in the best model. We summarized patient characteristics of the overall cohort in Table 3 and those stratified by treatment arms in Table 4. The optimal uplift score cut-off according to the Qini curve was 0.0124. We divided the test dataset according to an uplift score of 0.0124 to obtain two groups: uplift score ≥ 0.0124 group (N = 116) and uplift score < 0.0124 group (N = 132) (Table 5). As shown in Table 5, patients with uplift score ≥ 0.0124 were mostly female, had lower frequency of smoking history and sleep apnea syndrome, and had lower hemoglobin and brain natriuretic peptide (BNP) levels than those with uplift score < 0.0124. Patient data based on actual allocation of treatment in the EARNEST-PVI trial are shown in Table 6. Supplementary Table S3 shows combination of procedure in in the training dataset used to train models, Supplementary Table S4 shows combination of procedure in the training dataset used to plot Qini curves, Supplementary Table S5 shows combination of procedure in Uplift score ≥ 0.0124 group in the test dataset, and Supplementary Table S6 shows combination of procedure in Uplift score < 0.0124 group in the test dataset (Supplementary information 2).

Figure 2
figure 2

Qini curves of all models. The horizontal axis shows ranking of uplift score sorted in descending order, not uplift score in itself, and the vertical axis shows cumulative uplift.

Figure 3
figure 3

Feature importance of the top 13 variables.

Table 5 Patient characteristics in the test dataset.
Table 6 Patient characteristics in the test dataset according to uplift score and allocation of procedure.

Clinical endpoints

Figure 4 shows the results of Kaplan–Meier analysis for the primary endpoint in test dataset. Among patients with uplift score ≥ 0.0124, the event rate of recurrence of AF was significantly lower in those who received PVI-plus than those who received PVI-alone in the test dataset (PVI-plus [10/52, 19.2%] vs PVI-alone [26/64, 40.6%], HR 0.40; 95% CI 0.19–0.84; P-value = 0.015). In contrast, among patients with uplift score < 0.0124, no significant differences were observed in the event rate of recurrence of AF between PVI-plus and PVI-alone in the test dataset (PVI-plus [18/72, 25.0%] vs PVI-alone [13/60, 21.7%], HR 1.17; 95% CI 0.57–2.39; P-value: 0.661). There was a significant interaction between uplift score and treatment (P-value for interaction: 0.046) (Fig. 5).

Figure 4
figure 4

Kaplan–Meier analysis with a log-rank test of recurrence of atrial fibrillation in patients with uplift score ≥ 0.0124 (left) and uplift score < 0.0124 (right) in test dataset. PVI: pulmonary vein isolation.

Figure 5
figure 5

Hazard ratio of the primary endpoint using a Cox proportional hazards model in test dataset. PVI: pulmonary vein isolation; HR: hazard ratio; 95% CI: 95% confidence interval.

Discussion

Main findings

Our study has revealed the utility of uplift modeling in identifying a subset of patients with persistent AF who would most benefit from extensive ablation. By Using an adaptive boosting model on data from the EARNEST-PVI trial, we found that an extensive ablation strategy, such as linear ablation and/or CFAE ablation in addition to PVI, was efficacious in patients with uplift score ≥ 0.0124, but demonstrated similar efficacy to PVI-alone in those with uplift score < 0.0124. These results imply that calculating the uplift score using the 26 variables identified in this study (supplemental material) and stratifying patients based on an uplift score threshold of 0.0124 may be a promising approach for selecting the most appropriate ablation strategy for patients with persistent AF.

Strength of this study

This study presents the initial evidence that uplift modeling via machine learning is a valuable tool for identifying patients with persistent atrial fibrillation who may benefit from extensive catheter ablation or for whom pulmonary vein isolation-alone is adequate for rhythm control. Furthermore, it shows that an uplift score of 0.0124 derived from our model is an effective threshold for distinguishing between those who will or will not benefit from extensive catheter ablation. Several randomized controlled trials have been unable to show the superiority of extensive catheter ablation strategies over PVI-alone in persistent AF patients6,16. A meta-analysis on the efficacy of extensive catheter ablation, including CFAE and linear ablation, reported that there were no significant differences for maintaining sinus rhythm between PVI with extensive catheter ablation and PVI-alone17. Further, superiority of PVI-plus over PVI-alone could not be established also in the EARNEST-PVI trial. These inconclusive results related to the superiority of PVI-plus over PVI-alone might be attributed to the heterogeneity of patients with persistent AF, and suggest the importance of administering the appropriate treatment to the appropriate patients among those with persistent AF. The present study proposes a novel, useful approach, which employs uplift modeling to stratify persistent AF patients into those who may benefit from extensive ablation and those who may not require more than PVI-alone, has the potential to reduce unnecessary costs and complications. Further prospective studies are needed to confirm the clinical applicability of this approach and the factors we have identified for determining appropriate catheter ablation strategies.

Stratification of persistent atrial fibrillation

This study is the first report to employ uplift modeling for identifying a particular subgroup of patients with persistent AF who may derive benefits from extensive ablation, such as linear ablation or CFAE ablation, in addition to PVI, to maintain sinus rhythm. Several previous investigations have reported predictors of AF recurrence after catheter ablation, regardless of extensive ablation, including left atrial size, type of AF, AF duration, female gender, and machine learning models18,19,20. Nevertheless, only a few studies have been aimed at identifying a specific patient group that requires additional extensive ablation among those with persistent AF. We previously reported possible stratification by sex and DR-FLASH score9,10. We observed that PVI-plus presented a lower risk of AF recurrence than PVI-alone in patients with a DR-FLASH score of > 3, suggesting that the DR-FLASH score is a valuable tool for identifying patients with persistent AF who will benefit from PVI-plus10. In this novel study, we utilized a novel machine learning approach to identify patients who may benefit from extensive ablation by testing 16 probabilistic classification models using uplift modeling with 26 clinical factors. Through adaptive boosting, we obtained a hazard ratio of 0.40 (95% CI 0.19–0.84) for PVI-plus compared to PVI-alone in patients with an uplift score ≥ 0.0124, which was lower than that observed in patients with DR-FLASH score > 3 (HR 0.45, 95% CI 0.28–0.72) in our previous study10. Moreover, the clinical factors employed in this study can be non-invasively obtained via medical interview, blood examination, and transthoracic echocardiography. These results suggest that uplift modeling can be a valuable noninvasive tool to improve our ability to identify patients who require extensive ablation in addition to PVI among those with persistent AF. Nevertheless, further prospective studies may be warranted to assess the superior strategy between the DR-FLASH score and uplift modeling. Additionally, efforts to reduce the number of factors required for uplift modeling while maintaining discrimination power will be pivotal to enhance the practicality of this model in daily clinical practice. The uplift score is more difficult to implement in the clinical setting because our uplift modeling requires 26 variables and a substantial technical skillset. Nevertheless, the reason why we performed uplift modeling is because we were motivated to comprehensively analyze the dataset of the EARNEST-PVI trial. The present study on stratification with uplift modeling is data-driven research, whereas the previous study on stratification with DR-FLASH score is theory-driven research. Although hazard ratios in both studies appear almost identical in result, stratification by the uplift modeling was more accurate than by the DR-FLASH score. It is meaningful to reveal usefulness of uplift modeling for strictly selecting a catheter ablation strategy.

Characteristics of patients with high uplift score

The clinical factors that contribute to a high uplift score using an adaptive boosting strategy offer insight into the characteristics of patients who would benefit from extensive ablation. Our adaptive boosting approach identified serum creatinine level, left ventricular ejection fraction, hemoglobin level, BNP, C-reactive protein (CRP), left atrial diameter, smoking history, body mass index, history of heart failure and sleep apnea syndrome as the top ten factors with high feature importance (Fig. 3). Although all of these factors have already been reported to be predictors of AF recurrence after catheter ablation21,22, we were unable to determine the exact relationship between these variables and the uplift score due to the non-linear nature of machine learning. We therefore conducted a comparison of patient characteristics between those who had an uplift score ≥ 0.0124 and those who had a score < 0.0124 to assess the effect of each factor on the uplift score (Table 5). The analysis revealed that patients with an uplift score greater ≥ 0.0124 were predominantly female, had lower frequencies of smoking history and sleep apnea syndrome, and had lower hemoglobin and BNP levels, as well as larger left atrial diameters (Table 5). These observations suggest that these features may contribute to a high uplift score. Furthermore, given that female sex23, lower hemoglobin24 and larger left atrial size25,26 are already known to be associated with arrhythmogenic substrate, which can cause AF recurrence, these findings suggest that patients with arrhythmogenic substrate would benefit from extensive ablation. In contrast, patients with high uplift score also had lower frequencies of smoking history27 and sleep apnea28, and lower BNP levels29, which are generally considered to be associated with lower risk of recurrence after catheter ablation. While the true reasons for this discrepancy are unknown, one possible explanation is that machine learning using an adaptive boosting strategy led to the identification of patients who would benefit from extensive ablation based on different criteria from those previously reported, such as the presence of arrhythmogenic substrate. Another possible explanation is that smoking, sleep apnea and high BNP are less strongly associated with arrhythmogenic substrate than factors like female sex, lower hemoglobin and larger atrial size. Given that uplift modeling is used to identify patients who would benefit the most from an intervention, rather than to predict recurrence of an event, these results suggest that extensive ablation may not be effective at all for patients with smoking, sleep apnea and high BNP. Nevertheless, these findings suggest that the uplift score and machine learning may be useful for identifying a specific population that would benefit from extensive ablation, and that there may exist previously unrecognized criteria or algorithms that could enhance our ability to identify such a population.

Limitations

Several limitations exist in the current study. Firstly, the techniques employed for additional left atrial ablation in the PVI-plus category of the EARNEST-PVI trial were not pre-specified, resulting in heterogeneity. The trial was initially designed to investigate the non-inferiority of PVI-alone against any extensive catheter ablation for patients with persistent AF. Secondly, the study was conducted solely in an East Asian population, thus limiting the generalizability of the findings to other ethnic groups. Thirdly, the primary endpoint, recurrence of AF, may have been underestimated since only regular 12-lead ECG and Holter ECG were employed at each visit, while event recorders or implantable devices were not utilized to detect recurrence. Fourth, the uplift score is difficult to implement in the clinical setting because our uplift modeling requires 26 variables plus an advanced degree to calculate. Finally, all the included patients underwent their first procedure and therefore results are not applicable to redo procedures as is so often the case with persistent AF patients.

Conclusions

We demonstrated that the application of machine learning using uplift modeling can be useful for identifying a specific subgroup of patients with persistent AF who would most benefit from an extensive ablation strategy, comprising linear ablation and/or CFAE ablation in addition to PVI. An uplift score of 0.0124, calculated using our model, may be a useful threshold for stratifying patients with persistent AF who do and do not require extensive ablation in addition to PVI. However, additional prospective investigations are necessary to determine the efficacy of this approach.