Combining clinical and imaging data for predicting functional outcomes after acute ischemic stroke: an automated machine learning approach

This study aimed to develop and validate an automated machine learning (ML) system that predicts 3-month functional outcomes in acute ischemic stroke (AIS) patients by combining clinical and neuroimaging features. Functional outcomes were categorized as unfavorable (modified Rankin Scale ≥ 3) or not. A clinical model employing optimal clinical features (Model_A), a convolutional neural network model incorporating imaging data (Model_B), and an integrated model combining both imaging and clinical features (Model_C) were developed and tested to predict unfavorable outcomes. The developed models were compared with each other and with traditional risk-scoring models. The dataset comprised 4147 patients from a multicenter stroke registry, with 1268 (30.6%) experiencing unfavorable outcomes. Age, initial NIHSS, and early neurologic deterioration were identified as the most important clinical features. The ML model prediction achieved an area under the curves of 0.757 (95% CI 0.726–0.789) for Model_A, 0.725 (95% CI 0.693–0.755) for Model_B, and 0.786 (95% CI 0.757–0.814) for Model_C in the test set. The integrated models outperformed traditional risk-scoring models by 0.21 (95% CI 0.16–0.25) for HIAT and 0.15 (95% CI 0.11–0.19) for THRIVE. In conclusion, the integrated ML system enhanced stroke outcome prediction by combining imaging data and clinical features, outperforming traditional risk-scoring models.

Prognosis related to functional outcomes following a stroke is a major concern for patients and their families.Physicians need to be able to predict functional recovery when establishing long-term treatment plans 1 .Functional outcomes are influenced by whether the infarction occurred in a motor-related eloquent area 2,3 .Thus, utilizing imaging information about the location and extent of brain lesions is crucial for predicting patient prognosis 4,5 .
Convolutional neural networks (CNNs), a type of deep neural network, can effectively process spatial information from lesions.CNN models using stroke images are anticipated to provide better prognostic predictions after acute stroke.However, most traditional models for predicting stroke functional prognosis to date have relied on risk score systems that use only non-imaging clinical features or imaging-derived features [6][7][8][9] .Recently, several machine learning (ML) based prediction models have been proposed, but the majority did not incorporate lesion imaging information 10 .
A multi-modal system that combines multiple types of data, instead of using each data type alone, has been proposed to improve the model's performance by increasing the amount of information [11][12][13] .Nevertheless, few multimodal systems have been developed to predict stroke outcomes 10,14 .
The limited explainability and usability of existing ML models have hindered their practical application.Deep learning algorithms, such as CNNs, are often considered "black box" models, making them difficult to

Model development
To classify three-month binary functional outcomes, we developed and tested prediction models for performance validation.We randomly allocated 20% of the dataset as a test set, exclusively for evaluation.The remaining 80% served as a training set for hyperparameter determination and training, employing fivefold cross-validation (Fig. S3).
We developed the ML model as follows.First, we constructed Model A, a clinical model, using ML algorithms such as logistic regression (LR), random forest (RF), light gradient boosting machine (LGBM) 19 , and multi-layer perceptron (MLP) 20 , based on selected variables from the clinical dataset.Second, we built Model B, an imaging model, by training a deep three-dimensional DenseNet (CNN model) on imaging data with three channels (DWI, ADC, and ground truth lesion mask) 21 .Third, we developed an integrated ML model (Model C) using prediction probabilities from Model B and selected variables from the clinical dataset as multi-modal inputs.Lastly, we also developed a lesion segmentation model (Model S) for the same imaging data, utilizing a two-dimensional U-Net model with a ResNet152 backbone 22 .Model S was designed to create a pipeline that automatically processes segmentation and is assigned to the model's input if the user inputs only the original DWI and ADC without performing manual segmentation.We evaluated Model B using the predicted mask from Model S instead of the ground truth mask.
All developed models were evaluated for performance using the test set (unseen dataset).The overall flow diagram of ML model development and evaluation is shown in Fig. S4.

Performance evaluation
We used the area under the receiver operating characteristic curve (AUC) as the primary performance metric to evaluate the outcome prediction model.Initially, we assessed the performance of the LR, RF, LGBM, and MLP algorithms using the test set.We then adopted one algorithm as the representative ML model.With the representative ML model, we compared the performances of Models A, B, and C. Next, we plotted a calibration plot and calculated the Brier score, which represents the mean squared difference between the predicted probability and the true outcome, to evaluate the models' calibration performances 23,24 .Finally, we compared the performance of Model C on the test set with traditional models (HIAT and THRIVE) 6,7 .We used the intersection over union (IoU) and Dice similarity coefficient (DSC) as performance evaluation metrics for lesion area detection in the segmentation model (Model S) 25,26 .

Statistical analysis
Values are presented as mean ± standard deviation, median (interquartile range) for continuous variables, or as number (%) of subjects for categorical variables, as appropriate.We compared the clinical characteristics between two groups using the Chi-square test, Fisher's exact test, Mann-Whitney test, or Student's t-test, depending on the type of variable.To assess the models' performance in discriminating three-month functional outcomes, we plotted receiver operating characteristic (ROC) curves and calculated the area under the curve (AUC) and 95% confidence interval (CI) for each model.We compared differences between AUCs using DeLong's test 27 .Furthermore, we calculated positive predictive value, negative predictive value, and Brier scores as secondary outcome metrics.We also determined sensitivity and specificity values for the threshold defined by Youden's index J (J = sensitivity + specificity-1), if necessary.We conducted statistical analyses using MedCalc software (version 20.114) for generating the ROC curve and performing DeLong's test.We carried out other statistical analyses using R version 4.1.2.We considered a 2-tailed P-value < 0.05 to be statistically significant.

Results
A total of 4147 patients were included in the dataset, with 1268 (30.6%) experiencing an unfavorable outcome (Fig. S5).All missing values for the common dataset variables were within 10% (Fig. S6), leading us to apply multivariate imputation-chained equations to all variables.The mean age was 68.09 ± 12.6 years, and 1765 (61.3%) were men.Baseline demographics and clinical characteristics based on favorable and unfavorable outcomes of the dataset, with imputation performed, are listed in Table S2.
We calculated feature importance and contribution to the outcome for the 31 imputed common variables (Fig. S7).Age, initial NIHSS, and early neurologic deterioration (END) were consistently identified as the most important variables in Random Forest feature importance, permutation importance, and SHAP value analyses.The direction of feature contribution shown in the SHAP summary plot was in line with general clinical interpretation (Fig. S8).For most of the remaining features, the relationship between the distribution of variable values and SHAP values was more heterogeneous.Considering the concern of overfitting and the advantage of selecting the minimum features to use as input variables in clinical practice, we ultimately chose age, initial NIHSS, and END as the ML prediction system's clinical input features.
Figure 1 illustrates the overall pipeline of class prediction for the three-month functional outcome using the developed models, with model development details provided in Tables S3-S5.

Performance of the models
Model S achieved an overall mean Dice score of 0.8433 and an IoU of 0.7968.It selectively segmented only the high signal attributed to the infarcted core, excluding high signals that could be mistaken for diffusion restriction, such as susceptibility artifacts (Fig. S9).
Table 1 presents the performance estimates of the ML algorithms and CNN.Among the four ML algorithms, the performance was quite similar, but random forest (RF) demonstrated the highest performance.As a result, we chose RF as the representative ML algorithm for Models A and C.
Model calibration was performed to assess the likelihood that a given new observation belongs to each of the known classes.The calibration slopes showed minimal difference between the predicted and observed probability of unfavorable outcomes, indicating a good model fit (Fig. S10).

Discussion
In this study, we proposed an ML system that automatically segments infarct lesions and classifies 3-month functional outcomes by combining a CNN model using imaging data with a clinical model utilizing optimal clinical features.The system was developed with a multi-center training set of 3,332 patients and its performance was validated using a test set of 822 patients.The segmentation model displayed good performance in infarct core lesion segmentation.The results demonstrated that the integrated model was superior to both the clinical and imaging models.The integrated ML model outperformed traditional risk-scoring models, and the classification results of the imaging model and Grad-CAM suggested that the imaging model accurately detected infarction lesions and might have learned the eloquent brain area.
Concerns regarding overfitting and the need for input convenience in clinical practice prompted us to select features for use as input variables.As a result, Age, NIHSS, and END were identified as the most important features (Fig. S7).These are well-known risk factors for stroke functional outcomes and have been consistently reported in previous studies 8,9,28-30 .To not only assess the strength of these selected variables' contributions to the outcome but also to understand their direction of influence, we examined the SHAP values and found that they reflected the anticipated clinical direction of influence (Fig. S8).Furthermore, these features exhibited the same importance and directionality in the generalized linear model.Using logistic regression, the odds ratios (OR) for an unfavorable outcome from Age, NIHSS, and END were 1.04 (95% CI 1.03-1.04),1.24 (1.21-1.27),and 6.90 (5.32-8.95)respectively.While the effect sizes of Initial NIHSS and Age might seem negligible at first glance, considering the increase in log(odds) with every unit increase of these variables reveals a profound impact.
In our study, Model S (segmentation model) effectively segmented infarct core lesions.The model can visualize the infarction lesion and calculate the core volume.While the predicted mask from the segmentation model was utilized as an input for the classification model, its contribution to improving prediction performance appeared limited.However, the segmentation model still holds value for several reasons.Firstly, by offering visual information on cerebral lesions, clinicians were able to gain essential insights into the presence and location of cerebral infarctions.This information can be particularly valuable for medical professionals who are not well-acquainted with cerebral infarction imaging.
Secondly, the segmentation model can serve a role in verifying model reliability.When faced with input images of suboptimal quality or those deemed inappropriate, clinicians have the opportunity to detect such issues early through the segmentation model.Consequently, this gives them a chance to determine the trustworthiness of the classification model's results.
Lastly, considering recent clinical trials (RESCUE-Japan LIMIT, SELECT 2, ANGEL-ASPECT) [31][32][33] , the volume of the cerebral infarction is deemed one of the critical determinants for recanalization therapy.The volume information of the infarct core provided by our segmentation model could be of immense help in such clinical decision-making processes.One of the interests of our research was to determine if the brain imaging model was capturing not just the volume of the cerebral infarct but also its positioning within functionally eloquent brain areas.The term "brain eloquence" refers to the functional importance of specific brain regions, indicating their central role in neural operations.For instance, damage to crucial areas such as the primary motor cortex or language-associated regions can lead to severe clinical symptoms, whereas an infarct of similar size in other regions might not be as consequential.
To illustrate from our study results, Model B categorized cases 'a' and 'b' as having unfavorable outcomes (as shown in Fig. 4).However, for what are considered less functionally eloquent lesions, cases 'g' and 'h' in Fig. 4 were classified with favorable outcomes, despite having a larger volume than 'a' and 'b' .These findings hint that our imaging model might be taking into account not just the volume but possibly also the location and its functional eloquence.
This consideration of functional eloquence might be especially decisive for smaller infarcts.Referring to Fig. S11, smaller infarcts, on average, didn't seem to predict clinical outcomes based on volume.However, for medium to large infarcts, there appeared to be a linear trend where an increase in volume correlated with a higher likelihood of unfavorable outcomes.This pattern has been corroborated in previous studies, which proposed that medium to large infarcts had a stronger correlation with adverse outcomes, while smaller infarcts did not 34 .The more pronounced influence of volume over location in large infarcts might be attributable to a floor effect associated with outcomes.
In summary, our imaging model suggests that for smaller infarcts, the functional eloquence of the location seems pivotal, whereas, for larger infarcts, volume takes precedence.These insights provide a comprehensive interpretation reflecting the intricate decision criteria of the model.
In this study, the utilization of Grad-CAM as a CNN-specific attribution method carries two primary implications.However, it's vital to note that Grad-CAM does not provide a definitive rationale for predictions.Even if the model focuses on a specific region, that region might not necessarily be the primary basis for classification.Activation intensity might not always align with the significance in class decisions, and sometimes, less activated areas might have a more significant impact on the decision.Furthermore, Grad-CAM inherently does not provide quantitative information.Its main role serves as a starting point to infer the model's classification rationale, with the interpretation largely left to developers and clinicians.
In this context, our heatmap seems to suggest two possible interpretations.Firstly, while it is challenging to explain definitively, when observing that the heatmap increases uniformly across the entire stroke rather than just a part of it, it suggests the possibility that the model has captured the volume information of the lesion.Previous studies indicating a relationship between the volume of the stroke and outcomes 3 months later support the benefits gained from capturing this volume information.Secondly, the inclusion of non-contributing regions like the ventricles, eyeballs, and air-tissue boundaries might hint at the possibility that these areas were used as landmarks to recognize the relative position of the lesion.Since our study did not spatially align brain images to a specific atlas, using such landmarks to determine the relative position of the lesion might aid in discerning whether that area is related to essential brain regions.Yet, such interpretations remain speculative, echoing the sentiments shared across most CNN-based studies where the onus of deciphering the model's workings is entrusted to developers and clinicians.
This study had several limitations.First, due to the retrospective nature of the investigation, the performance results may be insufficient to determine the robustness of the ML model on clinical utility.Well-designed prospective cohort studies are required to provide clear evidence for clinical use.Second, the proposed ML model uses only DWI obtained at one time point during admission.Using image data from additional timepoints or other imaging modalities may be associated with additional uncaptured performance.The use of further advanced imaging or bio-signal data, such as vital signs, Holter ECG, or EEG, as additional features may aid ML models in learning deeper pathophysiologic mechanisms of ischemic strokes.Third, this study's lack of external validation raises concerns about its generalizability.However, this study utilized a large dataset from three stroke centers, and the unseen data only for the test was separately validated.
In conclusion, this study proposed an ML system that combines a CNN model using imaging data with a clinical model utilizing optimal clinical features to automatically segment infarct lesions and classify 3-month functional outcomes in stroke patients.The integrated model demonstrated superior performance compared to the clinical model, imaging model, and traditional risk-scoring models.The study also explored the use of Grad-CAM to visualize the prediction basis for the regions of the brain the CNN model used for classification.

Figure 1 .
Figure 1.Pipeline of proposed prediction model for classifying 3-month functional outcome.DWI diffusionweighted imaging, ADC apparent diffusion coefficient, mRS modified-Rankin Scale, NIHSS National Institutes of Health Stroke Scale, END early neurologic deterioration.a Model S : Lesion segmentation model using U-Net with a ResNet152 backbone.b Model A : Model using only clinical features (Initial NIHSS, age, END).c Model B : Model using only images (DWI, ADC and predicted lesion mask).d Model C: Model incorporating the probability estimates for unfavorable outcomes from Model B and integrates them with the clinical features (initial NIHSS, age, and END).e Machine learning algorithms : Logistic regression, random forest, light gradient boosting model, multilayer perceptron.

Figure 2 .
Figure 2. Receiver operating characteristic curves (left) and probability distribution (right) of binomial prediction models for predicting 3-month functional outcome in test set.Probability distribution from Model C were illustrated.For the x-axis, log scale was applied.

Figure 3 .
Figure 3. Receiver operating characteristic curves of the previous models and the integrated model predicting 3-month functional outcome in the test set.

Table 1 .
Comparison of the performance of models predicting 3-month functional outcomes on the test set (n = 822).AUROC area under the receiver operating characteristic curve, PPV positive predictive value, NPV negative predictive value, LR logistic regression, RF random forest, LGBM light gradient boosting model, MLP multilayer perceptron, CNN convolutional neural network.a The model using only clinical features.b The model using only images.c The model using both clinical features and images.