Random forest-based prediction of stroke outcome

We research into the clinical, biochemical and neuroimaging factors associated with the outcome of stroke patients to generate a predictive model using machine learning techniques for prediction of mortality and morbidity 3-months after admission. The dataset consisted of patients with ischemic stroke (IS) and non-traumatic intracerebral hemorrhage (ICH) admitted to Stroke Unit of a European Tertiary Hospital prospectively registered. We identified the main variables for machine learning Random Forest (RF), generating a predictive model that can estimate patient mortality/morbidity according to the following groups: (1) IS + ICH, (2) IS, and (3) ICH. A total of 6022 patients were included: 4922 (mean age 71.9 ± 13.8 years) with IS and 1100 (mean age 73.3 ± 13.1 years) with ICH. NIHSS at 24, 48 h and axillary temperature at admission were the most important variables to consider for evolution of patients at 3-months. IS + ICH group was the most stable for mortality prediction [0.904 ± 0.025 of area under the receiver operating characteristics curve (AUC)]. IS group presented similar results, although variability between experiments was slightly higher (0.909 ± 0.032 of AUC). ICH group was the one in which RF had more problems to make adequate predictions (0.9837 vs. 0.7104 of AUC). There were no major differences between IS and IS + ICH groups according to morbidity prediction (0.738 and 0.755 of AUC) but, after checking normality with a Shapiro Wilk test with the null hypothesis that the data follow a normal distribution, it was rejected with W = 0.93546 (p-value < 2.2e−16). Conditions required for a parametric test do not hold, and we performed a paired Wilcoxon Test assuming the null hypothesis that all the groups have the same performance. The null hypothesis was rejected with a value < 2.2e−16, so there are statistical differences between IS and ICH groups. In conclusion, machine learning algorithms RF can be effectively used in stroke patients for long-term outcome prediction of mortality and morbidity.

Forest (RF) in terms of classifi ation performance 11,12 .RF has recently been used successfully in a wide range of biomedical applications, such as the automatic detection of pulse during electrocardiogram-based cardiopulmonary resuscitation or in breast cancer diagnosis using mammography images [13][14][15][16][17][18] .
As to stroke, most studies focus on the use of ML methods to detect ischemic stroke (IS) lesions using neuroimaging data [19][20][21][22][23] and outcome estimation [24][25][26][27][28] .It has only been recently, however, that a study evaluated stroke outcome prediction at 3 months also in a group of non-traumatic intracerebral hemorrhage (ICH) patients using a nationwide disease registry 27 .Previous studies concluded that ML techniques can be effective to predict functional outcome of IS long-term patients or for prediction of symptomatic intracranial haermorrahe following thrombolysis from CT images.However, all works agree on the need to carry out further studies in order to confirm results, incorporate new variables and resolve their limitations or weaknesses.
Taking into account the prevalence of cerebrovascular diseases, accurately predicting stroke evolution is essential to stratify the rehabilitation care that should be administered, especially to patients with the best chance of recovery.The administration of rehabilitative therapies to those who are unlikely to benefit from them is inefficient for the Healthcare System and inconvenient and unproductive for patients.A predictive model that identifies stroke patients at risk of deterioration would make it possible to select/follow-up patients for reperfusion treatments, and increase the control of therapeutic homeostasis, thus addressing the needs of each patient individually.Furthermore, regarding to new regenerative cellular or molecular therapies, it is essential to identify the most suitable patients to respond accurately to treatments.We hypothesized that models developed with ML techniques based on the demographic, clinical, biochemical and neuroimaging variables obtained in the fi st 48 h after stroke are accurate stroke mortality and morbidity predictors at 3 months.

Results
We included in the study 6022 patients; 4922 (81.8%) presented with IS and 1100 (18.2%) with ICH.We excluded 228 patients, who died during the fi st 24 h, and 84 with no follow-up at 3 months.The 65 features of the different groups included in the experimented dataset are shown in Tables 1, 2 and 3. Figure 1 lists flowchart of patient groups with their functional outcome and divided into morbidity and mortality.
Of the 4922 IS patients valid for this study, 55.2% were male and 44.8% female; the mean age was 71.9 ± 13.8 years.According to the TOAST classifi ation, 1127 patients were classifi d as atherothrombotic (22.9%), 1786 as cardioembolic (36.3%), 428 as lacunar (8.7%) and 1520 as undetermined (30.9%).Poor functional outcome at 3 months was found in 47.5% of IS patients, thus showing a morbidity of 33.4% and a mortality of 13.2%.
Using the filter feature selection, the dataset was reduced to only 7 variables: National Institute of Health Stroke Scale score at admission [NIHSS (0)]; NIHSS score at 24 h [NIHSS (24)]; NIHSS score at 48 h [NIHSS (48)]; Axillary temperature at admission [T(0)]; Early neurological deterioration [ED]; Leukocytes at admission [LEU (0)]; and blood glucose at admission [GLU (0)] as with these variables, RF was much more stable and deviations or variations between experiments could be reduced.The most important variables, taking into account the three groups of patients analyzed, were NIHSS (48) and NIHSS (24).In the IS and ICH patient groups, the importance of ED, T (0) and NIHSS (0) should be highlighted.NIHSS (0) was also observed to be more important in patients with ICH than in those with IS when the models do not have data from both types of patients.It seems, however, that its importance is significantly reduced when the model has the complete set.Finally, LEU (0) and GLU (0) are variables that help balance the results for the complete model, reducing the variability of the individual IS or ICH models among all variables.
The variation obtained between all the repetitions performed in area under the receiver operating characteristics (ROC) curve (AUC) terms is detailed in Fig. 2B,C for the three experiments performed.The complete problem with two types of patients is the most stable with a minimum deviation between experiments (median of 0.904 ± 0.025 of AUC and 0.825 ± 0.030 of accuracy (ACC)).On the other hand, the ICH problem is the one in which RF has more problems to make adequate predictions as the range of results varies in more than 20 AUC points between the best (0.9837 of AUC with 0.94 of ACC) and the worst experiment (0.7104 of AUC and 0.6122 of ACC) and values for 100 repetitions of 0.875 ± 0.048 of AUC and 0.8 ± 0.052 of ACC.Th s prediction is therefore the most complex for the model.
As to the IS problem, RF presented similar values to those of the IS-ICH prediction problem although variability between experiments is slightly higher (0.909 ± 0.032 of AUC and 0.833 ± 0.040 of ACC).This led us to www.nature.com/scientificreports/conclude that there is enough information within the selected variables so that when RF has enough patients in the dataset, the model predicts very accurately which patients are most likely to die on the basis of the data collected at admission.It was also observed that when there is a greater amount of data and patients are stratified into the three categories, the model is much more stable, and results are better.A 2D-heatmap of mortality predictions against NIHSS (48) and NIHSS (24) is detailed in Fig. 3 in order to explain the decision boundary of the model.Note that the misclassified items are highlighted and that the intensity of the colors also indicates the certainty of the prediction.We showed the IS + ICH group as it was the most stable for mortality prediction (0.904 ± 0.025 of AUC).

Prediction of morbidity.
Figure 4A shows that NIHSS (48) and NIHSS (24) are again the most important variables for the model associated to the three groups studied in relation to morbidity prediction.However, it seems that the variables ED for the IS and total groups, and GLU (0) for ICH patients provide relevant predictive capacity.NIHSS (0) was identifi d by the model as a variable with negative effects on classifi ation for the IS patient group as it worsens prediction.For the ICH patient group, however, NIHSS (0) is a key variable.Figure 4B,C shows that there were no major differences between the IS and IS + ICH groups (0.738 and 0.755 of AUC and 0.683 and 0.700 of ACC) but with the ICH (0.667 of AUC and 0.618 of ACC).AUC with 7 input variables.Figure 5 shows the comparison of ROC curves for the most important variables of the ML model associated to the groups IS + ICH, IS or ICH patients in relation to the mortality and morbidity prediction.IS + ICH and IS patient groups curves revealed NIHSS (24) and NIHSS (48) variables with the best AUC values obtained for both morbidity [0.677 (CI 95% 0.662-0.692)vs. 0.703 (CI 95% 0.686-0.719);and 0.669 (CI 95% 0.654-0.684)vs. 0.697 (CI 95% 0.680-0.713)]and mortality [0.888 (CI 95% 0.876-0.900)vs. 0.892 (CI 95% 0.878-0.906);and 0.897 (CI 95% 0.885-0.908)vs. 0.899 (CI 95% 0.885-0.913)]prediction.

Discussion
It is difficult but essential to accurately predict functional outcomes after stroke.Outcome prediction plays an important role in long-term decision making, patient treatment, organization of Health Centers, and domestic conditions.Plans could be developed on the basis of a better prediction of the degree of recovery of each patient with appropriate and individualized rehabilitation measures that take into account the domestic and economic conditions, leading to shared decisions with patients, relatives, and sociomedical centers 27,29 .
The conventional approach to the evaluation of stroke outcomes data resorts to classical statistical models (logistic regression).Logistic regression models identify and validate predictive variables.Their main advantage is that they can be easily implemented and interpreted 24 .ML algorithms have the potential to outperform conventional regression because they are able to capture nonlinearities and complex interactions among multiple predictor variables.They can also handle large-scale multi-institutional data, with the added advantage of easily incorporating newly available data to improve prediction performance, and a better handling of a large number of predictors 30 .
A recent systematic review found that ML was not superior to logistic regression in clinical prediction modeling 31 .However, inconsistent conclusions have been often found when comparing the performance of classical models to different machine learning algorithms in clinical applied studies.These studies agree that further research is needed to assess the feasibility and acceptance of ML applications in clinical practice 32,33 .Different research works have proposed strategies for stroke prediction based on ML algorithms with excellent results, but with a great diversity in the variables analyzed (clinical, molecular markers or imaging), calibration/training protocols performed, and models implemented (neural networks, tree-based and kernel-based methods).Some limitations of these previous studies are due to: (1) the low sample size used, (2) the characteristics of the patients evaluated; most of the studies only evaluate IS sub-groups, such as, IS patients treated with rTPA or endovascular intervention, (3) studies used demographic, clinical, molecular or neuroimaging variables independently and uncorrelated, and (4) the small number of variables used in ML models.Asadi   28 .
In our study, we analyzed a ML model of stroke prediction at 3 months using the Hospital's Stroke Registry (BICHUS) on the basis of demographic, clinical, molecular and neuroimaging variables.Mortality and morbidity were evaluated by identifying the main variables for the ML model.Data were studied as a whole (IS + ICH) or as independent subsets.Our ML classifie s exhibited high performance with over 0.90 AUC in the three groups evaluated in relation to the mortality outcome.The IS group had the best results (n = 4922).The model indicates that the most relevant variables are NIHSS (48) and NIHSS (24).In addition, the variable NIHSS (0) is also important for the ICH patients (n = 1100).The rest of the variables provide information marginally, although the importance of T(0) and ED should not be disregarded.On the other hand, AUC over 0.75 was found in the three groups evaluated in relation to the morbidity outcome.The model developed indicates that the most relevant variables are NIHSS (48) and NIHSS (24), although ED for the IS group and GLU (0) for the ICH provide predictive capability.Compared to ROC curve analysis of the 7 input variables, ML classifier has a high performance in three groups, with the NIHSS (0), NIHSS (24) or NIHSS (48) as the most influential predictors.
The fi dings in this report are subject to at least four limitations.First, this was a retrospective, single-center study with a relatively small clinical dataset.The intrinsic need for large training datasets may affect the accuracy of ML models in studies that could be overadjusted by irrelevant clinical predictors, or some predictors may be underestimated, thus increasing random errors.It is important to note that the variables selected in our study had been previously identifi d by means of a T-test and supervised by expert neurologists.However, in the future, both training and validation procedures will need to include a multicenter dataset and a prospective study to verify the model and the variables obtained.Second, the IS and the ICH patient groups were unbalanced.We consider, however, that the two types of stroke should be studied independently to fi d both differences and similarities.Th rd, we used RF machine learning algorithms, although other models like DNN or deep logistic regression could be used for comparison purposes.Four, we used typical clinical variables as inputs for the ML model, and we did not stratify patients in different subgroups, which could improve the results presented.We consider that it would be useful to evaluate the common variables from a clinical point of view, so once again emphasis is on the importance of NIHSS, axillary temperature and blood glucose.The major strengths of the present study include the large sample size (6022 patients; 4922 with IS and 1100 with ICH), which enabled study of the combination of different stroke types in detail (IS + ICH, IS, and ICH).Furthermore, to derive a global risk score for stroke, we have evaluated/interrelated demographic, clinical, biochemical and neuroimaging variables.Another distinctive feature of this analysis compared with previous studies is that we also included molecular markers associated with inflammation (leukocytes, fibrinogen and C-reactive protein), endothelial and atrial dysfunction (microalbuminuria and NT-proBNP).

Conclusions
Machine learning algorithms, particularly Random Forest, can be effectively used in long-term outcome prediction of mortality and morbidity of stroke patients.NIHSS at 24, 48 h and axillary temperature are the most important variables to consider in the evolution of the patients at 3 months.Future studies could incorporate the use of imaging and genetic information.Furthermore, the robust model developed could be used in other applications and different scopes with similar data; such as traumatic brain injury, or dementia (Alzheimer's and Parkinson's disease).

Materials and methods
Patient selection.The dataset used in this research work consisted of patients with IS and ICH admitted to the Stroke Unit of the Hospital Clínico Universitario of Santiago de Compostela (Spain), who were prospectively registered in an approved data bank (BICHUS).All patients were treated by a certifi d neurologist according to national and international guidelines.Exclusion criteria for this analysis were: (1) patients who died during the fi st 24 h, and (2) loss of follow-up (personal interview or telephone contact) at 3 months.
The analysis of the data for this study was retrospective, from September 2007 to September 2017.Th s research was carried out in accordance with the Declaration of Helsinki of the World Medical Association (2008) and approved by the Ethics Committee of Santiago de Compostela (2019/616).All patients or their relatives signed the informed consent for inclusion in the registry and for anonymous use of their personal data for research purposes.
Demographic, clinical, molecular and neuroimaging variables.The registry includes demographic variables, previous medical history and vital signs.Blood samples for hemogram, biochemistry and coagulation tests were obtained and analyzed at the central hospital laboratory.Neurological defic t was evaluated by a certifi d neurologist using the National Institute of Health Stroke Scale (NIHSS) at admission, and every 24/48 h during hospitalization.The modifi d Rankin Scale (mRS) was used to evaluate functional outcome at discharge and at 3 months 33,34 .
Effective reperfusion of IS patients was defined as ≤ 8 points in the NIHSS during the fi st 24 h.Early neurological deterioration was defined as ≥ 4 points in NIHSS within the first 48 h with respect to baseline NIHSS score.Poor functional outcome was defined as mRS > 2 at 3 months, morbidity as 3 ≤ mRS ≤ 5, and mortality as mRS = 6.Ischemic stroke diagnosis was made using the TOAST criteria 35 .
Computed Tomography (CT) was performed in all patients and Magnetic Resonance Imaging (MRI) in selected patients at admission.Follow-up CT scan after fibrinolysis or thrombectomy was performed in all IS patients at 24 h, and CT at 48 h or when neurological deterioration was detected and between the 4th-7th day.ICH and perihematomal edema volumes were calculated using the ABC/2 method 36 .ICH topography was classifi d as lobar when it predominantly affected the cortical/subcortical white matter of the cerebral lobes or as deep when it was limited to the internal capsule, the basal ganglia or the thalamus.All neuroimaging tests were analyzed by a neuro-radiologist supervised by the above certified neurologist.
Outcome endpoints.The objective of this research work was to identify the main predictors for the machine learning model in order to generate a predictive model using machine learning techniques for the prediction of mortality and morbidity of stroke patients according to their stratifi ation to one of the following groups: 1) IS + ICH, 2) IS, and 3) ICH.
Machine learning.We used the RF algorithm for the prediction of mortality and morbidity of stroke patients.RF is an ensemble learning method, i.e., a strategy that aggregates many predictions to reduce the variance and improve the robustness and precision of outputs [37][38][39] .A remarkable characteristic of the RF is that it provides an internal measure of the relative importance of each feature on the prediction.Th s model generally works very well for any type of problem, regardless of size and even if the data are unbalanced or missing 37 .It also makes it possible to analyze the importance of the variables used by the model.To this end, the Gini importance index was calculated.Th s index measures the increase in impurity of each variable in the model when selected in the random distribution process.Each time a node selects a variable, the Gini impurity index for the two child nodes is lower than in the parent.It is not a simple fi al summation of the values obtained in all the trees for each variable but a weighting.
Generally speaking, all ML algorithms have a number of hyperparameters that must be optimized to obtain the best results for the particular problem they are analyzing.We used R 40,41 and the following packages: mlr 42 to calculate the best number of trees (ranging from 500 to 1000); Random Forest 43 or our experiments; and ggplot2 44 graphics for data analysis.
Data pre-processing.Balanced classes in classifi ation problems are critical for ML algorithms.When analyzing the problem, we initially obtained prediction values in AUC lower than 0.65 in the best of cases using 65 features and four different ML algorithms, which is considered a bad performance value for prediction.Th s is mainly due to the fact that in our dataset there is a high percentage of patients who survived versus patients who died, and we found noise and correlation between features, confounding the predictors.These numbers show an unbalanced problem that needs to be addressed since a predictive model for patient death is being generated.
The following are the two main approaches to balance the data: (1) oversampling the minority class or (2) undersampling the class where the data has more examples.Although these are very powerful techniques that are able to increase the performance of the classifie s, they must be handled cautiously, more so in medical problems, to prevent overadjustments or the loss of generation capacity in the models when new synthetic samples are included (oversampling) at the learning phase of the algorithms 45 .In this work, undersampling methods (random undersampling from the majority class) were assessed for class balancing purposes.To ensure that the undersampling process is fair and that the generalization capability of the models is not biased we ran 100 repetitions, each with a different random undersampling, of a tenfold cross-validation experiment to observe the behavior and the stability of results.The more stability the better the random removing process.This means that the remaining samples of the majority class captured the underlying knowledge of the class.
The experimental design developed to analyze the original data included: a data preprocessing phase and the balancing of the subclasses; a tenfold cross-validation and 100 repetitions.For each of these repetitions, the position of each patient in the dataset was randomized.The preprocessed data were also randomized to avoid any potential process-related bias.
The problem was broken down into six different but complementary and informative problems: mortality and morbidity prediction with IS, ICH or IS + ICH patients.This approach sought to analyze more exhaustively the differences between the different types of patients when predicting death/poor outcome and to analyze whether the variables with more weight in the prediction were the same in all the cases.
In order to identify which of the 65 variables available are the most informative, we performed feature selection.There are mainly three different approaches for feature selection in machine learning: filter, wrapper and embedded 46 .Filter methods assess the relevance of each feature by looking only at the intrinsic properties of the data (independent of the algorithms).We calculated a feature relevance score (T-test) on the training data, and low-scoring features were removed choosing a manual cut-off point to reduce the variance of the models (see Supplementary Fig. S1 and Supplementary Material).

Statistical analysis.
For the descriptive study of the quantitative variables, results were expressed as percentages for categorical variables and as mean (SD) or median (quartiles) for the continuous variables, depending on whether their distribution was normal or not.The Kolmogorov-Smirnov test was used for testing the normality of the distribution.To measure the performance of the model we used the area under the receiver operating characteristics (ROC) curve (AUC or AUROC) 47 .To train and validate the model we used tenfold cross validation.AUC results are presented as mean ± SD calculated over the tenfold validation sets.To test whether an AUC of logistic regression and ML models prediction could obtain similar results, ROC curve analysis was used to compare the 7-input variables selected for ML experiments of the different patient groups as potential morbidity and mortality clinical markers at 3 months.The statistical descriptive analysis was conducted in SPSS 25.0 (IBM, Chicago, IL) for Mac.

Figure 3 .
Figure 3. 2D-heatmap of mortality (EXT) predictions against NIHSS(48) and NIHSS(24).Model results are shown for the IS + ICH group, as it was the most stable for mortality prediction (0.904 ± 0.025 of AUC).Red areas correspond to patients who do not die (0), blue areas correspond to patients who die (1), and misclassifi d items are highlighted.

Figure 5 .
Figure 5.Comparison of ROC curves of 7 variables selected for machine learning experiments for mortality and morbidity prediction at 3 months of the different patient groups evaluated.(A,B) Morbidity and mortality of IS + ICH group.(C,D) Morbidity and mortality of IS group.(E-F) Morbidity and mortality of ICH group. https://doi.org/10.1038/s41598-021-89434-7

Table 1 .
Demographic variables of the experimented dataset of patients summarized by group.
IS +

ICH (n = 6022) IS (n = 4922) ICH (n = 1100) Demographic variables
Prediction of mortality.Figure 2A shows the most important variables for the model associated to IS, ICH or IS + ICH patient groups in relation to the mortality prediction.The value shown is, in all cases, the sum of the

Table 2 .
Clinical and neuroimaging variables of the experimented dataset of patients summarized by group.
importance obtained by the algorithm for the variable in each of the experiments internally.

Table 3 .
Molecular markers and outcome at 3 months of the experimented dataset of patients summarized by group.