Introduction

Stroke is an acute life-threatening disease that mostly occurs in out-of-hospital settings1. Emergency medical services (EMS) personnel, who are the first healthcare providers to respond to patients with a suspected stroke, evaluate the risk of stroke and transport those patients to hospitals. Early initiation of therapeutic approaches, including endovascular therapy for large vessel occlusion (LVO)2,3 is key to improving the clinical outcomes of strokes4. Thus, higher precision of stroke prediction in prehospital settings may contribute to improving the quality of stroke care and clinical outcomes.

Prehospital diagnostic algorithms for strokes, which are also known as stroke scales, have been developed substantially5,6,7. There could be a propensity to place importance on limiting the number of predictive values for simplification rather than enhancing predictive ability in the prediction precision for strokes or LVOs (area under the receiver operating characteristic curve [AUROC], stroke prediction, Cincinnati Prehospital Stroke Scale [CPSS] 0.8135, Japan Urgent Stroke Triage [JUST] Score 0.800–0.886; LVO prediction, 8 prehospital stroke scales 0.72–0.837). Recent advances in statistical approaches using machine learning have significantly improved the precision of predicting algorithms for acute diseases by using multiple predictive factors, including out-of-hospital cardiac arrest, acute coronary syndrome, and sepsis8,9,10. However, few investigations have focused on prehospital stroke/LVO diagnostic algorithms using machine learning.

Therefore, we tested the hypothesis that prehospital stroke diagnostic algorithms using machine learning had a high predictive value. We prospectively enrolled a large cohort of patients with suspected stroke in prehospital settings and analyzed them using five machine learning algorithms.

Results

Baseline characteristics and outcomes

In a training cohort as a derivation cohort, 834 patients had a stroke (training cohort, Table 1). Patients with strokes had significantly increased age, increased past history of heart diseases (atrial fibrillation and hypertension), decreased past history of diabetes mellitus, and decreased past history of neurological diseases compared to non-stroke patients. In terms of vital signs, patients with strokes had decreased heart rates, increased arrhythmia, increased blood pressure, and decreased probability of impaired Glasgow coma scale (GCS) components compared to non-stroke patients. Patients with strokes had decreased dizziness and convulsion and increased upper limb paralysis, hemiparalysis, facial palsy, and dysarthria compared to non-stroke patients. Onset timing (hourly) was earlier in patients with strokes than in those without strokes. Similar differences were observed in the test cohort data as a validation cohort (test cohort, Supplementary Table S2).

Table 1 Baseline characteristics and clinical outcomes in the training cohort.

Prediction of stroke

In the primary analysis of stroke prediction using logistic regression, random forest, support vector machines (SVM), and eXtreme Gradient Boosting (XGBoost), XGBoost had the highest predictive values (AUROC 0.994 [confidence interval; CI 0.991–0.997]) in the training cohort. In the test cohort, the XGBoost model also had the highest predictive value of the five machine learning approaches (AUROC 0.980 [CI 0.962–0.994]) (Table 2 and Fig. 1a, b). The SHapley Additive exPlanation (SHAP) summary plot revealed that the major predictive contributors for stroke were “sudden headache,” “upper limb paralysis,” “convulsion,” “sudden impaired consciousness or headache,” “systolic blood pressure,” “arrhythmia,” “conjugate deviation,” and “diastolic blood pressure” (Fig. 1c).

Table 2 Prehospital stroke prediction using machine learning.
Figure 1
figure 1

Receiver operating characteristic curve and the SHAP value of prehospital stroke prediction. (a) Training cohort (derivation cohort). (b) Test cohort (validation cohort). (c) SHAP value of stroke. AUROC (area under the receiver operating characteristic curve), CI (confidence interval), XGBoost (eXtreme Gradient Boosting), SVM (support vector machine), SHAP (SHapley Additive exPlanation), GCS M (Glasgow coma scale, best motor response), onset hour \(\left( {\cos \left( {2\pi \frac{{hour_{onset} }}{24}} \right)} \right)\).

Prediction of stroke subcategories

We next analyzed the prediction algorithms for stroke subcategories (acute ischemic stroke [AIS] with /without LVO, intracranial hemorrhage [ICH], and subarachnoid hemorrhage [SAH]) using XGBoost, which was the best approach in the primary analysis. The machine learning-based prediction algorithms had high predictive values (test data, AUROC [CI], AIS with LVO 0.898 [0.848–0.939], AIS without LVO 0.882 [0.836–0.923], ICH 0.866 [0.817–0.911], SAH 0.926 [0.874–0.971]) (Table 3 and Supplementary Fig. S2). The SHAP summary plot of AIS with LVO revealed that the major contributors were “GCS V,” “onset hours,” “age,” “arrhythmia,” “hemiparalysis,” “systolic blood pressure,” “Japan Coma Scale (JCS),” and “minimum thermo-hydrological index (THI)” (Supplementary Fig. S2).

Table 3 Prehospital stroke subcategory prediction using XGBoost.

Discussion

In this study of prehospital stroke prediction using machine learning, the algorithm using XGBoost had a high predictive value for strokes and stroke subcategories including LVO. It can be technically feasible that EMS personnel input required data for the predicting algorithm at the scene using tablet PCs and that the EMS and hospital personnel utilize the results of prediction, which may contribute to improving a quality of prehospital care.

Substantial investigations on prehospital predicting scales for strokes have been conducted with insufficient precision5,6,7. In a meta-analysis of the CPSS for stroke prediction (study n = 3, patient n = 1366), the CPSS had a summary AUROC of 0.8135. More recent investigation of the JUST score (patient n = 2236) documented AUROCs of 0.88 and 0.80 for stroke in the training (derivation) and test (validation) cohorts, respectively6. In accordance with these stroke predictions, insufficient predictive values of LVO prediction (AUROC 0.72–0.83) were documented in eight prehospital scales including the Rapid Arterial Occlusion Evaluation (RACE), Los Angeles Motor Scale, Cincinnati Stroke Triage Assessment Tool, Gaze-Face-Arm-Speech-Time, Prehospital Acute Stroke Severity, CPSS, Conveniently-Grasped Field Assessment Stroke Triage, and Face-Arm-Speech-Time plus severe arm or leg motor deficit test7. Thus, there is an unmet need for the improvement of the prediction precision for strokes/LVOs. The machine learning approach is a potential solution to improve precision, as we have demonstrated in this study. We found that the prediction algorithms using XGBoost had a high predictive value for strokes (AUROC 0.980) and stroke subcategories (AUROC 0.866–0.926) including LVOs (AUROC 0.898).

Substantial machine learning studies for diagnosis and prognosis in acute diseases have been documented8,9,10, while machine learning studies for strokes remain insufficient. To the best of our knowledge, investigations on predicting strokes by machine learning using a prehospital dataset have rarely been conducted. The majority of machine learning studies are focused on detecting strokes from computed tomography (CT) images, and machine learning studies focused on prehospital or hospital bedside prediction of strokes are limited11. In this study, we successfully developed prehospital stroke prediction algorithms using a machine learning approach with high precision. We found only one machine learning study that predicted LVOs with prehospital data. The study used 24 variables including prehospital data on 777 adult AIS patients (LVO n = 300) who underwent CT angiography or MR angiography and received reperfusion therapy within 8 h from symptom onset in a single center. They compared artificial neural network (ANN) and conventional stroke prediction scales including the Cincinnati Prehospital Stroke Severity Scale, Field Assessment Stroke for Emergency Destination, and RACE. They found that the ANN had a higher predictive value than the conventional scales (AUROC, ANN 0.823, conventional scales 0.740–0.796)12. In addition, prediction algorithms for LVOs using machine learning were documented by You et al., whose study was not a prehospital study but a hospital study13. Their study included 300 adult stroke patients (LVO n = 130) with 24 variables and compared the predictive values of XGBoost, logistic regression, random forest, and SVM. They found that XGBoost had the highest predictive value for LVOs (AUROC 0.809)13. In agreement with these findings, we also found that XGBoost had a higher predictive value for strokes compared to logistic regression, random forest, and SVM. Thus, XGBoost appears to be a better approach for this prediction.

Not only patients’ baseline factors, but also environmental factors, increased the risk of stroke14. In this study, of the 21 environmental factors screened, the onset timing factors including onset hour, day of the week (Monday), and minimum THI were analyzed. Stroke patients had earlier onset timing hours compared to non-stroke patients in the univariate analysis (Table 1). In addition, the SHAP analysis in the XGBoost for the identified onset hour of a stroke had an impact on the algorithms. Regarding the minimum THI, a high impact was identified in the SHAP analysis for LVOs. Few prehospital stroke scales include environmental factors; however, adding environmental factors may contribute to improving precision of prehospital stroke/stroke subcategory prediction using machine learning.

There were some limitations to this study. First, the study was a multicenter study, but conducted in a single urban region in Japan. Therefore, it remains unclear whether the algorism would have high predictive value in different areas with different backgrounds, and generalization. Second, we excluded pediatric patients, which may be a limiting factor of the present study; however, the frequency of stroke in pediatric patients was not common compared to adult and approximately 1.0 in 100,000 per year15. In addition, the predictive factors for pediatric stroke appears to be different compared to adults15. Further studies targeting pediatric stroke studies may develop different predicting algorithms. Third, we aimed for a better performance rather than limiting the number of predictors. Further development of algorithms with limited number of predictors to keep high predictive value would strengthen the study results and may lead to future implementation. Further studies including wider regions or a limited number of predictors may strengthen the findings for the prediction of strokes by machine learning.

In the conclusions, the prehospital stroke diagnostic algorithm using machine learning had a high predictive value for strokes and their subcategories including LVOs. This machine learning approach could potentially lead to precise stroke recognition in prehospital settings.

Methods

Study setting and patients

The current multicenter observational study was prospectively conducted in an urban area (Chiba city, population 1 million) in Japan, between September 2018 and September 2020. The Chiba City Fire Department covers this entire area with 26 ambulance squads, and has about 53,000 emergency dispatches per year. We included consecutive adult patients (≥ 20 years of age) with suspected stroke by EMS personnel who were subsequently transported to hospitals. There were 12 hospitals which can receive stroke patients in this entire area; all the 12 hospitals participated the study. We excluded pediatric patients, hypoglycemia proved by measurement of blood glucose, traumatic brain injury, drug abuse, undiagnosed patients who have not received diagnostic investigation based such as CT or magnetic resonance imaging (MRI). Of the 1778 patients screened, 332 who had missing diagnostic data or multiple entries were excluded and 1446 were analyzed (Supplementary Fig. S1).

The Chiba University Hospital Certified Clinical Research Review Board approved this study (No.2733) and waived the need for written informed consent, in conformity with the Ethical Guidelines for Medical and Health Research Involving Human Subjects in Japan. We posted information about this study in each ambulance. We promptly excluded the collected data when a patient or family indicated that they did not wish to participate in this study.

Data collection and definition

Data on variables for stroke prediction in the prehospital setting, including symptoms, physiological data, and medical history, were collected (Supplementary Table S1). The variables cover each of the scoring components of the National Institutes of Health Stroke Scale16. Since onset timing (daily, especially Monday, and hourly) and meteorological conditions potentially altered the risk of stroke14,17, onset time and weather data were added to the analysis.

Stroke is defined based on the National Institute of Neurdological Disorders and Stroke III18. Stroke is further subclassified into AIS with LVO, AIS without LVO, ICH, and SAH. LVO is defined as acute occlusion of the internal carotid artery, M1 or M2 portion of the middle cerebral artery, and the basilar artery. Board-certified neurologists in registered hospitals made the diagnoses based on examinations including CT, MRI, CT angiography, and MR angiography.

Missing values

We used domain knowledge to impute missing values first. The mutually imputed pairs or groups of features based on the knowledge were as follows: (i) conjugate deviation and visual field defects, (ii) dysarthria and facial paralysis, (iii) aphasia, best verbal response of the GCS, the JCS, and consciousness-related features.

For the remaining missing values of highly correlated variables (correlation coefficient > 0.7), multivariate imputation was applied between the variables using a regressor model, in which each feature with missing values is modeled as a function of other features. The mutually imputed pairs or groups of features were as follows: (1) systolic and diastolic blood pressure, (2) left and right pupil sizes, (3) left and right pupillary light reflex, (4) the GCS, the JCS, and consciousness-related features, and (5) paralysis-related features. Other missing values of the numerical features were imputed with each median value. For any other categorical attributes, the missing values were replaced with a new subcategory “Unknown”.

Note that XGBoost can handle missing values, unlike the other methods. Therefore, we trained two XGBoost models, one with the data before and the other after applying imputation, and confirmed that the imputation did not cause a decrease in performance (paired sampled t-test’s P-value = 0.268 for tenfold cross validation scores). For the purpose of comparison between the other models (logistic regression, random forest, SVM) and XGBoost, we used the imputed data for all analyses in this study.

Feature selection

Among the 59 features collected by EMS personnel and onset time variables, 51 features were selected after primitive cleaning of the feature candidates with a large fraction (over 40%) of missing values (2 features) or low variance (5% cutoff, 6 features).

For the meteorological features, we calculated Pearson’s chi-square values for all pairs between the stroke category and the 21 meteorological features (Supplementary Methods) and selected the features with P-values < 0.05 as feature candidates. These selected meteorological features were further reduced in number by a forward stepwise selection method, in which only features that improved the performance of the model were selected by repeatedly including the features one by one. Finally, only one meteorological feature, the ‘minimum THI on the onset day’ was selected.

Because we aimed to improve performance, rather than simply limit the number of predictors, we included 52 features for all analyses in this study.

Statistical analysis

Of the 1446 patients analyzed, 1156 (80%) were randomly included in a training cohort as a derivation cohort, and 290 (20%) were in a test cohort as a validation cohort. First, we developed the binary classification models for stroke classification as a primary outcome based on five common machine learning algorithms: logistic regression, random forest, SVM with linear or radius basis function kernels, and XGBoost. Then, as a secondary outcome, we built a multi-class classification model to predict each stroke category of AIS with/without LVO, ICH, and SAH. Based on the primary analysis that found XGBoost to be superior, we chose XGBoost for the secondary outcome analysis.

The parameters of the five machine learning models were selected by using the grid search method, in which we further split the training cohort into five folds, trained each of the five sets of data, and selected the parameters of the model that performed the best. It should be noted that the data are imbalanced. More weight must be set toward the minor classes in any model when the loss functions are calculated so that the major and minor classes are fairly evaluated.

The performance of the models was measured in terms of the AUROC as a superior metric, as well as sensitivity, specificity, and the F1 score. We used the SHAP algorithm19 of the XGBoost model to interpret the contributions of each feature to the predictive model. In the algorithm, the SHAP value was computed by a difference in model output resulting from the inclusion of a feature in the algorithm, providing information on each feature’s impact on the output. In the SHAP summary plots, every violin plot is composed of all data points of each feature with a higher value being redder, and a lower value being bluer. The violin plots are aligned with the SHAP value along the x-axis. Thus, a redder/bluer violin plot on the right side (i.e., higher positive SHAP value) suggests that the higher/lower the values of the feature are, the more the model predicts towards positive/negative impact.

Data are expressed as medians (interquartile ranges) for continuous values and absolute numbers and percentages for categorical values. Two-tailed P-values < 0.05 were considered significant. Analyses were performed using Python 3.7.6 packages (Scikit-learn 0.23.2, XGBoost 1.1.1, Pandas 1.1.5, and NumPy 1.19.2) to construct the machine learning models. The Python packages including Scikit-learn, XGBoost, Pandas, Numpy, and Matplotlib, and the SHAP package are all open-source packages. Permission to use these packages is granted, free of charge, to any person (Python License: https://docs.python.org/3/license.html, SHAP: https://github.com/slundberg/shap/blob/master/LICENSE). All figures in this study were drawn using the visualization package in Python, Matplotlib (3.3.4)20,21.