Introduction

Stroke is a leading cause of disability and death worldwide1,2,3. Effective treatment is time-sensitive, and an optimal outcome is more likely when treatment is administered within the first four and a half hours from stroke onset4,5. The gateway to ambulance transport and hospital admittance is through prehospital telehealth services, including emergency medical call centres, nurse advice call lines, and out-of-hours health services. In the prehospital setting, the use of mobile stroke units has made it possible to deliver advanced treatment faster6,7. As the mobile stroke unit is only dispatched to patients with a suspected stroke, the impact of the mobile stroke unit is directly influenced by accurate call-taker recognition of stroke6,7. Call-takers who can rapidly and accurately recognise stroke are therefore crucial in facilitating prompt care in both prehospital and in-hospital settings.

Despite initiatives to improve stroke recognition8,9, approximately half of all patients with stroke do not receive the correct triage for their condition from call-takers10,11,12. Most initiatives aim to improve stroke recognition by call-takers via introducing more specific assessment tools8,9 or providing specialised training13. Recent advances in machine learning technology might be applied to improve stroke recognition without requiring changes to the triaging approach, and machine learning-aided identification of stroke has been suggested as a means of improving mobile stroke unit effectiveness7. Real-time feedback from a machine learning model can improve the recognition of out-of-hospital cardiac arrest14,15. Therefore, this study aimed to develop and assess the potential of machine learning in improving prehospital stroke recognition during medical helpline calls.

In this study, we use call recordings and registry data from the Copenhagen Emergency Medical Services (CEMS) and the Danish Stroke Registry (DanStroke) from 2015 to 2020. We obtained call recordings from two call lines: the 1-1-2 emergency line and the medical helpline 1813 (MH-1813). We then fit a machine learning framework to classify medical helpline calls as stroke or non-stroke. Calls are first transcribed using an automatic speech recognition model and then categorised by a text classification model trained as an ensemble of five individual models. We compare the performance of the model with that of call-takers using MH-1813 data from 2021.

Results

Population characteristics

Calls to the MH-1813 were divided into training, validation, and test subsets, and calls to the emergency line 1-1-2 were only used as supplementary training data (Table 1). Calls from the test year (2021) that were not associated with a diagnostic category code, which we used to evaluate call-taker performance, were separated from our primary test set, but still included to assess potential bias in this group of calls (2021 w/o category, Table 1). The 1-1-2 training data differed from the MH-1813 data regarding age, male/female ratio, and stroke prevalence (Table 1). Therefore, we performed an ablation study where 1-1-2 data were not used for training to assess whether this difference negatively impacted model performance. The training, validation, and test subsets of the MH-1813 data had similar characteristics, whereas the 2021 data without diagnostic categories differed in age and sex.

Table 1 Population characteristics for each data subset.

Main results

The classification model outperformed the call-takers (Table 2), with significant differences in all metrics (p < 0.0001, paired approximate permutation test). Excluding the 1-1-2 call line training data significantly degraded the model’s performance (p < 0.0001, paired approximate permutation test), despite the domain mismatch with the MH-1813 call line test data. The performance on the 2021 calls without a diagnostic category was significantly worse than that of the test set regarding F1-score, sensitivity, false positive rate (FPR), and false omission rate (FOR) (p < 0.0001, independent approximate permutation test). The difference in positive predictive value (PPV) was not significant (p = 0.298, independent approximate permutation test).

Table 2 Overall performance on MH-1813 test data, performance without 1-1-2 training data, and performance on data from 2021 without diagnostic categories, as well as performance on MH-1813 based on demographic subgroups (age/sex) [mean (95% CI)].

The receiver operating characteristic (ROC) curve (Fig. 1, left) illustrates the potential to increase the sensitivity while maintaining an FPR lower than or equal to that of the call-takers. Similarly, the PPV-sensitivity curve (Fig. 1, right) demonstrates that sensitivity can be improved while retaining a PPV higher than that of the call-takers. The framework can thus be tuned to a sensitivity of around 73%, while still having a higher positive predictive value than the human call-taker (Fig. 1, right). The ensemble model outperformed the individual models regardless of the threshold, except for one that exhibited a slightly better sensitivity at a high FPR exceeding 1.5%. The confusion matrices (Fig. 2) illustrate the performance differences in absolute numbers, with the model exhibiting more true positives and fewer false positives than the call-takers.

Fig. 1: Receiver operator characteristic (ROC) curve and PPV-sensitivity curve.
figure 1

Left is the ROC curve, and right is the PPV-sensitivity curve (precision-recall curve). Models 1–5 are the individual models that make up the ensemble model.

Fig. 2: Prediction confusion matrices.
figure 2

Confusion matrices of predictions for call-takers and the model on the test set. Numbers for the model are given as the rounded mean over eleven runs.

Sex and age

The model and call-takers exhibited significantly higher PPV and F1 score in men than in women (p < 0.0001, independent approximate permutation test) (Table 2). The model significantly outperformed the call-takers on all metrics for each sex (p < 0.0001, paired approximate permutation test).

The model performed significantly better in the 65+ group than in the 18–64 year group regarding sensitivity, PPV, and F1-score (p < 0.0001, independent approximate permutation test). Similarly, the call-takers performed significantly better in the 65+ group than in the 18–64 group regarding PPV and F1 score (p < 0.0001, independent approximate permutation test). Finally, the model significantly outperformed the call-takers on all metrics in both age groups (p < 0.0001, paired approximate permutation test).

Model explainability

We performed an occlusion analysis to evaluate the importance of individual words for both positive and negative classifier predictions (Table 3). Among the words with a positive rank score, several words are synonymous with stroke, such as ‘blood clot’, ‘haemorrhagic stroke’, and ‘stroke’. Ambulances are rarely dispatched because the MH-1813 is not intended for emergencies. Therefore, a word like ‘ambulance’ may also be a strong indicator of call-taker recognition, which the model has learned to mimic. Additionally, most of the remaining words can be linked to stroke-related symptoms such as ‘double vision’, ‘difficulties speaking’, and ‘hangs’. Particularly, words describing the side of the body where symptoms occur ranked high (such as ‘left’, ‘right’, and ‘side’). Finally, some words were also related to the sudden onset of symptoms (including ‘suddenly’ and ‘minutes’).

Table 3 English translation of words with the largest positive and negative ranking score in calls predicted as stroke and non-stroke, respectively.

Among the words with a negative rank score, most were strong indicators for specific conditions, symptoms, or body parts that are unrelated to stroke (such as ‘tetanus’, ‘pregnant’, ‘swollen’, ‘fever’, and ‘the knee’). Another group of words used to describe aspects of treatment that are unlikely to be addressed in a stroke call included ‘prescription’, ‘bandage’, and ‘OTC’. Finally, a small group of words described institutions that are not commonly involved in stroke treatment (such as ‘psychiatric’, ‘the emergency room’, and ‘the police’).

Discussion

Our results showed that a machine learning framework could substantially improve stroke recognition in medical helpline calls compared to solely relying on human call-takers. This improvement was observed across all performance metrics and for basic patient demographics (age and sex). Our occlusion analysis revealed that the model relied on the relevant predictive features associated with call-taker triaging, patient symptoms, and treatment.

This study does not imply that a machine-learning model can replace medical call-takers. The effectiveness of the model is fully reliant on the conversation between the call-taker and caller and the call-taker’s ability to skillfully triage the patient. Instead, the model should be used as a supportive tool for call-takers in the decision-making process, contributing to a higher recognition of patients with stroke and potentially boosting the confidence of call-takers in their decisions. A similar machine learning model designed to predict cardiac arrest was tested in a randomised controlled trial (RCT) at CEMS15. The results highlighted the necessity of incorporating input from call-takers. The machine learning model for cardiac arrest has subsequently been implemented in daily practice at CEMS, in a setup similar to the one presented in our study. However, the implementation of our framework requires further investigation. The relative performance gap between call-takers and the model was larger in our study than in the cardiac arrest study15, which may affect the results of a potential RCT.

To support future work and discussions beyond the scope of this study, the supplementary material includes the results of a simulation of a live implementation where call-takers are assumed to follow a set of fixed rules based on the output of the machine learning framework (Supplementary Table 9). For instance, in one simulation call-takers are assumed to change any stroke negative to a positive, if the model predicts a positive. While the results of the simulation are encouraging, it is important to stress that it is not practically feasible to use a fixed rule set to overrule the call-taker. These results should only be seen as a preliminary indicator of a potential RCT. In practice, a nuanced set of guidelines should be developed over several iterations of implementation and testing.

The performance gap between the model and call-takers could be explained by the rarity of stroke calls to MH-1813 (0.250% of all calls in 2021), which might affect call-taker awareness of stroke as a possible cause of certain symptoms. Additionally, certain stroke symptoms are so rare that some call-takers may never encounter them, increasing the risk of false negatives. The model was trained on more calls than any single call-taker would handle in a lifetime, enabling it to recognise even rare descriptors of stroke. The model is specifically trained to recognise strokes and exclusively learns from actual stroke descriptions, unlike call-takers, who are trained with generalised teaching materials to triage many different conditions. Therefore, call-takers may not have received specific training for patients with stroke and may never have encountered them.

The model performed significantly better on men than on women. This could be attributed to several factors. First, the model may have learned to mimic call-takers with the same bias. Second, women may experience different and more challenging-to-identify symptoms than men16,17. Third, a higher prevalence of male patients with stroke was observed in the training data. Despite these potential sources of bias, the model exhibited less bias than call-takers did. That is, the relative performance improvements were higher for women than for men. This bias could be further reduced using advanced data augmentation and balanced data when training a machine learning model. However, such measures may degrade overall performance.

The improved sensitivity and PPV in the 65+ years group may be explained by a higher prior probability of stroke for older patients and stronger evidence from the patient’s medical history. The relatively high FOR and FPR for the 65+ group is likely to be a result of the much higher prevalence of stroke cases compared to the 18–64-year-olds (0.85% vs. 0.07%). We did not have data to estimate potential bias related to race, ethnicity, language, accent, or dialects. Previous studies on speech recognition for call centres have indeed found that non-native speakers had a higher rate of transcription errors18. Since our model was trained on a representative—and therefore unbalanced—sample, we expect it to behave similarly. Future research should look to address these shortcomings, for example, by utilising self-supervised learning on massive amounts of diverse, unlabelled data covering multiple languages, accents, and dialects.

Due to European data regulations (GDPR), it was not possible to manually transcribe MH-1813 calls to train a new speech recognition model, so we had to rely on an existing solution. This also meant that we could not evaluate the word error rate (WER) of the model. Instead, we used the downstream performance of the text classification model when trained in combination with different speech recognition models to choose the best option. Since the focus of this study is the ability to correctly recognise stroke, and not the performance of the speech recognition model alone, this approach is better suited. Indeed, the WER might be misleading when choosing a speech recognition model for a specific task. For instance, one model might fail to predict redundant minimal response words (e.g., “uh” and “uhm”) and make small inflection errors (e.g., “clot” instead of “clots”), which results in a relatively high WER, while another model only fails to predict rare, specialised words that are highly indicative of stroke (e.g., “haemorrhage” and “thrombolysis”), which results in a relatively low WER.

Although we believe that the proposed machine learning framework can be further improved, several alternatives have already been explored in the preliminary experimental phase. The speech recognition model we used was trained on 1-1-2 calls for a previous project14, and so, was specialised to a domain very similar to that of MH-1813. We also tested an open-source, multilingual model from OpenAI called Whisper19, but found that performance degraded slightly compared to the model trained on 1-1-2. We hypothesise that this is due to Whisper’s inability to handle the specific noise conditions and recognise words from a specialised medical vocabulary.

For text classification, we used an ensemble of multi-layer perceptrons (MLPs). We also tested convolutional, recurrent, and self-attention (i.e., Transformer) architectures. However, this did not improve performance. In addition, we tested a pre-trained self-supervised model. Although many of these models are freely available to the public, they are primarily trained on English data. Only relatively few options exist for the Danish language, none of which are specialised in the medical domain. We used a monolingual Danish BERT model, which has previously been shown to outperform a multilingual alternative from Google for Danish-named entity recognition20. However, this also did not result in a significant performance improvement. We hypothesise that the number of ground truth stroke positives was too small for these advanced models to learn more complex patterns than the MLP ensemble. In addition, a self-supervised model would likely benefit from being pre-trained on speech or text data from the target domain. Although training such large-scale foundation models has the potential to improve the classification model further, it is beyond the scope of this study. Thus, we chose the simpler MLP ensemble. We have included references to reviews of self-supervised learning for speech and text in the references21,22. Notably, it is not uncommon for small, simple models to match or outperform large, pre-trained models for text classification tasks23.

This study has some limitations. First, the mapping of call recordings to electronic records was incomplete due to technical limitations in the computer-aided dispatch (CAD) registry, which limited the number of calls available to us. Of note, there was no obvious pattern of bias related to the unmapped calls, and we included all calls with matching audio files, regardless of dispatcher performance. The results could potentially be improved if more calls were available for analysis. Second, calls without a call-taker-indicated diagnostic category were not included in the validation and test data because the call-taker’s performance could not be evaluated. Moreover, in exploratory analyses, the model performed worse on these calls, which might be attributed to differences in population characteristics (Table 1). Finally, the ground truth stroke labelling relied on the patient-reported time of onset being exact; however, estimating the accuracy of the timestamps in DanStroke was impossible.

In conclusion, using the largest collection of audio calls from patients with stroke to date, we developed a machine-learning framework that significantly outperformed human call-takers in stroke recognition in medical helpline calls. The framework can assist human call-takers during medical helpline calls. Ideally, this would enable a higher recognition of patients with stroke in the prehospital setting, benefiting both patient outcomes and health service resource allocation.

Methods

Data sources

Copenhagen emergency medical services (CEMS)

The CEMS is responsible for providing prehospital telehealth services in the Capital Region of Denmark, with a catchment area of 1.9 million24. CEMS operates two call lines: the 1-1-2 emergency line, similar to 9-1-1 in the United States, intended for acute conditions. The other is the medical helpline 1813 (MH-1813, pronounced ‘18-13’) intended for non-life-threatening conditions that cannot wait until a general practitioner is available25.

Call-takers for both lines, who are nurses, paramedics, or physicians, can dispatch ambulances. The condition suspected by the call-taker is categorised based on a predefined diagnostic index and stored in an electronic record using a CAD system. The CAD records are associated with the Danish civil registration number (CPR number)26 of the patient. The CPR number is a unique identification assigned to all Danish residents. It is used for interactions with health services and registries, enabling cross-referencing of the data sources used in this study. The call audio is recorded and stored separately from the CAD recordings using a telephone system.

Danish Stroke Registry (DanStroke)

All patients with a final diagnosis of stroke or transient ischaemic attack admitted to a Danish hospital within 5 days of symptom onset are recorded in the Danish Stroke Registry27, also known as DanStroke. This record includes the patient-reported time of onset, stroke type (haemorrhagic, ischaemic, or transient ischaemic attack), and CPR number of the patient. The diagnosis is obtained according to the national guidelines28, which includes cerebral imaging and full diagnostic workup by neurologists. The validity of the Danish stroke registry has been shown to be high29, and the number of stroke mimics is therefore minimised in our dataset.

Inclusion and ethics

The Danish Data Protection Agency (P-2021-475) approved this study. Danish law did not require approval from the Scientific Ethics Committee because the data were registry-based. CEMS approved the transcription of all calls made to 1-1-2 and MH-1813. All electronic records were anonymised before analysis, and the researchers did not inspect the calls manually.

Study scope

Stroke prevalence in calls made to the MH-1813 is lower than that in calls made to 1-1-2. Patients with stroke may exhibit different symptoms and symptom severity because MH-1813 is meant for low-acuity incidents, leading to reduced recognition. In addition, MH-1813 call-takers dispatch high-priority transport less frequently, which may affect optimal treatment timing. Therefore, we focused on MH-1813 in this study.

Stroke dataset

Cross-referencing data sources

From the CAD medical records, we included all calls that could be matched to a corresponding audio file for 1-1-2 and MH-1813 from 2015 to 2021 for patients older than 18. The CAD records were matched with the telephone call recordings based on the call start, call duration, and call-taker identity. Due to data incompleteness, and the way the audio data is stored, at CEMS, 2,730,199 contacts could not be matched to their corresponding audio file, however, 2,361,178 contacts were successfully matched. We found no obvious pattern in the matched and unmatched calls and we included all calls with a matching audio file. Next, a call was regarded as a case of ground truth stroke positive when the CPR number in the CAD record matched that of a DanStroke record, and the patient-reported time of onset was close to the call start time. We allowed a window of 72 hours before and 24 hours after the call starts to account for uncertainty in recording stroke onset time. We excluded calls involving subarachnoid haemorrhage cases. Finally, we considered a call to be a call-taker stroke positive when the call-taker selected the stroke diagnostic category during the call and dispatched an ambulance with the appropriate level of response30. To ensure that the effect of the machine learning framework was not overestimated, we excluded calls where diagnostic category had not been registered from the test set. We still reported the population characteristics and model performance of this group of calls to assess potential bias introduced by excluding them. A data-flow diagram is included in the supplementary material (Supplementary Fig. 1). The resulting dataset is the largest dataset of audio files from stroke calls collected to date.

Dataset splitting

We reserved all the MH-1813 calls from 2021 for testing. We used stratified sampling to divide the MH-1813 calls from 2015 to 2020 into validation and training subsets. The training subset was further split into five folds, which were used for ensemble training. The calls were stratified based on the ground truth stroke label and the presence of a diagnostic category. Calls without diagnostic categories were only included in the training set. The 1-1-2 calls were used only for training; however, calls from 2021 were discarded to avoid temporal overlap with the test period.

Machine learning pipeline

We employed a two-step machine learning pipeline. First, a call was transcribed using the speech recognition model. Second, the transcript was used as input for the text classification model. The final output score was used to classify whether the call concerned a stroke. The pipeline is illustrated in the supplementary material (Supplementary Fig. 2).

Speech recognition

The call recordings from the CEMS were stored as 8-bit linear pulse-code modulated audio, sampled at 8 kHz. A call was converted into a log-Mel spectrogram before being input into the speech recognition model. This conversion is a commonly used input representation for speech-processing tasks, which facilitates the identification of linguistic content in audio signals. We used a speech recognition model with a neural network architecture31, consisting of two-dimensional convolutional layers32 and blocks of bidirectional long short-term memory layers33. The output is a sequence of probability distributions over characters of the Danish alphabet, which were then converted into a human-readable transcript using a greedy decoder34.

Text classification

As input for the classification model, each transcript was transformed into a fixed-size bag-of-words vector, which encoded the occurrence of word and character (n-grams) in a fixed vocabulary. The feature selection procedure is detailed in the Supplementary Methods. The model was constructed as an ensemble35 of five identical, independently trained models. Each consists of a stack of neural network layers commonly referred to as a multi-layer perceptron36. The final layer has a single scalar output and applies a sigmoid nonlinearity to produce an output score between zero and one.

Threshold calibration and ensembling

For each model in the ensemble, we selected the prediction threshold as the harmonic mean of the two thresholds that ensure sensitivity and PPV equal to those of the call-takers. This simplifies the comparison by ensuring a trade-off between sensitivity and PPV, similar to that of call-takers.

As the threshold differed for each model in the ensemble, computing the ensemble output score as the average output score of the individual models would not be meaningful. Instead, we first subtracted the threshold from the output score in logit space (before sigmoid nonlinearity) for each model to obtain the same threshold (0.5). Subsequently, we defined the ensemble output score as the average of the centred output scores. The exact equations are provided in the supplementary material [Supplementary Equations (1) and (2)].

Model training

The speech recognition model was trained on 3,811 manually transcribed random calls (173 h) from the CEMS as part of a previous project14. These calls exclusively originated from 1-1-2 between 2015 and 2018, ensuring no overlap with the test data used for the text classification model. The model was trained using a connectionist temporal classification objective34.

We trained five models for the text classification ensemble using binary cross-entropy after transcribing all calls in the dataset using the speech recognition model. One training fold was used for early stopping using the F1-score, whereas the remaining fourfold and 1-1-2 data were used for training. Thus, each model in the ensemble was trained and validated using different datasets. We ran a grid search with 96 different hyperparameter configurations and selected the ensemble model with the best F1 score for the validation set.

Model explainability

We performed an occlusion analysis to better understand the predictions of the text classification model. This involved removing all instances of a given word from the input transcript to evaluate its impact on the model output. The word was removed before vectorisation, such that all word and character n-grams associated with the word were discarded. Specifically, let z(n,d,w) be the logit output of model n in the ensemble for transcript d when the word w is occluded. For transcript d, we computed the word impact score i(d,w) as the mean difference between the logit before and after occlusion.

$${i}^{(d,w)}=\frac{1}{N}{\mathop{\sum}\limits_{n=1}^{N}}\,{z}^{(n,d)}-{z}^{\left(n,d,w\right)}$$
(1)

We used the logit output to compute the impact score because the difference in sigmoid-normalised output is biased towards zero for values close to 0 or 1. To select words for inspection, we computed a ranking score, r(w), as the sum of the signed squares of the impact:

$${r}^{(w)}=\mathop{\sum }\limits_{d=1}^{D}\,\mathrm{sgn}\left({i}^{(d,w)}\right){\left({i}^{(d,w)}\right)}^{2}$$
(2)

where sgn(·) represents the sign function. Squaring i(d, w) favours rare features with a high impact over common features with a low impact.

Statistical analysis

We report the F1-score, sensitivity, PPV, FOR (equal to 1−negative predictive value), and FPR (equal to 1−specificity). Due to the imbalanced nature of the dataset, the negative predictive value and specificity were >99% for all cases. We reported FOR and FPR instead because such large numerical values exhibit low relative variance, thereby obfuscating comparisons. Finally, we report the prediction confusion matrices, ROC curve, and PPV-sensitivity curve, commonly known as the precision-recall curve. All results are reported with up to three significant digits.

We present the results with and without 1-1-2 training data, subgroup analyses based on age (18–64/65+) and sex (male/female), and call-takers performance. We also report the model performance on calls without a diagnostic category from the test year 2021 to assess potential data bias. We tested our results for statistical significance using approximate permutation tests. We used one-sided paired approximate permutation tests for model-to-model and model-to-call-taker comparisons when done on the same subset. For comparisons across different subsets (e.g., male vs. female), we used one-sided independent approximate permutation tests. We computed 95% confidence intervals (CIs) using bootstrapping37,38. In our assessment, we accounted for random variation associated with model training by basing the means, tests, and CIs on the predictions of 11 randomly initialised training runs. Statistical significance was defined as a p value of <0.05.

We used the model with the median F1-score out of the 11 runs for the occlusion analysis. We listed the 30 words with the highest positive ranking scores for calls classified as stroke and the 30 words with the highest negative ranking scores for calls classified as non-stroke.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.