A retrospective study on machine learning-assisted stroke recognition for medical helpline calls

Wenstrup, Jonathan; Havtorn, Jakob Drachmann; Borgholt, Lasse; Blomberg, Stig Nikolaj; Maaloe, Lars; Sayre, Michael R.; Christensen, Hanne; Kruuse, Christina; Christensen, Helle Collatz

doi:10.1038/s41746-023-00980-y

Download PDF

Article
Open access
Published: 19 December 2023

A retrospective study on machine learning-assisted stroke recognition for medical helpline calls

npj Digital Medicine volume 6, Article number: 235 (2023) Cite this article

1838 Accesses
1 Citations
11 Altmetric
Metrics details

Subjects

Abstract

Advanced stroke treatment is time-dependent and, therefore, relies on recognition by call-takers at prehospital telehealth services to ensure fast hospitalisation. This study aims to develop and assess the potential of machine learning in improving prehospital stroke recognition during medical helpline calls. We used calls from 1 January 2015 to 31 December 2020 in Copenhagen to develop a machine learning-based classification pipeline. Calls from 2021 are used for testing. Calls are first transcribed using an automatic speech recognition model and then categorised as stroke or non-stroke using a text classification model. Call-takers achieve a sensitivity of 52.7% (95% confidence interval 49.2–56.4%) with a positive predictive value (PPV) of 17.1% (15.5–18.6%). The machine learning framework performs significantly better (p < 0.0001) with a sensitivity of 63.0% (62.0–64.1%) and a PPV of 24.9% (24.3–25.5%). Thus, a machine learning framework for recognising stroke in prehospital medical helpline calls may become a supportive tool for call-takers, aiding in early and accurate stroke recognition.

An overview of clinical decision support systems: benefits, risks, and strategies for success

Article Open access 06 February 2020

Large language models in medicine

Article 17 July 2023

nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation

Article 07 December 2020

Introduction

Stroke is a leading cause of disability and death worldwide^1,2,3. Effective treatment is time-sensitive, and an optimal outcome is more likely when treatment is administered within the first four and a half hours from stroke onset^4,5. The gateway to ambulance transport and hospital admittance is through prehospital telehealth services, including emergency medical call centres, nurse advice call lines, and out-of-hours health services. In the prehospital setting, the use of mobile stroke units has made it possible to deliver advanced treatment faster^6,7. As the mobile stroke unit is only dispatched to patients with a suspected stroke, the impact of the mobile stroke unit is directly influenced by accurate call-taker recognition of stroke^6,7. Call-takers who can rapidly and accurately recognise stroke are therefore crucial in facilitating prompt care in both prehospital and in-hospital settings.

Despite initiatives to improve stroke recognition^8,9, approximately half of all patients with stroke do not receive the correct triage for their condition from call-takers^10,11,12. Most initiatives aim to improve stroke recognition by call-takers via introducing more specific assessment tools^8,9 or providing specialised training¹³. Recent advances in machine learning technology might be applied to improve stroke recognition without requiring changes to the triaging approach, and machine learning-aided identification of stroke has been suggested as a means of improving mobile stroke unit effectiveness⁷. Real-time feedback from a machine learning model can improve the recognition of out-of-hospital cardiac arrest^14,15. Therefore, this study aimed to develop and assess the potential of machine learning in improving prehospital stroke recognition during medical helpline calls.

In this study, we use call recordings and registry data from the Copenhagen Emergency Medical Services (CEMS) and the Danish Stroke Registry (DanStroke) from 2015 to 2020. We obtained call recordings from two call lines: the 1-1-2 emergency line and the medical helpline 1813 (MH-1813). We then fit a machine learning framework to classify medical helpline calls as stroke or non-stroke. Calls are first transcribed using an automatic speech recognition model and then categorised by a text classification model trained as an ensemble of five individual models. We compare the performance of the model with that of call-takers using MH-1813 data from 2021.

Results

Population characteristics

Calls to the MH-1813 were divided into training, validation, and test subsets, and calls to the emergency line 1-1-2 were only used as supplementary training data (Table 1). Calls from the test year (2021) that were not associated with a diagnostic category code, which we used to evaluate call-taker performance, were separated from our primary test set, but still included to assess potential bias in this group of calls (2021 w/o category, Table 1). The 1-1-2 training data differed from the MH-1813 data regarding age, male/female ratio, and stroke prevalence (Table 1). Therefore, we performed an ablation study where 1-1-2 data were not used for training to assess whether this difference negatively impacted model performance. The training, validation, and test subsets of the MH-1813 data had similar characteristics, whereas the 2021 data without diagnostic categories differed in age and sex.

Table 1 Population characteristics for each data subset.

Full size table

Main results

The classification model outperformed the call-takers (Table 2), with significant differences in all metrics (p < 0.0001, paired approximate permutation test). Excluding the 1-1-2 call line training data significantly degraded the model’s performance (p < 0.0001, paired approximate permutation test), despite the domain mismatch with the MH-1813 call line test data. The performance on the 2021 calls without a diagnostic category was significantly worse than that of the test set regarding F1-score, sensitivity, false positive rate (FPR), and false omission rate (FOR) (p < 0.0001, independent approximate permutation test). The difference in positive predictive value (PPV) was not significant (p = 0.298, independent approximate permutation test).

Table 2 Overall performance on MH-1813 test data, performance without 1-1-2 training data, and performance on data from 2021 without diagnostic categories, as well as performance on MH-1813 based on demographic subgroups (age/sex) [mean (95% CI)].

Full size table

The receiver operating characteristic (ROC) curve (Fig. 1, left) illustrates the potential to increase the sensitivity while maintaining an FPR lower than or equal to that of the call-takers. Similarly, the PPV-sensitivity curve (Fig. 1, right) demonstrates that sensitivity can be improved while retaining a PPV higher than that of the call-takers. The framework can thus be tuned to a sensitivity of around 73%, while still having a higher positive predictive value than the human call-taker (Fig. 1, right). The ensemble model outperformed the individual models regardless of the threshold, except for one that exhibited a slightly better sensitivity at a high FPR exceeding 1.5%. The confusion matrices (Fig. 2) illustrate the performance differences in absolute numbers, with the model exhibiting more true positives and fewer false positives than the call-takers.

**Fig. 1: Receiver operator characteristic (ROC) curve and PPV-sensitivity curve.**

**Fig. 2: Prediction confusion matrices.**

Sex and age

The model and call-takers exhibited significantly higher PPV and F1 score in men than in women (p < 0.0001, independent approximate permutation test) (Table 2). The model significantly outperformed the call-takers on all metrics for each sex (p < 0.0001, paired approximate permutation test).

The model performed significantly better in the 65+ group than in the 18–64 year group regarding sensitivity, PPV, and F1-score (p < 0.0001, independent approximate permutation test). Similarly, the call-takers performed significantly better in the 65+ group than in the 18–64 group regarding PPV and F1 score (p < 0.0001, independent approximate permutation test). Finally, the model significantly outperformed the call-takers on all metrics in both age groups (p < 0.0001, paired approximate permutation test).

Model explainability

We performed an occlusion analysis to evaluate the importance of individual words for both positive and negative classifier predictions (Table 3). Among the words with a positive rank score, several words are synonymous with stroke, such as ‘blood clot’, ‘haemorrhagic stroke’, and ‘stroke’. Ambulances are rarely dispatched because the MH-1813 is not intended for emergencies. Therefore, a word like ‘ambulance’ may also be a strong indicator of call-taker recognition, which the model has learned to mimic. Additionally, most of the remaining words can be linked to stroke-related symptoms such as ‘double vision’, ‘difficulties speaking’, and ‘hangs’. Particularly, words describing the side of the body where symptoms occur ranked high (such as ‘left’, ‘right’, and ‘side’). Finally, some words were also related to the sudden onset of symptoms (including ‘suddenly’ and ‘minutes’).

Table 3 English translation of words with the largest positive and negative ranking score in calls predicted as stroke and non-stroke, respectively.

Full size table

Among the words with a negative rank score, most were strong indicators for specific conditions, symptoms, or body parts that are unrelated to stroke (such as ‘tetanus’, ‘pregnant’, ‘swollen’, ‘fever’, and ‘the knee’). Another group of words used to describe aspects of treatment that are unlikely to be addressed in a stroke call included ‘prescription’, ‘bandage’, and ‘OTC’. Finally, a small group of words described institutions that are not commonly involved in stroke treatment (such as ‘psychiatric’, ‘the emergency room’, and ‘the police’).

Discussion

Our results showed that a machine learning framework could substantially improve stroke recognition in medical helpline calls compared to solely relying on human call-takers. This improvement was observed across all performance metrics and for basic patient demographics (age and sex). Our occlusion analysis revealed that the model relied on the relevant predictive features associated with call-taker triaging, patient symptoms, and treatment.

This study does not imply that a machine-learning model can replace medical call-takers. The effectiveness of the model is fully reliant on the conversation between the call-taker and caller and the call-taker’s ability to skillfully triage the patient. Instead, the model should be used as a supportive tool for call-takers in the decision-making process, contributing to a higher recognition of patients with stroke and potentially boosting the confidence of call-takers in their decisions. A similar machine learning model designed to predict cardiac arrest was tested in a randomised controlled trial (RCT) at CEMS¹⁵. The results highlighted the necessity of incorporating input from call-takers. The machine learning model for cardiac arrest has subsequently been implemented in daily practice at CEMS, in a setup similar to the one presented in our study. However, the implementation of our framework requires further investigation. The relative performance gap between call-takers and the model was larger in our study than in the cardiac arrest study¹⁵, which may affect the results of a potential RCT.

To support future work and discussions beyond the scope of this study, the supplementary material includes the results of a simulation of a live implementation where call-takers are assumed to follow a set of fixed rules based on the output of the machine learning framework (Supplementary Table 9). For instance, in one simulation call-takers are assumed to change any stroke negative to a positive, if the model predicts a positive. While the results of the simulation are encouraging, it is important to stress that it is not practically feasible to use a fixed rule set to overrule the call-taker. These results should only be seen as a preliminary indicator of a potential RCT. In practice, a nuanced set of guidelines should be developed over several iterations of implementation and testing.

The performance gap between the model and call-takers could be explained by the rarity of stroke calls to MH-1813 (0.250% of all calls in 2021), which might affect call-taker awareness of stroke as a possible cause of certain symptoms. Additionally, certain stroke symptoms are so rare that some call-takers may never encounter them, increasing the risk of false negatives. The model was trained on more calls than any single call-taker would handle in a lifetime, enabling it to recognise even rare descriptors of stroke. The model is specifically trained to recognise strokes and exclusively learns from actual stroke descriptions, unlike call-takers, who are trained with generalised teaching materials to triage many different conditions. Therefore, call-takers may not have received specific training for patients with stroke and may never have encountered them.

The model performed significantly better on men than on women. This could be attributed to several factors. First, the model may have learned to mimic call-takers with the same bias. Second, women may experience different and more challenging-to-identify symptoms than men^16,17. Third, a higher prevalence of male patients with stroke was observed in the training data. Despite these potential sources of bias, the model exhibited less bias than call-takers did. That is, the relative performance improvements were higher for women than for men. This bias could be further reduced using advanced data augmentation and balanced data when training a machine learning model. However, such measures may degrade overall performance.

The improved sensitivity and PPV in the 65+ years group may be explained by a higher prior probability of stroke for older patients and stronger evidence from the patient’s medical history. The relatively high FOR and FPR for the 65+ group is likely to be a result of the much higher prevalence of stroke cases compared to the 18–64-year-olds (0.85% vs. 0.07%). We did not have data to estimate potential bias related to race, ethnicity, language, accent, or dialects. Previous studies on speech recognition for call centres have indeed found that non-native speakers had a higher rate of transcription errors¹⁸. Since our model was trained on a representative—and therefore unbalanced—sample, we expect it to behave similarly. Future research should look to address these shortcomings, for example, by utilising self-supervised learning on massive amounts of diverse, unlabelled data covering multiple languages, accents, and dialects.

Due to European data regulations (GDPR), it was not possible to manually transcribe MH-1813 calls to train a new speech recognition model, so we had to rely on an existing solution. This also meant that we could not evaluate the word error rate (WER) of the model. Instead, we used the downstream performance of the text classification model when trained in combination with different speech recognition models to choose the best option. Since the focus of this study is the ability to correctly recognise stroke, and not the performance of the speech recognition model alone, this approach is better suited. Indeed, the WER might be misleading when choosing a speech recognition model for a specific task. For instance, one model might fail to predict redundant minimal response words (e.g., “uh” and “uhm”) and make small inflection errors (e.g., “clot” instead of “clots”), which results in a relatively high WER, while another model only fails to predict rare, specialised words that are highly indicative of stroke (e.g., “haemorrhage” and “thrombolysis”), which results in a relatively low WER.

Although we believe that the proposed machine learning framework can be further improved, several alternatives have already been explored in the preliminary experimental phase. The speech recognition model we used was trained on 1-1-2 calls for a previous project¹⁴, and so, was specialised to a domain very similar to that of MH-1813. We also tested an open-source, multilingual model from OpenAI called Whisper¹⁹, but found that performance degraded slightly compared to the model trained on 1-1-2. We hypothesise that this is due to Whisper’s inability to handle the specific noise conditions and recognise words from a specialised medical vocabulary.

For text classification, we used an ensemble of multi-layer perceptrons (MLPs). We also tested convolutional, recurrent, and self-attention (i.e., Transformer) architectures. However, this did not improve performance. In addition, we tested a pre-trained self-supervised model. Although many of these models are freely available to the public, they are primarily trained on English data. Only relatively few options exist for the Danish language, none of which are specialised in the medical domain. We used a monolingual Danish BERT model, which has previously been shown to outperform a multilingual alternative from Google for Danish-named entity recognition²⁰. However, this also did not result in a significant performance improvement. We hypothesise that the number of ground truth stroke positives was too small for these advanced models to learn more complex patterns than the MLP ensemble. In addition, a self-supervised model would likely benefit from being pre-trained on speech or text data from the target domain. Although training such large-scale foundation models has the potential to improve the classification model further, it is beyond the scope of this study. Thus, we chose the simpler MLP ensemble. We have included references to reviews of self-supervised learning for speech and text in the references^21,22. Notably, it is not uncommon for small, simple models to match or outperform large, pre-trained models for text classification tasks²³.

This study has some limitations. First, the mapping of call recordings to electronic records was incomplete due to technical limitations in the computer-aided dispatch (CAD) registry, which limited the number of calls available to us. Of note, there was no obvious pattern of bias related to the unmapped calls, and we included all calls with matching audio files, regardless of dispatcher performance. The results could potentially be improved if more calls were available for analysis. Second, calls without a call-taker-indicated diagnostic category were not included in the validation and test data because the call-taker’s performance could not be evaluated. Moreover, in exploratory analyses, the model performed worse on these calls, which might be attributed to differences in population characteristics (Table 1). Finally, the ground truth stroke labelling relied on the patient-reported time of onset being exact; however, estimating the accuracy of the timestamps in DanStroke was impossible.

In conclusion, using the largest collection of audio calls from patients with stroke to date, we developed a machine-learning framework that significantly outperformed human call-takers in stroke recognition in medical helpline calls. The framework can assist human call-takers during medical helpline calls. Ideally, this would enable a higher recognition of patients with stroke in the prehospital setting, benefiting both patient outcomes and health service resource allocation.

Methods

Data sources

Copenhagen emergency medical services (CEMS)

The CEMS is responsible for providing prehospital telehealth services in the Capital Region of Denmark, with a catchment area of 1.9 million²⁴. CEMS operates two call lines: the 1-1-2 emergency line, similar to 9-1-1 in the United States, intended for acute conditions. The other is the medical helpline 1813 (MH-1813, pronounced ‘18-13’) intended for non-life-threatening conditions that cannot wait until a general practitioner is available²⁵.

Call-takers for both lines, who are nurses, paramedics, or physicians, can dispatch ambulances. The condition suspected by the call-taker is categorised based on a predefined diagnostic index and stored in an electronic record using a CAD system. The CAD records are associated with the Danish civil registration number (CPR number)²⁶ of the patient. The CPR number is a unique identification assigned to all Danish residents. It is used for interactions with health services and registries, enabling cross-referencing of the data sources used in this study. The call audio is recorded and stored separately from the CAD recordings using a telephone system.

Danish Stroke Registry (DanStroke)

All patients with a final diagnosis of stroke or transient ischaemic attack admitted to a Danish hospital within 5 days of symptom onset are recorded in the Danish Stroke Registry²⁷, also known as DanStroke. This record includes the patient-reported time of onset, stroke type (haemorrhagic, ischaemic, or transient ischaemic attack), and CPR number of the patient. The diagnosis is obtained according to the national guidelines²⁸, which includes cerebral imaging and full diagnostic workup by neurologists. The validity of the Danish stroke registry has been shown to be high²⁹, and the number of stroke mimics is therefore minimised in our dataset.

Inclusion and ethics

The Danish Data Protection Agency (P-2021-475) approved this study. Danish law did not require approval from the Scientific Ethics Committee because the data were registry-based. CEMS approved the transcription of all calls made to 1-1-2 and MH-1813. All electronic records were anonymised before analysis, and the researchers did not inspect the calls manually.

Study scope

Stroke prevalence in calls made to the MH-1813 is lower than that in calls made to 1-1-2. Patients with stroke may exhibit different symptoms and symptom severity because MH-1813 is meant for low-acuity incidents, leading to reduced recognition. In addition, MH-1813 call-takers dispatch high-priority transport less frequently, which may affect optimal treatment timing. Therefore, we focused on MH-1813 in this study.

Stroke dataset

Cross-referencing data sources

From the CAD medical records, we included all calls that could be matched to a corresponding audio file for 1-1-2 and MH-1813 from 2015 to 2021 for patients older than 18. The CAD records were matched with the telephone call recordings based on the call start, call duration, and call-taker identity. Due to data incompleteness, and the way the audio data is stored, at CEMS, 2,730,199 contacts could not be matched to their corresponding audio file, however, 2,361,178 contacts were successfully matched. We found no obvious pattern in the matched and unmatched calls and we included all calls with a matching audio file. Next, a call was regarded as a case of ground truth stroke positive when the CPR number in the CAD record matched that of a DanStroke record, and the patient-reported time of onset was close to the call start time. We allowed a window of 72 hours before and 24 hours after the call starts to account for uncertainty in recording stroke onset time. We excluded calls involving subarachnoid haemorrhage cases. Finally, we considered a call to be a call-taker stroke positive when the call-taker selected the stroke diagnostic category during the call and dispatched an ambulance with the appropriate level of response³⁰. To ensure that the effect of the machine learning framework was not overestimated, we excluded calls where diagnostic category had not been registered from the test set. We still reported the population characteristics and model performance of this group of calls to assess potential bias introduced by excluding them. A data-flow diagram is included in the supplementary material (Supplementary Fig. 1). The resulting dataset is the largest dataset of audio files from stroke calls collected to date.

Dataset splitting

We reserved all the MH-1813 calls from 2021 for testing. We used stratified sampling to divide the MH-1813 calls from 2015 to 2020 into validation and training subsets. The training subset was further split into five folds, which were used for ensemble training. The calls were stratified based on the ground truth stroke label and the presence of a diagnostic category. Calls without diagnostic categories were only included in the training set. The 1-1-2 calls were used only for training; however, calls from 2021 were discarded to avoid temporal overlap with the test period.

Machine learning pipeline

We employed a two-step machine learning pipeline. First, a call was transcribed using the speech recognition model. Second, the transcript was used as input for the text classification model. The final output score was used to classify whether the call concerned a stroke. The pipeline is illustrated in the supplementary material (Supplementary Fig. 2).

Speech recognition

The call recordings from the CEMS were stored as 8-bit linear pulse-code modulated audio, sampled at 8 kHz. A call was converted into a log-Mel spectrogram before being input into the speech recognition model. This conversion is a commonly used input representation for speech-processing tasks, which facilitates the identification of linguistic content in audio signals. We used a speech recognition model with a neural network architecture³¹, consisting of two-dimensional convolutional layers³² and blocks of bidirectional long short-term memory layers³³. The output is a sequence of probability distributions over characters of the Danish alphabet, which were then converted into a human-readable transcript using a greedy decoder³⁴.

Text classification

As input for the classification model, each transcript was transformed into a fixed-size bag-of-words vector, which encoded the occurrence of word and character (n-grams) in a fixed vocabulary. The feature selection procedure is detailed in the Supplementary Methods. The model was constructed as an ensemble³⁵ of five identical, independently trained models. Each consists of a stack of neural network layers commonly referred to as a multi-layer perceptron³⁶. The final layer has a single scalar output and applies a sigmoid nonlinearity to produce an output score between zero and one.

Threshold calibration and ensembling

For each model in the ensemble, we selected the prediction threshold as the harmonic mean of the two thresholds that ensure sensitivity and PPV equal to those of the call-takers. This simplifies the comparison by ensuring a trade-off between sensitivity and PPV, similar to that of call-takers.

As the threshold differed for each model in the ensemble, computing the ensemble output score as the average output score of the individual models would not be meaningful. Instead, we first subtracted the threshold from the output score in logit space (before sigmoid nonlinearity) for each model to obtain the same threshold (0.5). Subsequently, we defined the ensemble output score as the average of the centred output scores. The exact equations are provided in the supplementary material [Supplementary Equations (1) and (2)].

Model training

The speech recognition model was trained on 3,811 manually transcribed random calls (173 h) from the CEMS as part of a previous project¹⁴. These calls exclusively originated from 1-1-2 between 2015 and 2018, ensuring no overlap with the test data used for the text classification model. The model was trained using a connectionist temporal classification objective³⁴.

We trained five models for the text classification ensemble using binary cross-entropy after transcribing all calls in the dataset using the speech recognition model. One training fold was used for early stopping using the F1-score, whereas the remaining fourfold and 1-1-2 data were used for training. Thus, each model in the ensemble was trained and validated using different datasets. We ran a grid search with 96 different hyperparameter configurations and selected the ensemble model with the best F1 score for the validation set.

Model explainability

We performed an occlusion analysis to better understand the predictions of the text classification model. This involved removing all instances of a given word from the input transcript to evaluate its impact on the model output. The word was removed before vectorisation, such that all word and character n-grams associated with the word were discarded. Specifically, let z^(n,d,w) be the logit output of model n in the ensemble for transcript d when the word w is occluded. For transcript d, we computed the word impact score i^(d,w) as the mean difference between the logit before and after occlusion.

$${i}^{(d,w)}=\frac{1}{N}{\mathop{\sum}\limits_{n=1}^{N}}\,{z}^{(n,d)}-{z}^{\left(n,d,w\right)}$$

(1)

We used the logit output to compute the impact score because the difference in sigmoid-normalised output is biased towards zero for values close to 0 or 1. To select words for inspection, we computed a ranking score, r^(w), as the sum of the signed squares of the impact:

$${r}^{(w)}=\mathop{\sum }\limits_{d=1}^{D}\,\mathrm{sgn}\left({i}^{(d,w)}\right){\left({i}^{(d,w)}\right)}^{2}$$

(2)

where sgn(·) represents the sign function. Squaring i^{(d, w)} favours rare features with a high impact over common features with a low impact.

Statistical analysis

We report the F1-score, sensitivity, PPV, FOR (equal to 1−negative predictive value), and FPR (equal to 1−specificity). Due to the imbalanced nature of the dataset, the negative predictive value and specificity were >99% for all cases. We reported FOR and FPR instead because such large numerical values exhibit low relative variance, thereby obfuscating comparisons. Finally, we report the prediction confusion matrices, ROC curve, and PPV-sensitivity curve, commonly known as the precision-recall curve. All results are reported with up to three significant digits.

We present the results with and without 1-1-2 training data, subgroup analyses based on age (18–64/65+) and sex (male/female), and call-takers performance. We also report the model performance on calls without a diagnostic category from the test year 2021 to assess potential data bias. We tested our results for statistical significance using approximate permutation tests. We used one-sided paired approximate permutation tests for model-to-model and model-to-call-taker comparisons when done on the same subset. For comparisons across different subsets (e.g., male vs. female), we used one-sided independent approximate permutation tests. We computed 95% confidence intervals (CIs) using bootstrapping^37,38. In our assessment, we accounted for random variation associated with model training by basing the means, tests, and CIs on the predictions of 11 randomly initialised training runs. Statistical significance was defined as a p value of <0.05.

We used the model with the median F1-score out of the 11 runs for the occlusion analysis. We listed the 30 words with the highest positive ranking scores for calls classified as stroke and the 30 words with the highest negative ranking scores for calls classified as non-stroke.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The datasets used to evaluate call-taker performance and to train and evaluate the machine learning framework are legally restricted by Danish patient privacy and secrecy laws and are, therefore, not publicly available. The data can be made available from the publication date but requires a Data Access Agreement, which is examined and approved by the ethics committees that approved this research³⁹. For the same reason, the machine learning framework trained in this study is not publicly available; however, instructions on how to train it are included in the main manuscript and the Supplementary Methods.

Code availability

The source code can be shared using a Creative Commons NC-ND 4.0, an international licence, upon reasonable written request to the corresponding author, and requires a data use agreement.

References

Feigin, V. L. et al. Global, regional, and national burden of stroke and its risk factors, 1990-2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet Neurol. 20, 1–26 (2021).
Article Google Scholar
Kyu, H. H. et al. Global, regional, and national disability-adjusted life-years (DALYs) for 359 diseases and injuries and healthy life expectancy (HALE) for 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet 392, 1859–1922 (2018).
Article Google Scholar
Katan, M. & Luft, A. Global burden of stroke. Semin. Neurol. 38, 208–211 (2018).
Article PubMed Google Scholar
Berge, E. et al. European Stroke Organisation (ESO) guidelines on intravenous thrombolysis for acute ischaemic stroke. Eur. Stroke J. 6, I–LXII (2021).
Article PubMed PubMed Central Google Scholar
Turc, G. et al. European Stroke Organisation (ESO) - European Society for Minimally Invasive Neurological Therapy (ESMINT) guidelines on mechanical thrombectomy in acute ischemic stroke. J. Neurointerv. Surg. 11, 535–538 (2019).
Article PubMed Google Scholar
Hariharan, P. et al. Mobile stroke units: current evidence and impact. Curr. Neurol. Neurosci. Rep. 22, 71–81 (2022).
Article PubMed Google Scholar
Navi, B. B. et al. Mobile stroke units: evidence, gaps, and next steps. Stroke 53, 2103–2113 (2022).
Article PubMed Google Scholar
Krebes, S. et al. Development and validation of a dispatcher identification algorithm for stroke emergencies. Stroke 43, 776–781 (2012).
Article PubMed Google Scholar
Govindarajan, P. et al. Feasibility study to assess the use of the Cincinnati stroke scale by emergency medical dispatchers: a pilot study. Emerg. Med. J. 29, 848–850 (2012).
Article PubMed Google Scholar
Oostema, J. A. et al. Dispatcher stroke recognition using a stroke screening tool: a systematic review. Cerebrovasc. Dis. 42, 370–377 (2016).
Article PubMed Google Scholar
Viereck S. et al. Medical dispatchers recognise substantial amount of acute stroke during emergency calls. Scand. J. Trauma Resusc. Emerg. Med. https://doi.org/10.1186/S13049-016-0277-5 (2016).
Bohm K., Kurland L. The accuracy of medical dispatch - a systematic review. Scand. J. Trauma Resusc. Emerg. Med. https://doi.org/10.1186/S13049-018-0528-8 (2018).
Watkins C. L., et al. Training emergency services’ dispatchers to recognise stroke: an interrupted time-series analysis. BMC Health Serv. Res. 13, 318 (2013)
Blomberg, S. N. et al. Machine learning as a supportive tool to recognize cardiac arrest in emergency calls. Resuscitation 138, 322–329 (2019).
Article PubMed Google Scholar
Blomberg S. N., et al. Effect of machine learning on dispatcher recognition of out-of-hospital cardiac arrest during calls to emergency medical services: a randomized clinical trial. JAMA Netw. Open. 4, e2032320 (2021).
Carcel C. et al. Sex matters in stroke: a review of recent evidence on the differences between women and men. Front. Neuroendocrinol. 59, 100870 (2020).
Eddelien H. S. et al. Sex and age differences in patient-reported acute stroke symptoms. Front. Neurol. https://doi.org/10.3389/FNEUR.2022.846690 (2022).
Han K. J. et al. Deep learning-based telephony speech recognition in the wild. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH https://www.isca-speech.org/archive/pdfs/interspeech_2017/han17_interspeech.pdf 1323–1327 (2017).
A. Radford, et al. Robust speech recognition via large-scale weak supervision. In: Proceedings of the 40th International Conference on Machine Learning. 28492–28518 (2023).
Hvingelby, R. et al. DaNE: a named entity resource for Danish. In: Proceedings of the LREC 2022 Workshop of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages. 4597–4604 (2020).
Mohamed, A. et al. Self-supervised speech representation learning: a review. IEEE J Sel Top Signal Process https://arxiv.org/abs/2205.10643 (2022).
Gururangan, S. et al. Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 8342–8360 (2020).
Galke, L. et al. Bag-of-words vs. graph vs. sequence in text classification: questioning the necessity of text-graphs and the surprising strength of a wide MLP. Assoc. Comput. Linguist. 1, 4038–4051 (2022).
Google Scholar
Danmarks Statistik (Statistics Denmark) (2023). FOLK1: population quarterly database. https://www.statistikbanken.dk/FOLK1A (2023).
Zinger N. D. et al. Impact of integrating out-of-hours services into Emergency Medical Services Copenhagen: a descriptive study of transformational years. Int. J. Emerg. Med. https://doi.org/10.1186/S12245-022-00442-4 (2022).
Schmidt, M., Pedersen, L. & Sørensen, H. T. The Danish Civil Registration System as a tool in epidemiology. Eur. J. Epidemiol. 29, 541–549 (2014).
Article PubMed Google Scholar
Johnsen, S. et al. The Danish Stroke Registry. Clin. Epidemiol. 8, 697–702 (2016).
Article PubMed PubMed Central Google Scholar
Blauenfeldt, R., Wienecke T. National Neurologisk Behandlingsvejledning: Iskæmisk apopleksi - akut udredning og behandling, https://nnbv.dk/iskaemisk-apopleksi-akut-udredning-og-behandling/ (accessed 19 September 2023).
Wildenschild, C. et al. Registration of acute stroke: validity in the Danish Stroke Registry and the Danish National Registry of Patients. Clin. Epidemiol. 6, 27–36 (2013).
Article PubMed PubMed Central Google Scholar
Dansk Indeks for Akuthjælp. Landsudgaven, version 1.10—revideret april 2022., https://www.ph.rm.dk/siteassets/prahospitalet/fagfolk/dansk-indeks/dansk%0A-indeks-version-1.10---landsudgaven.pdf (accessed 27 March 2023).
Borgholt L. et al. Do end-to-end speech recognition models care about context? In: Proceedings of the Annual Conference of the International Speech Communication, 4352–4356 https://arxiv.org/abs/2102.09928 (2020).
Lecun Y., Bengio Y. Convolutional networks for images, speech, and time-series, https://nyuscholars.nyu.edu/en/publications/convolutional-networks-for-images-speech-and-time-series (1995, accessed 20 April 2023).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article PubMed CAS Google Scholar
Graves A. et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. 369–376 (2006).
Hansen, L. K. & Salamon, P. Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12, 993–1001 (1990).
Article Google Scholar
Rosenblatt, F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–408 (1958).
Article PubMed CAS Google Scholar
Dwass, M. Modified randomization tests for nonparametric hypotheses. Ann. Math. Stat. 28, 181–187 (1957).
Article Google Scholar
Eden, T. & Yates, F. On the validity of Fisher’s z test when applied to an actual example of non-normal data. (With five text-figures.). J. Agric. Sci. 23, 6–17 (1933).
Article Google Scholar
Danish Data Protection Agency. https://www.datatilsynet.dk/english (accessed 14 November 2023).

Download references

Acknowledgements

We thank the staff of the CEMS for their role in generating the data used in this study. We thank Emilie Grunddal Pedersen, Mette Bjerg Lindhøj, and Jens Morten Haugård for their help and cooperation in accessing the data sources. We also thank the Centre for IT and Medical Technology (CIMT) and Corti employees Akihiro Inui and Nathaniel Joselson for their assistance in setting up and using the cloud-computing environment for training and evaluating the machine learning framework of this study. Funding for the work was received from Innovation Fund Denmark, Trygfonden, Copenhagen University Hospital—Herlev, Gentofte, and the University of Copenhagen. The grant providers had no role in the study design, data collection, analysis, interpretation, manuscript writing, or publication decision. Corti provided additional funds and technical expertise to develop the models. Corti was not financially compensated for this, and the project was part of its research initiatives, which were conducted in cooperation with several universities and the Innovation Fund Denmark. Corti owns the rights to the models and source code.

Author information

These authors contributed equally: Jonathan Wenstrup, Jakob Drachmann Havtorn, Lasse Borgholt.
These authors jointly supervised this work: Christina Kruuse, Helle Collatz Christensen.

Authors and Affiliations

Department of Neurology, Copenhagen University Hospital, Herlev and Gentofte, Borgmester Ib Juuls Vej 1, 2730, Herlev, Denmark
Jonathan Wenstrup & Christina Kruuse
Copenhagen Emergency Medical Services, Telegrafvej 5, 2750, Ballerup, Denmark
Jonathan Wenstrup
Corti, Store Strandstræde 21, 1255, Copenhagen, Denmark
Jakob Drachmann Havtorn, Lasse Borgholt & Lars Maaloe
Department of Applied Mathematics and Computer Science, Technical University of Denmark, Richard Petersens Plads, 321, 223, 2800 Kgs, Lyngby, Denmark
Jakob Drachmann Havtorn & Lars Maaloe
Department of Electronic Systems, Aalborg University, Fredrik Bajers Vej 7K, 9220, Aalborg Ø, Denmark
Lasse Borgholt
Pioneer Centre for Artificial Intelligence, Øster Voldgade 3, 1350, Copenhagen, Denmark
Lasse Borgholt
Prehospital Centre Region Zealand, Ringstedgade 61, 4700, Næstved, Denmark
Stig Nikolaj Blomberg & Helle Collatz Christensen
Department of Emergency Medicine, University of Washington, 325 9th Ave, Box 359727, Seattle, WA, 98104, USA
Michael R. Sayre
Department of Neurology, Copenhagen University Hospital, Bispebjerg, Bispebjerg Bakke 23, 2400, Copenhagen, NV, Denmark
Hanne Christensen
University of Copenhagen, Department of Clinical Medicine, Blegdamsvej 3B, 2200, Copenhagen, Denmark
Hanne Christensen, Christina Kruuse & Helle Collatz Christensen

Authors

Jonathan Wenstrup
View author publications
You can also search for this author in PubMed Google Scholar
Jakob Drachmann Havtorn
View author publications
You can also search for this author in PubMed Google Scholar
Lasse Borgholt
View author publications
You can also search for this author in PubMed Google Scholar
Stig Nikolaj Blomberg
View author publications
You can also search for this author in PubMed Google Scholar
Lars Maaloe
View author publications
You can also search for this author in PubMed Google Scholar
Michael R. Sayre
View author publications
You can also search for this author in PubMed Google Scholar
Hanne Christensen
View author publications
You can also search for this author in PubMed Google Scholar
Christina Kruuse
View author publications
You can also search for this author in PubMed Google Scholar
Helle Collatz Christensen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.W., J.D.H., and L.B. share lead authorship and contributed equally to the original drafting of the paper, review and editing, data curation, formal analysis, methodology, validation, and conceptualisation. J.D.H. and L.B. contributed to software development and funding acquisition. S.N.F.B. contributed to the conceptualisation, data curation, funding acquisition, supervision, writing—review, and editing. L.M. contributed to conceptualisation, funding acquisition, project administration, supervision, writing—review, and editing. M.S. contributed to supervision, writing, review, and editing. H.C. contributed to the methodology, supervision, writing—review, and editing. C.K. contributed to the conceptualisation, funding acquisition, methodology, project administration, supervision, writing—review and editing. HCC contributed to the conceptualisation, data curation, funding acquisition, methodology, project administration, supervision, writing—review, and editing. H.C.C. and C.K. share last authorship. The three lead authors directly accessed and verified the underlying data reported in this manuscript. J.W.’s affiliations are exclusively academic.

Corresponding author

Correspondence to Helle Collatz Christensen.

Ethics declarations

Competing interests

J.D.H. and L.B. received funding from the Innovation Fund Denmark. J.D.H. and L.B. used Corti and held stock warrants. L.M. is a co-founder, stockholder, and the Chief Technology Officer of Corti. J.W. received funding from Trygfonden. S.N.F.B. has no conflicts of interest to declare. H.C. has received funding from the Velux Foundation, Tværsfonden, Helsefonden, Hartmann Fonden, Lundbeck Foundation, and Novo Nordisk Foundation; royalties from Gyldendal; honoraria from Bayer and Bristol Meyers Squibb, and is chair of Action Plan for stroke in Europe Implementation, Co-chair of the Scientific Stroke Panel EAN and Senior Guest Editor of AHA Stroke. M.S. has no conflicts of interest to declare. H.C.C. has no conflicts of interest to declare. C.K. received funding from the Novo Nordisk Foundation and is the chair of the Danish Resuscitation Council and vice chair of the Danish Stroke Society. Both positions are unpaid.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wenstrup, J., Havtorn, J.D., Borgholt, L. et al. A retrospective study on machine learning-assisted stroke recognition for medical helpline calls. npj Digit. Med. 6, 235 (2023). https://doi.org/10.1038/s41746-023-00980-y

Download citation

Received: 30 June 2023
Accepted: 29 November 2023
Published: 19 December 2023
DOI: https://doi.org/10.1038/s41746-023-00980-y

Subjects

Abstract

Similar content being viewed by others

An overview of clinical decision support systems: benefits, risks, and strategies for success

Large language models in medicine

nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation

Introduction

Results

Population characteristics

Main results

Sex and age

Model explainability

Discussion

Methods

Data sources

Copenhagen emergency medical services (CEMS)

Danish Stroke Registry (DanStroke)

Inclusion and ethics

Study scope

Stroke dataset

Cross-referencing data sources

Dataset splitting

Machine learning pipeline

Speech recognition

Text classification

Threshold calibration and ensembling

Model training

Model explainability

Statistical analysis

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary

Reporting Summary

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links