Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers

Coppock, Harry; Nicholson, George; Kiskin, Ivan; Koutra, Vasiliki; Baker, Kieran; Budd, Jobie; Payne, Richard; Karoune, Emma; Hurley, David; Titcomb, Alexander; Egglestone, Sabrina; Tendero Cañadas, Ana; Butler, Lorraine; Jersakova, Radka; Mellor, Jonathon; Patel, Selina; Thornley, Tracey; Diggle, Peter; Richardson, Sylvia; Packham, Josef; Schuller, Björn W.; Pigoli, Davide; Gilmour, Steven; Roberts, Stephen; Holmes, Chris

doi:10.1038/s42256-023-00773-8

Download PDF

Article
Open access
Published: 07 February 2024

Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers

Harry Coppock^1,2^na1,
George Nicholson^1,3,4^na1,
Ivan Kiskin^1,3,5^na1,
Vasiliki Koutra^1,6,
Kieran Baker ORCID: orcid.org/0000-0002-3312-4806^1,6,
Jobie Budd ORCID: orcid.org/0000-0003-3337-6859^7,8,
Richard Payne ORCID: orcid.org/0000-0002-1721-3868⁹,
Emma Karoune ORCID: orcid.org/0000-0002-6576-6053¹,
David Hurley⁹,
Alexander Titcomb⁹,
Sabrina Egglestone⁹,
Ana Tendero Cañadas^9,10,
Lorraine Butler⁹,
Radka Jersakova¹,
Jonathon Mellor⁹,
Selina Patel^9,11,
Tracey Thornley ORCID: orcid.org/0000-0002-9276-052X¹²,
Peter Diggle¹³,
Sylvia Richardson¹,
Josef Packham⁹,
Björn W. Schuller^1,2,14,
Davide Pigoli ORCID: orcid.org/0000-0003-4591-4167^1,6^na2,
Steven Gilmour^1,6^na2,
Stephen Roberts^1,3^na2 &
…
Chris Holmes ORCID: orcid.org/0000-0002-6667-4943^1,3^na2

Nature Machine Intelligence volume 6, pages 229–242 (2024)Cite this article

2649 Accesses
37 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Recent work has reported that respiratory audio-trained AI classifiers can accurately predict SARS-CoV-2 infection status. However, it has not yet been determined whether such model performance is driven by latent audio biomarkers with true causal links to SARS-CoV-2 infection or by confounding effects, such as recruitment bias, present in observational studies. Here we undertake a large-scale study of audio-based AI classifiers as part of the UK government’s pandemic response. We collect a dataset of audio recordings from 67,842 individuals, with linked metadata, of whom 23,514 had positive polymerase chain reaction tests for SARS-CoV-2. In an unadjusted analysis, similar to that in previous works, AI classifiers predict SARS-CoV-2 infection status with high accuracy (ROC–AUC = 0.846 [0.838–0.854]). However, after matching on measured confounders, such as self-reported symptoms, performance is much weaker (ROC–AUC = 0.619 [0.594–0.644]). Upon quantifying the utility of audio-based classifiers in practical settings, we find them to be outperformed by predictions on the basis of user-reported symptoms. We make best-practice recommendations for handling recruitment bias, and for assessing audio-based classifiers by their utility in relevant practical settings. Our work provides insights into the value of AI audio analysis and the importance of study design and treatment of confounders in AI-enabled diagnostics.

Long COVID: major findings, mechanisms and recommendations

Article 13 January 2023

Large language models in medicine

Article 17 July 2023

Foundation models for generalist medical artificial intelligence

Article 12 April 2023

Main

The coronavirus disease 2019 (COVID-19) pandemic has been estimated by the World Health Organization (WHO) to have caused 14.9 million excess deaths over the 2020–2021 period¹. An accepted public health control measure for emerging infectious diseases is the isolation of infected individuals². As COVID-19 transmission occurs in both symptomatic and asymptomatic cases³, especially prior to nationwide vaccination deployment, a scalable and accurate test for the infection is crucial to avoid general population quarantine.

This has sparked an intense interest in AI-based classifiers that use respiratory audio data to classify severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection status (which we here refer to as COVID-19 status) via a digital screening tool that anyone with a smartphone or computer can use^{4,5,6,7,8,9,10,11,12,13,14,15,16,17,18}. In our review, as of July 2022, we found 93 published papers that reported evidence for the potential of audio-based COVID-19 classification. Of these 93 papers, 75 report an area under the curve (AUC) (or F1) of over 0.75, whereas 44 report a performance of above 0.90. Extended Data Table 1 summarizes nine highly cited datasets and their corresponding classification performance.

Despite these encouraging results, concerns remain that the prediction models may not be transferable to real-world settings^{11,15,18,19,20,21}. In some cases, data quality may be lowered by, for example, sampling biases, lack of verification of participants’ COVID-19 status, a long delay between infection and audio recording, or small numbers of individuals who are SARS-CoV-2 reverse transcription polymerase chain reaction (PCR)-positive (COVID⁺)²¹. Akin to findings in AI radiographic COVID-19 detection²², concerns centre around whether the learnt audio features are unique audio biomarkers caused by COVID-19 in the infected individual, or are due to other confounding signals.

Here we analyse the largest PCR-validated dataset collected so far in the field of audio-based COVID-19 screening (ABCS). We design and specify an analysis plan in advance to investigate whether using audio-based classifiers can improve the accuracy of COVID-19 screening over using self-reported symptoms.

Our contribution is as follows:

– We collect a respiratory acoustic dataset of 67,842 individuals with linked PCR test outcomes, including 23,514 who tested positive for COVID-19. This is, to the best of our knowledge, the largest PCR-validated dataset collected of its kind so far²³.
– We fit a range of AI classifiers and observe strong COVID-19 predictive performance (receiver operating charateristic area under the curve (ROC–AUC) = 0.85), as has been reported in past studies, for example refs. ^{4,5,6,7,8,9,10,11,12,13,14,15,16,17,18}; however, when controlling for measured confounders by matching, only a small amount of residual predictive variation remains (ROC–AUC = 0.62), some of which we attribute to unmeasured confounders.
– We find the COVID-19 predictive performance and practical utility of audio-based AI classifiers—as applied in simulated realistic settings—to be no better than classification on the basis of self-reported symptoms; we replicate this finding by fitting our classifiers in an external dataset.
– These results suggest that audio-based classifiers learn to predict COVID-19 via self-reported symptoms and potentially other confounders. Study recruitment on the basis of self-screened symptoms seems to be an important driver of this effect.
– We provide best-practice recommendations on how to address this problem in future studies.
– Our dataset and code-base is publicly available to enable reproducibility of results and to encourage further research into respiratory audio analysis and bias mitigation in high-dimensional, over-parameterized settings²³.

Our work is timely in highlighting the need for careful construction of machine learning evaluation procedures, aimed at yielding representative performance metrics. The important lessons from this case study on the effects of confounding extend across many applications in AI—where biases are often hard to spot and difficult to control for.

Results

Study design

This study invited volunteers from the Real-time Assessment of Community Transmission (REACT) programme and the National Health Service (NHS) Test-and-Trace (T+T) service to participate between March 2021 and March 2022 on an opt-in basis. Volunteers were directed to the ‘Speak up and help beat coronavirus’ web page²⁴, where they were instructed to provide audio recordings of four respiratory audio modalities. Demographic and health metadata, along with a validated PCR test result, were transferred from existing T+T/REACT records. Further audio-specific metadata were produced from the audio files after collection. The final dataset comprised 23,514 COVID⁺ and 44,328 SARS-CoV-2 PCR-negative (COVID^–) individuals. Figure 1 summarizes the dataset (a more detailed description of which is provided in Methods) and a full presentation can be found in the accompanying dataset paper²³.

**Fig. 1: Demographic statistics of collected dataset.**

Defining the acoustic target for COVID-19 screening

If a practically effective acoustic signal were to exist in SARS-CoV-2-infected individuals’ respiratory sounds, we propose that it would have the following properties:

P1: Caused by COVID-19. COVID-19 is well known to cause symptoms (such as a new continuous cough) that can be readily self-screened by individuals in the general population. The acoustic target would likewise be linked causally to COVID-19 and would therefore be more likely to generalize to other contexts and populations than non-causal associations.
P2: Not self-identifiable. The acoustic target would not directly represent self-identifiable symptoms that can be self-identified effectively by individuals in the general population. This is because: (1) it is more straightforward to measure self-identifiable symptoms directly using a questionnaire, rather than measuring them indirectly via audio; and (2) as we explain below, self-identifiable symptoms can affect enrolment and may therefore be strongly non-causally associated with COVID-19 in enrolled subpopulations.
P3: Enables high-utility COVID-19 screening. For an audio-based classifier to perform strongly in practical settings, it should possess high sensitivity and specificity, corresponding to an acoustic signal that would be detectable in high and low proportions of individuals who are COVID⁺ and COVID^–, respectively. We formalize the mathematical relationship linking expected utility, sensitivity and specificity in equation (1) (see Methods).

Characterizing and controlling recruitment bias

In audio-based COVID-19 classification, results can be highly sensitive to the characteristics of the enrolled population. Our study’s recruitment protocol is subject to enrolment bias because the vast majority of individuals in pillar 2 of the UK government’s NHS T+T programme²⁵ were PCR tested as a direct consequence of reporting symptoms (see Methods). Figure 1e,f display our participants’ symptom profiles, stratified by COVID-19 infection status. Figure 2a presents the joint distribution of COVID-19 status and binary symptoms status as ‘symptoms-based enrolment’, in contrast to Fig. 2b, which presents ‘general population enrolment’ on the basis of random sampling from a general population with 2% COVID⁺ prevalence. Note that the dependence between binary symptoms status and COVID-19 is stronger under symptoms-based enrolment (population correlation coefficient ρ = 0.66) than general population enrolment (ρ = 0.15).

**Fig. 2: Illustrative tables relating symptoms status with COVID-19 status.**

We will discuss three simplified recruitment processes to illustrate the effects of different types of enrolment protocol. These are illustrated in Fig. 3 using the probabilistic framework of directed and undirected graphical models (a good introduction to which can be found in chapters 10 and 19 of ref. ²⁶). As defined above, our goal is to train a classifier capable of predicting COVID-19 via its latent acoustic signature. We explain below how this requires the classifier predictions to be conditionally dependent on the latent COVID signature given self-reported symptoms, denoted by the red edge in Fig. 3ciii under a matched recruitment protocol; however the corresponding edges are missing under the other recruitment protocols in Fig. 3aiii and Fig. 3biii.

**Fig. 3: Study enrolment influences capability of classifiers to learn a latent COVID acoustic signature.**

First consider our simplified causal model of symptoms-based recruitment (Fig. 3ai). Enrolment is jointly influenced by COVID-19 status, self-reported symptoms and factors such as age and gender (Extended Data Fig. 1 shows a detailed Bayesian knowledge graph of the recruitment process). Collecting data only from enrolled individuals is, in effect, conditioning on e_i = 1 at the enrolment node in Fig. 3ai. As the enrolment node has directed edges incoming from both COVID-19 status and self-reported symptoms (that is, it is a collider node), conditioning on it induces a non-causal dependence between its parent nodes (in addition to the causal dependence of symptoms on COVID-19 status). Figure 3aii displays the moralized undirected graph implied by Fig. 3ai, conditional on enrolment, with the strong COVID-19-to-symptoms dependence represented by a thick line illustratively labelled ρ = 0.66 with reference to Fig. 2a. By contrast, Fig. 3bi is conditional on random enrolment and does not introduce any additional non-causal association between COVID-19 status and self-reported symptoms.

If a study’s enrolment bias is unaddressed and shared across both training and held-out test sets, a classifier that seems to perform well may not generalize to other datasets^20,21. This is due to two effects: first, the classifier may learn to predict using confounding variables that are not causally related to COVID-19 but are associated due to their influence on enrolment (for example, gender, age or symptoms unrelated to COVID-19); second, even symptoms that are truly causally related to COVID-19, such as a new continuous cough, may exhibit inflated association with COVID-19 in the enrolled cohort due to their influence on enrolment (illustrated by the thick edges labelled ρ = 0.66 in Fig. 3aii,aiii).

As well as leading to poor generalizability, audible characteristics that are non-causally but strongly associated with COVID-19 can obscure any COVID-19 acoustic signature that may exist. This is illustrated in Fig. 3aiii, where the association between classifier prediction and SARS-CoV-2 status is mediated by symptoms instead of via a targeted latent COVID acoustic signature (that is, there are no edges corresponding to the red edge seen in Fig. 3ciii). Even in the case of randomized enrolment from the general population, a classifier may learn to predict SARS-CoV-2 status via self-reported symptoms, as opposed to via a latent COVID-19 acoustic signature, as illustrated in Fig. 3biii (again, a lack of red edge indicates that classifier predictions are conditionally independent of latent acoustic signature given self-reported symptoms).

Here, our goal is to build a classifier whose association with COVID-19 is mediated by an acoustic signature with the three properties defined above. We use the established epidemiological methodology known as matching²⁷, whereby study enrolment balances the number of COVID⁺ and COVID^– participants having each combination of potentially audible measured confounding variables. This has the effect of inducing independence between COVID-19 and these confounders in the matched population, as shown in Fig. 3cii. The classifier is then constrained to predict COVID-19 status either via the latent COVID-19 acoustic signature (via the red edge in Fig. 3ciii), or via unmeasured confounders.

Primary analyses

Pre-specified analysis plan

We designed and fixed a pre-specified analysis plan to increase the replicability of conclusions²⁸. As part of this advance planning, we detailed the analyses to be conducted and generated the test/validate/train data splits through subsampling of the full dataset. The design of these splits is detailed in the Methods, with sample sizes listed in Fig. 1h.

Audio-based COVID-19 prediction performance

Table 1 presents our study’s COVID-19 prediction performance across nine train/validate/test splits, four modalities and three models: Self-Supervised Audio Spectrogram Transformer (SSAST), Bayesian neural networks (BNNs) and an openSMILE–support vector machine (SVM). The SSAST and BNN classifiers consistently outperform the baseline SVM, and the best prediction is achieved with the sentence modality. Reported results are for the SSAST performance on the sentence modality, unless stated otherwise. Under the randomized data split, the SSAST classifier achieves a high COVID-19 predictive accuracy of ROC–AUC = 0.846 [0.838–0.854]. We hypothesize that this strong predictive accuracy is mainly attributable to enrolment on the basis of self-reported symptoms, and explore this further in confirmatory analyses below.

Table 1 Results detailing the performance of the SVM, SSAST and BNN models on the nine evaluation tasks for each of the four audio modalities: sentence, three coughs, cough and exhalation

Full size table

When we control for enrolment bias by matching on age, gender and self-reported symptoms, predictive accuracy drops to a consistently low level of ROC–AUC = 0.619 [0.594–0.644] in the matched test set, and ROC–AUC = 0.621 [0.605–0.637] in the longitudinal matched test set—that is, a temporally out-of-distribution test set consisting of only submissions after 29 November 2021 (both trained on the standard training set). When training instead on our matched training set, we see a minor improvement in the matched test set (ROC–AUC = 0.635 [0.610–0.660]), and, by contrast, a slight decrease in prediction accuracy in the longitudinal matched test set (ROC–AUC = 0.604 [0.588–0.620]). Figure 4 illustrates these different experimental settings and the corresponding classification performance. A cluster analysis is also performed on the SSAST learnt representations (detailed in Supplementary Note 2) visually demonstrating the effect of decoupling measured confounders and COVID-19 status. To explore whether classifier performance might be higher in some matched groups than in others, we calculated ROC–AUC within matched strata (Extended Data Fig. 2), observing the estimates and confidence intervals to be consistent with a homogeneously low predictive score of ROC–AUC = 0.62 across strata.

**Fig. 4: Schematic demonstrating the importance of ascertainment bias adjustment in the context of reporting COVID-19 detection from patient respiratory audio performance.**

Confirmatory analyses and validation

The additional predictive value of ABCS

Audio-based classifiers can be useful in practice if they deliver improved performance relative to classifiers that are based on self-identifiable symptoms. Moreover, it is beneficial to assess the performance of ABCS classifiers in test sets reflecting the application of the testing protocol in a real-life setting. Here we generate a general population test set, through balanced subsampling, without replacement from our combined standard and longitudinal test sets, to capture the age/gender/symptoms/COVID-19 profile of the general population during the pandemic. Specifically, the proportion of symptomatic individuals is set to 65% in the COVID⁺ subgroup²⁹, compared with a setting of either (10%, 20%, 30%) symptomatic individuals in the COVID^– subgroup; the age distribution is constrained to be the same in both COVID⁺ and COVID^– subgroups; and with males/females balanced in a 1:1 ratio in each COVID⁺/COVID^– subgroup. We benchmark the COVID-19 predictive performance of the audio-based SSAST classifier against the performance attainable through random forest (RF) classifiers trained on self-identifiable symptoms and demographic data (a ‘symptoms’ RF classifier). In the benchmarking we also include an RF classifier, which takes as inputs the audio-based SSAST probabilistic outputs alongside self-identifiable symptoms and demographic data (‘symptoms + audio’ RF classifier). Training for all three classifiers is performed in our standard training set. The resulting ROC curves are shown in Fig. 5a–c. Focusing on the general population with 20% of COVID^– symptomatic in Fig. 5b, the combined symptoms + audio RF classifier (ROC–AUC = 0.787 [0.772–0.801], 95% DeLong CI) offers a significant (P = 9.7 × 10⁻¹¹, DeLong test) but small increase in predictive accuracy over the symptoms RF classifier (ROC–AUC = 0.757 [0.743–0.771], 95% DeLong CI), which in turn yields a less significant increase in ROC–AUC compared with the audio-only classifier (P = 0.0033) (ROC–AUC = 0.733 [0.717–0.748], 95% DeLong CI).

**Fig. 5: Comparison of sensitivity, specificity, and utility across audio-based and symptoms-based classifiers, as applied in simulated general populations.**

We replicate these findings using an external dataset¹⁸, observing qualitatively similar results with a symptoms classifier (ROC–AUC = 0.79 [0.71–0.87]) outperforming an SSAST audio-based classifier (ROC–AUC = 0.68 [0.59–0.77]) in a general population test set simulated from the external test set¹⁸, in which 20% of the individuals who are COVID^– are symptomatic. We observe similar results when comparing a symptoms classifier with our SSAST and Han and colleagues’ convolutional neural network (CNN) on the external test set directly¹⁸: symptoms classifier ROC–AUC = 0.81 [0.76–0.86]; SSAST audio-based classifier ROC–AUC = 0.68 [0.62–0.74]; CNN audio-based classifier¹⁸ ROC–AUC = 0.66 [0.60–0.71] (see also Extended Data Fig. 3). The reported results for our SSAST and Han and colleagues’ CNN are for the ‘cough’ modality; however, we see similar results for both ‘breathing’ and ‘voice’.

Translating prediction accuracy into utility

To characterize the practical benefit of ABCS in any particular testing setting, we can specify the utility ${u}_{\hat{y},y}$ of predicting $\hat{y}\in \{0,1\}$ for a random individual, in the targeted testing population, whose true COVID status is y ∈ {0, 1}, and calculate the per-test expected utility (EU) as

$$\begin{array}{ll}{{{\rm{EU}}}}&\equiv {\mathbb{E}}[{{{\rm{utility}}}}| {{{\boldsymbol{u}}}},\pi ,{{{\rm{sensitivity}}}},{{{\rm{specificity}}}}]\\&=\pi \left[({u}_{1,1}-{u}_{0,1})\times {{{\rm{sensitivity}}}}+{u}_{0,1}\right]\\& \hspace{.4cm} + (1-\pi )\left[({u}_{0,0}-{u}_{1,0})\times {{{\rm{specificity}}}}+{u}_{1,0}\right],\end{array}$$

(1)

where π is the COVID-19 prevalence in the tested population (equation (1) is derived in the Methods). The EU is increasing in both sensitivity and specificity, with their relative weights depending on prevalence π and utility u. Note that it is not only π and u that are context-dependent: the sensitivity and specificity of any particular COVID-19-classifier depends on the characteristics of the targeted testing population, as illustrated by the effects of variation in the proportion of COVID^– individuals that are symptomatic across Fig. 5a–c.

Consider the following illustrative utility function, measured in units of the number of infections prevented:

$$\begin{array}{ll}{u}_{1,1}={R}_{t}-\varepsilon\\\qquad\left[\,{{\mbox{True positive result for}}}\,\,{{{{\rm{COVID}}}}}^{+},\right.\\\qquad\left.{R}_{t}\,\,{{\mbox{infections prevented on average}}}\,\right]\\ {u}_{1,0}=-\varepsilon\\\qquad\left[\,{{\mbox{False positive for}}}\,\,{{{{\rm{COVID}}}}}^{-},\right.\\\qquad\left.-\varepsilon \,\,{{\mbox{is negative impact of self-isolation}}}\,\right]\\ {u}_{0,0}=0\\\qquad [\,{{\mbox{True negative for}}}\,\,{{{{\rm{COVID}}}}}^{-}]\\ {u}_{0,1}=-\delta\\\qquad\left[\,{{\mbox{False negative for}}}\,\,{{{{\rm{COVID}}}}}^{+},\right.\\\qquad\left.{{{\rm{causing}}}}\,\delta \,\,{{\mbox{additional infections on average}}}\,\right].\end{array}$$

There are three specified parameters in the above: (i) the number of cases prevented by intervention on a single individual is specified as the effective reproduction number, R_t ≥ 0, that is, the average number of infections that person would cause under no intervention, assuming that all individuals with a positive result follow self-isolation guidance and cause no transmission; (ii) ε ≥ 0 measures the cost of intervention (for example, the negative impact on health or education resulting from self-isolation); and (iii) δ ≥ 0 is the expected number of additional infections caused by a false-negative result (for example, due to reduced caution and increased social mixing following a negative result).

Figure 5d–f shows maximum EU, as a function of prevalence, for settings of R_t ∈ {1, 1.5} and ε ∈ {0.02, 0.2} and with δ = 0 (corresponding results for δ = 0.25 are shown in Extended Data Fig. 4). The maximization is performed point-wise with respect to sensitivity and specificity across the corresponding ROC curves in Fig. 5a–c. The utility of all classifiers decreases as the percentage of COVID^– symptomatics increases from 10% to 30% in Fig. 5d–f, with the intuition being that it is more difficult to distinguish COVID^– from the 65% symptomatic COVID⁺ population. When we compare the symptoms + audio RF classifier with the symptoms RF classifier, neither is generally optimal, with each classifier showing greater EU than the others for some values of (π, R_t, ε).

Exploratory approaches to identify the influence of unmeasured confounders

We explore whether the residual COVID-19 prediction performance of ROC–AUC = 0.62 in the matched test set is truly attributable to the targeted acoustic signature, or whether it stems from unmeasured confounders such as the audible recording environment, or unreported symptoms. We describe the two complementary exploratory methods in greater detail in Supplementary Note 1.

Method 1 investigates how much of the residual predictive variation persists when we map all matched test set samples to the first k-principal components of the COVID^– samples. We train a classifier on COVID-19 detection in this k-dimensional space and hypothesize that, below a threshold value for k ≤ τ, correct classification is due to confounding in the signal. The value of τ is determined by running a calibration experiment and is set to 14 for the sentence modality. By removing these correctly classified cases to form a curated matched test set, we see a drop in SSAST performance to UAR = 0.51 (the results from this experiment can be found in Extended Data Fig. 5).

Method 2 examines how much residual predictive variation persists when, using a robust distance metric in openSMILE space, we map each COVID⁺ participant to their nearest COVID^–neighbour. Predictive variation that persists in the space spanned by individuals who are COVID^– is then attributed to unmeasured confounders. After the COVID⁺-to-COVID^– nearest-neighbour mapping, the SVM matched test set ROC–AUC drops from 0.60 to 0.55. We interpret this persistent component of predictive variation after the mapping to COVID^– individuals (that is, ROC–AUC drops only to 0.55 as opposed to 0.50) as pointing to some degree of unmeasured confounding contributing to the score of ROC–AUC = 0.60 in the matched test set.

Discussion

COVID-19 is well known to be causally related to particular self-identifiable symptoms such as a new continuous cough. This has allowed such symptoms to be used by governments during the pandemic as a basis for population intervention to control disease spread (for example, as a triage tool for individuals) via self-screening and without recourse to audio recording. It is therefore desirable to develop audio-based classifiers that can augment and complement the information provided by self-identifiable COVID-19-specific symptoms, that is, to learn clinically valuable latent acoustic signatures caused by COVID-19.

Problematically, enrolment on the basis of symptoms has the potential not only to artificially inflate the association between COVID-19 and its particular symptoms, but also to introduce association between COVID-19 and symptoms that are not COVID-19 specific. Furthermore, enrolment on the basis of other characteristics such as gender and age may also introduce non-causal COVID-19–gender or COVID-19–age associations in the enrolled subpopulation, possibly interacting with symptoms.

Under such recruitment bias, classifiers trained to predict COVID-19 in enrolled subpopulations may learn to predict self-identifiable COVID-19-specific symptoms, thereby providing no further utility beyond a classifier trained directly on those self-screened symptoms. It is worth noting that there are exceptions to this statement, including situations in which passive monitoring is necessary or when individuals have reasons to provide incorrect information. More concerningly, the classifier may learn to predict age/gender/non-COVID-19-specific symptoms as proxies for COVID-19 in the enrolled subpopulation, in which case its performance will not generalize to subpopulations unaffected by the same recruitment bias.

Han et al.¹⁸ examine several aspects of recruitment bias in ABCS, simulating the effects of biases introduced by age, gender, language, and by the same individuals appearing in both the train and test sets. Although their dataset is approximately balanced with respect to age and gender (across COVID⁺ and COVID^– subgroups), it is imbalanced with respect to self-reported symptoms (84% of their COVID⁺ subgroup is symptomatic, compared with 49% of the COVID^–subgroup; ref. ¹⁸, Fig. 1e). Such imbalance with respect to symptoms is also present in our study (prior to matching) and other studies for which data, including symptoms, are available (see Extended Data Fig. 6).

We make the following recommendations with the aim of clarifying the effects of recruitment bias and mediation by self-identifiable symptoms in future studies:

1.
Collect and disseminate metadata. Repositories of audio samples should include details of the study recruitment criteria and relevant metadata (for example, gender, age, symptoms, location, time since COVID-19 test) so that data can be filtered for quality and for relevance to hypothesis, and bias from measured confounders can be characterized and controlled if necessary.
2.
Characterize and control recruitment bias. Analyse data using methods that acknowledge and control for the effects of recruitment bias. We approached this by matching on measured confounders in our test and/or training sets.
3.
Design studies with bias control in mind. Matching leads to reduced sample size when performed post-recruitment, so it can be beneficial to design observational studies that recruit participants to maximize the potential for matching on measured confounders in the enrolled cohort.
4.
Focus on the added predictive value of classifiers. Quantify the additional predictive value offered by classifiers compared with standard methods.
5.
Assess classifier performance in targeted settings. Measures of classifiers’ predictive accuracy, such as ROC–AUC, sensitivity and specificity vary depending upon the characteristics of the targeted population (for example, according to prevalence and the proportion of COVID⁺ and COVID^– individuals that are symptomatic). Where possible, apply the classifier in a test set that reflects the appropriate application setting, for example, by subsampling a test set that is representative of the general population, as we do here.
6.
Examine classifiers’ expected utility. We can specify utilities for each testing outcome, that is, quantify the average benefit accrued from a true positive, the different benefit of a true negative, and similarly the costs attached to false positives and false negatives. Then the expected utility provides a highly context-specific score for quantifying and comparing classifiers’ performance.
7.
Out-of-study replication. Replication studies could be performed in randomly sampled cohorts, or in pilot studies in real-world prediction settings with domain-specific utility functions. There are extra challenges when performing out-of-study replication; in particular, the audio-capture protocols might differ. It would facilitate replication if standardized protocols for audio data gathering are collaboratively developed.

We conclude by outlining some limitations of our study, dataset and findings. There are potentially subtle unmeasured confounders across our recruitment channels REACT and T+T. For example, PCR testing in T+T usually occurs in the days after self-screening of symptoms, whereas PCR tests in REACT are more likely to occur on a date approximately pre-determined by study researchers, and therefore to be independent of participants’ symptoms. We attempted to control for such unmeasured confounders by including recruitment channel as one of the matched variables in the test set. Despite matching on measured confounders, some residual predictive variation persists (ROC–AUC = 0.62). Our exploratory approaches for characterizing this residual predictive variation (Methods and Supplementary Note 1) suggest that some of this residual performance may be due to unmeasured confounders, but these results are inconclusive. Our results are suggestive of little utility of ABCS in practice relative to symptoms-based screening (Fig. 5). The development of more sophisticated methods for training audio-based models—such as utilizing voice activity detection, fusing multiple models’ predictions, or new augmentation methods—in the presence of symptoms data and recruitment bias is a worthwhile and active area of research which, alongside careful design and replication of studies, will eventually provide full clarity on the potential of ABCS as a tool to protect public health.

Methods

Dataset and study design

This section contains an overview of how the dataset was collected, its characteristics and its underlying study design. More in-depth descriptions are provided in two accompanying papers: Budd and co-workers²³ report a detailed description of the full dataset, whereas Pigoli et al.³⁰ present the rationale for and full details of the statistical design of our study.

Recruitment sources

Our main sources of recruitment were the REACT study and the NHS T+T system. REACT is a prevalence survey of SARS-CoV-2 that is based on repeated cross-sectional samples from a representative subpopulation defined via (stratified) random sampling from England’s NHS patient register³¹. The NHS T+T service was a key part of the UK government’s COVID-19 recovery strategy for England. It ensured that anyone developing COVID-19 symptoms could be swab tested, followed by the tracing of recent close contacts of any individuals testing positive for SARS-CoV-2 (ref. ²⁵).

Criteria for enrolment

Enrolment for both the REACT and NHS T+T recruitment channels was performed on an opt-in basis. Individuals participating in the REACT study were presented with the option to volunteer for this study. For the NHS T+T recruitment channel, individuals receiving a PCR test from the NHS T+T pillar 2 scheme were invited to take part in research (pillar 1 tests refer to ‘all swab tests performed in Public Health England laboratories and NHS hospitals for those with a clinical need, and health and care workers’, whereas pillar 2 comprises ‘swab testing for the wider population’²⁵). The guidance provided to potential participants was that they should be at least 18 years old, had taken a recent swab test (initially no more than 48 h, changing to 72 h on 14 May 2021), agree to our data privacy statement and have their PCR barcode identifier available, which was then internally validated.

Audio recordings

Participants were directed to the ‘Speak up and help beat coronavirus’ web page²⁴. Here, after agreeing to the privacy statement and completing the survey questions, participants were asked to record four audio clips. The first involved the participant reading out the sentence: ‘I love nothing more than an afternoon cream tea’, which was designed to contain a range of different vowel and nasal sounds. This was followed by three successive sharp exhalations, taking the form of a ‘ha’ sound. The final two recordings involved the participant performing volitional/forced coughs, once, and then three times in succession. Recordings were saved in .wav format. Smart phones, tablets, laptops and desktops were all permitted. The audio recording protocol was homogenized across platforms to reduce the risk of bias due to device types.

Demographic and clinical/health metadata

Existing metadata such as age, gender, ethnicity and location were transferred from linked T+T/REACT records. Participants were not asked to repeat this information to avoid survey fatigue. An additional set of attributes—hypothesized to pose the most utility for evaluating the possibility for COVID-19 detection from audio—was collected in the digital survey. This was in line with General Data Protection Regulation requirements that only the personal data necessary to the task should be collected and processed. This set included the symptoms currently on display (the full set of which are detailed in Fig. 1e,f), and long-term respiratory conditions such as asthma. The participants’ first language was also collected to control for different dialects/accents, and complement location and ethnicity. Finally, the test centre at which the PCR was conducted was recorded. This enabled the removal of submissions when cases were linked to faulty test centre results. A full set of the dataset attributes can be found in Budd and colleagues²³.

Final dataset

The final dataset is downstream of a quality control filter (see Fig. 1g), in which a total of 5,157 records were removed, each with one or more of the following characteristics: (1) missing response data (missing a PCR test); (2) missing predictor data (any missing audio files or missing demographic/symptoms metadata); (3) audio submission delays exceeding ten days post test result; (4) self-inconsistent symptoms data; (5) a PCR testing laboratory under investigation for unreliable results; (6) a participant age of under 18; and (7) sensitive personal information detected in the audio signal (see Fig. 3d of ref. ²³). Pigoli et al.³⁰ present these implemented filters in full, and the rationale behind each one. The final collected dataset, after data filtration, comprised 23,514 COVID⁺ and 44,328 COVID^– individuals recruited between March 2021 and March 2022. Please note that the sample size here differs to that in our accompanying papers, in which Budd et al.²³ reported numbers before the data quality filter was applied, whereas our statistical study design considerations, detailed in a work by Pigoli and colleagues³⁰, focused on data from the restricted date range spanning March to November 2021. We note the step-like profile of the COVID^– count is due to the six REACT rounds, where a higher proportion of COVID^– participants were recruited than in the T+T channel. As detailed in the geo-plots in Fig. 1a,b, the dataset achieves a good coverage across England, with some areas yielding more recruited individuals than others. We are pleased to see no major correlation between geographical location and COVID-19 status, (Fig. 1c), with Cornwall displaying the highest level of COVID-19 imbalance, with a 0.8% difference in percentage proportion of COVID⁺ and COVID^– cases.

Data splits

In our pre-specified analysis plan, we defined three training sets and five test sets to define a range of analyses in which we investigate, characterize and control for the effects of enrolment bias in our data:

Randomized train and test sets. A participant-disjoint train and test set was randomly created from the whole dataset, similar to methods in previous works.
Standard train and test set. Designed to be a challenging, out-of-distribution evaluation procedure. Carefully selected attributes such as geographical location, ethnicity and first language are held out for the test set. The standard test set was also engineered to over represent sparse combinations of categories such as older COVID⁺ participants³⁰. The samples included in this split exclusively consist of recordings made prior to 29 November 2021.
Matched train and test sets. The numbers of COVID^– and COVID⁺ participants are balanced within each of several key strata. Each stratum is defined by a unique combination of measured confounders, including binned age, gender and a number of binary symptoms (for example, cough, sore throat, shortness of breath; see Methods for a full description). The samples included in this split exclusively consist of recordings made prior to 29 November 2021.
Longitudinal test set. To examine how classifiers generalized out-of-sample over time, the longitudinal test set was constructed only from participants joining the study after 29 November 2021.
Matched longitudinal test set. Within the longitudinal test set, the numbers of COVID^– and COVID⁺ participants are balanced within each of several key strata, similarly as in the matched test set above.

The supports for each of these splits are detailed in Fig. 1h.

Machine learning models

Three separate models were implemented for the task of COVID-19 detection from audio, each representing an independent machine learning pipeline. These three models collectively span the machine learning research space thoroughly—ranging from the established baseline to the current state of the art in audio classification technologies–and are visually represented in Extended Data Fig. 7. We also fitted an RF classifier to predict COVID-19 status from self-reported symptoms and demographic data. The outcome used to train and test each of the prediction models was a participant’s SARS-CoV-2 PCR test result. Each model’s inputs and predictors, and the details on how they are handled, can be found below. Wherever applicable, we have reported our study’s findings in accordance with TRIPOD statement guidelines³². The following measures were used to assess model performance: ROC–AUC, area under the precision–recall curve (PR–AUC), and UAR (also known as balanced accuracy). Confidence intervals for ROC–AUC, PR–AUC and UAR are based on the normal approximation method³³, unless otherwise stated to be calculated by the DeLong method³⁴.

openSMILE–SVM

We defaulted to the widely used openSMILE–SVM approach³⁵ for our baseline model. Here, 6,373 handcrafted features (the ComParE 2016 set)—including the zero-crossing rate and shimmer, which have been shown to represent human paralinguistics well—are extracted from the raw audio form. These features are then concatenated to form a 6,373-dimensional vector, f_openSMILE(w) → v, where the raw waveform, ${{{\bf{w}}}}\in {{\mathbb{R}}}^{n}$ (n = clip duration in seconds × sample rate) is transformed to ${{{\bf{v}}}}\in {{\mathbb{R}}}^{6,373}$; v is then normalized prior to training and inference. A linear SVM is fitted to this space and tasked with binary classification. We select the optimal SVM configuration on the basis of the validation set before then retraining on the combined train–validation set.

ResNet-50 BNN

Bayesian neural networks provide estimates of uncertainty, alongside strong supervised classification performance, which is desirable for real-world use cases, especially those involving clinical use. Bayesian neural networks are naturally suited to Bayesian decision theory, which benefits decision-making applications with different costs on error types (for example, assigning unequal weighting to errors in different COVID-19 outcome classifications)^36,37. We thus supply a ResNet-50 (ref. ³⁸) BNN model. The base ResNet-50 model showed initial strong promise for ABCS⁵, further motivating its inclusion in this comparison. We achieve estimates of uncertainty through Monte-Carlo Dropout to achieve approximate Bayesian inference over the posterior, as in ref. ³⁹. We opt to use the pre-trained model for a warm start to the weight approximations, and allow full retraining of layers.

The features used to create an intermediate representation, as input to the convolutional layers, are Mel filterbank features with default configuration from the VGGish GitHub (ref. ⁴⁰): ${{{{\bf{X}}}}}_{i}\in {{\mathbb{R}}}^{96\times 64}$, 64 log-mel spectrogram coefficients using 96 feature frames of 10 ms duration, taken from a resampled signal at 16 kHz. Each input signal was divided into these two-dimensional windows, such that a 2,880 ms clip would produce three training examples with the label assigned to each clip (COVID⁺ or COVID⁻). Incomplete frames at edges were discarded. As with the openSMILE–SVM, silence was not removed. For evaluation, the mean prediction over feature windows was taken per audio recording, to produce a single decision per participant. To make use of the available uncertainty metrics, Supplementary Note 3 details an uncertainty analysis over all audio modalities for a range of train–test partitions.

SSAST

In recent years, transformers⁴¹ have started to perform well in high-dimensional settings such as audio^42,43. This is particularly the case when models are first trained in a self-supervised manner on unlabelled audio data. We adopt the SSAST⁴⁴, which is on a par with the current state of the art for audio event classification. Raw audio is first resampled to 16 kHz and normalized before being transformed into Mel filter banks. Strided convolutional neural layers are used to project the Mel filter bank to a series of patch level representations. During self-supervised pretraining, random patches are masked before all of the patches are passed to a transformer encoder. The model is trained to jointly reconstruct the masked audio and to classify the order of which the masked audio occurs. The transformer is made up of 12 multihead attention blocks. The model is trained end to end, with gradients being passed all of the way back to the convolutional feature extractors. The model is pre-trained on a combined set of AudioSet-2M (ref. ⁴⁵) and Librispeech⁴⁶, representing over two million audio clips for a total of ten epochs. The model is then fine-tuned in a supervised manner on the task of COVID-19 detection from audio. Silent sections of audio recordings are removed before then being resampled to 16 kHz and normalized. Clips are cut/zero-padded to a fixed length of 5.12 s, which corresponds to approximately the mean length of the audio clip. For cases in which the signal length exceeds 5.12 s (after silence is removed), the first 5.12 s are taken. At the training time, the signal is augmented through applying SpecAugment⁴⁷ along with the addition of Gaussian noise. The output representations are mean pooled before being fed through a linear projection head. No layers are frozen and again the model is trained end-to-end. The model is fine-tuned for a total of 20 epochs. The model is evaluated on the validation set at the end of each epoch and its weights are saved. At the end of training the best performing model, over all epochs, is chosen.

Random forest classifier

To predict SARS-CoV-2 infection status from self-reported symptoms and demographic data, we applied an RF classifier with default settings (having self-reported symptoms and demographic data as inputs). In our dataset, predictor variables for the symptoms RF classifier on our dataset comprised: cough; sore throat; asthma; shortness of breath; runny/blocked nose; a new continuous cough; Chronic obstructive pulmonary disease (COPD) or emphysema; another respiratory condition; age; gender; smoker status; and ethnicity. In Han and colleagues’ dataset¹⁸, predictor variables for the symptoms RF classifier comprised: tightness of chest; dry cough; wet cough; runny/blocked nose; chills; smell/taste loss; muscle ache; headache; sore throat; short breath; dizziness; fever; runny/blocked nose; age; gender; smoker status; language; and location. Prior to training, categorical attributes were one-hot encoded. No hyperparameter tuning was performed, and models were trained on the combined Standard train and validation sets. For the hybrid symptoms + audio RF classifier, the outputted predicted COVID⁺ probability from an audio-trained SSAST is appended as an additional input variable to the self-reported symptoms and demographic variables listed above.

Matching methodology

The matched test set was constructed by exactly balancing the numbers of individuals with COVID⁺ and COVID^– in each stratum where, to be in the same stratum, individuals must be matched on all of (recruitment channel) × (10-year-wide age bins) × (gender) × (all of six binary symptoms covariates). The six binary symptoms matched on in the matched test set were: cough; sore throat; asthma; shortness of breath; runny/blocked nose; and ‘at least one symptom’.

Our matching algorithm proceeds as follows. First, each participant is mapped to exactly one stratum. Second, the following matching procedure is applied separately in each stratum: in stratum s (of a total of S strata) let n_s,+ and n_s,− denote the number of individuals with COVID⁺ and COVID^–, respectively, and let ${{{{\mathscr{A}}}}}_{s,+}$ and ${{{{\mathscr{A}}}}}_{s,-}$ be the corresponding sets of individuals. Use ${{{{\mathscr{M}}}}}_{s,+}$ and ${{{{\mathscr{M}}}}}_{s,-}$ to denote random samples without replacement of size $\min \{{n}_{s,+},{n}_{s,-}\}$ from ${{{{\mathscr{A}}}}}_{s,+}$ and ${{{{\mathscr{A}}}}}_{s,-}$ respectively. Finally we combine matched individuals across all strata into the matched dataset ${{{\mathscr{M}}}}$ defined as:

$${{{\mathscr{M}}}}:= {\cup }_{s = 1}^{S}({{{{\mathscr{M}}}}}_{s,+}\cup {{{{\mathscr{M}}}}}_{s,-}).$$

The resulting matched test set comprised 907 participants who were COVID positive and 907 who were COVID negative. The matched training set was constructed similarly to the matched test set, though with slightly different strata, so as to increase available sample size. For the matched training set, individuals were matched on all of: (10-year-wide age bins) × (gender) × (all of seven binary covariates). The seven binary covariates used for the matched training set were: cough; sore throat; asthma; shortness of breath; runny/blocked nose; COPD or emphysema; and smoker status. The resulting matched training set comprised 2,599 participants who were COVID positive and 2,599 who were COVID negative.

Quantifying the expected utility of a testing protocol

We consider the action of applying a particular testing protocol to an individual randomly selected from a population. The four possible outcomes ${O}_{\hat{y},y}$ are

$${O}_{\hat{y},y}:= [\,{{\mbox{Predict COVID{{\mbox{-}19}} status as}}}\,\,\hat{y}]\,{{{\rm{AND}}}}\,[\,{{\mbox{True COVID{{\mbox{-}19}} status is}}}\,\,y]$$

(2)

for predicted COVID-19 status $\hat{y}\in \{0,1\}$ and true COVID-19 status y ∈ {0, 1}. We denote the probability of outcome ${O}_{\hat{y},y}$ by

$${p}_{\hat{y},y}:= {\mathbb{P}}({O}_{\hat{y},y})$$

(3)

and use ${u}_{\hat{y},y}$ to denote the combined utility of the consequences of outcome ${O}_{\hat{y},y}$. For a particular population prevalence proportion, π, the ${p}_{\hat{y},y}$ are subject to the constraints

$${p}_{0,1}+{p}_{1,1}=\uppi$$

(4)

$${p}_{0,0}+{p}_{1,0}=1-\uppi ,$$

(5)

leading to the following relationships, valid for π ∈ (0, 1), involving the sensitivity and specificity of the testing protocol:

$${{{\rm{sensitivity}}}}\equiv \frac{{p}_{1,1}}{{p}_{1,1}+{p}_{0,1}}=\frac{{p}_{1,1}}{\uppi }$$

(6)

$${{{\rm{specificity}}}}\equiv \frac{{p}_{0,0}}{{p}_{0,0}+{p}_{1,0}}=\frac{{p}_{0,0}}{1-\uppi }.$$

(7)

The expected utility is:

$${{{\rm{EU}}}}=\mathop{\sum}\limits_{\hat{y}\in \{0,1\}}\mathop{\sum}\limits_{y\in \{0,1\}}{u}_{\hat{y},y}{p}_{\hat{y},y}$$

(8)

$$={u}_{1,1}{p}_{1,1}+{u}_{0,1}(\uppi -{p}_{1,1})+{u}_{0,0}{p}_{0,0}+{u}_{1,0}(1-\uppi -{p}_{0,0})$$

(9)

$$\begin{array}{l}=\uppi [({u}_{1,1}-{u}_{0,1})\times {{{\rm{sensitivity}}}}+{u}_{0,1}]\\+(1-\uppi )[({u}_{0,0}-{u}_{1,0})\times {{{\rm{specificity}}}}+{u}_{1,0}],\end{array}$$

(10)

where equations (4) and (5) are substituted into equation (8) to obtain equation (9), and equations (6) and (7) are substituted into equation (9) to obtain equation (10).

Demonstration code

To provide researchers easy access to running the code, we have created a demonstration notebook where the participant is invited to record their own ‘sentence’, ‘cough’, ‘three cough’ or ‘exhalation’ sounds and evaluate our COVID-19 detection machine learning models on it. The model outputs a COVID-19 prediction, along with some explainable AI analysis, for example, enabling the user to listen back to the parts of the signal which the model allocated the most attention to. In the demonstration, we detail that this is not a clinical diagnostic test for COVID-19, but that it is instead for research purposes and does not provide any medical recommendation, nor should any action be taken following its use. The demonstration file is detailed on the main repository page and can be accessed at https://colab.research.google.com/drive/1Hdy2H6lrfEocUBfz3LoC5EDJrJr2GXpu?usp=sharing.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

To obtain access to ‘The UK COVID-19 Vocal Audio Dataset’ in full, interested parties may submit their requests to UKHSA at DataAccess@ukhsa.gov.uk. Access is subject to approval and completion of a data sharing contract. For information on how one can apply for UKHSA data, please visit https://www.gov.uk/government/publications/accessing-ukhsa-protected-data/accessing-ukhsa-protected-data. Audio data are provided in .wav format, with four files (one for each recording) for each of the 72,999 participants (unless missing). Metadata are provided in three .csv files, linked by a participant identifier code. Although the dataset is fully anonymized and therefore does not contain any personal data, it has been deposited as safeguarded data in line with the privacy notice provided to participants. Safeguarded data can be used for non-commercial, commercial and teaching projects. To enable wider accessibility, we created another dataset in addition to the original dataset; this dataset has been made open access under a Open Government Licence v3.0. This subset of the ‘The UK COVID-19 Vocal Audio Dataset’ has been curated to meet the ISB1523: Anonymisation Standard for Publishing Health and Social Care Data standards. Two key changes have been made to achieve this. The ‘sentence’ modality has been removed; this was non-negotiable from a data privacy perspective as it was classified as personally identifiable information on its own. Furthermore, the granularity of the metadata has been decreased to K3 anonymity after combining all attributes. This was achieved by dropping attributes such as participant location, binning age and date obfuscation. This dataset is available at https://doi.org/10.5281/zenodo.10043978 (ref. ⁴⁸).

Code availability

The code-base developed for this project can be found at this public GitHub repository: https://github.com/alan-turing-institute/Turing-RSS-Health-Data-Lab-Biomedical-Acoustic-Markers under https://doi.org/10.5281/zenodo.8130844 (ref. ⁴⁹). Here, instructions are provided to replicate our experimental environment and run our experiments. We have provided a docker image to replicate our experimental set-up fully, which can be initialized with the following command: docker run-it–name-v:/workspace/–gpus=all–ipc=host harrycoppock/ciab:ciab_v4. Further details can be found on the GitHub repository⁴⁹.

References

Rosengren, H. 14.9 Million excess deaths associated with the COVID-19 pandemic in 2020 and 2021 World Health Organization (5 May 2022); https://www.who.int/news/item/05-05-2022-14.9-million-excess-deaths-were-associated-with-the-covid-19-pandemic-in-2020-and-2021
Kucharski, A. J. et al. Effectiveness of isolation, testing, contact tracing, and physical distancing on reducing transmission of SARS-CoV-2 in different settings: a mathematical modelling study. Lancet Infect. Dis. 20, 1151–1160 (2020).
Muller, C. P. Do asymptomatic carriers of SARS-COV-2 transmit the virus? Lancet Reg. 4,100082 (2021).
Nessiem, M. A et al. Detecting COVID-19 from breathing and coughing sounds using deep neural networks. In IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS) https://doi.org/10.1109/CBMS52027.2021.00069 (IEEE, 2021).
Laguarta, J., Hueto, F. & Subirana, B. COVID-19 artificial intelligence diagnosis using only cough recordings. IEEE Open J. Eng. Med. Biol. 1, 275–281 (2020).
Bagad, P. et al. Cough against COVID: evidence of COVID-19 signature in cough sounds. Preprint at arXiv https://doi.org/10.48550/arXiv.2009.08790 (2020).
Brown, C. et al. Exploring automatic diagnosis of COVID-19 from crowdsourced respiratory sound data. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 3474–3484 (ACM, 2020); https://doi.org/10.1145/3394486.3412865
Imran, A. et al. AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app. Inform. Med. Unlocked 20, 100378 (2020).
Article PubMed PubMed Central Google Scholar
Pinkas, G. et al. SARS-CoV-2 detection from voice. IEEE Open J. Eng. Med. Biol. 1, 268–274 (2020).
Article PubMed Google Scholar
Hassan, A., Shahin, I. & Alsabek, M. B. COVID-19 detection system using recurrent neural networks. In 2020 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI) 1–5 (IEEE, 2020).
Han, J. et al. Exploring automatic COVID-19 diagnosis via voice and symptoms from crowdsourced data. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 8328–8332 (IEEE, 2021).
Chaudhari, G. et al. Virufy: global applicability of crowdsourced and clinical datasets for AI detection of COVID-19 from cough. Preprint at arXiv https://doi.org/10.48550/arXiv.2011.13320 (2021).
Lella, K. K. & Pja, A. Automatic COVID-19 disease diagnosis using 1D convolutional neural network and augmentation with human respiratory sound based on parameters: cough, breath, and voice. AIMS Public Health 8, 240–264 (2021).
Article PubMed PubMed Central Google Scholar
Andreu-Perez, J. et al. A generic deep learning based cough analysis system from clinically validated samples for point-of-need COVID-19 test and severity levels. IEEE Trans. Services Comput. 15, 9361107 (2021).
Coppock, H. et al. End-to-end convolutional neural network enables COVID-19 detection from breath and cough audio: a pilot study. BMJ Innov. 7, 000668 (2021).
Pahar, M., Klopper, M., Warren, R. & Niesler, T. COVID-19 cough classification using machine learning and global smartphone recordings. Comput. Biol. Med. 135, 104572 (2021).
Article CAS PubMed PubMed Central Google Scholar
Pizzo, D. T. & Esteban, S. IATos: AI-powered pre-screening tool for COVID-19 from cough audio samples. Preprint at arXiv https://doi.org/10.48550/arXiv.2104.13247 (2021).
Han, J. et al. Sounds of COVID-19: exploring realistic performance of audio-based digital testing. npj Digit. Med. 5, 1–9 (2022).
Article ADS Google Scholar
Wynants, L. et al. Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. Br. Med. J. 369, m1328 (2020).
Article Google Scholar
Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 3, 199–217 (2021).
Article Google Scholar
Coppock, H., Jones, L., Kiskin, I. & Schuller, B. COVID-19 detection from audio: seven grains of salt. Lancet Digit. Health 3, e537–e538 (2021).
Article CAS PubMed PubMed Central Google Scholar
DeGrave, A. J., Janizek, J. D. & Lee, S.-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619 (2021).
Article Google Scholar
Budd, J. et al. A large-scale and PCR-referenced vocal audio dataset for COVID-19. Preprint at arXiv https://doi.org/10.48550/arXiv.2212.07738 (2023).
Speak Up and Help Beat Coronavirus (COVID-19) (UK Government, 2021); https://www.gov.uk/government/news/speak-up-and-help-beat-coronavirus-covid-19
Department of Health and Social Care (UK), COVID-19 Testing Data: Methodology Note (UK Government, 2022); https://www.gov.uk/government/publications/coronavirus-covid-19-testing-data-methodology/covid-19-testing-data-methodology-note
Murphy, K. P. Probabilistic Machine Learning: An introduction (MIT Press, 2022).
Stuart, E. A. Matching methods for causal inference: a review and a look forward. Stat. Sci. 25, 1–21 (2010).
Article MathSciNet PubMed PubMed Central Google Scholar
Kahan, B. C., Forbes, G. & Cro, S. How to design a pre-specified statistical analysis approach to limit p-hacking in clinical trials: the Pre-SPEC framework. BMC Med. 18, 253 (2020).
Article PubMed PubMed Central Google Scholar
Sah, P. et al. Asymptomatic SARS-CoV-2 infection: a systematic review and meta-analysis. Proc. Natl Acad. Sci. USA 118, e2109229118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Pigoli, D. et al. Statistical design and analysis for robust machine learning: a case study from COVID-19. Preprint at arXiv https://doi.org/10.48550/arXiv.2212.08571 (2022).
Chadeau-Hyam, M. et al. REACT-1 study round 14: high and increasing prevalence of SARS-CoV-2 infection among school-aged children during September 2021 and vaccine effectiveness against infection in England. Preprint at medRxiv https://www.medrxiv.org/content/early/2021/10/22/2021.10.14.21264965 (2021).
Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMC Med. 13, 1 (2015).
Article PubMed PubMed Central Google Scholar
Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982).
Article CAS PubMed Google Scholar
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Article CAS PubMed Google Scholar
Eyben, F., Wöllmer, M. & Schuller, B. OpenSmile—the Munich versatile and fast open-source audio feature extractor. In Proc. 18th ACM International Conference on Multimedia 1459–1462 (ACM, 2010).
Vadera, M. P., Ghosh, S., Ng, K. & Marlin, B. M. Post-hoc loss-calibration for Bayesian neural networks. In Proc. Thirty-Seventh Conference on Uncertainty in Artificial Intelligence 1403–1412 (PMLR, 2021).
Cobb, A. D., Roberts, S. J. & Gal, Y. Loss-calibrated approximate inference in Bayesian neural networks. Preprint at arXiv https://doi.org/10.48550/arXiv.1805.03901 (2018).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).
Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proc. 33rd International Conference on Machine Learning1050–1059 (PMLR, 2016).
Tensorflow/Models (GitHub, 2019); https://github.com/tensorflow/models/blob/master/research/audioset/vggish/vggish_input.py
Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems https://proceedings.neurips.cc/paper_files/paper/2017 (2017).
Baevski, A., Zhou, Y., Mohamed, A. & Auli, M. wav2vec 2.0: a framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems (eds. Lin, H. et al.) Vol. 33, 12449–12460 (Curran Associates, 2020); https://proceedings.neurips.cc/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In International Conference on Learning Representations(ICLR, 2021); https://openreview.net/forum?id=YicbFdNTTy
Gong, Y., Lai, C.-I. J., Chung, Y.-A. & Glass, J. SSAST: self-supervised audio spectrogram transformer. In Proc. AAAI Conference on Artificial Intelligence https://doi.org/10.1609/aaai.v36i10.21315 (AAAI, 2022).
Gemmeke, J. F. et al. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017 (IEEE, 2017).
Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5206–5210 (IEEE, 2015).
Park, D. S. et al. SpecAugment: a simple data augmentation method for automatic apeech recognition. In Proc. Interspeech 2019 2613–2617 (ISCA, 2019).
Coppock, H. et al. The UK COVID-19 Vocal Audio Dataset (openAccessv1.0) (Zenodo, 2023); https://doi.org/10.5281/zenodo.10043978
Coppock, H. et al. Alan-Turing-Institute/Turing-RSS-Health-Data-Lab-Biomedical-Acoustic-Markers: Initial (Zenodo, 2023); https://doi.org/10.5281/zenodo.8130844

Download references

Acknowledgements

We gratefully acknowledge the contributions of staff from the NHS Test and Trace Lighthouse Labs, REACT study, Ipsos MORI, Studio24 and Fujitsu Services. Authors at The Alan Turing Institute and Royal Statistical Society Health Data Laboratory gratefully acknowledge funding from the Data, Analytics and Surveillance Group, a part of the UK Health Security Agency (UKHSA). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. We acknowledge The Department for Health and Social Care (grant no. 2020/045) with support from The Alan Turing Institute (grant no. EP/W037211/1) and in-kind support from The Royal Statistical Society; the NIHR Biomedical Research Centre, Oxford (grant no. NIHR203311 to G.N.); the i-sense EPSRC IRC in Agile Early Warning Sensing Systems for Infectious Diseases and Antimicrobial Resistance (grant no. EP/R00529X/1 to J.B.); the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant (grant no. 801604 to A.T.C.); and The Economic and Social Research Council (ESRC) (grant no. ES/P000592/1 to S.P.).

Author information

These authors contributed equally: Harry Coppock, George Nicholson, Ivan Kiskin.
These authors jointly supervised this work: Davide Pigoli, Steven Gilmour, Stephen Roberts, Chris Holmes.

Authors and Affiliations

The Alan Turing Institute, London, UK
Harry Coppock, George Nicholson, Ivan Kiskin, Vasiliki Koutra, Kieran Baker, Emma Karoune, Radka Jersakova, Sylvia Richardson, Björn W. Schuller, Davide Pigoli, Steven Gilmour, Stephen Roberts & Chris Holmes
Imperial College London, London, UK
Harry Coppock & Björn W. Schuller
University of Oxford, Oxford, UK
George Nicholson, Ivan Kiskin, Stephen Roberts & Chris Holmes
NIHR Oxford Biomedical Research Centre, Oxford, UK
George Nicholson
Surrey Institute for People-Centred AI, University of Surrey, Guildford, UK
Ivan Kiskin
King’s College London, London, UK
Vasiliki Koutra, Kieran Baker, Davide Pigoli & Steven Gilmour
Division of Medicine, University College London, London, UK
Jobie Budd
London Centre for Nanotechnology, University College London, London, UK
Jobie Budd
UK Health Security Agency, London, UK
Richard Payne, David Hurley, Alexander Titcomb, Sabrina Egglestone, Ana Tendero Cañadas, Lorraine Butler, Jonathon Mellor, Selina Patel & Josef Packham
School of Applied Sciences, University of Brighton, Brighton, UK
Ana Tendero Cañadas
Institute of Health Informatics, University College London, London, UK
Selina Patel
University of Nottingham, Nottingham, UK
Tracey Thornley
University of Lancaster, Lancaster, UK
Peter Diggle
University of Augsburg, Augsburg, Germany
Björn W. Schuller

Authors

Harry Coppock
View author publications
You can also search for this author in PubMed Google Scholar
George Nicholson
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Kiskin
View author publications
You can also search for this author in PubMed Google Scholar
Vasiliki Koutra
View author publications
You can also search for this author in PubMed Google Scholar
Kieran Baker
View author publications
You can also search for this author in PubMed Google Scholar
Jobie Budd
View author publications
You can also search for this author in PubMed Google Scholar
Richard Payne
View author publications
You can also search for this author in PubMed Google Scholar
Emma Karoune
View author publications
You can also search for this author in PubMed Google Scholar
David Hurley
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Titcomb
View author publications
You can also search for this author in PubMed Google Scholar
Sabrina Egglestone
View author publications
You can also search for this author in PubMed Google Scholar
Ana Tendero Cañadas
View author publications
You can also search for this author in PubMed Google Scholar
Lorraine Butler
View author publications
You can also search for this author in PubMed Google Scholar
Radka Jersakova
View author publications
You can also search for this author in PubMed Google Scholar
Jonathon Mellor
View author publications
You can also search for this author in PubMed Google Scholar
Selina Patel
View author publications
You can also search for this author in PubMed Google Scholar
Tracey Thornley
View author publications
You can also search for this author in PubMed Google Scholar
Peter Diggle
View author publications
You can also search for this author in PubMed Google Scholar
Sylvia Richardson
View author publications
You can also search for this author in PubMed Google Scholar
Josef Packham
View author publications
You can also search for this author in PubMed Google Scholar
Björn W. Schuller
View author publications
You can also search for this author in PubMed Google Scholar
Davide Pigoli
View author publications
You can also search for this author in PubMed Google Scholar
Steven Gilmour
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Roberts
View author publications
You can also search for this author in PubMed Google Scholar
Chris Holmes
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.C., D.P., K.B., J.B., C.H., S.G.G., V.K., S.R., I.K., G.N., B.W.S. and R.P. conceived and designed the experiments. H.C., J.B., K.B., I.K., G.N., D.H., R.P. and L.B. performed the experiments. H.C., S.P., J.B., K.B., C.H., T.T., J.P., L.B., A.T., S.R., E.K., I.K., G.N., B.W.S., D.H., S.E. and R.P. contributed materials/analysis tools. H.C., D.P., J.B., K.B., I.K., G.N., D.H. and R.P. analysed the data. H.C., K.B., J.B., J.P., V.K., I.K., J.M., S.E., T.T., A.T., S.R., B.W.S., S.G.G., G.N., P.D., R.P., D.P., C.H., L.B., E.K., S.P., D.H. and R.J. wrote the paper.

Corresponding author

Correspondence to Chris Holmes.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics statement

This study has been approved by The National Statistician’s Data Ethics Advisory Committee (reference NSDEC(21)01) and the Cambridge South NHS Research Ethics Committee (reference 21/EE/0036) and Nottingham NHS Research Ethics Committee (reference 21/EM/0067). All participants reviewed the provided participant information and gave their informed consent to take part in the study.

Peer review

Peer review information

Nature Machine Intelligence thanks Huijun Ding and Georg Stemmer for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Bayesian knowledge graph describing the main features of the recruitment process.

The nodes in the graph represent the states of an individual in the population; shaded nodes are observed and non-shaded are latent.

Extended Data Fig. 2 Predictive accuracy within Matched strata.

Estimated ROC–AUC in each of 88 strata in the combined Matched and Longitudinal Matched test sets (these are the 88 largest strata in this combined test set, having at least 10 COVID+ and 10 COVID- participants). Upon controlling False Discovery Rate (FDR) at 5%, we observe significant differences in predictive scores between COVID- and COVID+ individuals in 28 strata (two-tailed Mann-Whitney U test; significance denoted by filled points), suggesting that the classifier has low but consistent predictive power across a large number of strata. Results are presented as estimated ROC–AUC with accompanying error bars denoting DeLong 95% confidence intervals (the sample size underlying each CI is given by ‘# in stratum’ in the final row of the table beneath the plot). Of these CIs, 84 out of 88 (95.4%) are overlapping with 0.62, consistent with a common value of ROC–AUC=0.62 across all strata. Details of each stratum are shown in the table below the plot. The reference value of ROC–AUC=0.62, representing the estimated global (non-stratified) predictive ability of the SSAST classifier (see Table 1) is marked with a horizontal dashed line. The value ROC–AUC=0.5, representative of no predictive ability is marked by a solid horizontal line.

Extended Data Fig. 3 ROC–AUC SSAST performance when trained and evaluated on the COVID-19 sounds publicly available dataset 18.

Here, our SSAST model’s ROC–AUC exceed those of the CNN model of Han et al.¹⁸, but the difference in ROC–AUC between the methods is small and is compatible with random estimation error, as seen from the wide confidence intervals (attributable to the small test set of size 200): cough (ROC–AUC for SSAST 0.68 [0.62-0.74] vs CNN 0.66 [0.60-0.71]), breath (0.64 [0.58-0.70] vs 0.62 [0.56-0.68]), voice (0.64 [0.58-0.70] vs 0.62 [0.56-0.68]). A simple symptoms checker (RF) and a hybrid symptoms-audio are also evaluated for comparison, outperforming both our SSAST audio-only fit and Han et al.’s audio-only CNN.

Extended Data Fig. 4 Comparison of sensitivity, specificity, and utility across audio-based and symptoms-based classifiers, as applied in a simulated general populations.

The percentage of COVID- individuals who are symptomatic in the general population varies between 10% and 30% across the three columns of panels (labelled top). (a)-(c) Comparison of ROC curves between the Audio, Symptoms, and Symptoms+Audio classifiers; curves show sensitivity as a function of specificity with error bars denoting 95% CIs for sensitivity, where CIs are calculated in pROC::ci.se and are based on a sample size of 2,000 stratified bootstrap replicates; panel legends show the curve colour for each classifier alongside ROC–AUC estimates and 95% DeLong CIs. (d)-(f) Comparison of maximum expected utility across classifiers. Four different utility functions are included, as detailed in the top-right legend (utility function parameters Rt, ε and δ are defined in Results; in this Figure, δ = 0.25). Curves are coloured to indicate Audio, Symptoms or Symptoms+Audio classifiers, as detailed in the top left legend.

Extended Data Fig. 5 Results of the Weak-Robust approach.

The blue line represents the SVM model trained and evaluated on an increasing number of Principal Component Analysis (PCA) components of openSMILE vector representations of the audio signal for the Matched COVID-19 detection from audio task (‘weak-model-covid- matched’). Individuals correctly classified by the weak model in the Matched test set are hypothesized to har- bour confounding signal, and are removed to create the curated Matched test set. The red line shows SSAST performance on this curated Matched test set (‘ssast-covid-matched-curated-removal’). For comparison, we also randomly remove Matched test cases and these results are shown by the purple line (‘ssast-covid-matched- curated-removal’). The vertical green line corresponds to the calibration threshold, that is, the number of PCs for which the weak model achieves UAR of greater than 80% on the calibration task. The green shaded area corresponds to the drop in SSAST performance that we attribute to the removal of confounding in Matched test set cases. We note that the drop in performance below random classification is hypothesized to be due to only the ‘tricky’ cases remaining (for example, symptomatic COVID-). The 95% confidence intervals are calculated via the normal approximation method with the outcome of the experiment being the center line.

Extended Data Fig. 6 Symptomatic vs asymptomatic for other COVID-19 datasets when sufficient information is provided.

Coswara⁴⁸, COVID-19 Sounds¹⁸, Tos COVID-19¹⁷ and Virufy⁴⁹.

Extended Data Fig. 7 Schematic detailing the three separate pipelines implemented to evaluate ABCS.

openSMILE- SVM baseline, the Bayesian Neural Network (BNN) and the Self-Supervised Audio Spectrogram Transformer (SSAST). Both SSAST and BNN first convert the raw audio signal to mel spectrogram space whereas the openSMILE–SVM approach extracts a series of handcrafted audio features.

Extended Data Table 1 Non-exhaustive record of COVID-19 respiratory audio datasets

Full size table

Supplementary information

Supplementary Information

Supplementary Discussion and Figs. 1–3, and Tables 1 and 2.

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Coppock, H., Nicholson, G., Kiskin, I. et al. Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers. Nat Mach Intell 6, 229–242 (2024). https://doi.org/10.1038/s42256-023-00773-8

Download citation

Received: 19 January 2023
Accepted: 19 November 2023
Published: 07 February 2024
Issue Date: February 2024
DOI: https://doi.org/10.1038/s42256-023-00773-8

Subjects

Abstract

Similar content being viewed by others

Main

Results

Study design

Defining the acoustic target for COVID-19 screening

Characterizing and controlling recruitment bias

Primary analyses

Pre-specified analysis plan

Audio-based COVID-19 prediction performance

Confirmatory analyses and validation

The additional predictive value of ABCS

Translating prediction accuracy into utility

Exploratory approaches to identify the influence of unmeasured confounders

Discussion

Methods

Dataset and study design

Recruitment sources

Criteria for enrolment

Audio recordings

Demographic and clinical/health metadata

Final dataset

Data splits

Machine learning models

openSMILE–SVM

ResNet-50 BNN

SSAST

Random forest classifier

Matching methodology

Quantifying the expected utility of a testing protocol

Demonstration code

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics statement

Peer review

Peer review information

Additional information

Extended data

Extended Data Fig. 3 ROC–AUC SSAST performance when trained and evaluated on the COVID-19 sounds publicly available dataset18.

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links

Extended Data Fig. 3 ROC–AUC SSAST performance when trained and evaluated on the COVID-19 sounds publicly available dataset 18.