Main

The coronavirus disease 2019 (COVID-19) pandemic has been estimated by the World Health Organization (WHO) to have caused 14.9 million excess deaths over the 2020–2021 period1. An accepted public health control measure for emerging infectious diseases is the isolation of infected individuals2. As COVID-19 transmission occurs in both symptomatic and asymptomatic cases3, especially prior to nationwide vaccination deployment, a scalable and accurate test for the infection is crucial to avoid general population quarantine.

This has sparked an intense interest in AI-based classifiers that use respiratory audio data to classify severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection status (which we here refer to as COVID-19 status) via a digital screening tool that anyone with a smartphone or computer can use4,5,6,7,8,9,10,11,12,13,14,15,16,17,18. In our review, as of July 2022, we found 93 published papers that reported evidence for the potential of audio-based COVID-19 classification. Of these 93 papers, 75 report an area under the curve (AUC) (or F1) of over 0.75, whereas 44 report a performance of above 0.90. Extended Data Table 1 summarizes nine highly cited datasets and their corresponding classification performance.

Despite these encouraging results, concerns remain that the prediction models may not be transferable to real-world settings11,15,18,19,20,21. In some cases, data quality may be lowered by, for example, sampling biases, lack of verification of participants’ COVID-19 status, a long delay between infection and audio recording, or small numbers of individuals who are SARS-CoV-2 reverse transcription polymerase chain reaction (PCR)-positive (COVID+)21. Akin to findings in AI radiographic COVID-19 detection22, concerns centre around whether the learnt audio features are unique audio biomarkers caused by COVID-19 in the infected individual, or are due to other confounding signals.

Here we analyse the largest PCR-validated dataset collected so far in the field of audio-based COVID-19 screening (ABCS). We design and specify an analysis plan in advance to investigate whether using audio-based classifiers can improve the accuracy of COVID-19 screening over using self-reported symptoms.

Our contribution is as follows:

  • – We collect a respiratory acoustic dataset of 67,842 individuals with linked PCR test outcomes, including 23,514 who tested positive for COVID-19. This is, to the best of our knowledge, the largest PCR-validated dataset collected of its kind so far23.

  • – We fit a range of AI classifiers and observe strong COVID-19 predictive performance (receiver operating charateristic area under the curve (ROC–AUC) = 0.85), as has been reported in past studies, for example refs. 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18; however, when controlling for measured confounders by matching, only a small amount of residual predictive variation remains (ROC–AUC = 0.62), some of which we attribute to unmeasured confounders.

  • – We find the COVID-19 predictive performance and practical utility of audio-based AI classifiers—as applied in simulated realistic settings—to be no better than classification on the basis of self-reported symptoms; we replicate this finding by fitting our classifiers in an external dataset.

  • – These results suggest that audio-based classifiers learn to predict COVID-19 via self-reported symptoms and potentially other confounders. Study recruitment on the basis of self-screened symptoms seems to be an important driver of this effect.

  • – We provide best-practice recommendations on how to address this problem in future studies.

  • – Our dataset and code-base is publicly available to enable reproducibility of results and to encourage further research into respiratory audio analysis and bias mitigation in high-dimensional, over-parameterized settings23.

Our work is timely in highlighting the need for careful construction of machine learning evaluation procedures, aimed at yielding representative performance metrics. The important lessons from this case study on the effects of confounding extend across many applications in AI—where biases are often hard to spot and difficult to control for.

Results

Study design

This study invited volunteers from the Real-time Assessment of Community Transmission (REACT) programme and the National Health Service (NHS) Test-and-Trace (T+T) service to participate between March 2021 and March 2022 on an opt-in basis. Volunteers were directed to the ‘Speak up and help beat coronavirus’ web page24, where they were instructed to provide audio recordings of four respiratory audio modalities. Demographic and health metadata, along with a validated PCR test result, were transferred from existing T+T/REACT records. Further audio-specific metadata were produced from the audio files after collection. The final dataset comprised 23,514 COVID+ and 44,328 SARS-CoV-2 PCR-negative (COVID) individuals. Figure 1 summarizes the dataset (a more detailed description of which is provided in Methods) and a full presentation can be found in the accompanying dataset paper23.

Fig. 1: Demographic statistics of collected dataset.
figure 1

a,b, Geographical locations of COVID positive (a) and negative (b) PCR-confirmed participants. Colour bar units are individual participant count. c, \(\scriptstyle\% \frac{100\times \#{{{\rm{negative}}}}\,{{{\rm{participants}}}}\,{{{\rm{at}}}}\,{{{\rm{location}}}}}{\#{{{\rm{negative}}}}\,{{{\rm{participants}}}}\,{{{\rm{in}}}}\,{{{\rm{total}}}}}-\frac{100\times \#{{{\rm{positive}}}}\,{{{\rm{participants}}}}\,{{{\rm{at}}}}\,{{{\rm{location}}}}}{\#{{{\rm{positive}}}}\,{{{\rm{participants}}}}\,{{{\rm{in}}}}\,{{{\rm{total}}}}}\) Colour bar units are the difference between percentage points. d, Cumulative count of the number of participants partaking in the study. e,f, The 21 most common combinations of symptoms for COVID positive (e) and negative (f) participants, ordered along the x-axis by total number of participants displaying that particular combination of symptoms. Symptoms are ordered along the y-axis according to total number of participants displaying at least that symptom at the time of recording. g, Schematic detailing the two recruitment sources for the study and the filtration steps applied to yield the final dataset. h, Dataset splits in participant numbers.

Defining the acoustic target for COVID-19 screening

If a practically effective acoustic signal were to exist in SARS-CoV-2-infected individuals’ respiratory sounds, we propose that it would have the following properties:

  • P1: Caused by COVID-19. COVID-19 is well known to cause symptoms (such as a new continuous cough) that can be readily self-screened by individuals in the general population. The acoustic target would likewise be linked causally to COVID-19 and would therefore be more likely to generalize to other contexts and populations than non-causal associations.

  • P2: Not self-identifiable. The acoustic target would not directly represent self-identifiable symptoms that can be self-identified effectively by individuals in the general population. This is because: (1) it is more straightforward to measure self-identifiable symptoms directly using a questionnaire, rather than measuring them indirectly via audio; and (2) as we explain below, self-identifiable symptoms can affect enrolment and may therefore be strongly non-causally associated with COVID-19 in enrolled subpopulations.

  • P3: Enables high-utility COVID-19 screening. For an audio-based classifier to perform strongly in practical settings, it should possess high sensitivity and specificity, corresponding to an acoustic signal that would be detectable in high and low proportions of individuals who are COVID+ and COVID, respectively. We formalize the mathematical relationship linking expected utility, sensitivity and specificity in equation (1) (see Methods).

Characterizing and controlling recruitment bias

In audio-based COVID-19 classification, results can be highly sensitive to the characteristics of the enrolled population. Our study’s recruitment protocol is subject to enrolment bias because the vast majority of individuals in pillar 2 of the UK government’s NHS T+T programme25 were PCR tested as a direct consequence of reporting symptoms (see Methods). Figure 1e,f display our participants’ symptom profiles, stratified by COVID-19 infection status. Figure 2a presents the joint distribution of COVID-19 status and binary symptoms status as ‘symptoms-based enrolment’, in contrast to Fig. 2b, which presents ‘general population enrolment’ on the basis of random sampling from a general population with 2% COVID+ prevalence. Note that the dependence between binary symptoms status and COVID-19 is stronger under symptoms-based enrolment (population correlation coefficient ρ = 0.66) than general population enrolment (ρ = 0.15).

Fig. 2: Illustrative tables relating symptoms status with COVID-19 status.
figure 2

a, Symptoms-based enrolment, where individuals who are COVID+ are preferentially recruited on the basis of symptoms (percentages are calculated from the entire sample of individuals recruited into this study). b, General population enrolment on the basis of random sampling from an illustrative general population with a COVID-19 prevalence of 2%, where symptomatic individuals make up 20% and 65% of COVID and COVID+ subpopulations, respectively. c, Matched enrolment, where the number of individuals who are COVID and COVID+ is the same for each particular symptoms profile within the symptomatic and asymptomatic subgroups (percentages shown are for the matched test set in the current study). For each type of enrolment, the diagnostic accuracies of the resulting symptoms-only COVID-19 classifier are shown below the table: ρ, mutual information (MI), sensitivity, specificity and AUC.

We will discuss three simplified recruitment processes to illustrate the effects of different types of enrolment protocol. These are illustrated in Fig. 3 using the probabilistic framework of directed and undirected graphical models (a good introduction to which can be found in chapters 10 and 19 of ref. 26). As defined above, our goal is to train a classifier capable of predicting COVID-19 via its latent acoustic signature. We explain below how this requires the classifier predictions to be conditionally dependent on the latent COVID signature given self-reported symptoms, denoted by the red edge in Fig. 3ciii under a matched recruitment protocol; however the corresponding edges are missing under the other recruitment protocols in Fig. 3aiii and Fig. 3biii.

Fig. 3: Study enrolment influences capability of classifiers to learn a latent COVID acoustic signature.
figure 3

ac, Our goal is to train a classifier whose predictions are conditionally dependent on the latent COVID acoustic signature given self-reported symptoms, as denoted by the red edge in ciii under matched enrolment. This edge is not present in aiii or biii because the classifier’s predictive ability is mediated by self-reported symptoms under symptoms-based and general population enrolment. Rows ac present three different enrolment protocols, whereas columns (i)–(iii) show different types of graph; in the undirected graphs, shaded nodes are observed variables, while the edge thickness is used to illustrate the approximate strength of conditional dependence. a, Symptoms-based enrolment enforces a supervised sampling regime in which individuals who are COVID+ are preferentially recruited on the basis of symptoms shown (for example, Fig. 2a). b, Randomized enrolment performs a random sampling of individuals from the general population. c, Matched enrolment balances the number of COVID+ and COVID individuals that share each particular symptoms profile; (i) Bayesian knowledge graphs displaying a simplified causal model; (ii) undirected conditional independence graphs implied by the directed graphs in (i) when we condition on enrolment (ei = 1); (iii) undirected conditional independence graphs, as in (ii), but now showing the trained acoustic classifier \(\hat{h}({x}_{i})\) (trained to predict yi based on input xi) in lieu of the acoustic recording data xi.

First consider our simplified causal model of symptoms-based recruitment (Fig. 3ai). Enrolment is jointly influenced by COVID-19 status, self-reported symptoms and factors such as age and gender (Extended Data Fig. 1 shows a detailed Bayesian knowledge graph of the recruitment process). Collecting data only from enrolled individuals is, in effect, conditioning on ei = 1 at the enrolment node in Fig. 3ai. As the enrolment node has directed edges incoming from both COVID-19 status and self-reported symptoms (that is, it is a collider node), conditioning on it induces a non-causal dependence between its parent nodes (in addition to the causal dependence of symptoms on COVID-19 status). Figure 3aii displays the moralized undirected graph implied by Fig. 3ai, conditional on enrolment, with the strong COVID-19-to-symptoms dependence represented by a thick line illustratively labelled ρ = 0.66 with reference to Fig. 2a. By contrast, Fig. 3bi is conditional on random enrolment and does not introduce any additional non-causal association between COVID-19 status and self-reported symptoms.

If a study’s enrolment bias is unaddressed and shared across both training and held-out test sets, a classifier that seems to perform well may not generalize to other datasets20,21. This is due to two effects: first, the classifier may learn to predict using confounding variables that are not causally related to COVID-19 but are associated due to their influence on enrolment (for example, gender, age or symptoms unrelated to COVID-19); second, even symptoms that are truly causally related to COVID-19, such as a new continuous cough, may exhibit inflated association with COVID-19 in the enrolled cohort due to their influence on enrolment (illustrated by the thick edges labelled ρ = 0.66 in Fig. 3aii,aiii).

As well as leading to poor generalizability, audible characteristics that are non-causally but strongly associated with COVID-19 can obscure any COVID-19 acoustic signature that may exist. This is illustrated in Fig. 3aiii, where the association between classifier prediction and SARS-CoV-2 status is mediated by symptoms instead of via a targeted latent COVID acoustic signature (that is, there are no edges corresponding to the red edge seen in Fig. 3ciii). Even in the case of randomized enrolment from the general population, a classifier may learn to predict SARS-CoV-2 status via self-reported symptoms, as opposed to via a latent COVID-19 acoustic signature, as illustrated in Fig. 3biii (again, a lack of red edge indicates that classifier predictions are conditionally independent of latent acoustic signature given self-reported symptoms).

Here, our goal is to build a classifier whose association with COVID-19 is mediated by an acoustic signature with the three properties defined above. We use the established epidemiological methodology known as matching27, whereby study enrolment balances the number of COVID+ and COVID participants having each combination of potentially audible measured confounding variables. This has the effect of inducing independence between COVID-19 and these confounders in the matched population, as shown in Fig. 3cii. The classifier is then constrained to predict COVID-19 status either via the latent COVID-19 acoustic signature (via the red edge in Fig. 3ciii), or via unmeasured confounders.

Primary analyses

Pre-specified analysis plan

We designed and fixed a pre-specified analysis plan to increase the replicability of conclusions28. As part of this advance planning, we detailed the analyses to be conducted and generated the test/validate/train data splits through subsampling of the full dataset. The design of these splits is detailed in the Methods, with sample sizes listed in Fig. 1h.

Audio-based COVID-19 prediction performance

Table 1 presents our study’s COVID-19 prediction performance across nine train/validate/test splits, four modalities and three models: Self-Supervised Audio Spectrogram Transformer (SSAST), Bayesian neural networks (BNNs) and an openSMILE–support vector machine (SVM). The SSAST and BNN classifiers consistently outperform the baseline SVM, and the best prediction is achieved with the sentence modality. Reported results are for the SSAST performance on the sentence modality, unless stated otherwise. Under the randomized data split, the SSAST classifier achieves a high COVID-19 predictive accuracy of ROC–AUC = 0.846 [0.838–0.854]. We hypothesize that this strong predictive accuracy is mainly attributable to enrolment on the basis of self-reported symptoms, and explore this further in confirmatory analyses below.

Table 1 Results detailing the performance of the SVM, SSAST and BNN models on the nine evaluation tasks for each of the four audio modalities: sentence, three coughs, cough and exhalation

When we control for enrolment bias by matching on age, gender and self-reported symptoms, predictive accuracy drops to a consistently low level of ROC–AUC = 0.619 [0.594–0.644] in the matched test set, and ROC–AUC = 0.621 [0.605–0.637] in the longitudinal matched test set—that is, a temporally out-of-distribution test set consisting of only submissions after 29 November 2021 (both trained on the standard training set). When training instead on our matched training set, we see a minor improvement in the matched test set (ROC–AUC = 0.635 [0.610–0.660]), and, by contrast, a slight decrease in prediction accuracy in the longitudinal matched test set (ROC–AUC = 0.604 [0.588–0.620]). Figure 4 illustrates these different experimental settings and the corresponding classification performance. A cluster analysis is also performed on the SSAST learnt representations (detailed in Supplementary Note 2) visually demonstrating the effect of decoupling measured confounders and COVID-19 status. To explore whether classifier performance might be higher in some matched groups than in others, we calculated ROC–AUC within matched strata (Extended Data Fig. 2), observing the estimates and confidence intervals to be consistent with a homogeneously low predictive score of ROC–AUC = 0.62 across strata.

Fig. 4: Schematic demonstrating the importance of ascertainment bias adjustment in the context of reporting COVID-19 detection from patient respiratory audio performance.
figure 4

Human figures represent study participants and their corresponding COVID-19 infection status, with the different colours portraying different demographic or symptomatic features. When participants are randomly split into training and test sets, the randomized split models perform well at COVID-19 detection, achieving AUCs in excess of 0.8; however, matched test set performance is seen to drop to estimated AUC between 0.60 and 0.65, with an AUC of 0.5 representing random classification. Inflated classification performance is also seen in engineered out of distribution test sets such as: the designed test set, in which a select set of demographic groups appear solely in the testing set, and the longitudinal test set, in which there is no overlap in the time of submission between train and test instances. The 95% confidence intervals calculated via the normal approximation method are shown, along with the corresponding n numbers of the train and test sets. Figure 4 created with Biorender.com.

Confirmatory analyses and validation

The additional predictive value of ABCS

Audio-based classifiers can be useful in practice if they deliver improved performance relative to classifiers that are based on self-identifiable symptoms. Moreover, it is beneficial to assess the performance of ABCS classifiers in test sets reflecting the application of the testing protocol in a real-life setting. Here we generate a general population test set, through balanced subsampling, without replacement from our combined standard and longitudinal test sets, to capture the age/gender/symptoms/COVID-19 profile of the general population during the pandemic. Specifically, the proportion of symptomatic individuals is set to 65% in the COVID+ subgroup29, compared with a setting of either (10%, 20%, 30%) symptomatic individuals in the COVID subgroup; the age distribution is constrained to be the same in both COVID+ and COVID subgroups; and with males/females balanced in a 1:1 ratio in each COVID+/COVID subgroup. We benchmark the COVID-19 predictive performance of the audio-based SSAST classifier against the performance attainable through random forest (RF) classifiers trained on self-identifiable symptoms and demographic data (a ‘symptoms’ RF classifier). In the benchmarking we also include an RF classifier, which takes as inputs the audio-based SSAST probabilistic outputs alongside self-identifiable symptoms and demographic data (‘symptoms + audio’ RF classifier). Training for all three classifiers is performed in our standard training set. The resulting ROC curves are shown in Fig. 5a–c. Focusing on the general population with 20% of COVID symptomatic in Fig. 5b, the combined symptoms + audio RF classifier (ROC–AUC = 0.787 [0.772–0.801], 95% DeLong CI) offers a significant (P = 9.7 × 10−11, DeLong test) but small increase in predictive accuracy over the symptoms RF classifier (ROC–AUC = 0.757 [0.743–0.771], 95% DeLong CI), which in turn yields a less significant increase in ROC–AUC compared with the audio-only classifier (P = 0.0033) (ROC–AUC = 0.733 [0.717–0.748], 95% DeLong CI).

Fig. 5: Comparison of sensitivity, specificity, and utility across audio-based and symptoms-based classifiers, as applied in simulated general populations.
figure 5

ac, The percentage of COVID individuals who are symptomatic in the general population varies between 10% and 30% across the three columns of panels (labelled at the top). Comparison of ROC curves between the audio, symptoms, and symptoms + audio classifiers; curves show sensitivity as a function of specificity, with error bars denoting 95% confidence intervals for sensitivity, where confidence intervals are calculated using the pROC::ci.se() R function and are based on a sample size of 2,000 stratified bootstrap replicates. The legends show the curve colour for each classifier alongside ROC–AUC estimates and 95% DeLong confidence intervals. df, Comparison of maximum expected utility across classifiers. Four different utility functions are included, as detailed in the top-right legend (utility function parameters Rt, ε and δ are defined in the ‘Results’ section; δ = 0 in this figure). Curves are coloured to indicate audio, symptoms, or symptoms + audio classifiers, as detailed in the top-left legend.

We replicate these findings using an external dataset18, observing qualitatively similar results with a symptoms classifier (ROC–AUC = 0.79 [0.71–0.87]) outperforming an SSAST audio-based classifier (ROC–AUC = 0.68 [0.59–0.77]) in a general population test set simulated from the external test set18, in which 20% of the individuals who are COVID are symptomatic. We observe similar results when comparing a symptoms classifier with our SSAST and Han and colleagues’ convolutional neural network (CNN) on the external test set directly18: symptoms classifier ROC–AUC = 0.81 [0.76–0.86]; SSAST audio-based classifier ROC–AUC = 0.68 [0.62–0.74]; CNN audio-based classifier18 ROC–AUC = 0.66 [0.60–0.71] (see also Extended Data Fig. 3). The reported results for our SSAST and Han and colleagues’ CNN are for the ‘cough’ modality; however, we see similar results for both ‘breathing’ and ‘voice’.

Translating prediction accuracy into utility

To characterize the practical benefit of ABCS in any particular testing setting, we can specify the utility \({u}_{\hat{y},y}\) of predicting \(\hat{y}\in \{0,1\}\) for a random individual, in the targeted testing population, whose true COVID status is y {0, 1}, and calculate the per-test expected utility (EU) as

$$\begin{array}{ll}{{{\rm{EU}}}}&\equiv {\mathbb{E}}[{{{\rm{utility}}}}| {{{\boldsymbol{u}}}},\pi ,{{{\rm{sensitivity}}}},{{{\rm{specificity}}}}]\\&=\pi \left[({u}_{1,1}-{u}_{0,1})\times {{{\rm{sensitivity}}}}+{u}_{0,1}\right]\\& \hspace{.4cm} + (1-\pi )\left[({u}_{0,0}-{u}_{1,0})\times {{{\rm{specificity}}}}+{u}_{1,0}\right],\end{array}$$
(1)

where π is the COVID-19 prevalence in the tested population (equation (1) is derived in the Methods). The EU is increasing in both sensitivity and specificity, with their relative weights depending on prevalence π and utility u. Note that it is not only π and u that are context-dependent: the sensitivity and specificity of any particular COVID-19-classifier depends on the characteristics of the targeted testing population, as illustrated by the effects of variation in the proportion of COVID individuals that are symptomatic across Fig. 5a–c.

Consider the following illustrative utility function, measured in units of the number of infections prevented:

$$\begin{array}{ll}{u}_{1,1}={R}_{t}-\varepsilon\\\qquad\left[\,{{\mbox{True positive result for}}}\,\,{{{{\rm{COVID}}}}}^{+},\right.\\\qquad\left.{R}_{t}\,\,{{\mbox{infections prevented on average}}}\,\right]\\ {u}_{1,0}=-\varepsilon\\\qquad\left[\,{{\mbox{False positive for}}}\,\,{{{{\rm{COVID}}}}}^{-},\right.\\\qquad\left.-\varepsilon \,\,{{\mbox{is negative impact of self-isolation}}}\,\right]\\ {u}_{0,0}=0\\\qquad [\,{{\mbox{True negative for}}}\,\,{{{{\rm{COVID}}}}}^{-}]\\ {u}_{0,1}=-\delta\\\qquad\left[\,{{\mbox{False negative for}}}\,\,{{{{\rm{COVID}}}}}^{+},\right.\\\qquad\left.{{{\rm{causing}}}}\,\delta \,\,{{\mbox{additional infections on average}}}\,\right].\end{array}$$

There are three specified parameters in the above: (i) the number of cases prevented by intervention on a single individual is specified as the effective reproduction number, Rt ≥ 0, that is, the average number of infections that person would cause under no intervention, assuming that all individuals with a positive result follow self-isolation guidance and cause no transmission; (ii) ε ≥ 0 measures the cost of intervention (for example, the negative impact on health or education resulting from self-isolation); and (iii) δ ≥ 0 is the expected number of additional infections caused by a false-negative result (for example, due to reduced caution and increased social mixing following a negative result).

Figure 5d–f shows maximum EU, as a function of prevalence, for settings of Rt {1, 1.5} and ε {0.02, 0.2} and with δ = 0 (corresponding results for δ = 0.25 are shown in Extended Data Fig. 4). The maximization is performed point-wise with respect to sensitivity and specificity across the corresponding ROC curves in Fig. 5a–c. The utility of all classifiers decreases as the percentage of COVID symptomatics increases from 10% to 30% in Fig. 5d–f, with the intuition being that it is more difficult to distinguish COVID from the 65% symptomatic COVID+ population. When we compare the symptoms + audio RF classifier with the symptoms RF classifier, neither is generally optimal, with each classifier showing greater EU than the others for some values of (π, Rt, ε).

Exploratory approaches to identify the influence of unmeasured confounders

We explore whether the residual COVID-19 prediction performance of ROC–AUC = 0.62 in the matched test set is truly attributable to the targeted acoustic signature, or whether it stems from unmeasured confounders such as the audible recording environment, or unreported symptoms. We describe the two complementary exploratory methods in greater detail in Supplementary Note 1.

Method 1 investigates how much of the residual predictive variation persists when we map all matched test set samples to the first k-principal components of the COVID samples. We train a classifier on COVID-19 detection in this k-dimensional space and hypothesize that, below a threshold value for k ≤ τ, correct classification is due to confounding in the signal. The value of τ is determined by running a calibration experiment and is set to 14 for the sentence modality. By removing these correctly classified cases to form a curated matched test set, we see a drop in SSAST performance to UAR = 0.51 (the results from this experiment can be found in Extended Data Fig. 5).

Method 2 examines how much residual predictive variation persists when, using a robust distance metric in openSMILE space, we map each COVID+ participant to their nearest COVIDneighbour. Predictive variation that persists in the space spanned by individuals who are COVID is then attributed to unmeasured confounders. After the COVID+-to-COVID nearest-neighbour mapping, the SVM matched test set ROC–AUC drops from 0.60 to 0.55. We interpret this persistent component of predictive variation after the mapping to COVID individuals (that is, ROC–AUC drops only to 0.55 as opposed to 0.50) as pointing to some degree of unmeasured confounding contributing to the score of ROC–AUC = 0.60 in the matched test set.

Discussion

COVID-19 is well known to be causally related to particular self-identifiable symptoms such as a new continuous cough. This has allowed such symptoms to be used by governments during the pandemic as a basis for population intervention to control disease spread (for example, as a triage tool for individuals) via self-screening and without recourse to audio recording. It is therefore desirable to develop audio-based classifiers that can augment and complement the information provided by self-identifiable COVID-19-specific symptoms, that is, to learn clinically valuable latent acoustic signatures caused by COVID-19.

Problematically, enrolment on the basis of symptoms has the potential not only to artificially inflate the association between COVID-19 and its particular symptoms, but also to introduce association between COVID-19 and symptoms that are not COVID-19 specific. Furthermore, enrolment on the basis of other characteristics such as gender and age may also introduce non-causal COVID-19–gender or COVID-19–age associations in the enrolled subpopulation, possibly interacting with symptoms.

Under such recruitment bias, classifiers trained to predict COVID-19 in enrolled subpopulations may learn to predict self-identifiable COVID-19-specific symptoms, thereby providing no further utility beyond a classifier trained directly on those self-screened symptoms. It is worth noting that there are exceptions to this statement, including situations in which passive monitoring is necessary or when individuals have reasons to provide incorrect information. More concerningly, the classifier may learn to predict age/gender/non-COVID-19-specific symptoms as proxies for COVID-19 in the enrolled subpopulation, in which case its performance will not generalize to subpopulations unaffected by the same recruitment bias.

Han et al.18 examine several aspects of recruitment bias in ABCS, simulating the effects of biases introduced by age, gender, language, and by the same individuals appearing in both the train and test sets. Although their dataset is approximately balanced with respect to age and gender (across COVID+ and COVID subgroups), it is imbalanced with respect to self-reported symptoms (84% of their COVID+ subgroup is symptomatic, compared with 49% of the COVIDsubgroup; ref. 18, Fig. 1e). Such imbalance with respect to symptoms is also present in our study (prior to matching) and other studies for which data, including symptoms, are available (see Extended Data Fig. 6).

We make the following recommendations with the aim of clarifying the effects of recruitment bias and mediation by self-identifiable symptoms in future studies:

  1. 1.

    Collect and disseminate metadata. Repositories of audio samples should include details of the study recruitment criteria and relevant metadata (for example, gender, age, symptoms, location, time since COVID-19 test) so that data can be filtered for quality and for relevance to hypothesis, and bias from measured confounders can be characterized and controlled if necessary.

  2. 2.

    Characterize and control recruitment bias. Analyse data using methods that acknowledge and control for the effects of recruitment bias. We approached this by matching on measured confounders in our test and/or training sets.

  3. 3.

    Design studies with bias control in mind. Matching leads to reduced sample size when performed post-recruitment, so it can be beneficial to design observational studies that recruit participants to maximize the potential for matching on measured confounders in the enrolled cohort.

  4. 4.

    Focus on the added predictive value of classifiers. Quantify the additional predictive value offered by classifiers compared with standard methods.

  5. 5.

    Assess classifier performance in targeted settings. Measures of classifiers’ predictive accuracy, such as ROC–AUC, sensitivity and specificity vary depending upon the characteristics of the targeted population (for example, according to prevalence and the proportion of COVID+ and COVID individuals that are symptomatic). Where possible, apply the classifier in a test set that reflects the appropriate application setting, for example, by subsampling a test set that is representative of the general population, as we do here.

  6. 6.

    Examine classifiers’ expected utility. We can specify utilities for each testing outcome, that is, quantify the average benefit accrued from a true positive, the different benefit of a true negative, and similarly the costs attached to false positives and false negatives. Then the expected utility provides a highly context-specific score for quantifying and comparing classifiers’ performance.

  7. 7.

    Out-of-study replication. Replication studies could be performed in randomly sampled cohorts, or in pilot studies in real-world prediction settings with domain-specific utility functions. There are extra challenges when performing out-of-study replication; in particular, the audio-capture protocols might differ. It would facilitate replication if standardized protocols for audio data gathering are collaboratively developed.

We conclude by outlining some limitations of our study, dataset and findings. There are potentially subtle unmeasured confounders across our recruitment channels REACT and T+T. For example, PCR testing in T+T usually occurs in the days after self-screening of symptoms, whereas PCR tests in REACT are more likely to occur on a date approximately pre-determined by study researchers, and therefore to be independent of participants’ symptoms. We attempted to control for such unmeasured confounders by including recruitment channel as one of the matched variables in the test set. Despite matching on measured confounders, some residual predictive variation persists (ROC–AUC = 0.62). Our exploratory approaches for characterizing this residual predictive variation (Methods and Supplementary Note 1) suggest that some of this residual performance may be due to unmeasured confounders, but these results are inconclusive. Our results are suggestive of little utility of ABCS in practice relative to symptoms-based screening (Fig. 5). The development of more sophisticated methods for training audio-based models—such as utilizing voice activity detection, fusing multiple models’ predictions, or new augmentation methods—in the presence of symptoms data and recruitment bias is a worthwhile and active area of research which, alongside careful design and replication of studies, will eventually provide full clarity on the potential of ABCS as a tool to protect public health.

Methods

Dataset and study design

This section contains an overview of how the dataset was collected, its characteristics and its underlying study design. More in-depth descriptions are provided in two accompanying papers: Budd and co-workers23 report a detailed description of the full dataset, whereas Pigoli et al.30 present the rationale for and full details of the statistical design of our study.

Recruitment sources

Our main sources of recruitment were the REACT study and the NHS T+T system. REACT is a prevalence survey of SARS-CoV-2 that is based on repeated cross-sectional samples from a representative subpopulation defined via (stratified) random sampling from England’s NHS patient register31. The NHS T+T service was a key part of the UK government’s COVID-19 recovery strategy for England. It ensured that anyone developing COVID-19 symptoms could be swab tested, followed by the tracing of recent close contacts of any individuals testing positive for SARS-CoV-2 (ref. 25).

Criteria for enrolment

Enrolment for both the REACT and NHS T+T recruitment channels was performed on an opt-in basis. Individuals participating in the REACT study were presented with the option to volunteer for this study. For the NHS T+T recruitment channel, individuals receiving a PCR test from the NHS T+T pillar 2 scheme were invited to take part in research (pillar 1 tests refer to all swab tests performed in Public Health England laboratories and NHS hospitals for those with a clinical need, and health and care workers, whereas pillar 2 comprises swab testing for the wider population’25). The guidance provided to potential participants was that they should be at least 18 years old, had taken a recent swab test (initially no more than 48 h, changing to 72 h on 14 May 2021), agree to our data privacy statement and have their PCR barcode identifier available, which was then internally validated.

Audio recordings

Participants were directed to the ‘Speak up and help beat coronavirus’ web page24. Here, after agreeing to the privacy statement and completing the survey questions, participants were asked to record four audio clips. The first involved the participant reading out the sentence: ‘I love nothing more than an afternoon cream tea’, which was designed to contain a range of different vowel and nasal sounds. This was followed by three successive sharp exhalations, taking the form of a ‘ha’ sound. The final two recordings involved the participant performing volitional/forced coughs, once, and then three times in succession. Recordings were saved in .wav format. Smart phones, tablets, laptops and desktops were all permitted. The audio recording protocol was homogenized across platforms to reduce the risk of bias due to device types.

Demographic and clinical/health metadata

Existing metadata such as age, gender, ethnicity and location were transferred from linked T+T/REACT records. Participants were not asked to repeat this information to avoid survey fatigue. An additional set of attributes—hypothesized to pose the most utility for evaluating the possibility for COVID-19 detection from audio—was collected in the digital survey. This was in line with General Data Protection Regulation requirements that only the personal data necessary to the task should be collected and processed. This set included the symptoms currently on display (the full set of which are detailed in Fig. 1e,f), and long-term respiratory conditions such as asthma. The participants’ first language was also collected to control for different dialects/accents, and complement location and ethnicity. Finally, the test centre at which the PCR was conducted was recorded. This enabled the removal of submissions when cases were linked to faulty test centre results. A full set of the dataset attributes can be found in Budd and colleagues23.

Final dataset

The final dataset is downstream of a quality control filter (see Fig. 1g), in which a total of 5,157 records were removed, each with one or more of the following characteristics: (1) missing response data (missing a PCR test); (2) missing predictor data (any missing audio files or missing demographic/symptoms metadata); (3) audio submission delays exceeding ten days post test result; (4) self-inconsistent symptoms data; (5) a PCR testing laboratory under investigation for unreliable results; (6) a participant age of under 18; and (7) sensitive personal information detected in the audio signal (see Fig. 3d of ref. 23). Pigoli et al.30 present these implemented filters in full, and the rationale behind each one. The final collected dataset, after data filtration, comprised 23,514 COVID+ and 44,328 COVID individuals recruited between March 2021 and March 2022. Please note that the sample size here differs to that in our accompanying papers, in which Budd et al.23 reported numbers before the data quality filter was applied, whereas our statistical study design considerations, detailed in a work by Pigoli and colleagues30, focused on data from the restricted date range spanning March to November 2021. We note the step-like profile of the COVID count is due to the six REACT rounds, where a higher proportion of COVID participants were recruited than in the T+T channel. As detailed in the geo-plots in Fig. 1a,b, the dataset achieves a good coverage across England, with some areas yielding more recruited individuals than others. We are pleased to see no major correlation between geographical location and COVID-19 status, (Fig. 1c), with Cornwall displaying the highest level of COVID-19 imbalance, with a 0.8% difference in percentage proportion of COVID+ and COVID cases.

Data splits

In our pre-specified analysis plan, we defined three training sets and five test sets to define a range of analyses in which we investigate, characterize and control for the effects of enrolment bias in our data:

  • Randomized train and test sets. A participant-disjoint train and test set was randomly created from the whole dataset, similar to methods in previous works.

  • Standard train and test set. Designed to be a challenging, out-of-distribution evaluation procedure. Carefully selected attributes such as geographical location, ethnicity and first language are held out for the test set. The standard test set was also engineered to over represent sparse combinations of categories such as older COVID+ participants30. The samples included in this split exclusively consist of recordings made prior to 29 November 2021.

  • Matched train and test sets. The numbers of COVID and COVID+ participants are balanced within each of several key strata. Each stratum is defined by a unique combination of measured confounders, including binned age, gender and a number of binary symptoms (for example, cough, sore throat, shortness of breath; see Methods for a full description). The samples included in this split exclusively consist of recordings made prior to 29 November 2021.

  • Longitudinal test set. To examine how classifiers generalized out-of-sample over time, the longitudinal test set was constructed only from participants joining the study after 29 November 2021.

  • Matched longitudinal test set. Within the longitudinal test set, the numbers of COVID and COVID+ participants are balanced within each of several key strata, similarly as in the matched test set above.

The supports for each of these splits are detailed in Fig. 1h.

Machine learning models

Three separate models were implemented for the task of COVID-19 detection from audio, each representing an independent machine learning pipeline. These three models collectively span the machine learning research space thoroughly—ranging from the established baseline to the current state of the art in audio classification technologies–and are visually represented in Extended Data Fig. 7. We also fitted an RF classifier to predict COVID-19 status from self-reported symptoms and demographic data. The outcome used to train and test each of the prediction models was a participant’s SARS-CoV-2 PCR test result. Each model’s inputs and predictors, and the details on how they are handled, can be found below. Wherever applicable, we have reported our study’s findings in accordance with TRIPOD statement guidelines32. The following measures were used to assess model performance: ROC–AUC, area under the precision–recall curve (PR–AUC), and UAR (also known as balanced accuracy). Confidence intervals for ROC–AUC, PR–AUC and UAR are based on the normal approximation method33, unless otherwise stated to be calculated by the DeLong method34.

openSMILE–SVM

We defaulted to the widely used openSMILE–SVM approach35 for our baseline model. Here, 6,373 handcrafted features (the ComParE 2016 set)—including the zero-crossing rate and shimmer, which have been shown to represent human paralinguistics well—are extracted from the raw audio form. These features are then concatenated to form a 6,373-dimensional vector, fopenSMILE(w) → v, where the raw waveform, \({{{\bf{w}}}}\in {{\mathbb{R}}}^{n}\) (n = clip duration in seconds × sample rate) is transformed to \({{{\bf{v}}}}\in {{\mathbb{R}}}^{6,373}\); v is then normalized prior to training and inference. A linear SVM is fitted to this space and tasked with binary classification. We select the optimal SVM configuration on the basis of the validation set before then retraining on the combined train–validation set.

ResNet-50 BNN

Bayesian neural networks provide estimates of uncertainty, alongside strong supervised classification performance, which is desirable for real-world use cases, especially those involving clinical use. Bayesian neural networks are naturally suited to Bayesian decision theory, which benefits decision-making applications with different costs on error types (for example, assigning unequal weighting to errors in different COVID-19 outcome classifications)36,37. We thus supply a ResNet-50 (ref. 38) BNN model. The base ResNet-50 model showed initial strong promise for ABCS5, further motivating its inclusion in this comparison. We achieve estimates of uncertainty through Monte-Carlo Dropout to achieve approximate Bayesian inference over the posterior, as in ref. 39. We opt to use the pre-trained model for a warm start to the weight approximations, and allow full retraining of layers.

The features used to create an intermediate representation, as input to the convolutional layers, are Mel filterbank features with default configuration from the VGGish GitHub (ref. 40): \({{{{\bf{X}}}}}_{i}\in {{\mathbb{R}}}^{96\times 64}\), 64 log-mel spectrogram coefficients using 96 feature frames of 10 ms duration, taken from a resampled signal at 16 kHz. Each input signal was divided into these two-dimensional windows, such that a 2,880 ms clip would produce three training examples with the label assigned to each clip (COVID+ or COVID). Incomplete frames at edges were discarded. As with the openSMILE–SVM, silence was not removed. For evaluation, the mean prediction over feature windows was taken per audio recording, to produce a single decision per participant. To make use of the available uncertainty metrics, Supplementary Note 3 details an uncertainty analysis over all audio modalities for a range of train–test partitions.

SSAST

In recent years, transformers41 have started to perform well in high-dimensional settings such as audio42,43. This is particularly the case when models are first trained in a self-supervised manner on unlabelled audio data. We adopt the SSAST44, which is on a par with the current state of the art for audio event classification. Raw audio is first resampled to 16 kHz and normalized before being transformed into Mel filter banks. Strided convolutional neural layers are used to project the Mel filter bank to a series of patch level representations. During self-supervised pretraining, random patches are masked before all of the patches are passed to a transformer encoder. The model is trained to jointly reconstruct the masked audio and to classify the order of which the masked audio occurs. The transformer is made up of 12 multihead attention blocks. The model is trained end to end, with gradients being passed all of the way back to the convolutional feature extractors. The model is pre-trained on a combined set of AudioSet-2M (ref. 45) and Librispeech46, representing over two million audio clips for a total of ten epochs. The model is then fine-tuned in a supervised manner on the task of COVID-19 detection from audio. Silent sections of audio recordings are removed before then being resampled to 16 kHz and normalized. Clips are cut/zero-padded to a fixed length of 5.12 s, which corresponds to approximately the mean length of the audio clip. For cases in which the signal length exceeds 5.12 s (after silence is removed), the first 5.12 s are taken. At the training time, the signal is augmented through applying SpecAugment47 along with the addition of Gaussian noise. The output representations are mean pooled before being fed through a linear projection head. No layers are frozen and again the model is trained end-to-end. The model is fine-tuned for a total of 20 epochs. The model is evaluated on the validation set at the end of each epoch and its weights are saved. At the end of training the best performing model, over all epochs, is chosen.

Random forest classifier

To predict SARS-CoV-2 infection status from self-reported symptoms and demographic data, we applied an RF classifier with default settings (having self-reported symptoms and demographic data as inputs). In our dataset, predictor variables for the symptoms RF classifier on our dataset comprised: cough; sore throat; asthma; shortness of breath; runny/blocked nose; a new continuous cough; Chronic obstructive pulmonary disease (COPD) or emphysema; another respiratory condition; age; gender; smoker status; and ethnicity. In Han and colleagues’ dataset18, predictor variables for the symptoms RF classifier comprised: tightness of chest; dry cough; wet cough; runny/blocked nose; chills; smell/taste loss; muscle ache; headache; sore throat; short breath; dizziness; fever; runny/blocked nose; age; gender; smoker status; language; and location. Prior to training, categorical attributes were one-hot encoded. No hyperparameter tuning was performed, and models were trained on the combined Standard train and validation sets. For the hybrid symptoms + audio RF classifier, the outputted predicted COVID+ probability from an audio-trained SSAST is appended as an additional input variable to the self-reported symptoms and demographic variables listed above.

Matching methodology

The matched test set was constructed by exactly balancing the numbers of individuals with COVID+ and COVID in each stratum where, to be in the same stratum, individuals must be matched on all of (recruitment channel) × (10-year-wide age bins) × (gender) × (all of six binary symptoms covariates). The six binary symptoms matched on in the matched test set were: cough; sore throat; asthma; shortness of breath; runny/blocked nose; and ‘at least one symptom’.

Our matching algorithm proceeds as follows. First, each participant is mapped to exactly one stratum. Second, the following matching procedure is applied separately in each stratum: in stratum s (of a total of S strata) let ns,+ and ns,− denote the number of individuals with COVID+ and COVID, respectively, and let \({{{{\mathscr{A}}}}}_{s,+}\) and \({{{{\mathscr{A}}}}}_{s,-}\) be the corresponding sets of individuals. Use \({{{{\mathscr{M}}}}}_{s,+}\) and \({{{{\mathscr{M}}}}}_{s,-}\) to denote random samples without replacement of size \(\min \{{n}_{s,+},{n}_{s,-}\}\) from \({{{{\mathscr{A}}}}}_{s,+}\) and \({{{{\mathscr{A}}}}}_{s,-}\) respectively. Finally we combine matched individuals across all strata into the matched dataset \({{{\mathscr{M}}}}\) defined as:

$${{{\mathscr{M}}}}:= {\cup }_{s = 1}^{S}({{{{\mathscr{M}}}}}_{s,+}\cup {{{{\mathscr{M}}}}}_{s,-}).$$

The resulting matched test set comprised 907 participants who were COVID positive and 907 who were COVID negative. The matched training set was constructed similarly to the matched test set, though with slightly different strata, so as to increase available sample size. For the matched training set, individuals were matched on all of: (10-year-wide age bins) × (gender) × (all of seven binary covariates). The seven binary covariates used for the matched training set were: cough; sore throat; asthma; shortness of breath; runny/blocked nose; COPD or emphysema; and smoker status. The resulting matched training set comprised 2,599 participants who were COVID positive and 2,599 who were COVID negative.

Quantifying the expected utility of a testing protocol

We consider the action of applying a particular testing protocol to an individual randomly selected from a population. The four possible outcomes \({O}_{\hat{y},y}\) are

$${O}_{\hat{y},y}:= [\,{{\mbox{Predict COVID{{\mbox{-}19}} status as}}}\,\,\hat{y}]\,{{{\rm{AND}}}}\,[\,{{\mbox{True COVID{{\mbox{-}19}} status is}}}\,\,y]$$
(2)

for predicted COVID-19 status \(\hat{y}\in \{0,1\}\) and true COVID-19 status y {0, 1}. We denote the probability of outcome \({O}_{\hat{y},y}\) by

$${p}_{\hat{y},y}:= {\mathbb{P}}({O}_{\hat{y},y})$$
(3)

and use \({u}_{\hat{y},y}\) to denote the combined utility of the consequences of outcome \({O}_{\hat{y},y}\). For a particular population prevalence proportion, π, the \({p}_{\hat{y},y}\) are subject to the constraints

$${p}_{0,1}+{p}_{1,1}=\uppi$$
(4)
$${p}_{0,0}+{p}_{1,0}=1-\uppi ,$$
(5)

leading to the following relationships, valid for π (0, 1), involving the sensitivity and specificity of the testing protocol:

$${{{\rm{sensitivity}}}}\equiv \frac{{p}_{1,1}}{{p}_{1,1}+{p}_{0,1}}=\frac{{p}_{1,1}}{\uppi }$$
(6)
$${{{\rm{specificity}}}}\equiv \frac{{p}_{0,0}}{{p}_{0,0}+{p}_{1,0}}=\frac{{p}_{0,0}}{1-\uppi }.$$
(7)

The expected utility is:

$${{{\rm{EU}}}}=\mathop{\sum}\limits_{\hat{y}\in \{0,1\}}\mathop{\sum}\limits_{y\in \{0,1\}}{u}_{\hat{y},y}{p}_{\hat{y},y}$$
(8)
$$={u}_{1,1}{p}_{1,1}+{u}_{0,1}(\uppi -{p}_{1,1})+{u}_{0,0}{p}_{0,0}+{u}_{1,0}(1-\uppi -{p}_{0,0})$$
(9)
$$\begin{array}{l}=\uppi [({u}_{1,1}-{u}_{0,1})\times {{{\rm{sensitivity}}}}+{u}_{0,1}]\\+(1-\uppi )[({u}_{0,0}-{u}_{1,0})\times {{{\rm{specificity}}}}+{u}_{1,0}],\end{array}$$
(10)

where equations (4) and (5) are substituted into equation (8) to obtain equation (9), and equations (6) and (7) are substituted into equation (9) to obtain equation (10).

Demonstration code

To provide researchers easy access to running the code, we have created a demonstration notebook where the participant is invited to record their own ‘sentence’, ‘cough’, ‘three cough’ or ‘exhalation’ sounds and evaluate our COVID-19 detection machine learning models on it. The model outputs a COVID-19 prediction, along with some explainable AI analysis, for example, enabling the user to listen back to the parts of the signal which the model allocated the most attention to. In the demonstration, we detail that this is not a clinical diagnostic test for COVID-19, but that it is instead for research purposes and does not provide any medical recommendation, nor should any action be taken following its use. The demonstration file is detailed on the main repository page and can be accessed at https://colab.research.google.com/drive/1Hdy2H6lrfEocUBfz3LoC5EDJrJr2GXpu?usp=sharing.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.