Sounds of COVID-19: exploring realistic performance of audio-based digital testing

To identify Coronavirus disease (COVID-19) cases efficiently, affordably, and at scale, recent work has shown how audio (including cough, breathing and voice) based approaches can be used for testing. However, there is a lack of exploration of how biases and methodological decisions impact these tools’ performance in practice. In this paper, we explore the realistic performance of audio-based digital testing of COVID-19. To investigate this, we collected a large crowdsourced respiratory audio dataset through a mobile app, alongside symptoms and COVID-19 test results. Within the collected dataset, we selected 5240 samples from 2478 English-speaking participants and split them into participant-independent sets for model development and validation. In addition to controlling the language, we also balanced demographics for model training to avoid potential acoustic bias. We used these audio samples to construct an audio-based COVID-19 prediction model. The unbiased model took features extracted from breathing, coughs and voice signals as predictors and yielded an AUC-ROC of 0.71 (95% CI: 0.65–0.77). We further explored several scenarios with different types of unbalanced data distributions to demonstrate how biases and participant splits affect the performance. With these different, but less appropriate, evaluation strategies, the performance could be overestimated, reaching an AUC up to 0.90 (95% CI: 0.85–0.95) in some circumstances. We found that an unrealistic experimental setting can result in misleading, sometimes over-optimistic, performance. Instead, we reported complete and reliable results on crowd-sourced data, which would allow medical professionals and policy makers to accurately assess the value of this technology and facilitate its deployment.


Introduction
Since its outbreak in early December 2019, over 169 million cases of the novel coronavirus disease have been reported, including 3.5 million deaths.Researchers and scientists have made considerable strides in developing treatments and vaccines for COVID-19, and effective and easily accessible tests have been key to trace infected people quickly.Currently, the most commonly used and first-line diagnostic tool for COVID-19 is the reverse transcription polymerase chain reaction (RT-PCR) assay to detect the presence of viral ribonucleic acid (RNA) from swab samples [1,2].RT-PCR tests are highly sensitive in laboratory testing (over 95% diagnostic sensitivity and specificity), however they have been found to perform differently in several commercial kits, with sensitivity ranging from 75% to 100%, and in the worst case reaching as low as 38% [3,4,5].Moreover, the sample analysis process is involved, time-consuming, and limited to approved laboratories with highly-trained staff, leading to limited testing capacity and failing to meet the rapid increase in demand.It is crucial that the pandemic response overcomes these challenges to timely test on a massive scale.This requires fast, affordable, sustainable and effective testing methods, which can be repeated over time by individuals to track progression.This would help contain the current spread but also suppress resurgence and minimise health risks.
Within this context, in the past year researchers have developed and published multiple models for COVID-19 prediction using audio [6,7,8,9,10,11].In particular, advances in machine learning have demonstrated the potential of automated auscultation of respiratory sounds and brought about new possibilities for fully automated COVID-19 screening [12,13,14,15,16,17,18,19]. For instance, a systematic review by Wynants et al reports that AUC-ROC (Area Under the Receiver Operating Characteristics) performance of over 75 existing COVID-19 prediction models [20] is in the range of .70 and .99.
There is however a lack of studies exploring the biases and model evaluation processes which affect (potentially even positively but unrealistically) these performance results.Such issues include: • Potential underlying data biases or study limitations not reported sufficiently, where models were developed and evaluated with limited data which might not be representative of the target population (e.g., 19 subjects in [19], 51 subjects in [21], and 88 subjects in [15]).
• Methodological flaws (e.g.same users during model development and validation [22]) which would be unrealistic in a practical clinical setting, resulting in artificial performance boost.
• Lack of systematic comparison with other respiratory diseases like asthma and bronchitis, and only distinguish COVID-19 from healthy controls [23].
Due to these issues, many researchers raised concerns on the feasibility and effectiveness of such models if deployed in real settings [20,24,25].
In this work, we investigate the limits of audiobased COVID-19 testing with the aim of creating the foundation of realistically applicable audio tools.The aim of this study is two-folds: First, based on a large crowdsourced dataset, to investigate the performance of an audio-based testing method when working with, to the best of our knowledge, unbiased data with a methodological design based on realistic assumptions (e.g.independent user split).Secondly, to explore the impact of biases and design pipeline on performance.
For this purpose, we first gathered crowd-sourced respiratory sound data from general population via smartphones.We carefully prepared data for model development and validation, by selecting representative audio samples from self-declared COVID-19 positive or negative participants.Subsequently, we developed a deep learning model on a portion of the data and then validated its predictive performance on an independent population.In particular, we adhered to the TRIPOD reporting guideline [26], aiming at reporting in a complete, transparent and usable manner.Our discussion explores the biases and how machine learning model hyperparameters could be tuned, depending on the use of the tool (e.g. on symptomatic or asymptomatic populations) and public health needs.

Results
Dataset preparation and statistics.For data gathering purposes, an app1 was developed and released in April 2020 to crowd-source participants' demographics, medical history, symptoms, COVID-19 test results and audio recordings: three voluntary cough sounds, three to five breathing sounds, and three speech recordings where user was asked to read a specific sentence.The self-reported COVID-19 test results included receiving a positive/negative test result, or not being tested before the recording.More details can be found in Methods in Appendix.By  Our app is a multi-language tool, but in this study we focus only on audio samples from Englishspeaking participants (77.7% of the overall participants) to avoid language-related bias.Audio quality checks were conducted to filter out incomplete or noisy samples.Finally, 2,478 participants (514 positive and 1,964 negative participants) with 5,240 samples were included for experiments, as shown in Fig. 1a (more detailed data selecting criteria can be found in Methods).Demographic statistics for the experimental data are presented in Fig. 1b-d: 56% participants in the selected data were male, the majority aged 20-49, half never smoked.In addition, as shown in Fig. 1e, 84% of the participants who tested positive reported symptoms like fever or cough, while others did not report any symptoms at the recording time.51% of the negative participants reported no symptoms, while 49% had symptoms such as dry/wet cough, fever, dizziness, etc.
From the 2,478 English-speaking participants, we prepared a training & validation set which consists of 800 participants with balanced COVID-19 status and other demographics to optimise the parameters of our deep learning model, as labelled by the yellow box in Fig. 1a, and the rest of the data were used for evaluation, namely testing set pool (green box in Fig. 1a).To inspect the performance in different realistic deployment scenarios, we first held out a representative testing set with balanced and demographic controlled participants, containing 100 positive and 100 negative participants.Furthermore, we randomly selected positive and negative participants from the testing pool to form new groups with the criteria of various prevalence levels, medical history, smoking status, to holistically validate our model.Apart from the controlled training and testing data, to simulate the impact the unrealistic experiment setting and bias, we also prepared some training and testing sets with improper data splitting and various biases the model.Details can be found in Methods.Accuracy is presented by ROC-AUC (Area Under the Receiver Operating Characteristics curve), sensitivity and specificity.
COVID-19 detection performance.On the demographic-representative testing set with 200 participants (see age and gender distribution in Appendix Table 1), our deep learning model with three sound types yielded a ROC-AUC of 0.71 (0.65-0.77) (Fig. 2a), with sensitivity 0.65 (0.58-0.72) and specificity 0.69 (0.62-0.76) (Fig. 2b).The combination of three sound types outperformed any single modality: a ROC-AUC of 0.66 (0.60-0.71) on cough, 0.62 (0.56-0.68) on breathing, and 0.61 (0.55-0.67) on voice was achieved.Moreover, breathing yielded the highest sensitivity of 0.64 (0.56-0.71), but cough showed the highest specificity of 0.66 (0.58-0.73).This indicates that all modalities are informative, and their combination leads to the optimal performance.We further tested the performance of the model on different demographic subgroups under this testing set (Fig. 2c), which showes similar results across different age and gender distributions: ROC-AUCs were all above 0.65, and sensitivity and specificity were similar for each group.Accuracy on the over-60 subgroup is slightly higher, but we suspect that the increased performance might be a result of the limited number of participants in this group.We also inspected how symptoms impact the model performance by dividing the 200 participants in the testing set into asymptomatic and symptomatic subgroups.As presented in Fig. 2c, for both subgroups our model yielded ROC-AUCs above 0.66.Yet, from the comparison we can also observe that our model performs better on distinguishing asymptomatic negative participants (specificity=0.85(0.77-0.92)) and symptomatic positive participants (sensitivity=0.67(0.59-0.74)).While for more challenging cases, i.e. predicting symptomatic negative and asymptomatic positive cases, we achieved a lower accuracy: Sensitivity of 0.50 (0.25-0.76) for asymptomatic and specificity of 0.56 (0.45-0.66) for symptomatic participants.A potential explanation could be that asymptomatic positive participants might not manifest changes in audio characteristics, and thus are intractable for detection.Further discussion about the real implications and applications can be found in Discussion.
Model performance on various health and smoking status.One of the most important concerns to address is whether these audio models for COVID-19 testing might be confusing COVID-19 with other illnesses or respiratory pathologies.To investigate this topic further, we split the participants of the testing pool into several non-overlapping groups.Results are presented in Fig. 3.
The first controlled criterion is medical history.We selected participants who reported that they have Asthma, HPB (High Blood Pressure), and those who claimed having no medical history.We compared the model performance and found that all metrics reach comparable level of accuracy on average: the specificity on Asthma group was 0.62 (0.55-0.68), on HPB group was 0.69 (0.62-0.76), on no medical history group was 0.65 (0.62-0.76) (Fig. 3a), and on a mix of participants was 0.69 (0.62-0.76) (Fig. 2b).Predicted COVID-19 probabilities of the negative participants based on our model are compared in Fig. 3b.Kruskal- Wallis H Test [28] on the three negative groups' probabilities yielded a p-value of 0.62 (>0.05), showing that the predictions are from the same distribution.This validated the assumption that the medical history cannot confuse our model.Worth noting, that the declined sensitivity for Asthma and HPB groups might be caused by the very limited number of testing samples, leading to relatively large performance fluctuations.
The second controlled criterion is the reported smoking status.The variance of the performance across groups was marginal (Fig. 3a): Specificity for those never smoking was 0.66 (0.63-0.68), for those having quit smoking was 0.67 (0.62-0.71), and for those smoking currently was 0.63 (0.58-0.68).Similar to medical history, predicted probabilities for these three are presented in Fig. 3b, with a p-value of 0.51 (>0.05) from Kruskal-Wallis H Test. Sensitivity for smokers was slightly lower: 0.47 (0.31-0.66), which might be explained by the fact that five of the 22 COVID-19-positive smokers were asymptomatic (23% in this group against 16% in Fig. 1e).As our model is better in predicting symptomatic COVID-19 correctly, this explains the slight drop in the overall sensitivity for this group.
Model performance with unrealistic evaluation and biases.To show how the bias and unrealistic experiment design impacts the model performance, we re-selected and purposefully introduced various biases that previous works might have had, to generate another four training and testing sets to attempt to artificially inflate the results.The artificially created biases are as follows: 1) Using sample-level random splits (random-splits for short) instead of participant-independent splits (user-splits for short) for training and testing.2) Introducing gender bias into the data by selecting 85% of the negative participants as female.3) Bringing age bias into negative group.There are two biased groups: selecting all negative participants as who aged over 39 (Group 1) and as who aged under 39 (Group 2).4) Replacing some English-speaking participants with Italian-speaking participants and making the proportion of Italian-speaking participants relatively higher in the positive group.Details of the data used for comparison can be found in Methods and Appendix Fig. 4-5.We trained the model without changing the network structure, and if not specifically mentioned, the results are based on the combination of three sound types (breathing, cough and voice).parison can be found in Appendix Table 2-5).From Fig. 4a, random-splits yielded a higher accuracy than user-splits, with the performance gains coming from the overlapping participants whose data have been seen from training: with sensitivity of 0.84 (0.75-0.92) and specificity of 0.78 (0.68-0.87), since personal sound traits are easy to memorise for the model.However, this is less realistic, as in real world scenarios the model should ideally be well adapted to unseen new population.This also may validate our hypothesis that some previous works reported opti-mistic performance by using this random-split protocol.Demographic bias either in age or gender appears to also lead to biased results.Overall ROC-AUC might be boosted as shown in Appendix Table 3-4, but a great difference between sensitivity and specificity can be observed in some subgroups.For instance, sensitivity of 0.23 (0.14-0.33) but specificity of 0.93 (0.90-0.97) were obtained as shown in Fig. 4b on biased (Female) group, because positive females were under-represented in the training set and this model tends to treat female participants as negative.
Similar results can be observed in age-biased groups (see Fig. 4c).In the group where negative participants aged-over-39 in training set, the model yielded higher specificity than sensitivity on the aged-over-60 participants in the testing set.On the contrary, the model trained from the data biased to aged-under-39 negative participants yielded higher specificity on the younger group (seen Fig. 4c).When it comes to the language bias, i. e., for Italian-speakers, positive participants were over-represented for training, we get the results that sensitivity is as low as 0.25 (0.15-0.36) in English subgroup and specificity is close to 0 in Italian subgroup from Fig. 4d, and this bias particularly impacted voice modality (see Fig. 4f) and slightly influenced cough (see Fig. 4e).Yet, our performance (namely controlled model in Fig. 4) shows consistent sensitivity and specificity across all subgroups, presenting a realistic value for model application.

Discussion
Comparison with other studies.For digital technologies to penetrate the clinical practice it is pivotal that studies become more explainable and that the models are resilient to the data noise, variability and bias present in real data.
Demographic Bias: In our study design and data selection we concerned ourselves with potential confounding factors and tried to rule out selection bias, as these may lead to unrealistically inflated results.Specifically, we split positive and negative samples into three partitions for model training, optimisation, and testing, while adjusting the data partitioning and maintaining similar distributions of age and gender across different data splits to control for potential confounding variables (see Appendix Table 1).This is different from some prior studies in the literature, in which the selection of the data is unclear and lacking a cohort diagram [14,15].More importantly, we further performed experimental analysis to explore the effect of demographic bias on the model.
Language Bias: With the potential of COVID-19 digital testing to be applicable worldwide, it is important to explore the effect of language bias on differ-ent audio-based data (such as cough, breathing and voice).To disentangle possible confounding effect of language, we restricted our analysis to Englishspeaking data, which gives the most realistic perspective of the capabilities of audio based diagnostics for COVID-19.In addition, similar to demographic bias, we carried out experiments to test the effect of language bias when the model was trained with unbalanced multi-language data.
User Independence: Moreover, in some prior studies, cross-validation was applied for performance validation: this is generally done when the data is scarce and user samples become very important.Data from the same participant might be used for both model training and validation [22,29]: while this might be considered acceptable in testing theoretical machine learning techniques, if a user appears in both training and testing sets, such models typically do not generalise well, making them poorly-suited for a realistic setting.With the luxury of a large dataset, we could choose to perform user-independent validation, where participants' data used for model validation are not included and unseen during model training.We are confident that this is a more realistic approach, which could inform future in-the-wild audio-based screening.
Limitations of our study.Several limitations to our work should be acknowledged.COVID-19 is known to often manifest as respiratory symptoms, which are also common for other relatively widespread diseases, as well as among the smoking population.Therefore, we conducted an in-depth analysis to establish whether our model could be influenced by other respiratory pathology.Specifically, we evaluated the ability of our model to correctly identify a COVID-19 infection in participants who indicated Asthma and High blood pressure in their medical history as well (as these are reasonably large cohorts in our data collection) compared to a cohort who indicated no other medical conditions.We also tested the model on participants with a variety of smoking status reported (e. g., few to many cigarettes per day).However, we note that we have not had the opportunity to test against a wider variety of specific respiratory infections, such as influenza or rhinovirus, since they were not prevalent when our data was collected and are difficult to have a reliable ground truth for.It is possible, however, that the participants who reported a cough and had a negative COVID-19 test result were indeed suffering from some respiratory condition at the time of the sample collection.
Also, as our models did not fully control for all potential confounding factors such as race and have much less number of elderly participants, future studies should investigate these biases.In addition, though in the present study the language was well controlled (all English), it is yet unclear whether and how different types of accents would affect the model, while we lack such information to study this.
Our data is crowdsourced: we rely on the trustworthiness of the responses from individuals, especially with respect to their COVID-19 testing status.The scale of the data helps in amortising the noise generated by the crowdsourcing process while, at the same time, shows robustness of the approach to uncontrolled conditions.Our data, while aims to match the cohort to target population as much as possible, lacks clinical validation.Thus, additional external validation should be performed to assess the generalisation of the prediction model before being applied in clinical practice.
Potential implications for practice.The model's trainable parameters are optimised based on a default threshold of 0.5 from the final softmax output layer (see Fig. 5): this value is used to classify COVID-19 positive and negative predictions.While our model could be used on the general population for COVID-19 digital testing, we explore different contexts of applications where this threshold, as a hyper-parameter, can be adjusted for a more optional optimal outcome: we report the ROC curve and sensitivity/specificity under different decision thresholds for asymptomatic and symptomatic groups (participants who did and did not declare symptoms) in Appendix: Fig. 1 and Fig. 2, respectively.Specifically, when applying the model with the aim of screening the asymptomatic population for risk of exposure, from Fig. 1b), a lower threshold can be used to guarantee a higher Youden Index (defined as Sensitivity + Specificity −1) and a higher sensitivity compared to the threshold of 0.5, so that potential COVID-19 infections are exhaustively covered, and false positives can be easily filtered by a further clinical testing.Yet, if the targeted group is symptomatic (Fig. 2b), to limit the false positives, a higher specificity can be achieved by by slightly increasing the threshold to maximise the Youden Index.In this study, as we have limited samples for validation in our dataset, we only demonstrate the performance on the test set data under different threshold settings as a proof-of-concept.For clinical use, a further investigation is required on how to adjust and calibrate this threshold to meet different testing criteria.
Finally, audio-based predictive models could be combined with other signatures from other biological signals such as heart rate [30], as well as selfreported symptoms [6] for improved accuracy, however this would require crowdsourcing additional data from the participants.
Conclusion.In this work, we have developed and validated a deep learning method for detecting COVID-19 solely by analysing human sounds via mobile or web applications.In particular, the crowdsourced data have been collected and processed to make the results reliable, by controlling potential confounding factors in COVID-19 positive and negative cases.We analysed the presented model's predictive performance on detecting COVID-19 infection, which may bring insights into the adoption of digital health technologies in the COVID-19 era.Moreover, we analysed the risks of modelling with various biased data, which led to overestimated performance.This demonstrated that biased data or modelling should be avoided to rigorously validate the digital testing tool for clinical efficacy.

Methods
Data collection and preparation.Our data were crowdsourced via a data gathering framework released in April 2020, in multiple languages and for multiple platforms (a webpage, an Android app, and an iOS app).Collected data consist of participants' age, gender, medical history, current symptoms, and three audio recordings: three voluntary cough sounds, three to five inhalation-exhalation sounds, and the participant reading a standard sentence from the screen three times.Participants were asked whether they had been tested for COVID-19, and an optional geo-location sample was collected.
The mobile apps also prompted the participant to input symptoms and sounds every two days.No identifiable information was collected.As of 26th April 2021, a total of 36,364 participants contributed 75,201 samples to our project.
We used samples with self-reported COVID-19 test results for experiments as ground truth.Hence, 61,615 samples without reported test results were excluded.Further 110 samples with COVID-19 testing results declared to be obtained 2 weeks before the recordings were made were also discarded due to the delayed audio recordings with respect to COVID-19 testing.Our data was sourced in multiple languages (English, Italian, Spanish, Portuguese, etc.) and the number of samples in each language varied.To avoid language bias, for the main results of this paper we used English audio samples only, with 8,102 non-English samples excluded.Lastly, we manually checked the quality of each recording, deleting in total another 134 samples, that were either incomplete with recordings shorter than 2 seconds, or samples with silent recordings, or distorted samples with poor audio quality.As a result, 5,240 samples from 2,478 participants were explored for the majority of the experiments.
Data used and experiment design for bias evaluation.In addition to the above-mentioned data, we also prepared four datasets with known biases to validate the impact of confounding factors on audio based COVID-19 testing, by selecting from all eligible samples with COVID-19 test results.To be more specific, the following strategies were followed to generate the data: • Splitting: Our balanced training set and testing set contained 1000 participants (800 participants for training&validation, 200 for testing) and 1486 samples (1162 samples for train-ing&validation, and 329 for testing), with 1.5 samples per participant on average.Instead of splitting training and testing by participants, for this comparison group, we randomly shuffled all samples and split them into training and testing according to the original ratio (1162:329).
• Gender bias: To simulate the scenario where COVID-19 positive rate is significantly different in different gender groups, which raises the concern that the model is detecting gender instead of COVID-19, we manually selected 500 positive and 500 negative participants from the total 2,478 participants (blue box in Fig. 1) with gender distribution bias.Specifically, 56% of the positive group are male and the rest 44% are female, while in the negative group, females account for 85% and males for %15.Age demographics are kept balanced and the total number of participants is unchanged, as shown in Appendix Fig. 4.
• Age bias: With the same approach, for negative participants, we also purposefully selected 1) those aged over 39, and 2) those aged under 39 to simulate the scenarios when participants were not from the whole population.The revised distribution can be found in Appendix Fig. 3.
• Language bias: Rather than using all English speakers, to investigate the effect of language, we replaced some English-speaking participants with Italian-speaking participants.Specifically, we used more positively-tested Italians than negative.As a result, the positive group mainly consists of Italian-speakers, indicating the bias that participants who speak Italian are more likely to be COVID-19 positive.The detailed percentage can be found in Appendix Fig. 5.
Data processing.For data pre-processing, all the collected audio recordings were resampled to 16 kHz and converted to mono channel.Then, these audio recordings were cropped by removing the silence periods at the beginning and the end of the recording, after which each sample was normalised.
Model architecture.We implemented a Convolutional Neural Network (CNN) based model for COVID-19 classification, as shown in Appendix Fig. 5.The network receives one sample with three audio recordings as input, including breathing, cough, and voice from one participant.A spectrogram is computed for each of the recordings and is fed into a VGGish subnetwork.VGGish is a state-ofthe-art pre-trained CNN, with which we leverage and transfer the knowledge learnt from an external massive general-audio dataset [31].Each VGGish block transforms the input spectrogram into a latent feature vector, then the features from three sound types are concatenated, and finally fed into a binary classifier.Such model design allowed the three sound types to be analysed jointly.Specifically, the model is composed of three parts (Fig. 5): (1) Input Layers: The audio recording is first chunked into non-overlapping segments of 0.96 seconds.Log-mel spectrogram is computed for each segment with a window size of 25 ms, a window hop of 10 ms, and a periodic Hanning window.64 Mel bins are adopted for Mel spectrogram covering the frequency range from 125 Hz to 7500 Hz.A small offset is used to convert the mel-spectrogram into log-scale, resulting in a log-mel spectrogram with size of 64 x 96 per chunk.
(2) Feature Extraction Layers: The main component of the model is VGGish, a CNN-based network with cascaded convolutional layers, max-pooling, and fully connected layers.This network transforms each input spectrogram frame into a 128-dimensional feature vector.Then, an average pooling layer is employed to aggregate all frames within one audio recording into one fixed-length latent feature vector.The size of the CNN kernels and the number of hidden states of fully connected layers are kept consistent with the original work [31].
(3) Prediction Layers: The resulting latent feature vectors for three modalities are concatenated, and fed into the binary classifier, which consists of two dense layers (the number of hidden states are 96 and 2, respectively) with non-linear ReLU and Softmax activation functions, respectively.The output of the model is a continuous score within the range of 0 to 1 (i.e., the probability of the participant being infected with COVID-19), which can then be categorised into a binary score (0: negative, and 1: positive) with a threshold of 0.5.
To improve the robustness and generalisation ability of our deep learning model, the following techniques were employed: • Transfer learning: Our collected data is relatively small compared to the number of param-eters in the proposed deep neural network.In light of this, we harness transfer learning to improve the representing ability.Specifically, VG-Gish layers are initialised by a pre-train model, which is designed for audio classification task.
• Differential learning rate: Both VGGish and dense layers are jointly updated by using our audio data.However, we used a small learning rate for parameter update of the VGGish part of the network, and increased the learning rate 10 times for the dense layers.Specifically, learning rate was set as 1e-6 for VGGish and 1e-5 for the final dense layers.
• Experiment design.We performed participantindependent splitting, which means that samples from a participant included in training set for parameters estimation will not be used for model evaluation.We randomly selected 80% of all positive participants for model learning and sampled the same number of negative participants by maintaining a similar demographic distributions in these two groups (see Appendix Table 1).This can minimise the data bias collected in a crowd-sourcing manner.10% of participants were held out for hyper-parameters searching like the size of dense layers.Once a final trained model was obtained, the performance were evaluated on different demographics, prevalence levels, and health conditions data.Measures of performance include the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), sensitivity, and specificity.For all the metrics, we calculated 95% confidence intervals (95% CIs), using bootstrap resampling with 1000 bootstrap samples and replacement [32].

Data Availability
The data is sensitive as voice sounds can be deanonymised.Anonymised data will be made available for academic research upon requests directed to the corresponding author.Institutions will need to sign a data transfer agreement with the University of Cambridge to obtain the data.A copy of the data will be transferred to the institution requesting the data.We already have the data transfer agreement in place.

Ethics
The study was approved by the ethics committee of the Department of Computer Science at the University of Cambridge, with ID #722.Our app displays a consent screen, where we ask the user's permission to participate in the study by using the app.Also note that the legal basis for processing any personal data collected for this work is to perform a task in the public interest, namely academic research.More information is available at https: //covid-19-sounds.org/en/privacy.html.

36, 364 Fig. 1 |Fig. 2 |
Fig. 1 | Data flow diagram and demographic statistics.a) Data cleaning and selecting with 514 COVID-19 positive and 1,964 COVID-19 negative participants eventually included for experiments.b-e) Demographic statistics for the included 2478 participants with blue representing positive class and orange -negative class: b)Gender distribution with about 56% male, 43% female and 1% preferring not to say both two classes.c) Age distribution with more than 87% positive and negative participants aged 20-59.d) Smoking history with more than half participants never smoking and the remaining participants having smoked before or smoking currently.e) symptom distribution: 16% of the positive group showed no symptoms, while 49% of the negative group reported at least one of the symptoms.

Fig. 3 |
Fig. 3 | Performance comparison under different conditions.a) Performance for different prevalence levels, for participants with asthma/HBP or without any medical history, and for never-, ex-and nowsmoking participants.# denotes the number of positive/negative participants.b) Predicted probabilities of infecting COVID-19 on those negative participants, with values above 0.5 indicating a false positive.No significant difference was observed across subgroups.

Fig. 4 Fig. 4 |
Fig. 4 | Performance comparison.Sensitivity (blue) and specificity (pink) are presented with sold liner showing the 95% CIs.If not particularly mentioned, the results are based on the combination of three sound types.a, User-independent splits v.s.sample-level random splits: (Seen) denotes the performance on samples whose other samples were used for training, otherwise the performance is notated by (Unseen).b, Controlled demographics v.s.gender bias: (Female) denotes the female subgroup.c, Controlled demographics v.s.two types of gender biases: all negative participants in training set aged over 39 or under 39.(Aged60-) and (Ages16-39) denote the elder and the younger subgroup.d, e, and f, Model for English-speakers v.s.model for biased English-and Italian-speakers: (En) and (It) denote two subgroups from the testing set.

4 Fig. 5 |
Fig. 5 | Overview architecture of the deep learning model.A convolutional neural network using cough, breathing, and voice sounds as input, to predict COVID-19 as a binary outcome.This model can also be applied to longitudinal data iteratively to assess the clinical progression over time.(VGGish is a neural network pre-trained on the Audioset dataset, Pooling is an aggregation operator, Dense is a fully connected neural network layer, Dropout is a randomized operation that reduces overfitting, ReLU is a rectified linear unit activation, Softmax is the logistic function.) Two-phase training: To use the data more efficiently, we primarily trained the model via training set and identified the best hyper-parameters based on the averaged sensitivity and specificity of the 15 th epoch on validation set, and then we merged the training set to fine tune the model until the training performance kept unchanged.Parameters of the deep learning model were updated by iterative gradient back propagation by the binary cross-entropy loss function on training set.The training batch size was 1.The whole framework was implemented by Python 3.6 and Tensorflow 1.15.Model training was done on a Nvidia Quadro RTX 8000 GPU.

Fig. 1 |
Fig. 1 | Model performance for asymptomatic screening.a, Receiver-operating characteristic curve for the binary classification task of diagnosing COVID-19.b, Sensitivity and specificity with 95% CIs under different thresholds.The default value is 0.50.

Fig. 2 |
Fig. 2 | Model performance for symptomatic diagnosis.a, Receiver-operating characteristic curve for the binary classification task of diagnosing COVID-19.b, Sensitivity and specificity with 95% CIs under different thresholds.The default value is 0.50.

Table 1 |
Demographics distribution of the training and balanced testing set.small proportion (under 5%) of the participants in each set preferred not to answer about their age or gender. A