An ensemble learning approach to digital corona virus preliminary screening from cough sounds

This work develops a robust classifier for a COVID-19 pre-screening model from crowdsourced cough sound data. The crowdsourced cough recordings contain a variable number of coughs, with some input sound files more informative than the others. Accurate detection of COVID-19 from the sound datasets requires overcoming two main challenges (i) the variable number of coughs in each recording and (ii) the low number of COVID-positive cases compared to healthy coughs in the data. We use two open datasets of crowdsourced cough recordings and segment each cough recording into non-overlapping coughs. The segmentation enriches the original data without oversampling by splitting the original cough sound files into non-overlapping segments. Splitting the sound files enables us to increase the samples of the minority class (COVID-19) without changing the feature distribution of the COVID-19 samples resulted from applying oversampling techniques. Each cough sound segment is transformed into six image representations for further analyses. We conduct extensive experiments with shallow machine learning, Convolutional Neural Network (CNN), and pre-trained CNN models. The results of our models were compared to other recently published papers that apply machine learning to cough sound data for COVID-19 detection. Our method demonstrated a high performance using an ensemble model on the testing dataset with area under receiver operating characteristics curve = 0.77, precision = 0.80, recall = 0.71, F1 measure = 0.75, and Kappa = 0.53. The results show an improvement in the prediction accuracy of our COVID-19 pre-screening model compared to the other models.

www.nature.com/scientificreports/ To achieve better automation in voice/cough feature extraction, a large-scale crowdsourced dataset of respiratory sounds was collected to aid the detection of COVID-19 9 . The authors used cough and breathing sounds to identify COVID-19 distinguished from sounds from asthma patients or healthy people. The librosa module was used as the primary audio processing library, while VGGish was used to automatically extract audio features in addition to the various handcrafted features 33 . The handcrafted and VGGish extracted features were utilized in shallow machine learning algorithms (i.e., logistic regression and support vector machine). The results showed that this model could differentiate between cough and breathing sounds of COVID-19 patients and healthy users or patients with asthma (AUC = 0.8).
There are several challenges and limitations associated with the previous studies. The main challenge is data availability and quality. Even though some datasets are publicly available, the datasets are naturally limited in COVID-positive samples compared to the negative samples. Moreover, the nature of the crowdsourced data does not guarantee any noise-free recordings. The crowdsourced cough sounds may include prolonged silence periods or significant background noise, making it challenging for any machine learning model to identify valuable patterns related to COVID-19. Previous studies have used an overlapped sliding window approach to segment the cough sound files and, consequently, enrich the data of limited COVID-positive samples. The overlapped sliding window size may significantly impact the machine learning model results as it may accumulate sound information unrelated to the cough (silence) if the window size is relatively long. If the window size is small, the machine learning model may learn repetitive patterns that might not necessarily correlate with COVID- 19. The previous studies based their analysis on either the MFCC or Spectrogram of the sound files and did not explore other features or representations of the cough sound files. Moreover, the lack of fully automated feature extraction limits the ability of machine learning models to learn from diverse features that may identify COVID- 19. In this work, we utilize a crowdsourced cough dataset with diverse length, pacing, number of coughs, and stochastic background noise from publicly available data 4,11 and segment the cough sound recordings into individual non-overlapped segments to enrich the COVID-positive records. We process each recorded cough for the first time to generate multiple representations and extract automated features per record. We then employ the generated feature library to develop and examine several shallow and deep learning models. The high-performance models are selected and further aggregated into an ensemble of classifiers to produce a robust classifier to identify COVID-19 from cough recordings. We used the kappa statistic to incorporate high-rank classifiers without favouring any of the classes 34 .

Results and discussion
The presented work identifies COVID-19 from cough sound recordings. The main challenge faced in this work is how to utilize a crowdsourced cough dataset with diverse length, pacing, number of coughs, and stochastic background noise from publicly available COVID-19 cough sounds. We provide a practical solution that segments the cough sound recordings into individual non-overlapped segments to enrich the COVID-19 positive records. We process each recorded cough to generate multiple representations and extract automated features per record. We then employ the generated feature library to develop and examine several shallow and deep learning models.
The high-performance models are selected and aggregated into an ensemble of classifiers to produce a robust classifier and identify COVID-19 patients from their cough recordings. In addition, we used the kappa statistic to incorporate high-rank classifiers without favouring any of the classes. Finally, we show the significance of the proposed classification method by comparing the proposed method to recent related works. The proposed method outperforms compared to other complicated methods.
The methods developed so far segmented the cough sound recordings into overlapped segments of unjustified length and padded the resulted segments as needed. This type of segmentation introduced undesired frequencies and led to misleading classification results. Our method was deployed into a Web App to identify COVID-19 patients from cough sounds that signal the work's potential practical significance.
There is a legitimate need for the proposed predictive models based on shallow and deep learning, wherein these models use non-medical secondary data to identify health-related conditions such as COVID-19. These predictive models can be used in large-scale real-world settings. The results on real-world datasets are promising and motivate further investigations into secondary data analysis for identification of other health-related conditions".
Here we illustrate the shallow and deep learning experimentation results on the target cough sound data extracted from crowdsourced recordings. The goal is to identify COVID-19 patients from just one cough. Table 1 shows the accuracy (average ± standard deviation) of seven different classifiers trained on each of the six www.nature.com/scientificreports/ representations extracted from each cough sound segment. We also use the raw data directly as input images to train these classifiers. The results show that the shallow learning models cannot explain much of the data variance. The Random Forest (RF) classifier trained with spectrogram shows the highest accuracy of 0.67, followed by the logistic regression classifier trained with spectrogram (0.66). Tables 1, 2 and 3 highlights the top-three highest (accuracy, sensitivity, specificity, precision, and negative predictive value) features per classifier. The results also show that the essential representations of the cough sounds are the spectrogram, power spectrum, MFCC, and MelSpectrum. This ranking is based on how many times a specific representation appears in the top-three highest accuracy features per classifier list. This is mainly due to the non-overlapping window used to perform the cough sound segmentation in this study. The Chroma, RAW, and Tonal representations have no significant impact in detecting COVID-19 from cough sounds. Since the other studies have not presented multiple features as we did in this study 3,9,31 , there is no comparative information presentable in this regard. As most of the classification results are close to a random chance on average across all features of the classifiers, we do not proceed with shallow learning models in the final ensemble. Tables 4 and 5 show the experimentation results with the three deep learning models of CNN from scratch, the original Vgg16 model, and used Vgg16 with data augmentation. The deep learning models showed a better performance compared to the other shallow learning models. It is noted that the essential features that produce  www.nature.com/scientificreports/ the highest accuracy and AUC are the same as the list discovered by the shallow learning models. The top four features of the kappa statistic are more than 0.2, suggesting at least a fair agreement between the observed accuracy from data and the accuracy due to the classifier decision function. This comparison justifies composing an ensemble from all the features and classifiers where the kappa statistic is more than 0.2. Here, we only compose four classifier models to obtain a more accurate classifier ensemble. The ensemble is, though, created from all the features regardless of the associated kappa values. The last rows in Tables 2 and 3 represent the performance of three deep learning models following their training using all features. The high variation in the entire feature images creates a very diverse pattern that could not be captured well enough using deep learning models (maximum AUC = 0.63). Tables 6, 7 and 8 show the classification performances of the classifiers for training and testing the three deep learning models. The three deep learning models were trained for 100 epochs and recorded the average accuracy and standard deviation per feature. A CNN model was designed from scratch and trained on the power spectrum feature to train the other two deep learning models. The results show the highest average accuracy of 0.84 for the CNN model, followed by the accuracy of 0.8 for the Mel spectrum, 0.77 for the spectrogram, and 0.68 for MFCC. Chroma, Tonal, and the Raw data did not show an improved performance compared to the other  www.nature.com/scientificreports/ features, consistent with the results of shallow classifiers. Although the standard deviation of all models appeared to be relatively small, overfitting is observed for all classifiers, marked by the significant difference between the average accuracy for training and testing. The overfitting is mainly due to a relatively large number of weights and hyperparameters (compared to the input training image size) that must be estimated during training. Early stopping during the training phase is an effective method to compact overfitting. However, the stochastic gradient descent algorithm used for training a CNN model may get stuck into a local minimum when one uses the ' early stopping' as a stopping criterion to terminate the training process. Another method to create a more robust classifier with resistance to overfitting is to promote independent classifier models (with different features) or aggregate them using majority voting. Tables 9, 10, and 11 present the result of ensembling the top 4 classifiers with kappa > = 0.2. The last row of these tables shows the performance of the ensemble models resulting from all the classifiers and all features. The CNN models trained from scratch showed the highest performance compared to other models (Precision = 0.8, Recall = 0.71, F1 = 0.75, AUC = 0.77, and kappa = 0.53). Table 12 provides a comparison of our results with results reported in previous studies. Previous works manipulated the classifier threshold to achieve specific sensitivity and specificity of interest 3,31 . However, we set the threshold of all classifiers at 0.5 to eliminate the bias to a specific class (COVID-19 versus non-COVID-19). Our results are closest to the study 24 , where they used log-Mel spectrogram from cough sounds to train a ResNet18 CNN model and manipulated the model threshold toward producing the sensitivity of 0.9. The study of Laguarta et al. 31 used four ResNet 50 CNN pre-trained models trained on muscular degradation and vocal cords, where the threshold manipulation was done on MFCC features to achieve AUC = 0.97.

Conclusion and future work
This work contributes to the crucial project of developing a purely digital COVID-19 diagnostic test by applying machine learning methods to analyze cough recordings. We developed a new technique to enrich crowdsourced cough sound samples by splitting/isolating the cough sound into non-overlapping coughs and extracting six different representations from each cough sound. It is assumed that there is a negligible information loss or frequency distortion due to the segmentation 35 (dynamic behaviour of the cough sound such as start-stop sequence or pauses). Several shallow (traditional) and deep machine learning models were trained to detect COVID-19 status (either positive or negative) using the kappa statistic (> = 0.2) to select candidate classifiers and create an ensemble model to identify COVID-19 status with better accuracy compared to individual models. Because there is a high degree of overlap between the class features, we did not reach an accuracy above 90%. However, this unbiased classification threshold ensures the minimal dependency of the predictive model on the type and pattern of classifiers. Future work can emphasize learning the similarity and difference among class labels and avoid or minimize excessive false positive (waste or resources) or false negative (untreated COVID-19 patient) results. The design and deployment of a mobile and Web app to longitudinally collect and analyze cough sounds can further support informing subjects about the algorithm's performance for their COVID-19 pre-screening. One of the recent developments in computational neuroscience is the utilization of the spiking neural network (SNN) 36,37 , a new neural network model based on discrete events (spikes) representation over time, rather than continuous values representation used in the convolutional neural network. SNN showed considerable success in discrete event detection such as tinnitus 37 (i.e., medical condition causes ringing ears on uneven time interval with variable intensity). Therefore, we utilize the SNN model to identify COVID-19 vs non-COVID-19 directly from the coughing sound. Furthermore, utilizing the SNN model would help us prevent any information loss (due to quantization error) when segmenting the sound files into none overlapping segments and further converting each segment into different visual representations (i.e., images).

Materials and methods
COVID-19 data pre-processing. Other studies used a sliding window (2 to 6 s) 4,11,38,39 to extract information from coughing and breathing sounds. The sliding window technique is sufficient if the dataset is noise-free. The noise may include a prolonged pause period and background noise. The sliding window may capture the dynamics of the sound signal. For instance, the sliding window technique can capture the number of coughs per www.nature.com/scientificreports/ 'unit time' and the time between two consecutive coughs. The dynamics of the cough sound signal may positively impact the successful detection of COVID-19 cases. However, the width of the sliding window may differ based on the quality of the cough sound. When the sliding window is relatively small, the dynamics of the cough sound may not be correctly captured, which causes misleading results. The longer the sliding window length, the less the dynamics of the cough sound are captured. An ensemble of machine learning models implemented in this study uses crowdsourced cough recordings to identify COVID-19. We randomly and manually verified 30% of the cough sound files in both datasets as a safe-guard. Our verification test agrees with the ones done in previous studies 4,11 . Table 13 shows the datasets used in our study. The dataset contains cough sounds for 1502 participants, of whom 114 participants are SARS-CoV-2 positive. It is noted that the combined total duration of cough sounds from COVID-positive participants is about 20 min and 4 s, which is considerably short compared to the combined total duration of cough sounds from the population of controls (4 h, 30 min, and 15 s). This highly imbalanced data motivates the segmentation of the positive cough sounds into non-overlapped segments (each segment contains only one coughing sound) to enrich the minority class (COVID-19 positives). After segmentation, the total number of sound samples that are COVID-positive is 638.

Experimental setup.
Our main goal is to learn from multiple representations of crowdsourced cough sounds to identify COVID-19 patients. More specifically, we aim to extract and integrate multiple information signals from a single cough sound to identify COVID-positive versus negative patients with adequate accuracy in a classifier without bias toward a specific population. While ensemble learning is a standard method to integrate multiple information signals 40 (either learning and pooling different classifiers on the same dataset or using bagging or boosting methods for ensemble learning), we focus on investigating different extracted features of a single cough sound to enhance the identification of COVID-19 status without oversampling or sliding window techniques. The research hypothesis necessitates the following requirements for a successful solution: The first requirement is to enrich the original data without oversampling by splitting the original cough sound files into non-overlapping segments. Splitting the sound files allows us to increase the sample size of the minority class (COVID-19) without changing the feature distribution resulted from applying oversampling techniques. The second requirement is to use an ensemble of classifiers that act independently on each extracted information signal and utilize the value of 0.5 as a threshold to decide on the input feature classes (COVID-19 positive vs. negative). This provides those classifiers and classifier ensembles that do not favour one class over the other. The third requirement is to implement a robust inclusion/exclusion criterion to include or exclude a classifier in an ensemble.
This study utilizes several classification evaluation metrics, including AUC, accuracy (ACC), precision, recall, harmonic mean (F1), and Kappa statistic. The Kappa statistic is used as a reliability measure 34 (the inclusion/ exclusion criterion) of each classifier to include it into an ensemble for producing a more robust classifier. The range of the Kappa statistic is (− 1,1). It is interpreted as follows: values ≤ 0 imply no agreement (i.e., the observed classification results is a random chance and not due to the expected results of a classifier decision function), 0.01-0.20 as none to a slight agreement, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1.00 as almost perfect agreement (i.e., the observed classification results is in 100% agreement with the expected accuracy due to the classifier decision function).
Analysis workflow. Figure 1a shows the analytical pipeline used in this study for pre-processing, feature extraction, and ensemble learning of COVID-19 relevant cough sounds. The pipeline starts with reading the cough audio files and segmenting them into individual non-overlapping sound files. The segmentation is conducted using the audio activity detection module to process audio files (Auditok) 41 . This module is used as a universal tool for sound data tokenization, functioning based on finding where an acoustic activity occurs in an audio stream followed by isolating the equivalent slice of the audio signal. Figure 1b shows an example of an original cough recording sound signal, and Fig. 1c shows the corresponding isolated non-overlapped signals. www.nature.com/scientificreports/ Each isolated cough sound enters the measuring module following the audio splitting step to generate six different independent frequency measures (representations of the same cough sound). Each measure is converted into a reasonable resolution image (432*288 pixels) for further analysis. The Mel frequency scale is a standard audio signal representation offering a rough human frequency perception model 33 . The six measures for each isolated segment are Mel spectrum, power spectrum spectrogram, chroma, tonal, and MFCC, all based on the Mel frequency scale. Figure 2 shows an example of raw cough sound data with its associated images of the measures.
Inspired by the Vggish's model 42 for feature extraction in audio signals, we extract features from these images using Vgg16 architecture and subject them to several shallow and deep learning models. Following the segmenting of all the positive and negative cough sound files for all participants in both datasets used in this study, we reached a total of 638 COVID-positive and 8248 negative cough sounds. We used all the 638 positive cough sounds while randomly selecting 638 negative coughing sounds to create a balanced dataset (1276 cough sound samples) for training and testing purposes. The data was divided into 80% for training (1020 images for each measure) and 20% for testing all the machine learning classifiers used in this study (256 images for each measure).
We experiment with several traditional (shallow) machine learning models, including Naïve Bayes, logistic regression, k-nearest neighbours, random forest, stochastic gradient descent, extreme gradient boosting, and support vector machine. Figure 3 shows the overall analytical pipeline for training and testing our models. The training features are extracted using the pre-trained vgg19 model. the pre-trained model produces 25,088 feature vectors per input image. The principal component analysis was employed to reduce the dimension of the input feature and a stander scalar to normalize the input features and eventually train a set of seven classifiers. Furthermore, we experiment with three different CNN models, where one model is trained from scratch, and the other two are based on the vgg16 pre-trained model.   . We measure and record the training evaluation results to choose the best classifiers. (b) Testing pipeline: Once the training process is completed, we score the testing data against each trained pipeline. The trained pipeline is composed of the best PCA, standard scalar, and classifier parameters. The testing data has 256 images per representation with equal labels for both COVIDpositive and negative cases. We measure and record the testing evaluation results to estimate the generalization error of each pipeline.