Development and clinical validation of Swaasa AI platform for screening and prioritization of pulmonary TB

Acoustic signal analysis has been employed in various medical devices. However, studies involving cough sound analysis to screen the potential pulmonary tuberculosis (PTB) suspects are very few. The main objective of this cross-sectional validation study was to develop and validate the Swaasa AI platform to screen and prioritize at risk patients for PTB based on the signature cough sound as well as symptomatic information provided by the subjects. The voluntary cough sound data was collected at Andhra Medical College-India. An Algorithm based on multimodal convolutional neural network architecture and feedforward artificial neural network (tabular features) was built and validated on a total of 567 subjects, comprising 278 positive and 289 negative PTB cases. The output from these two models was combined to detect the likely presence (positive cases) of PTB. In the clinical validation phase, the AI-model was found to be 86.82% accurate in detecting the likely presence of PTB with 90.36% sensitivity and 84.67% specificity. The pilot testing of model was conducted at a peripheral health care centre, RHC Simhachalam-India on 65 presumptive PTB cases. Out of which, 15 subjects truly turned out to be PTB positive with a positive predictive value of 75%. The validation results obtained from the model are quite encouraging. This platform has the potential to fulfil the unmet need of a cost-effective PTB screening method. It works remotely, presents instantaneous results, and does not require a highly trained operator. Therefore, it could be implemented in various inaccessible, resource-poor parts of the world.

www.nature.com/scientificreports/ has also been postulated that the glottis behaves differently under different pathological conditions, which makes it possible to distinguish coughs originating from different underlying conditions such as asthma, bronchitis, and pertussis (whooping cough) 11 . Since coughing is a dominant symptom of PTB, there are reports which suggest that the coughing sound of an individual with pulmonary TB has some unique characteristic features that distinguish the diseased condition from the normal scenario [12][13][14] . Still, a lot of research is needed to fully explore and decode the information contained in the cough sound to use it as an indicator of the underlying disease. The recent application of artificial intelligence (AI) and advances of ubiquitous computing for respiratory disease prediction has created an auspicious trend and myriad of future possibilities in the medical domain [15][16][17] . There is an expeditiously emerging trend of machine learning (ML) and deep learning (DL)-based algorithms exploiting cough signatures 18 . Cough analysis approaches are primarily subjective and are affected by the limitations of human perception. Audiometric analysis of cough (digital signal) provides essential information about characteristics of cough sounds in different respiratory pathological conditions. Several studies have been conducted in the past to collect and analyse cough sound data for PTB pre-screening and triaging using mobile devices. However, there are some missing links in terms of selecting the subjects, collecting the cough data and lack of proper technical/ clinical validations to scale up these tools for mass screening of PTB subjects 8,9,13,19,20 .
Our study provides a holistic approach by developing, validating, and testing the "Swaasa AI platform" to screen and prioritize the potential PTB cases. It is a SaMD (Software as a Medical Device) that evaluates respiratory health using a 10-s cough sound recording, serving as a quick Point of Care tool. It effectively prioritizes at-risk patients for molecular testing when used as a screening and triaging tool. As opposed to majority of the previous reports that utilized the crowdsource cough sound database for training their model, we have conducted the data collection from 567 unique subjects for our model derivation as well as validation phase in a proper clinical setting. Hence, our data have cough recordings collected from various unique subjects to build a robust model. Unlike others, we have trained two parallel models i.e., convolutional neural network (CNN) model with Mel spectrograms and feedforward artificial neural network (FFANN or tabular) model with primary as well as secondary features and merged the final layer to build a combined logic. In the validation phase, the AI-model was found to be 86.82% accurate in detecting the likely presence of PTB with 90.36% sensitivity and 84.67% specificity. Therefore, it satisfies the specificity (70%) and sensitivity (90%) criteria set by the World Health Organisation (WHO) for a community-based mass TB screening test 14 . The results obtained by the model are very promising with a scope to make it scalable for quick, cost-effective, and non-invasive screening of PTB cases. A large-scale study will further help us to improvise the accuracy of the platform for making it more reliable for screening genetically diverse subjects under different environmental conditions.

Materials and methods
Sample size estimation. To calculate the adequate sample size for our study, we used a simple formula that required us to select appropriate values for several assumptions. The formula used was n = Z 2 *P(1 − P)/ d 2 , where n represented the sample size, Z was the statistic corresponding to the level of confidence, P was the expected prevalence, and d was the precision corresponding to the effect size 21 . The values used in the formula were, Z = 1.96, P = 0.0075, d = 0.01. The value obtained for sample size (n) estimation using this formula was 285. Whereas, we recruited 567 subjects (almost double) to account for potential attrition or missing data during the study. However, all the 567 participant's data have been used. We validated the model to see if it could detect PTB respiratory conditions with a 90% sensitivity on considering a 1% error for a 95% confidence interval (CI) and a prevalence of 0.75% as the highest prevalence of PBT in India is 0.747% (747 per 100,000 population) 22 . Among 567 subjects, 50.9% subjects were classified as controls. The control group consisted of both healthy individuals and those who displayed respiratory disease symptoms but tested negative for PTB via CB-NAAT. These respiratory conditions included asthma, Chronic Obstructive Pulmonary Disease (COPD), Interstitial lung disease (ILD), and pneumonia. The number of TB records were calculated based on disease prevalence. In order to avoid potential bias in the model, it was trained using an equal number of TB and non-TB records.
Data collection. The cough data has been collected at Andhra Medical College (AMC), Visakhapatnam, India as part of the clinical study "Swaasa Artificial Intelligence Platform for detecting the likely presence of pulmonary tuberculosis". The study was registered with Clinical Trials Registry-India (CTRI/2021/09/036609) on 17th September 2021. The methods were performed in accordance with relevant guidelines and regulations and approved by AMC-Institutional Ethics Committee (IEC). Written informed consent was taken from all the enrolled subjects. After getting the informed consent, the patient's demographic details and vitals were collected. The patients were also interviewed as per the Part I of the St. George's Respiratory Questionnaire (SGRQ) 23 , which primarily covers the symptoms they've had experienced within the past few months or year. This was followed by cough sound collection by trained health care personnel via a smartphone (Android or iPhone). To ensure the highest quality data for analysis, several factors were taken into consideration before recording. The person being recorded was given specific instructions to sit comfortably in a quiet place, hold the recording device (which included smartphones and tablets from various manufacturers) 4-8 inches away from their mouth, and maintain a 90° angle with their face. They were also instructed to take a deep breath and cough 2-3 times until the recording stopped, which lasted for 10 s. However, since the collected data was from varying environments and a variety of devices, it was important to control as many potential variables as possible. Therefore, noise filtering was applied using a noise reduction algorithm. This algorithm calculated the ratio of the power of observed signals at two microphones for smartphones with two or more built-in microphones, and then calculated the spectral gain function based on the power level ratio using the sigmoid function. The result was a denoised audio recording. For smartphones with a single built-in microphone, noise filtering was not applied during the recording. Instead, a noise removal technique was applied during pre-processing. This www.nature.com/scientificreports/ involved subtracting the noise audio clip (which contained background noise such as electronic noise, multiple people talking, and fan sound) from the signal audio clip (which contained the cough). The noise removal technique isolated the signal using Fast Fourier Transform, removing the background noise and resulting in a cleaner recording. Valid coughs were detected using a cough/non-cough classifier, which screened the dataset for coughs with high background noise. If a recording did not meet the minimum required valid coughs, a message would appear on the mobile screen instructing the person to give another recording following the instructions. Overall, these processes standardized the dataset, making it suitable for analysis. During the audio recording process, we implemented several safety measures to prevent the transmission of disease. All subjects were required to wear a surgical mask while providing the audio recordings, in order to limit the spread of germs through water droplets during coughing. After each recording, the phone used for recording was cleaned using one of three methods. A disinfectant wipe was used to clean the phone, or alternatively, a damp microfiber cloth dipped in soapy water was used if the phone was waterproof. Another option was to use a mobile sanitizer to clean the phone. These measures helped to maintain a clean and safe environment during the data collection process.
Following the cough sample collection, patients were subjected for CB-NAAT and chest X-Ray (CXR P/A) view for diagnosis of PTB. The data distribution across different gender and age groups is presented in Fig. 1. The inclusion criteria were that a patient must be of (a) age ≥ 18 years and should display (b) symptoms suggestive of PTB (presumptive PTB). Whereas patients with (a) age ≤ 18 years and who were (b) on ventilators support were completely excluded from the current study. COVID precautionary and infection control measures were followed strictly.

Model development and training.
In the Phase 1 (Derivation phase) of the study, we aimed to develop and train a machine learning-based model for the detection of pulmonary tuberculosis (PTB) using cough sounds. The objective of this phase was to quantify the technical as well as analytical performance of the device by establishing a unique cough signature for PTB. A total of 195 PTB positive cases were recruited for the derivation phase, and audio cough recordings were collected. In addition, 152 non-PTB subjects were also included to train the model to distinguish between PTB condition and normal healthy subjects as well as other respiratory disease scenarios.
Event extraction was carried out from the collected audio cough records using the moving window signal standard deviation technique 24 . A cough/non-cough classifier was used to segregate the events into actual coughs and non-coughs such as silence, speech, fan sounds, vehicle sounds like horn, and noise. A total of 3102 cough events were extracted at this step. The features were extracted from the time as well as frequency domain of each cough event. The important time domain features that were taken into consideration were Zero crossing rate (ZCR) and Energy. The frequency domain features which were utilized for data analysis are MFCC, Spectral centroid, Spectral bandwidth, and Spectral roll-off 25 . The features were extracted for each frame within the cough signal. Each frame was typically about 20 ms in duration. The cough event duration can vary from anywhere between 200 and 700 ms.
The total features extracted were 209, that includes age, gender, 120 Mel Frequency Cepstral coefficients (40 MFCC, 40 first derivatives of MFCC, 40 s order derivatives of MFCC), 9 spectral features (spectral centroid, spectral roll-off, spectral bandwidth, dominant frequency, spectral skewness, spectral kurtosis, spectral crest, spectral spread and spectral entropy), 33 chroma features (11 chroma, 11 first derivatives of chroma, 11 s derivatives of chroma), 18 contrast features (6 contrast, 6 first derivatives of contrast, 6 s derivatives of contrast), 15 tonnentz features (5 tonnentz, 5 first derivatives of tonnentz, 5 s derivatives of tonnentz), 3 Zero-crossing rate (ZCR, first derivatives of ZCR, second derivatives of ZCR), 3 Energy (Energy, first derivatives of energy, second derivatives of energy), 3 skewness (skewness, first derivatives of skewness, second derivatives of skewness), 3 kurtosis (kurtosis, first derivatives of kurtosis, second derivatives of kurtosis). On these features, we did correlation analysis and recursive feature elimination (RFE) to rank the feature according to their importance. Correlationbased feature selection was used to reduce the feature size from 209 to 170, and highly correlated features were removed to prevent overfitting and improve the performance of the model. Primary features include all the 170 features. The secondary features included age (categorized), gender, symptoms, cough type (dry/wet), and cough duration. The cough type is derived from the primary features and cough duration is derived from audio signal. www.nature.com/scientificreports/ The CNN model is trained with the Mel spectrograms of cough sounds. Whereas, both secondary and primary features were used to train the FFANN model. The CNN model used in the study was based on transfer learning using Resnet-34 with imagenet for training on spectrograms. Whereas, the FFANN was utilized to process the tabular data. The FFANN consisted of two hidden layers, with 400 and 300 neurons, respectively. Each layer was followed by batch normalization. The selection of the number of layers, number of neurons in each layer, and the activation functions were determined using the Bayesian optimization method. The last fully connected layers of both models were removed, and a new fully connected layer (merged layer) was added to predict the final output. The merged layer consists of activation layers (Fig. 2). This merging approach of the last layers of the two models was named the combined logic. When the model is uncertain about the likely detection of PTB as yes/no, it provides the output as inconclusive as displayed in the block diagram in Fig. 3, wherein PTB likely indicates TB positive and PTB unlikely indicates TB negative condition.
Overall, the primary and secondary features were used to train feedforward neural network models (tabular) and Mel spectrograms were used to train CNN and the combined logic approach was used to merge the outputs of the two models. The correlation-based feature selection was also used to improve the performance of the model.
During our study, we split the initial dataset into 80% training and 20% testing sets to assess the performance of the machine learning model. In addition, we used the k-fold (K = 10) cross-validation approach to divide the training dataset into k subsets of data, which allowed us to obtain a more robust estimate of the model's performance.  www.nature.com/scientificreports/ Clinical validation of the model. After training the model, it was tested in the Phase 2 i.e., Clinical validation phase. A total of 220 presumptive PTB cases were recruited and subjected to the screening test using the model. The results were compared with diagnosis based on sputum CB-NAAT test or radiological diagnosis, which are considered classical gold standard diagnostic methods. A total of 98% of the subjects underwent sputum CB-NAAT testing for the diagnosis of TB. In cases where the results of CB-NAAT testing were unclear for the remaining 2% of subjects, a repeat test was performed. If the results remained inconclusive after the second test, fresh sputum samples were collected. Additional tests such as acid-fast bacilli (AFB) staining, and chest X-ray (CXR) were performed to confirm the TB status of the patients. The consolidated test summary sheet was generated, which contained the results obtained from the classical gold standard diagnosis methods along with the model's output. Both the results were then compared by a statistician.
External validation of the model. In Phase 3 (Pilot Phase), the model was externally validated to evaluate its effectiveness as a screening tool for PTB detection prior to clinical diagnosis. The sample size for this phase consisted of 65 individuals who were identified as presumptive PTB cases and recruited from a peripheral healthcare center, RHC Simhachalam. The effectiveness of the model was measured by calculating the ratio of patients truly diagnosed as PTB positive via standard lab-based diagnostic techniques to all those who were predicted to be PTB positive via the AI-based model. The diagnostic performance of the model was evaluated using metrics such as sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy. To assess the effectiveness of the model, a data analysis strategy similar to that used in phase 2 was employed. The AI-based model was compared with classical gold standard diagnostic methods, such as sputum CB-NAAT testing or radiological diagnosis, and the results were analysed using statistical methods. LIME representation. In Local interpretable model-agnostic explanations (LIME) representation 26 , the green part shows where the model reacted positively for a particular class and red parts highlights where it reacted negatively. It explains the prediction by presenting textual or visual artefacts that provide qualitative understanding of the relationship between the instance's components (e.g., words in text, patches in an image) and the model's prediction.

Statistical analysis. The comprehensive evaluation of the model performance on the test set includes
accuracy sensitivity, specificity, positive prediction value (PPV), negative predictive values (NPV) and ROC. To measure the variability around these parameters, we used 95% confidence intervals using the Clopper-Pearson method 27 . To better understand the performance of the model in screening PTB subjects, we also calculated confusion metrics on the entire test set.

Results
Patient population in model derivation phase. Cough sound data was collected from 195 subjects PTB positive subjects and 152 PTB negative subjects in the derivation phase. Among 195 subjects, 65% were male and 35% were female, with age ranging from 18 years to 64 and above. Subjects were confirmed with TB by standard diagnosis methods. In this phase multiple data points were collected from the subjects. Each data point was called a record. A total of 597 cough records were collected from 195 patients. The data was annotated with disease condition as PTB i.e., PTB likely as "yes". For PTB unlikely, data representing other respiratory disease conditions was added from the pre-existing labelled datasets (collected as a part of earlier studies) in appropriate propositions 25 .
The features listed in Table 1 depicts the mean value of the features extracted from individual frames, where we have considered normal as well as respiratory diseases data other than PTB from our previous validation study conducted at Apollo Hospitals, Hyderabad 25 . LIME data comparison. Spectral content is the distribution of audio signal based on its frequency w.r.t time, where high spectral content emphasizes that the energy of the cough bout remains same throughout the signal, whereas low spectral content corresponds to the conditions where the energy of the cough decreases with www.nature.com/scientificreports/ time. We observed that conditions like COPD and ILD carried very low spectral frequencies as compared to Asthma, which has a medium spectral frequency. On the other hand, we detected a very high spectral content for diseases where mucus accumulation in the airways and fluid accumulation in parenchyma region was present such as, PTB. Features like high spectral content brought uniqueness in the PTB cough, which differentiates it from the other respiratory diseases. Thorough feature analysis of the cough sounds highlighted that the cough sounds could distinguish diseases. Variation in the cough duration and frequency distribution alters with the pathological conditions of the respiratory system 9,28 .
We have enlisted a few examples of cough signatures, cough spectrograms and related LIME maps for different respiratory diseases, including PTB in Table 2. It is evident from the LIME maps that frequency distribution of the coughs is unique for each disease. To be specific, both Asthma & ILD have negative reactions in high frequency regions. TB has a positive reaction in the high frequency region and in the low frequency region. Normal cough signature is widely spread. However, it is not like other diseased conditions, where it has a strong patch around a given region. Similarly in the first column of the table, the variation of the amplitudes of the cough from bout to bout is different in coughs related to different diseases. As amplitudes vary, energy also varies from bout to bout.
From the feature analysis we conclude that PTB related cough has a unique signature, and it is captured by the features extracted from the cough, which can be identified by a machine learning model.
Performance of combined logic model. Initially, the training data i.e., 3102 coughs which were extracted from 597 records collected from 195 subjects was internally divided into training and validation as required to www.nature.com/scientificreports/ build as well as optimize the model performance based on K-fold cross validation technique. The performance of our model was evaluated using k-fold cross validation, with k set to 10. The metric used for evaluation was the Area Under the Curve of the "Receiver Operating Characteristic (AUCROC) curve, which provides a measure of the model's ability to distinguish between positive and negative samples. The obtained AUC score was 0.98, indicating that our model is highly effective in making accurate predictions. Figure 4 shows the representative ROC curve of the best performing fold among the 10 cross validation folds. In machine learning model, attribute like learning function, activation function were fixed for learning. Hence, the dataset was divided into subsets and the model was trained with each subset to validate the model. Further, the data collected in the derivation phase have been divided into 80% train and 20% test, when the test data was run through the classifier. We obtained four outcomes as enlisted in Table 3 i.e., 102 True positives (TP), 20 False Negatives (FN); 22 False Positives (FP) and 128 True Negatives (TN), that corresponds to 85% accuracy, 84% sensitivity and 85% specificity.
A total of 220 subjects participated in the validation phase, out of which 83 subjects were found to be PTB positive and 137 subjects PTB negative by standard diagnostic methods. Only one cough record was collected from each subject in this phase. We achieved an AUC (Area Under the ROC Curve) of 0.94 (Fig. 5). Confusion matrix for validation phase of the model is illustrated in Table 4, where the row represents the actual label, and the column represents predicted label. For the Validation phase we achieved an accuracy of 86.82% with 90.36% sensitivity and 84.67% specificity ( Table 5).
Model output in the pilot phase. Pilot testing was done on a total of 65 subjects. The patients approaching the testing centre with symptoms of cough suggestive of possible pulmonary tuberculosis are assessed for eligibility. Patient's demographic details and vitals were collected and interviewed as per the SGRQ questionnaire. This is followed by cough sound collection by trained health care personnel.
Among 65 subjects, the model was able to identify 20 subjects as having a likely presence of TB. Out of these 20 subjects, 15 truly turned out to be TB positive with a Positive predictive value (PPV) of 75%. The confusion matrix for pilot testing phase is listed in Table 6. The model obtained a high AUC score of 0.90. Figure 6 shows the ROC curve of the model's ability to distinguish between positive and negative samples.
When compared to the existing classical methods, screening of PTB patients by the model saved a significant amount of time. Additionally, it does not require any trained professionals, the testing can be done by a  www.nature.com/scientificreports/ community healthcare worker. The worker did not require any specific hardware or any other consumables. A smartphone with an internet connection is enough to conduct the test.

Discussion
Several studies have been conducted in the past to deploy the information contained in the cough sound to detect and predict different disease outcomes such as Asthma, Pneumonia, COPD, bronchitis, and lung-cancer 10,29-31 . Nowadays, due to the increasing COVID-19 cases, there has been a tremendous boost in the use of ML/DL frameworks to determine the presence of SARS-CoV-2 infection via cough sample analysis. This is because cough is one of the most prominent symptoms for the diseases that primarily affect the respiratory system. Numerous studies have shown that cough analysis can accurately predict COVID-19 32,33 . However, there are only a handful of clinical trials that emphasise the association of cough to the underlying Pulmonary TB condition [12][13][14]20 . Most     13,14,20 . We conducted the pilot screening on a comparatively large cohort, whereas previous studies were performed on a smaller scale. A pilot study conducted in Peru focused on analysing cough sounds for providing a foundation to support larger-scale studies of coughing rates over time for TB patients undergoing treatment 20 . A similar cough sound analysis study was undertaken in South Africa for automatically classifying coughing sounds, which could be a viable low-cost and low-complexity screening method for PTB 13 .
The approach of the current study is different with respect to the previously published data in terms of the amount of data collected to build and train the model. As compared to the maximum AUC of 0.94 achieved in a similar study upon utilizing only 23 features and with less dataset, we have utilized 170 features while training the model and achieved an AUC of 0.84 on a dataset comprising TB and non-PTB, where non-PTB includes other important diseases like Asthma, COPD, COVID-19, Pneumonia as well as healthy subjects 14 . Having a greater number of latent features helps in distinguishing the signature better. Our model achieved an accuracy of 86.82% with 90.36% sensitivity and 84.67% specificity in the clinical validation phase. We conducted the pilot testing in a real primary care setting to test the accuracy of the tool. Upon deployment as a screening and triaging tool prior to molecular testing, the model was proven statistically effective in prioritizing at-risk patients for confirmatory testing. In the pilot phase also, the model achieved a positive prediction value of 75% in a clinical setup at a tertiary care hospital.
Considering the performance of the present diagnostic tests for PTB, our model's technical and clinical validation results are quite encouraging, given the device is primarily intended to be used as a screening tool and helps in prioritizing and fast tracking the patients for subjecting them to the standard reference tests for confirmation of diagnosis of PTB.
During our study we observed that on an average 10 to 12 patients are diagnosed with extensive PTB with severe parenchymal damage, respiratory failure and poor lung function every month at a remote community health centre in India. Most of these patients belong to tribal areas. The delay in diagnosis is mainly due to lack of awareness, and social inhibitions in reaching a doctor or a peripheral health worker. We believe that this model will help in reducing the gap in accessibility for the much-needed population.

Data availability
Due to the nature of this research, participants of this study did not agree for their data to be shared publicly. However, the detailed analysis can be shared by N.R.S upon reasonable request.  www.nature.com/scientificreports/