Automatic stridor detection using small training set via patch-wise few-shot learning for diagnosis of multiple system atrophy

Stridor is a rare but important non-motor symptom that can support the diagnosis and prediction of worse prognosis in multiple system atrophy. Recording sounds generated during sleep by video-polysomnography is recommended for detecting stridor, but the analysis is labor intensive and time consuming. A method for automatic stridor detection should be developed using technologies such as artificial intelligence (AI) or machine learning. However, the rarity of stridor hinders the collection of sufficient data from diverse patients. Therefore, an AI method with high diagnostic performance should be devised to address this limitation. We propose an AI method for detecting patients with stridor by combining audio splitting and reintegration with few-shot learning for diagnosis. We used video-polysomnography data from patients with stridor (19 patients with multiple system atrophy) and without stridor (28 patients with parkinsonism and 18 patients with sleep disorders). To the best of our knowledge, this is the first study to propose a method for stridor detection and attempt the validation of few-shot learning to process medical audio signals. Even with a small training set, a substantial improvement was achieved for stridor detection, confirming the clinical utility of our method compared with similar developments. The proposed method achieved a detection accuracy above 96% using data from only eight patients with stridor for training. Performance improvements of 4%–13% were achieved compared with a state-of-the-art AI baseline. Moreover, our method determined whether a patient had stridor and performed real-time localization of the corresponding audio patches, thus providing physicians with support for interpreting and efficiently employing the results of this method.


Scientific Reports
| (2023) 13:10899 | https://doi.org/10.1038/s41598-023-37620-0 www.nature.com/scientificreports/ shaped audio signal with neither formants nor harmonics 5 . However, identifying features and analyzing audio data are labor intensive and time consuming. To date, deep learning has not been used for binary classification of snoring and stridor. Considering the rarity of MSA and importance of stridor, an automatic method to detect stridor related to MSA should be developed using the few available samples. A recent advancement in few-shot learning has enabled the development of AI models with a few training samples, but it has not been applied to audio processing in the medical field [6][7][8][9] . Few-shot learning allows to integrate newly available training data during inference, thereby improving the diagnostic performance compared with other AI methods (Fig. 2). This learning strategy is particularly suited for developing classification applications with scarce training data, such as stridor data. The main contributions of this study are summarized as follows: • We introduce a method to automatically diagnose stridor.
• We combine audio splitting and reintegration (SR) with few-shot learning in our method called patch-wise few-shot learning for sound detection (PFL-SD). This is the first method to incorporate these techniques into medical diagnosis based on audio signals, and we demonstrate the method validity. • Compared with existing AI methods, PFL-SD improves the diagnostic performance for stridor even with scarce training data (achieving a stridor detection accuracy above 95%) while identifying and localizing www.nature.com/scientificreports/ suspected stridor patches in the audio recordings of a patient (Fig. 3). Hence, physicians may better interpret the method results with low inspection effort.

Materials and methods
Ethical approval. All  Participants. Audio recordings from 65 participants were included in this study. Among the participants, 19 MSA patients had stridor and 46 patients had snoring without stridor. The demographics and clinical char- Overview of proposed PFL-SD and existing AI audio-based diagnosis methods. In the conventional method (baseline), θ * is correctly learned from the training data, but the network does not directly use training data for (post-training) diagnosis. In the proposed method, our network improves the performance by using training data even during diagnosis (post-training), that is, the distance between the training and inference samples is determined during classification. Hence, the proposed method improves its diagnostic performance with few training samples and correctly identifies stridor patches in an audio recording. The red box denotes the baseline method does not directly use training data for (post-training) diagnosis. The green box denotes the proposed method utilizes training data in the inference process (the training data is not evaluated additionally). Then, SR is applied to the ROI to produce multiple audio patches for few-shot learning to diagnose stridor. By synthesizing diagnosis results for all the patches, a diagnosis is inferred. The diagnosis results for each patch allows visualization to interpret the results by identifying the patches suspected of stridor or snoring.
Data preprocessing. We performed preprocessing of the audio recordings to extract regions of interest (i.e., regions in which stridor or snoring was pronounced) by using binary thresholds based on the audio volume. Subsequently, a new waveform was obtained by conserving only the waveform values above the threshold. Audio preprocessing is illustrated in the first stage of Fig. 3. We calculated the sound level (in decibels) of the entire waveform and set the threshold to 30% of this value. Preprocessing removed the patient's silence and environmental noise, which accounted for more than one-third of the original audio recording.
Data splitting for network training and testing. We considered 65 patients training data pairs , where D 0 tr and D 1 tr denote the training sets for class 0 (snoring) and 1 (stridor), respectively), x i is the audio file of patient i after preprocessing, and y i denotes the class label (if y i = 1 , patient i has suspected stridor; if y i = 0 , patient i belongs to the normal group with snoring but without stenosis symptoms). The network was trained using the corresponding data, and the performance of the trained network was evaluated using the remaining data. K training data points for each class were randomly selected ( K = 4, 6, 8 in this study), and details of data splitting are listed in Table 2. For training and evaluation, Monte Carlo cross-validation 10 was applied to obtain the mean and standard deviation over M trials ( M = 10 in this study).
Network training and testing. Owing to the nature of medical data, datasets are often insufficient for training AI models. To overcome this limitation, we devised a few-shot-learning-based patch segmentation [12][13][14] for audio classification (stridor detection in this study). A comparison of the proposed method and the baseline is illustrated in Fig. 4. To introduce the proposed method, we describe the baseline and our proposal below.
Baseline AI method. A general AI method for audio classification receives the preprocessed complete audio signal, x i , as the input and learns to perform binary classification to provide ground-truth label y i ∈ {0, 1} (training phase). After training, the network receives a test preprocessed complete audio signal, x te , as the input www.nature.com/scientificreports/ and provides the binary classification prediction, ŷ ∈ {0, 1} (inference). Training and inference are outlined in Fig. 4a. These approaches were considered as the baseline (CNN14 15 ) in this study.
Training. Network f θ takes input x i and provides two-dimensional (2D) probability vector f θ (x i ) ∈ R 2 as its output. Then, the baseline training is aimed to minimize the following loss function (i.e., determine network parameter θ * to minimize the loss): where L cls is a conventional classification loss (binary cross-entropy loss in this study). From training, the network output probability vector, f θ (x i ) ∈ R 2 , is learned to become a one-hot vector, where the value of the index of the inferred label, y i ∈ {0, 1} , is 1.
Inference. For inference after training, the index of the largest value of the network output probability vector, f θ * (x te ) ∈ R 2 , from the test audio signal, x te , is considered as predicted class ŷ θ * (x te ) ∈ {0, 1}.
Proposed AI method: PFL-SD. The proposed method is intended to achieve high diagnostic performance, even with few training samples, by applying patch-wise few-shot learning. Before describing training and inference, we explain the patch-wise audio splitting procedure. Preprocessed waveform audio signal x is divided into P sequential patches (i.e., splitting process in SR): Because preprocessed audio signal x has a length of at least 150 s, we split an audio signal from 0 s to 150 s into P = 30 patches of 5 s. Hence, we can detect stridor in each patch. In addition, when the diagnosis result is obtained by combining the patch results, the diagnostic performance is improved by using the patch-wise audio splitting owing to the increase in data diversity.
We perform training that uses each patch as a network input and generates the diagnosis results based on few-shot learning. For the set of K training samples per class, half of the set (i.e., {1, 2, ...,K} with K := K/2 ) is defined as a support set, and the other half (i.e., {K + 1, ..., K} ) is defined as the query set. The network is trained to minimize the average distance between the feature maps of the samples in the query and support sets. Specifically, the feature maps of the support set for class c can be expressed as matrix S c , which is the collection of feature maps of dimension R · L for P patches and K objects in the support set of class c with dimension K P × R · L: where Matrix S c in Eq. (4) is a set of feature maps generated by taking individual patches of the target group belonging to the support set with class c as inputs, and S c,p in Eq. (5) is the subset of S c such that only the feature maps of patch p are included. In Eq. (3), z θ (x) ∈ R R·L is the feature map of target network f θ (x) = g θ (z θ (x)) , with R being the spatial resolution (i.e., height times weight) and L being the number of channels at an arbitrarily configurable layer in the network. We consider the feature map before the fully connected layers in the target network. In addition, g θ is the set of the remaining layers, and θ is the network parameter for learning.
Given patch p, x p , extracted from sample x in the query set, the distance in the feature domain between patch x p and support set S c for class c can be calculated as solution v θ (x p ; S c ) of ridge regression on the feature map as follows 16 : www.nature.com/scientificreports/ where γ and are constants that control the scale and rank of the distance, respectively. We set the constants according to the original setup 6 . Figure 4. Diagram of proposed PFL-SD compared with conventional method. (a) The existing AI method (baseline) receives the patient's entire audio recording, x te , as the network input and provides the diagnosis result as probability vector f θ * (x te ) . (b) The proposed PFL-SD splits the audio recording, x te , into P patches ) based on few-shot learning per patch, and performs patient-level diagnosis by merging the diagnosis results (Eq. (10)). The diagnosis of individual patches is performed through distance (Eq. (6)) comparison between the target sample and prototype representation of the training data support set per class 11 . As training data can be integrated as additional information even during inference, high diagnostic performance can be achieved even with few training samples available. www.nature.com/scientificreports/ A smaller value (distance) increases the probability that the target query patch, x p , belongs to class c, as depicted in Fig. 4. Accordingly, whether target query patch x p belongs to either class, 0 (snoring) or 1 (stridor), can be expressed as a probability through distance comparison with the support set for each class as follows: Training. The detailed training of the proposed PFL-SD is shown in Fig. 4b. By applying a conventional classification loss to the class prediction probability vector, p θ (x p ; D tr ) , based on the distance between samples, the network is trained to obtain a one-hot vector that refers to ground-truth class y i,c (i.e., c) for each input query patch x p i,c : Inference. From the index information corresponding to the maximum value, the learned probability vector, p θ * (x p ; D tr ) , provides a prediction of the class to which input patch x p belongs. Given the P patches per patient, we integrate the P diagnosis results (i.e., reintegration in SR) to obtain a patient-level diagnosis. As a straightforward approach, the prediction is derived as ŷ θ * (x te ; D tr ) ∈ {0, 1} in Eq. (9) from average probability vector p θ * (x te ; D tr ) of P class prediction probability vectors acquired from the P patches (i.e., where Figure 4b illustrates the final diagnosis and abnormal patch visualization of PFL-SD.
Comparison between proposed method and baseline. The proposed method has two main differences with the existing baseline.
• When designing the class probability prediction vector, the baseline ( f θ (x te ) in Eq. (2)) only considers whether target sample x te belongs to the corresponding class but not the correlation with other samples. Unlike the baseline, the proposed method can achieve a higher diagnostic performance because it considers information from other samples in addition to x te . Hence, different from f θ (x te ) in Eq. (2) of the baseline, class probability prediction vector p θ * (x te ; D tr ) in Eq. (9) of the proposed method uses training sample D tr (i.e., support set S c of class c in D tr ) during inference. Externally using training samples for inference prevents overfitting caused a small training set in the conventional method (i.e., Eqs. (1) and (2)). This is because the conventional method uses training samples exclusively to obtain the training parameters. In this study, we applied few-shot learning to stridor detection and demonstrated its effectiveness. • We improved the performance of few-shot learning by redesigning training and inference to be applied to each of the P audio patches to then integrate the inference results into a patient-level diagnosis (Eq. (10)). In addition to the clinical implications of accurate stridor detection, the proposed patch-wise few-shot learning is an innovative approach.

Experimental settings and implementation details.
For the audio recordings of sounds (length between 724 s and 20,638 s) generated by each patient during sleep, we performed preprocessing as illustrated in stage 1 of Fig. 3. For each preprocessed audio signal, a segment of 150 s was extracted. This segment length is commonly used for the input in various classification networks. The proposed method split this segment into P patches (i.e., patches were divided by patient-level), received individual patches as inputs, and aggregated the P results for final diagnosis. We set P to 30, and thus each patch had a length of 150/30 = 5 s. For the input, we converted every signal into a Log-Mel spectrogram and used the resulting 2D representation as input. The Log-Mel spectrogram has commonly been used as the input instead of the sound signal 15,[17][18][19] , and we adopted this strategy. For existing AI methods, the Log-Mel spectrogram 2D representation of each 150 s signal was obtained and used as input. For the proposed AI method, P Log-Mel spectrogram 2D representations of the 5 s patches were obtained from each 150 s signal and used as inputs.
The baseline was trained as follows. We used CNN14 15 as the backbone. This network consisted of six convolution blocks, with each block comprising two convolutional layers with a kernel size of 2 × 2 . After the last block, global average pooling 20 was applied to extract the 2D feature map of each channel. We used the same implementation reported by Song et al. 21 . For training, we set the minibatch size to 8, the number of epochs to 500, the binary cross-entropy loss with an initial learning rate of 0.01, and an adaptive moment estimation 22 for optimization. We also applied transfer learning 23 to set the initial network parameters to those pre-trained on the AudioSet dataset 24 . We set the audio sampling rate to 22,050 (Hz), window size to 2048, hop size to 512, and window type to Hann 15 .
The proposed method was trained as follows. We used ResNet12 6-9 as the backbone. It consisted of four residual blocks, with each residual block comprising three convolutional layers with a kernel size of 3 × 3 , www.nature.com/scientificreports/ and a 2 × 2 max-pooling layer was applied after the first three blocks. We used the same implementation for ResNet12 reported by Wertheimer et al. 6 . The proposed PFL-SD was trained with a minibatch size of 8, number of epochs of 500, binary cross-entropy loss with an initial learning rate of 0.01, and stochastic gradient descent 25 for optimization. We also applied transfer learning to use the pre-trained parameters from the mini-ImageNet dataset 26

Results
Evaluation measures of classification performance. To evaluate the stridor detection performance of the proposed PFL-SD (i.e., binary classification of snoring or stridor), we used the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, accuracy, sensitivity, specificity, precision, and F1 score. In the ROC curve, we selected the decision threshold as 0.5, which is the most commonly used, and calculated the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) rates based on that threshold. Then, the accuracy, sensitivity, specificity, precision, and F1 scores were calculated as follows: Performance of existing and proposed diagnosis methods. The proposed PFL-SD comprises audio SR and diagnosis based on few-shot learning. In addition, the method has versions 1 and 2 for few-shot learning without and with SR, respectively. The diagnostic performance of the proposed and existing methods was compared. Note that existing methods lack SR and few-short learning. We evaluated seven methods with different backbones: ResNet18 27 , ResNet50 27 , DenseNet201 28 , MobileNetV2 29 , VGG16 30 , YAMNet 31 , and CNN14 15 . 10-fold Monte Carlo validation was applied in network training and testing to evaluate the diagnostic performance across 10 trials by randomly selecting a sample from the dataset for training and the remaining samples to construct an evaluation dataset per fold. Four, six, and eight training samples were considered per class. The existing and proposed methods used the same Log-Mel spectrogram as input for a fair comparison. The experimental results from the performance evaluation are listed in Table 3. We confirmed that the bestperforming existing method was the one based on CNN14, which was thus designated as the baseline. Version 1 of the proposed method outperformed the baseline, while version 2 outperformed version 1. These results consistently appeared for different numbers of training samples. Therefore, the proposed method (version 2) achieved superior stridor diagnostic performance compared with existing methods. In addition, the validity of combining SR and few-shot learning was verified because version 2 outperformed version 1 that lacked SR.
To further compare the performance of the proposed and existing methods, we investigated various performance indicators in addition to accuracy. The results are presented in Fig. 5 (AUC), Table 5 (AUC with 95%      Table 4 shows the accuracy, sensitivity, and specificity of the methods. In most cases, the proposed technique showed a higher sensitivity and specificity than the baseline. In addition, the performance improvement was confirmed by the F1 score, verifying the superiority of the proposed method, which did not reflect class imbalance. Figure 5 and Table 5 show the receiver operating characteristic curves and AUC (with 95% CI) of the proposed method and baseline, respectively. The proposed method consistently improved the performance for all the decision thresholds compared with the baseline, indicating its lack of bias regardless of the threshold. Table 6 shows the confusion matrices from which the results for Tables 3 and 4 were derived. The matrices supported the reliability of the results.

Visualization of diagnosis results using proposed method. The proposed PFL-SD classifies an
audio signal into a stridor or snoring case while allowing to visualize the patches in the input audio signal to evaluate the diagnosis result. Figure 6 illustrates the sequential introduction of the four main processes in the proposed PFL-SD, providing the visualization results and verifying the visualization capability. The proposed method extracts audio patches during preprocessing and performs separate diagnoses. These results are merged to provide patient-level diagnosis, and the individual diagnosis results can be displayed on the original audio source to visualize the patches containing classification information of stridor or snoring. This visualization strategy may allow physicians to focus on patches containing stridor information (purple patches in Fig. 6) for accurate diagnosis, likely shortening the time required for stridor confirmation.

Discussion
We first developed binary classification using an AI method for detecting stridor, which is an important nonmotor symptom in MSA. Considering the rarity of MSA and stridor, developing an AI method with high diagnostic performance is challenging, but we obtained high performance by applying few-shot learning. Even with few training samples, the proposed method achieved a detection accuracy above 95%. In addition to stridor detection, the proposed method could locate the audio patches showing stridor in real time, thus providing physicians with additional assistance for interpreting and employing the diagnosis results.
MSA is a rare neurogenerative disease with a prevalence of 3.4-4.9 cases per 100,000 persons 3 . The prevalence of stridor in MSA varies from 12% to 42% depending on the disease stage 32 . Stridor is an important non-motor symptom of MSA that facilitates both diagnosis and prognosis. Stridor is a supportive non-motor symptom of clinically established MSA and a distinctive feature of MSA mimicking 1 . In addition, early presence of stridor is an independent predictor of shorter survival 4 . Continuous positive airway pressure or tracheostomy is recommended for managing stridor, with the latter possibly improving patient survival 3 . Therefore, the timely detection and management of stridor are crucial for patients with MSA, but there is no gold standard for stridor diagnosis. Moreover, nighttime monitoring is required because most patients show stridor during sleep, and they may be unaware of its occurrence. Although nighttime monitoring allows the detection of abnormal breathing, stridor detection should be confirmed on VPSG because stridor is often confused with ordinary snoring, which is also common in MSA. Consequently, stridor detection is labor intensive and time consuming for a physician, who should manually review acoustic data recorded from VPSG. Therefore, an automatic method to detect stridor and differentiate it from ordinary snoring should be developed. The proposed automatic AI method for stridor detection can achieve a high accuracy of 96.1%.  33 reported the recording of breathing sounds using a smartphone to support diagnosis. They found that recording breathing sounds may help physicians by enabling early stridor detection and MSA diagnosis. In addition, 5.2% of MSA patients have exhibited stridor 4 , which can be used for screening or detecting prodromal symptoms of MSA. Although a very low positive rate is expected considering the low prevalence of MSA and stridor, automatic stridor detection using our method may have a low cost and negligible labor burden. To date, no method for automatic stridor diagnosis has been developed, rendering the physician's examination indispensable. Thus, the patient with MSA should visit a hospital for suspected stridor, and VPSG is conducted. The physician then manually reviews the audio recordings and determines the presence of stridor. Consequently, this approach is only applied when the patient visits the hospital after MSA diagnosis and when stridor is suspected. This hinders detection when an individual is not diagnosed with MSA or has no suspicion of stridor, likely missing timely diagnosis and proper early treatment. We have shown the feasibility of developing a method for automatic and accurate stridor diagnosis. Hence, stridor can be diagnosed even outside hospital settings, and an individual can visit the hospital and undergo a detailed examination if stridor is diagnosed. As a result, deterioration caused by this neurodegenerative disease may be mitigated at an early stage.
Medical audio classification using AI and few training samples has previously been demonstrated 21,[34][35][36][37][38] . However, our method is the first one to use few training samples for accurate classification of snoring and stridor using AI. We achieve high diagnostic performance by applying few-shot learning to binary classification using audio recordings in the medical field. Existing AI solutions 21,[34][35][36][37][38] have performed audio classification without considering correlations between samples during inference, like in the baseline scheme illustrated in Fig. 2, which simply outputs a probability vector for the classes of an input sample while neglecting the similarity between samples. Conventional classification without sample correlation tends to perform poorly when few training samples are available 39 . Given scarce training data, overfitting on the decision boundary of the classifier can occur. Externalizing the decision boundary rather than learning it internally through few-shot learning allows inferring the class of a sample by calculating the distance between its location and the class location obtained from the training samples. Hence, the learning efficiency and inference performance can be enhanced with only few training samples. The performance improvement is due to the information on the input test sample (conventional method) being used along with information from all the training samples for classification. Accordingly, we apply few-shot learning for the first time to stridor detection to support MSA diagnosis (Fig. 2) and demonstrate improved diagnostic performance over other AI methods using few training samples.
Although our proposed method has shown promising results, since this is a prototype to perform snoring and stridor binary classification with small data using few-shot learning, there are limitations to be addressed in future work. First, the number of stridor patients used in our study was unsatisfactory. We used 65 patients (19 with stridor and 46 without stridor) as the dataset of our study, and we tried to collect snoring and stridor patients at similar rates, but only 29% of all patients (i.e., 65 patients) were stridor patients due to the rarity of MSA and stridor 3 . Therefore, the proportion of snoring patients (71%) was higher than the proportion of stridor patients (29%) (i.e., the ratio of snoring to stridor is about 7 to 3), resulting in slightly higher specificity (although we were able to derive high sensitivity by arbitrarily controlling the threshold, we set it to the default threshold of 0.5.). However, since the proportion of snoring patients in the actual diagnosis (i.e., breath sound test through  www.nature.com/scientificreports/ polysomnography) of MSA patients will be more than about 70% 40 , we expect that models with high specificity will increase the diagnostic success rate of MSA patients (i.e., with snoring and stridor, which are common sleep breathing problems). Second, since the purpose of our study was to develop a tool to distinguish between snoring and stridor, we could not evaluate various audio recording data (i.e., excluding snoring and stridor data). Future studies should investigate the applicability of our proposed method to problems such as stridor and other abnormal breathing sounds (e.g., wheeze, crackle, etc.) classification by collecting various audio recording data. And if a large number of data is collected, it is necessary to additionally consider a method suitable for classifying a large dataset (e.g., contrastive learning [41][42][43][44] ). Although our study has some limitations, we apply audio SR to few-shot learning for the first time. SR splits a sleep test audio recording into multiple patches and classifies them. The classification across patches is then merged to obtain patient-level diagnosis, as illustrated in Fig. 3. Accordingly, through SR, we obtain multiple diagnosis results from a patient and achieve superior performance owing to the audio patch diversity. In addition, SR improves diagnostic performance and helps physicians interpret the AI results by providing diagnosis results for patches with suspected stridor in the audio source. The patches containing stridor information can be visualized, as reported in the Results section.

Conclusion
We implemented automatic stridor detection using few-shot learning and patch splitting for audio processing in the medical field using an AI method. The proposed PFL-SD showed high-performance stridor detection, an important MSA indicator, even with fewer training samples than those required for conventional AI methods. The proposed PFL-SD merged diagnosis results from multiple patches extracted from an audio signal. The obtained patient-level result improved the diagnostic performance, and the patch-level results enabled the visualization of stridor-suspected patches for confirmation by a physician. A patient shows short stridor periods in an audio recording from a sleep test despite the diagnosis being positive. Until now, a physician had to analyze the entire audio recording to discover periods with suspected stridor, resulting in a costly and burdensome evaluation. The proposed method may allow physicians to confirm stridor by simply listening to the patches with positive stridor diagnosis, considerably accelerating the diagnosis confirmation. Although the proposed PFL-SD was only validated for stridor diagnosis on sleep test data, we expect to extend the method to various sound applications in the medical field, especially those with challenging data collection, to demonstrate its superiority and clinical utility in future work.

Supplementary Information
Please download Supplementary Files S3 and S4 and play in the video player (legend for Supplementary Files S3 and S4 are shown in Supplementary Figure S5). Supplementary Files S3 and S4 were created by Ju Hwan Lee using the OpenCV (version 3.4.2) library 45 in Python (version 3.7.9).

Data availability
The main data supporting the results of this study are reported within the paper. The raw datasets from Samsung Medical Center are protected to preserve patient privacy, but they can be made available upon reasonable request if approval is obtained from the corresponding Institutional Review Board. For the request, please contact Jin Whan Cho at jinwhan.cho@samsung.com.