Introduction

Obstructive sleep apnea syndrome(OSAS) is a very common sleep disorder with high prevalence. Globally, nearly 1 billion adults aged 30 to 69 years, are estimated to have mild to severe OSA1. OSAS is not only known as a risk factor for hypertension and other various cardiovascular diseases but also to affect the quality of life and cognitive disorders2,3,4. Therefore, active management and treatment are required. Nonetheless, due to the lack of recognition, patients with OSAS symptoms often do not know that they are suffering from OSAS, or even have symptoms of OSAS5.

Since the severity of OSAS is estimated using the apnea-hypopnea index(AHI), polysomnography(PSG) is considered as the traditional gold standard for diagnosing OSAS6,7. However, PSG requires overnight sleep in a laboratory, a dedicated personnel and system that leads to limited efficiency. In addition, PSG also requires various skin-contacted sensors, which may disturb the subject’s sleep. Other methods are also being attempted to diagnose OSAS, such as home sleep apnea test8 and cardiopulmonary monitoring9,10, which require at least overnight and also require testing equipment. As the number of suspected OSAS patients increases, the necessity for a simplified new method to countervail the shortcomings of preexisting sleep tests is rising.

As the rapid growth of artificial intelligence affects throughout modern society, applications of artificial intelligence-related technologies have recently emerged in diverse fields. Machine learning, which forms an axis of artificial intelligence, is excellent for recognizing and classifying complex patterns in massive data. This characteristic of machine learning is well-suits to complex, heavy, and enormous healthcare data11. Therefore, there is a growing tendency of applying machine learning techniques in medical and healthcare fields12. Since variables affecting the morbidity of OSAS and their correlations are complex, machine learning techniques are likely to be appropriate for proposing prediction models.

Since OSAS is a disease with very complex and diverse factors, lots of studies are being conducted to phenotype OSAS. Clustering, a subfield of machine learning and unsupervised learning, is widely used for phenotyping OSAS13,14,15 because it is suitable for multidimensional data without labels. Focusing on this point, this study attempts to obtain better classification performance by proceeding with clustering before classification.

This study aims to present models that can predict the severity of OSAS without performing PSG using assorted machine learning algorithms, in both supervised and unsupervised learning. Since the data is highly dimensional, we attempt to reduce the computation complexity and increase the performance by feature selection and clustering before classification16. In this study, experiments are conducted using a variety of methods, from techniques used in machine learning to methods suggested by medical studies. Accuracy is calculated through comparison with AHI measured from actual PSG and through the calculated accuracy, we compare the utility of models according to the severity of OSAS.

Methods

Data acquisition and ethics declarations

The data used were collected from patients who visited the sleep clinic of Samsung Medical Center between 2014 and 2021. The data include personal information, such as gender, age, height, and weight, as well as physical measurements(abdominal circumference, neck circumference, hip circumference, etc.) and results of self-report questionnaires(Epworth Sleepiness Scale(ESS), Insomnia Severity Index(ISI), etc.) PSG was performed with an Embla N7000 (Medcare-Embla, Reykjavik, Iceland), and the results from the machine’s automated scoring system were used to determine OSAS. AHI was measured as the number of episodes of apnea and hypopnea per hour. PSG features were also collected. The workflow of the predictive models is shown in Fig. 1.

For the software tools, the open-source programming language Python (version 3.9.9; Python Software Foundation, Delaware, USA) was used in all the processes of the study. SciPy17 package (version 1.8.1) was mainly used for statistical analysis, and scikit-learn18 library (version 1.1.2) was mainly used to develop the predictive models. The study protocol was approved by the institutional review board of Samsung Medical Center (IRB no. 2022-07-003), and the entire process of the study was performed in accordance with the ethical standards of the Declaration of Helsinki. The waiver of informed consent was approved by the institutional review board of Samsung Medical Center since this work is a retrospective study that only involves anonymous patient data.

Data pre-processing

The processed data consists of 4014 samples and is described by 33 numerical or categorical features. The main characteristics of the dataset are shown in Table 1. The OSAS severity of the dataset was classified into 4 classes corresponding to the severity level defined by the American Academy of Sleep Medicine Task Force19. For the classification, 20% of the dataset was used as test data. Each classifier was trained with 5-fold cross-validation with the train dataset. Among input features, numerical features were analyzed for normal distribution using the Kolmogorov-Smirnov’s test. In the case of the normal distribution, Student’s t-test was performed, and in the case of not, the Mann-Whitney U test was conducted. For categorical features, the chi-square test was operated. A p-value of less than 0.05 was considered significant.

Figure 1
figure 1

The workflow of the predictive models.

Table 1 The report of statistical analysis. Data are resported as median [interquartile range] or number (percentage). BMI: Body Mass Index, ESS: Epworth Sleepiness Scale, ISI: Insomnia Severity Index, K-BDI-II: Korean-Beck Depression Inventory-II, PSQI: Pittsburgh Sleep Quality Index, SSS: Stanford Sleepiness Scale.

Clustering

A combination of mutual information (MI) and recursive feature elimination (RFE)20 strategy on LightGBM was applied as feature selection methods for clustering. MI is a metric that indicates the interdependence between two variables, and RFE is a feature selection method that starts with all input features and removes less important features one by one as learning repeats. In the feature selection process, MI was computed to filter less informative variable. The threshold for filtering was set as the mean of the mutual information score. RFE was applied to finally determine the number of features for clustering.

For clustering algorithms, hierarchical agglomerative clustering, K-means, bisecting K-means algorithm, and Gaussian mixture model were used. The algorithms that automatically assign the number of clusters all had a large number of clusters, which did not fit our purpose of conducting clustering. Therefore, clustering algorithms that need to assign the number of clusters manually were used.

Hierarchical clustering is a common clustering algorithm that builds nested clusters by successively merging or splitting them. Agglomerative clustering is a bottom-up approach for hierarchical clustering. Each point starts with an individual cluster and similar clusters are consecutively merged in the clustering process.

K-means is the most popular clustering algorithm21 and is known for its simplicity. For finding K clusters, select K points as the initial centroids. Then, assign all points to the nearest centroid and recompute the centroid of each cluster. Repeat these steps until the centroids remain unchanged. Bisecting K-means is a variant of K-means algorithm22. Bisecting K-means algorithm uses the basic K-means algorithm to find 2 sub-clusters (bisecting step), and repeats the bisecting step and take the segmentation that produces the clustering with the highest overall similarity.

Gaussian mixture models (GMM) is a probabilistic model which assumes the probability distribution of all subgroups follows the Gaussian distribution23.

Feature engineering

Both methods proposed in medical researches and widely used in machine learning were applied as feature engineering techniques. Weighted ESS and a formula for predicting AHI were used as the medical approach, and body proportion data were also added by processing body measurement data in the dataset.

Weighted ESS is given different weights for each question of ESS. A recent study has shown that weighted ESS is better at predicting the severity of OSAS than general ESS24. Since our dataset includes the response of each ESS item, weighted ESS could be applied.

Following is predictive mathematical formula for AHI we used in this work. \(\textrm{AHIpred} = \textrm{NC} \times 0.84 + \textrm{EDS} \times 7.78 + \textrm{BMI} \times 0.91 \ - \ [8.2 \times \textrm{gender constant} (1 \textrm{ or } 2) + 37]\)25. We modified constants using SciPy package to optimize the formula for our dataset. Since the dataset contains two measurements of neck circumference (NC): in sitting and lying positions, the formula was also optimized for those measurements accordingly. In addition, three different criteria were used for determining excessive daytime sleepiness (EDS): the criteria for weighted ESS, the criteria from the American Academy of Sleep Medicine Task Force, and the criteria from the study proposed the predictive formula.

Predictive models

Gradient boosting-based models and random forest are considered as most effective machine learning models for dealing with large amounts of complex data. These algorithms are proven to be not only accurate but also efficient26,27. Therefore, in this work, we used random forest and three different models based on gradient boosting, XGBoost, LightGBM, and CatBoost, to enhance classification performance efficiently.

Random forest is a classifier consisting of a combination of decision trees built on random sub-samples of the dataset28. Since the classifier is composed of decorrelated decision trees, it is resistant to noises and the over-fitting problem.

XGBoost is a gradient boosting-based decision tree ensemble designed to be highly efficient and scalable29. Since the model automatically operates parallel computation, it is relatively faster than the general gradient boosting framework. XGBoost also lowers the risk of over-fitting by applying different regularization penalties.

LightGBM is a gradient boosting framework designed to be fast and highly efficient30. When the data are high-dimensional and large, traditional gradient boosting-based models require scanning all the data instances for each feature to estimate the information gain of all the possible segmentation points, which is excessively time-consuming and inefficient. LightGBM uses Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to deal with this problem. With those techniques, LightGBM reduces the number of samples and the number of features in the dataset.

CatBoost is a gradient boosting on decision trees algorithm that presents an innovative technique to process categorical features, and a variant of gradient boosting which is a permutation-driven alternative31. Both methods were created to resist a prediction shift caused by a target leakage, which is present in other implementations of gradient boosting algorithms.

The hyperparameter optimization process is the most cumbersome part of machine learning project. Therefore, diverse optimization techniques are used to simplify the procedure. In this work, we selected Bayesian optimization, which is one of the most commonly used optimization method for hyperparmeter tuning. The hyperparameters to be optimized were selected considering both the characteristics of the dataset and the classifier model. Selected hyperparameters of each model were optimized with a technique based on bayesian optimization using Optuna32.

Results

Clustering results

Figure 2
figure 2

Mutual information (MI) scores for all input features. Each threshold was set as the mean of the mutual information score.

Figure 3
figure 3

Visualised 5-fold cross-validation results of recursive feature elimination (RFE).

Various feature scaling methods were applied to the numerical features of the dataset and MI-LightGBM-RFE was used for the feature selection. First, MI scores according to AHI cut-off values were computed for all input features to filter out less informative variables. Computed MI scores are shown in Fig. 2. After this process, less important features were eliminated through LightGBM-RFE method. The number of features was determined by the 5-fold cross-validation. Cross-validation result of LightGBM-RFE is shown in Fig. 3. Hip circumference, head circumference, age, neck circumference (sitting position), weight, BMI, abdominal circumference were selected as features for the mild OSAS (AHI \(\ge \) 5) clustering. For the moderate OSAS (AHI \(\ge \) 15), age, abdominal circumference, PSQI total score, BMI, weight, hip circumference, SSS total score, head circumference, height were selected. For the severe OSAS (AHI \(\ge \) 30), sex, hours of sleep, abdominal circumference, weight, hip circumference, SSS total score, head circumference, height were selected.

All of the selected clustering algorithms were applied to datasets of scaled and selected features. The clustering results with the best classification accuracy of the test dataset were selected for the final prediction models. Among the selected clustering algorithms, hierarchical agglomerative clustering recorded the best classification accuracy when the AHI cut-off value is 5. GMM exhibited highest classification accuracy for the moderate OSAS (AHI \(\ge \) 15). For the severe OSAS (AHI \(\ge \) 30), K-means showed the best performance. The number of clusters was determined using the elbow method based on the silhouette score, and it was determined to be 2 for all AHI cut-off values.

Classification results by machine learning models and feature engineering methods

Figure 4
figure 4

Comparisons of classification accuracy by machine learning classification algorithms.

In the classification accuracy analysis, CatBoost was the best with 87.52% for the mild OSAS. LightGBM recorded the best, achieving 86.01% and 91.11% in the classification of moderate OSAS and severe OSAS, respectively. Figure  4 shows the classification accuracy according to classification algorithms. Overall, LightGBM showed the best performance in all severity classes. On the other hand, Random forest showed the lowest performance in all severity classes showing significant differences from the other machine learning models.

Figure 5
figure 5

Comparisons of classification accuracy by feature engineering methods. The accuracy of the best performing feature engineering methods and the accuracy of those without the applied feature engineering methods were compared. APNLB: AHI prediction computed using NC in a lying position with EDS criteria from the work of Bouloukaki et al., APNLG: AHI prediction computed using NC in a lying position with general EDS criteria, BMR: Body measurement ratio, WESS: Weighted ESS.

We adopted diverse methods for the dataset in the feature engineering procedure in which all of them were trained and evaluated. For the mild OSAS, applying AHI prediction with neck circumference in a lying position, and applying this method with body measurement ratio showed the best accuracy with 87.48%. For the moderate OSAS, applying weighted ESS, and appying weighted ess with body measurement ratio showed the best accuracy with 84.41%. When predicting the severe OSAS, the best performing feature engineering methods were showed similar with the ones in mild OSAS. The best accuracy was 88.13%. Figure  5 shows the classification accuracy according to feature engineering methods.

Classification results by approaches building prediction models

Table 2 The report of classification metrics of predictive models by approaches. Data are reported as mean (standard deviation) and [score range]. * \(p<0.05\) was statistically significant. ** Accuracies of the results were statistically tested and the classification results without clustering were used as the baseline for the statistical test (Mann-Whitney U test).
Figure 6
figure 6

Comparisons of receiver operation characteristic(ROC) curves based on approach to building predictive models. Best records were used for plotting. WOC: Without clustering, CO: Clustering only, CF: Clustering with feature engineering, CFH: Clustering with feature engineering and hyper-parameter tuning.

The prediction results with clustering showed significantly superior performance compared to the prediction results without clustering. The report of classification metrics is presented in Table  2. Statistical significance was tested using the Mann-Whitney U test (significance level 0.05). Using clustering to build a classification model was statistically significant for mild and moderate OSAS classifications compared to without clustering, while it was not for severe OSAS classifications.

In terms of classification accuracy, the approach of clustering with feature engineering and hyperparameter tuning showed the best in moderate and severe OSAS predictions, exhibiting 87.84% and 91.06%, respectively. However, clustering with feature engineering showed the highest accuracy with 88.16% when predicting mild OSAS.

ROC curves according to severity classes of OSAS and approaches to build the predictive models are visualized in Fig.  6. In common with the results of the accuracy analysis, the best AUC value was observed when predicting after clustering with feature engineering and hyperparameter tuning in moderate and severe OSAS predictions. When it comes to predicting mild OSAS, clustering with feature engineering was the best.

Discussion

In this study, the predictive models for the severity of OSAS were developed by applying various machine learning methodologies. The applicability of the model was tested and analyzed according to the severity. Using MI-LightGBM-RFE, we identified that important features according to each AHI cut-off value for clustering. We also discovered that hierarchical agglomerative clustering, GMM, and K-means clustering are effective for predicting mild OSAS, moderate, and severe OSAS prediction, respectively, based on classification accuracy. Of the three levels of severity, LightGBM performed best for both moderate and severe, except for mild. In particular, it performed well in the moderate OSAS classification, with a fairly large accuracy difference from the other algorithms. While LightGBM is the most functional algorithm overall, CatBoost is the most out-performing algorithm in mild OSAS. Our work demonstrated excellent performances exceeding at least 87% on all three AHI thresholds in classification accuracy.

The gold standard for diagnosing OSAS is PSG. Although, PSG has the disadvantages of being laborious, time-consuming, and expensive. Therefore, many studies have been conducted to develop methods for screening OSAS without performing PSG, and the application of machine learning techniques has also been widely used33,34,35,36,37. In recent years, researches on the South Korean population have also been actively conducted. However, there were limitations in that the experiment was conducted on a minority population and focused only on supervised learning38,39.

To the best of our knowledge, this work has the best performance among studies predicting OSAS severity from South Korean population using machine learning techniques. Compared to previous studies, this study is significant not only in terms of the research results but also in terms of the research process. In this work, we suggested a new methodology that uses both supervised and unsupervised learning algorithms to predict the severity of OSAS using machine learning techniques. Moreover, our experiment is important in that it has so far targeted the largest South Korean population in the research of predicting OSAS severity using the application of machine learning algorithms.

Despite the appreciable prediction performance, there are several limitations in this study. Since the data were collected from only one sleep clinic, this result is difficult to be estimated for the population of other sleep centers. In addition, a considerable amount of missing values existed in the provided data because this work is a retrospective study.

OSAS is a major worldwide public health concern with an increasing prevalence. Therefore, there is a need for OSAS severity prediction models which can be used in clinical settings. Our work provides the basis for confirming the sufficient potential for utilizing machine learning in OSAS severity prediction, and also suggests outcome prediction models may be useful for screening priorities that assign patients to PSG.

Conclusion

In this study, we predicted the severity of OSAS with only simple information such as gender and age, body measurement, and questionnaire using diverse machine learning techniques. Compared to the general supervised learning-based machine learning application, the approach of applying machine learning techniques using both supervised and unsupervised learning showed significant performance in OSAS severity prediction. The results of this work demonstrate the superiority of OSAS screening applicability using machine learning methods. Due to the retrospective nature of the study, a considerable amount of data was unavailable for reasons such as missing values, and the data was collected from a single institution, which may introduce bias. Future work could be conducted with data from a larger population at various institutions to improve upon this study. In conclusion, the predictive model presented in this study presents an accurate estimated severity class of OSAS, which provides important evidence that OSAS can be effectively screened without time-consuming and labor-intensive tests.