Wearable monitoring of sleep-disordered breathing: estimation of the apnea–hypopnea index using wrist-worn reflective photoplethysmography

A large part of the worldwide population suffers from obstructive sleep apnea (OSA), a disorder impairing the restorative function of sleep and constituting a risk factor for several cardiovascular pathologies. The standard diagnostic metric to define OSA is the apnea–hypopnea index (AHI), typically obtained by manually annotating polysomnographic recordings. However, this clinical procedure cannot be employed for screening and for long-term monitoring of OSA due to its obtrusiveness and cost. Here, we propose an automatic unobtrusive AHI estimation method fully based on wrist-worn reflective photoplethysmography (rPPG), employing a deep learning model exploiting cardiorespiratory and sleep information extracted from the rPPG signal trained with 250 recordings. We tested our method with an independent set of 188 heterogeneously disordered clinical recordings and we found it estimates the AHI with a good agreement to the gold standard polysomnography reference (correlation = 0.61, estimation error = 3±10 events/h). The estimated AHI was shown to reliably assess OSA severity (weighted Cohen’s kappa = 0.51) and screen for OSA (ROC–AUC = 0.84/0.86/0.85 for mild/moderate/severe OSA). These findings suggest that wrist-worn rPPG measurements that can be implemented in wearables such as smartwatches, have the potential to complement standard OSA diagnostic techniques by allowing unobtrusive sleep and respiratory monitoring.


Features
Cohen  Table S1. Comparison of the RE-epoch detection performance on the hold-out set without the low rPPG quality recordings exclusion and for the complete and reduced features sets. The performance is for the overall number of epochs contributing to the AHI (i.e. not Wake and epochs with less than 80% undefined features).

Deep learning models training
The general architecture of a deep learning model for supervised learning consists of a series of stacked layers, with layerdependent characteristics (hyper-parameters), which connect the input features with the desired output, i.e. epoch-by-epoch probability for positive class. We explored several possible combinations of blocks of layers and hyper-parameters, summarized in Table S2. The work of Radha et al. 68 inspired us to include LSTM and dense layers in our blocks. The choice of using convolution was based on the capability of this type of layers to extract meaningful information from multidimensional time-series with lower risk of over-fitting than other type of layers, such as recurrent and dense layers. We performed this exploration from the simplest to the most complex structures and empirically adjusted our exploration based on the AHI estimation results obtained on the training and validation sets, i.e. blindly from the hold-out test set. As an example, only a few long-short term memory (LSTM) -based models were explored because we noticed that they were usually performing worse than other architectures during our initial explorations. For each model, we set the maximum number of training iterations to 5000 and used a batch size of 32 overnight recordings. The loss function used was a weighted binary cross-entropy and the optimizer a root mean square propagation optimizer with a learning rate of 0.0001. The training stopped when the validation set loss did not improve at least by 0.0005 for 300 consecutive training iterations, and we selected the model with the lowest validation loss as the final model. The models were been implemented in a Python 3.6 environment using the Keras functions of the Tensorflow library (version 1.14).
The amount of RE-epochs is lower than the amount of non-RE-epochs. To compensate for the class imbalanced, we calculated the loss weight of each epoch as: where N RE−epochs , N ad j. RE−epochs and N tot. epochs are, respectively, the amount of RE-epochs, the amount of non-RE-epochs adjacent to a RE-epoch and the total amount of epochs. In this manner, we also forced the model to pay attention to the change in class from one epoch to the next. Epochs with more than 80% of undefined features were considered unreliable, and they were assigned loss weight of zero.
The output of each model, i.e. positive class probability, was thresholded to define a positive class detection. The choice of the probability threshold was performed on the training set based on Cohen's kappa between reference and estimated OSA severity 27 ).

Deep learning model selection
After training over a thousand different deep learning models to perform RE-epochs detection, we selected the model that performed best in AHI estimation in the validation set. More specifically, 1. We selected the 20 models with unique combinations of blocks, used independently from their hyper-parameters, that had the highest Cohen's kappa 88 between reference and estimated OSA severity; 2. These selected models were retrained eight times by re-randomizing training and validation sets. In this way, we estimated the resilience to changes in training and validation data; 3. We selected the model architecture with the best median Cohen's kappa of OSA severity; 4. For the selected model architecture, we selected the model using the training and validation sets split with the highest Spearman's correlation between the reference and estimated AHI.
We decided to select the model based on the AHI estimation performance instead of the RE-epochs detection performance to avoid possible bias introduced by participants with few respiratory events (i.e. cases when positive class is absent or extremely unbalanced). Even though the OSA severity is an ordinal quantity, we opted for standard Cohen's kappa instead of its weighted version 66 to have a stricter evaluation of each misclassification. The last selection step employed the Spearman's correlation, as opposed to Cohen's kappa, because it can account for the spread within each OSA severity class.

Performance of the selected model architecture
The twenty unique architectures selected in step 1 of the model selection had more than a fair agreement between the reference and the estimated OSA severity in the validation set (minimum OSA severity: Cohen's kappa of 0.37 amongst the 20 selected). Most of these architectures were stacked convolutional blocks spaced by blocks to reduce over-fitting (Gaussian noise layers). The selected model architecture (step 3. of the model selection) had a median and IQR of the OSA severity Cohen's kappa on the re-randomized validation sets equal to 0.36 [0.26 -0.38].

Influence of rPPG quality on performance
The performance after the exclusion of low-quality recordings showed an increase with respect to the AHI estimation. Therefore, we tested wheter further tightening of the rPPG quality requirements would increase performance. Since the initial low-quality definition was based on multiple parameters derived from the training, we decided to maintain a similar definition for the more tight quality requirements. We calculated the parameters reported in Table 3 also for 20 th , 30 th , 40 th and 50 th percentile of the quality of the training set recordings (Table S3).
Increasing the rPPG quality requirements generated an overall increase in performance for RE-epoch detection (Table S4) and AHI estimation (Table S5) at the cost of a decreased amount of recordings available.  Table S5. AHI estimation performance and the number of remaining recordings when the rPPG quality requirements increase. In bold the best performance values (before rounding).