Evaluating reliability in wearable devices for sleep staging

Birrer, Vera; Elgendi, Mohamed; Lambercy, Olivier; Menon, Carlo

doi:10.1038/s41746-024-01016-9

Download PDF

Review Article
Open access
Published: 18 March 2024

Evaluating reliability in wearable devices for sleep staging

npj Digital Medicine volume 7, Article number: 74 (2024) Cite this article

1617 Accesses
5 Altmetric
Metrics details

Subjects

Abstract

Sleep is crucial for physical and mental health, but traditional sleep quality assessment methods have limitations. This scoping review analyzes 35 articles from the past decade, evaluating 62 wearable setups with varying sensors, algorithms, and features. Our analysis indicates a trend towards combining accelerometer and photoplethysmography (PPG) data for out-of-lab sleep staging. Devices using only accelerometer data are effective for sleep/wake detection but fall short in identifying multiple sleep stages, unlike those incorporating PPG signals. To enhance the reliability of sleep staging wearables, we propose five recommendations: (1) Algorithm validation with equity, diversity, and inclusion considerations, (2) Comparative performance analysis of commercial algorithms across multiple sleep stages, (3) Exploration of feature impacts on algorithm accuracy, (4) Consistent reporting of performance metrics for objective reliability assessment, and (5) Encouragement of open-source classifier and data availability. Implementing these recommendations can improve the accuracy and reliability of sleep staging algorithms in wearables, solidifying their value in research and clinical settings.

A systematic review and multivariate meta-analysis of the physical and mental health benefits of touch interventions

Article Open access 08 April 2024

Self-supervised learning for human activity recognition using 700,000 person-days of wearable data

Article Open access 12 April 2024

Sleep quality, duration, and consistency are associated with better academic performance in college students

Article Open access 01 October 2019

Introduction

Sleep, encompassing approximately one-third of our lifespan, is a fundamental aspect of our daily activities and plays a crucial role in maintaining our health, work performance, and overall well-being¹. Extensive research has consistently demonstrated the detrimental impact of poor sleep quality on various health conditions, including cardiovascular diseases², diabetes³, hypertension⁴, depression⁵, immune-related diseases⁶, and cancer mortality risk⁷. As an increasing number of individuals recognize the significance of sleep quality in leading a healthy lifestyle, both sleep-related research and industries have witnessed substantial growth^8,9.

Polysomnography (PSG) currently serves as the gold standard for sleep assessment, involving a comprehensive measurement of various physiological changes during sleep¹⁰. This method requires the placement of multiple sensors to monitor brain activity, heart activity, eye movements, muscle activity, blood oxygen levels, breathing patterns, body movements, snoring, and other noises. However, the complex setup and high cost associated with PSG discourage regular testing, thereby limiting its utility for accurate sleep monitoring. Patients undergoing PSG must endure the placement of numerous sensors on their bodies, intricate wiring systems, and bulky electronic devices for data transmission and storage. Additionally, PSG recordings primarily take place within specialized sleep laboratories, which are often inhospitable to natural sleep patterns¹⁰. Consequently, many patients experience difficulties falling asleep and do not exhibit natural sleep behavior due to the elaborate setup.

While many wearable-based algorithms focus on distinguishing between sleep and wakefulness, a comprehensive evaluation of sleep architecture and specific sleep stages is essential for proper diagnosis and treatment of sleep disorders¹¹. Sleep staging provides valuable insights into the quality, characteristics, and transitions of sleep stages, enabling a more thorough understanding of sleep patterns and facilitating tailored interventions¹².

Recent articles have summarized the use of commercially available devices for sleep monitoring, yet there is a notable gap in addressing the development of algorithms for sleep staging and the associated challenges. In response to this gap, this review aims to provide a comprehensive overview of recent advancements in wearable sensors and portable electronics, particularly focusing on innovations that enhance the comfort and usability of sleep monitoring devices by eliminating the need for adhesive, conductive gels, or cable connections. We also offer essential recommendations to guide future developments in algorithm design for wearables, targeting the accurate and reliable assessment of sleep parameters. This work is essential in improving the diagnosis and management of sleep disorders, ultimately contributing to better overall sleep health and well-being^13,14,15.

Results

Publications

This scoping review identified a total of 35 articles that evaluated a total of 62 setups of wearable devices, some of which occurred several times in different articles, as shown in Fig. 1. On PubMed 88 articles were identified, On Embase 41 articles were retrieved and on IEEE Xplore 9 articles. While screening through the articles, an additional 14 relevant articles were identified. While screening 22 duplicates and six inaccessible or incompatible articles were removed, leaving a total of 124 articles for evaluation. Fifty articles were excluded either did not discuss wearables or did not assess them, and another 14 articles did not evaluate the sleep metrics of the wearables. Additionally, 4 review articles and 5 theoretical articles were removed. Finally, 16 articles were removed where no epoch-by-epoch evaluation was included, resulting in 35 articles that were deemed suitable for in-depth analysis. Five of which were analyzed in more depth to extract the details for sleep staging algorithms and the used features. It was observed that the trend in wearable technology is shifting toward multi-sensor devices, where wearables incorporate not only accelerometers but also PPG, temperature, or other types of sensors. Specifically, this review included 62 wearable setups, of which 28 exclusively utilized accelerometers and 32 incorporated multiple different sensors. For two devices^16,17 it was not clearly stated what sensor input(s) are being used to assess sleep.

**Fig. 1: Search workflow depicting the identification, screening, eligibility, and inclusion of articles in the review.**

Characteristics of participants

Sleep stages exhibit significant variation both between males and females and across different age groups.¹⁸ Most of the studies included a relatively balanced number of male and female participants, except Fedorin et al.¹⁹ did not state the gender distribution. Eight studies focused on children and adolescents^{17,20,21,22,23,24,25,26}, and five studies targeted young adults^{27,28,29,30,31}, which was defined as articles reporting an average age below 25 or specifically stating that they investigated young adults. Only two articles^32,33 examined the performance of wearable devices in an older population, meaning having an average age over 50. One article also had an average investigated age above 50 but reported a large variance in age³⁴. The remaining 22 studies covered mainly individuals between 25 and 50 years. Finally, Fedorin et al.¹⁹ did not state the age of their participants.

Inclusion of participants with sleep disorders and/or comorbidities

Medical conditions like insomnia, sleep disorders, or neurological disorders can also affect sleep staging.³⁵ The majority (25) of the included articles recorded data from healthy participants only. Four articles included healthy participants as well as participants with some kind of sleep disorder^25,32,36,37. Three studies focused exclusively on participants with sleep disorders^20,34,38. One article included only participants with unipolar major depressive disorder³⁹, while another one only involved participants with dermatitis⁴⁰. Finally, one article included only participants who had obstructive sleep apnea (OSA) had neurological disorders, and/or used medications that are known to have effects on sleep³³. In Fig. 2 these findings are summarized.

**Fig. 2: Distribution of included participants based on health status per article.**

Types of devices and reference systems

The majority of devices examined in this review (d = 28, ‘d’ is the number of devices) relied solely on accelerometer data for sleep analysis. However, there has been an increasing trend in recent years towards utilizing both accelerometer and PPG data for evaluating sleep, which is reflected in the inclusion of 28 such devices in this review, as seen in Fig. 3. Further, two devices^41,42 included in this review incorporated data from three sensors—accelerometer, PPG, and temperature sensors. An additional two devices^40,43 utilized input from accelerometers and additionally other sensors, such as ambient light, bio-impedance, or skin temperature, but did not include a PPG sensor. Lastly, there were two devices^16,17 for which the specific sensor input utilized for sleep analysis was not reported for all included devices.

**Fig. 3: Type of sensors used to perform the sleep analysis per device.**

On average, sleep/wake classification accuracies were reported to be 87.2% based on 53 assessed devices. There was no significant difference in accuracies between devices using only accelerometer data (86.7%, d = 28) and devices using both PPG and accelerometer data (87.8%, d = 22), as determined by a t-test (significance threshold p < 0.05). All reported accuracies ranged from 79% to 96%, except for Kanady et al.’s study²⁸, which reported lower values of 54% and 64%. This difference can be attributed to their 24-hour measurement, which had a higher wake-to-sleep ratio compared to overnight measurements in other studies. Therefore, these accuracies reflect the generally poor performance of sleep classifiers in detecting wake. The average accuracy for 3-stage classification (wake vs. NREM vs. REM) was 69.7% (d = 3), and for 4-stage classification (wake vs. light vs. deep vs. REM), it was 65.2% (d = 9). More detailed information is in Table 1.

Table 1 Overview of all articles included in this review

Full size table

Articles discussed data collection at sleep laboratories (n = 23, ‘n’ is the number of articles), at home (n = 9) or quasi-/semi-laboratories (n = 2). One study included recordings from particpants’ home and a sleep laboratory⁴¹. Most of the articles (n = 32) used PSG as a reference system to validate the results of the wearables, as it can be seen in Fig. 4. However, three studies utilized an EEG system^31,44,45 as a reference, two used a single-channel EEG device^44,45 and one used the Dreem 2³¹ mobile EEG device.

**Fig. 4: Ground truth methods used for evaluating wearables per article.**

Sleep staging epoch lengths

According to the guidelines for sleep staging, the PSG data are analysed in 30-s segments, called epochs, and these are then classified into the sleep stages⁴⁶. About two thirds (d = 41) of the 62 wearable setups in the reviewed articles provided epochs of 30 s, which can be directly compared to the epochs of the PSG data. A quarter (d = 17) of the wearable setups had access to 60-s epochs. One article⁴³ employed a device that only provided access to 2-min segmented data. Furthermore, for two devices in one study⁴⁷ the sleep stages in epochs of 5 min were reported. For one device⁴⁸ the epoch length was not stated. The distribution of epoch lengths used can be seen in Fig. 5.

**Fig. 5: Reported length of epochs used to evaluate the performance of wearables per wearable setup.**

A challenge is to compare sleep stages that are half or two/four/ten times as long as the reference measurements. A commonly used method for 60-s epochs is to fuse the PSG epochs to 60 s. If one or both epochs are classified as wake, they are scored as wake, and if both are classified as sleep, they are scored as sleep^{17,20,24,27,28,32,34}. Another commonly used method is to split the epochs into 30-s segments and assign them the same value as the long epoch^31,39,47. Roberts et al.⁴⁸ used the timestamp of the beginning of the staged epoch and used the classification of the reference epoch with the nearest start timestamp; no conversion between 30 s and 60 s occurred. Devine et al.⁴³ assigned sleep and wake with the values 1 and 0, respectively, averaged the values over four epochs, and then rounded to the nearest integer to obtain 2-min epochs. Chinoy et al.¹⁶ scored the PSG data at 30-s and 60-s epochs to be able to compare it to devices with 30-s epochs and devices with 60-s epochs. Stucky et al.⁴⁹ used PSG data that was scored in 20-s epochs and compared it to 30-s epochs where they looked at the PSG intervals and compared it to the dominating device stage in that interval; if two were equal, the first one was chosen.

When authors were able to work with 30-s epochs (or raw data) of commercially available devices, the devices often had to be provided by the company or the authors were employed by the company^{21,22,26,29,30,33,41,42,47}.

Algorithms for sleep staging

The majority of the articles^{16,17,19,20,22,24,25,26,27,28,29,30,31,32,36,37,38,39,42,43,45,47,48,49,50,51,52,53} included in this review reported their findings based on proprietary algorithms used by wearable device companies, with many not disclosing the specific features employed in their sleep staging algorithms, as it can be seen from Fig. 6. For sleep detection using only accelerometer data (actigraphy), well-established algorithms are most often used, including the Cole-Kripke algorithm⁵⁴, the University of California, San Diego (UCSD) scoring algorithm⁵⁵ and the Sadeh algorithm⁵⁶. In general, they calculate weighted sums of activity levels in one-minute intervals, including levels from preceding and succeeding minutes⁵⁷. For devices using also PPG data, five articles^{19,41,48,50,51} describe their own sleep staging algorithms in detail using machine learning, which are reviewed in the following sections. Further Mahadevan et al.⁴⁰ described a possible algorithm for a wake / sleep detection using accelerometer data, skin temperature and an environment light sensor but no PPG data.

**Fig. 6: Reported algorithms for devices using PPG sensors.**

The evaluated classifiers for sleep staging with wearable devices include linear discriminant classifier, quadratic discriminant classifier, random forest classifier, support vector machine, neural nets, logistic regression, k-nearest neighbor and gradient boosting machine^{19,41,48,50,51}. The overall best accuracy for sleep/wake classification has been shown to be 96% with the light gradient boosting machine⁴¹. The best accuracy for 3 stages sleep staging was 85%¹⁹ with the linear discriminant classifier. The overall highest accuracy for 4 stage sleep staging was 79%⁴¹ with the light gradient boosting machine. It has to be mentioned, that both Beattie et al.⁵⁰ and Walch et al.⁵¹ state in their articles that the choice of the classifier was not as impactful as the selection of the input features.

Data processing and feature selection

In some studies^19,41,50 before feature extraction for classifier training, the data underwent pre-processing. This included peak detection in PPG to estimate RR intervals in ECG⁵⁰ or detrending, denoising, and filtering on all raw data¹⁹. Altini and Kinnunen⁴¹ applied a 5th order Butterworth filter (3–11 Hz) on the accelerometer data and performed temperature artifact rejection by masking values outside of 31–40 degrees. They applied a real-time moving average filter to the PPG data and removed intervals more than 16 bpm away from the 7-point median of its immediate neighbors. Additionally, they required the existence of five consecutive windows.

Beattie et al.⁵⁰ used accelerometer features including an integration of the accelerometer signal in 30-s epochs, the magnitude (maximum and minimum of each axis), and time since the last movement and until the next significant movement. Walch et al.⁵¹ described their feature extracted from the accelerometer as the activity count from the raw data, which should be similar to the features used by actigraphy (described and evaluated by te Lindert et al.⁵⁸). Altini and Kinnunen⁴¹ included the trimmed mean, maximum, and interquartile range of each axis in 30-s windows. Furthermore, the mean amplitude deviation and the difference in arm angle were evaluated of 5-s epochs and then aggregated to 30-s epochs. Finally, Fedorin et al.¹⁹ also utilized features derived from accelerometer data, but their specific features were not explicitly stated.

The included features derived from the PPG measurements varied greatly from article to article. Beattie et al.⁵⁰ extracted heart rate (HR) from the PPG signal and used several heart rate variability (HRV) features in their sleep staging classifier, including high frequency (HF), low frequency (LF), and very low frequency (VLF) power, root mean sum of squared distance (RMSSD), percentage of adjacent RR intervals differing by more than 50 ms (pNN50), delta RR, mean HR, 90th percentile HR, and 10th percentile HR. They also included breathing rate features such as HF power (0.15–0.4 Hz), LF power (0.04–0.15 Hz), and VLF power (0.015–0.04 Hz). Altini and Kinnunen⁴¹ used several HRV features in their sleep staging classifier, including HR, RMSSD, standard deviation of normal-to-normal intervals (SDNN), pNN50, LF power (0.04–0.15 Hz), and HF power (0.15–0.4 Hz), frequency peak in LF and HF, total power, normalized power, breathing rate, mean, and coefficient of variation of zero-crossing interval. On the other hand, Walch et al.⁵¹ used the bpm values for every second and the standard deviation of the windows around the scored epoch. Finally, Fedorin et al.¹⁹ included the HRV and the RR in its time and frequency domains and in nonlinear time sequence processing. They also used some PPG shape features, although these were not specified.

In their classifier, Walch et al.⁵¹ incorporated a feature termed “clock proxy,” which is a cosine wave derived from an individual’s circadian clock that was estimated using data from the previous night’s sleep with the wearable. Fedorin et al.¹⁹ included statistical information regarding sleep stages as features, such as a sleep stage transition probability matrix and the probability of each sleep stage occurring per hour after falling asleep. Altini and Kinnunen⁴¹ included features derived from a negative temperature coefficient sensor, including mean, minimum, maximum, and standard deviation, as well as a sensor-independent circadian factor. The circadian factor is composed of a cosine wave representing circadian drive, a decay representing the decay of homeostatic sleep pressure, and a linear function representing the elapsed time since the beginning of sleep.

Altini and Kinnunen⁴¹ did a normalization of most of the features per night, excluding some acceleration features, and then used them as an input for the models. Beattie et al.⁵⁰ used a set of rules after sleep staging to penalize unlikely physiological patterns.

Sleep staging without full raw data access

In the study by Roberts et al.⁴⁸, already processed data provided by Apple and Oura were used to distinguish between wake and sleep without full raw data access, like the previously described classifiers. The Apple Watch Series 2 provided raw accelerometer data but only provided access to bpm estimates for the heart rate, sampled at approximately 0.2 Hz. For the Oura Ring, the researchers used motion counts provided every 30 s and RR intervals from the PPG sensor. They employed a gradient boosting classifier and achieved accuracy and sensitivity comparable to the proprietary sleep staging algorithm used by Oura. At the time of this study, Apple did not yet have its own sleep classifier. The model trained on the data obtained from these devices achieved higher accuracy for the Apple Watch than for the Oura Ring. The researchers suspected that the difference in accuracy and specificity could be attributed to the various types of data available from the devices. Additionally, the algorithm developed in this study was suitable for real-time applications.

Influence of different features on classifier performance

The reported specificities for sleep/wake detection range from 41% to 60.2% (accuracies 90%/92.6%)⁴⁸ for the algorithms using already processed data and 65% (sensitivity fixed at 90%)⁵¹ up to 80.74% (accuracy 98.15%)⁴¹. Walch et al.⁵¹ stated that for the wake/sleep staging, the motion features are a good predictor, and the addition of the circadian features increases the accuracy more than the addition of the heart rate features. Altini and Kinnunen⁴¹ also used motion as the baseline accuracy and added features, reporting that the addition of temperature and HRV increased the accuracy by about the same amount, while the last added circadian features only increased the f1 score. Roberts et al.⁴⁸ found that the specificity could be increased by around 20–35% when the wake epochs are oversampled, at the cost of 8–12% of accuracy.

The reported accuracies for three-stage sleep staging were 69%⁵¹ and 85%¹⁹, with Cohen’s kappa values ranging from 0.4 to 0.67 indicating moderate to substantial agreement with the PSG sleep staging. Walch et al.⁵¹ found that motion is the weakest predictor of three-stage sleep staging, indicating that heart rate features are much more important.

For four-stage sleep staging, the reported accuracies were 69%⁵⁰, 77%¹⁹ and 79%⁴¹ and the Cohen’s kappa values were 0.52⁵⁰ and 0.58¹⁹, indicating moderate agreement with the PSG sleep staging. Beattie et al.⁵⁰ stated that the Cohen’s kappa value is the same if one is only using motion or accelerometer features and that the score doubles when using both feature types. Altini and Kinnunen⁴¹ started with a baseline accuracy using just motion features, resulting in an accuracy of 57%. The addition of temperature features added 4%, while the addition of HRV features increased accuracy by 16%. Finally, the addition of circadian features resulted in an increase in accuracy by 3%.

Discussion

The objective of this review was to assess the current literature on the challenges associated with algorithm development in sleep staging using wearables. To achieve this, we conducted an extensive search to identify previous research in this area. Although many articles discussed wearables and sleep evaluation, most focused on sensing technologies or devices that only use accelerometer data. Despite the growing number of wearables that incorporate multiple sensors for sleep staging, there is a lack of research on algorithms used for sleep staging and the potential benefits of using multi-sensor inputs.

The American Academy of Sleep Medicine (AASM) expressed the need for validation of consumer sleep technologies⁵⁹. However, there are no standardized protocol or measures for evaluating wearable devices which do not include EEG sensors. Menghini et al.⁶⁰ proposed a framework to improve validation. Two types of assessment measures that are commonly used are: total duration of different sleep quality measures (total sleep time, sleep onset latency, wake after sleep onset, and sleep efficiency) and epoch-by-epoch sleep staging comparison (accuracy, sensitivity, and specificity). In this review only articles were included which report results of an epoch-by-epoch sleep staging comparison.

PSG is considered the gold-standard method for diagnosing sleep disorders. Physiological signals, including EEG, electrooculography (EOG), electromyography (EMG), and electrocardiography (ECG), are measured during PSG to identify sleep stages. Sleep is classified into N1, N2, N3, and REM stages, each with unique physiological patterns, according to the AASM sleep scoring⁴⁶. The N1 and N2 stages are often combined and referred to as light sleep, whereas N3 is considered deep sleep. However, manual sleep staging may not be perfectly consistent across different scorers. The agreement among scorers for sleep staging ranged from 78.9%⁶¹ to 82.6%⁶². Before 2007 the standard to classify sleep stages was developed by Rechtschaffen and Kales⁶³. In this standard the sleep is classified in S1 to S4, REM and movement time. Generally, S1 to S4 are referred to N1, N2 and N3 where S3+S4 refer to N3, and REM stays REM. Although significant differences between the two manuals have been identified⁶⁴ and the usage of data of two different manual have to be handled carefully.

Sleep evaluation faces several limitations: PSG, the gold standard measurement device, is bulky and inconvenient, and existing studies using actigraphy, a widely used alternative, have shown limitations in detecting wake episodes and providing more detailed sleep staging. However, Ryser et al.⁶⁵ have recently demonstrated a more reliable approach for correctly classifying wake epochs. New generations of wearables, with multiple sensors for PPG or temperature, aspire to overcome these limitations and provide more detailed sleep staging from unobtrusive devices using more advanced algorithms.

The current review acknowledges certain limitations that should be taken into consideration. Firstly, although a thorough search was conducted across three platforms (IEEE Xplore, PubMed, and Embase), it is important to note that there is a possibility of missing out on relevant articles. Secondly, some of the selected articles did not report accuracy as a primary outcome, but other results like sensitivity, specificity or total durations of sleep and wake. This may impact the overall representation of the findings in the final table, potentially influencing the interpretation of the results. These limitations, though present, do not undermine the value of this review, but rather highlight the importance of future research to report all outcome values and address any potential gaps to enhance our understanding of the topic.

We identified two main evaluation metrics for sleep wearables: total duration of sleep and wake time and epoch-by-epoch sleep classifier evaluation. These metrics are often reported in relation to PSG or EEG measurements and sometimes in combination with actigraphy devices. However, the reported metrics need to be treated with caution due to various sources of error, such as data synchronization issues and variable sleep staging epoch lengths. We decided to focus on articles reporting epoch-by-epoch results as these results contain the most information about the performance of classifiers.

Our in-depth analysis of the algorithms for sleep staging with multiple sensor inputs, especially the addition of PPG features to machine learning models, shows promising results. Feature selection has been shown to be crucial for the development of a sleep staging classifier. Next to features extracted from the accelerometer and the PPG data, some further features, such as temperature, were used. Additionally, features that were not from sensors, such as circadian features and statistical information, were included. A recent study⁶⁶ demonstrated that the breathing rate can be extracted from an accelerometer positioned on the chest. This extracted breathing rate could be used as another feature for classifiers sleep staging classification.

However, most of the reviewed articles did not provide insight into the algorithms used for sleep staging, as they were proprietary algorithms provided by the manufacturer. This makes it hard to compare the same device in two different studies and may be a cause for differences. Furthermore, access to sleep staging epochs is often limited, and the authors of the articles had to rely on the manufacturer to provide them. Consequently, for many of the in-depth analysis articles, the data were provided by or associated with the manufacturer of the device.

While our primary focus is on wearables, it is essential to recognize that the field of sleep evaluation continues to evolve. Recent research has also evolved beyond traditional wearables, exploring sleep staging from sound analysis^67,68. Although not within the scope of this article, sound-based sleep staging methods, which analyze audio data during sleep, offer a promising avenue for non-intrusive assessment of sleep quality and staging. Future studies might explore combinations between sound-based sleep monitoring and wearable technologies to further enhance the accuracy and comprehensiveness of sleep evaluation.

Further research and standardization of the framework⁶⁰ are necessary to evaluate the benefits of including multiple sensors in wearables for reliable sleep staging. This requires access to epoch-by-epoch data and knowledge of the algorithms used. Moreover, a deeper understanding of the important features measured by wearables should be addressed. The data sets used should put special emphasis on heterogeneous field participants, including varying ages, different ethnicities, and a balanced gender distribution. Further emphasis should be placed on investigating the performance of wearables for sleep disorders and other comorbidities.

After conducting this literature review the following is recommended for future work:

Conduct validation studies to evaluate algorithm performance, particularly when involving diverse participants with sleep disorders (like insomnia or sleep apnea) and comorbidities (like pychiatric disorders). Implementing equity, diversity and inclusion will enhance the generalizability of the findings and allows for a comprehensive assessment of the algorithm’s effectiveness in real-world scenarios. As it can be seen from Fig. 2, most of the studies were conducted with only healthy participants. The sample size of the articles reported in this review range from 6 to 118 participants. Where the average number of participant is 42.6. In order to achieve generalization it is important to have a reasonable large dataset which should contain more than 50 participants. In general we recommend using the article of Bujang and Adnan⁶⁹ to calculate the suitable sample size.
Compare commercially available multi-stage devices across studies to validate their performance. The validation process plays a pivotal role in ensuring the reliability and accuracy of multistage devices in detecting sleep stages, while also providing valuable insights into the performance of diverse algorithms. Through systematic evaluation across multiple studies, researchers can acquire a comprehensive understanding of the strengths, limitations, and areas for improvement of these devices. As it can be seen from the Table 1, only a fraction of all available wearables doing sleep staging have been validated in independent studies to validate their performance.
Conduct investigations to thoroughly explore and understand the significant features measured by wearable sensors, such as accelerometer, PPG, temperature, and other non-sensor-based features. By delving into these features, researchers can gain insights into their respective contributions and potential synergies in assessing sleep quality and stages. Understanding the characteristics, strengths, and limitations of each sensor-based and non-sensor-based feature enables researchers to make informed decisions regarding their inclusion in algorithms and data analysis pipelines. The necessity for more investigation in features arise from the fact that only 20% of all articles reported the used algorithm (Fig. 6) and in total only 5 articles described the used features.
Consistently report sensor specifications (type, resolution, measurement range), validation details (sensor input, epoch length) and performance metrics (accuracy, sensitivity, specificity) for transparency and comparisons⁶⁰. For example, sleep data is typically more abundant than wake data in sleep studies, as individuals spend a significant portion of their time asleep. This data asymmetry could impose bias in the algorithm toward having a higher likelihood of correctly identifying sleep stages but may have more difficulty accurately classifying wakefulness. In the following unbiased metrics should be used to report the performance of a classifier, especially the Matthews correlation coefficient⁷⁰.
Cultivate the open-source availability of classifier code for independent validation and research collaboration. This facilitates rigorous peer review and enables researchers to in-depth check the algorithm’s methodology. It also allows other researchers to reproduce the results, conduct comparative analyses, and build upon existing work.

In conclusion, accurate and reliable consumer sleep technology is pivotal in comprehending sleep patterns and their impact on health. Our literature review uncovered an increasing trend in utilizing accelerometer and photoplethysmography (PPG) data for sleep assessment, with the integration of PPG features and additional sensors demonstrating enhanced sleep stage classification. To achieve precise sleep stage classification, meticulous analysis and optimization of data processing, alignment, epoch length, and feature selection are imperative. Collaborative endeavors between sleep researchers and device manufacturers are instrumental in refining machine learning models and augmenting the accuracy of sleep wearables. Further research is required to validate the performance of multi-sensor devices, deepen the understanding of key wearable-based features, and assess their efficacy in sleep disorders and comorbidities. Five recommendations for future work are proposed: (1) validate algorithms after implementing equity, diversity, and inclusion, (2) compare multi-stage device performance, (3) explore impact of features, (4) report validation use performance metrics consistently, and (5) promote open-source classifier and data availability. These guidelines could facilitate more precise and reliable sleep assessment, ultimately benefiting individuals’ well-being and advancing the field of sleep research.

Methods

Literature Search and Selection Criteria

We conducted a literature search across IEEE Xplore, PubMed, and Embase, adhering to PRISMA guidelines for systematic reviews⁷¹. The search covered publications from January 2013 to January 2023, focusing on recent developments in sleep assessment using wearable technology. Search terms included ‘sleep’, ‘quality’, ‘efficiency’, ‘assessment’, ‘evaluation’, ‘actigraphy’, ‘accelerometer’, ‘PPG’, ‘photoplethysmogram’, ‘photoplethysmography’, ‘heart rate’, and ‘wearable’. These terms were combined using Boolean operators to capture a broad range of relevant studies. The detailed search terms can be found in the supplemental material (see “Supplementary methods”). The literature review process involved one author (V.B.) conducting the initial search and a second author (M.E.) independently verifying the results.

Inclusion criteria for the review were articles presenting results of wearable devices for sleep evaluation on an epoch-by-epoch basis. Exclusion criteria included duplicate publications, inaccessible articles (lacking full-text availability), studies not relevant to wearable technology, those not assessing sleep metrics or lacking epoch-by-epoch evaluation, as well as review articles and theoretical papers.

Data Analysis and Statistical Approach

For data analysis, we focused on the accuracy of sleep staging classifiers as reported in the selected studies. Given the potential imbalance in sleep stage datasets (disproportionate representation of sleep versus wake epochs), we chose accuracy for its widespread recognition and interpretability in sleep research. The analysis involved compiling reported accuracies of various devices and algorithms, specifically noting their performance in differentiating between sleep stages such as wake, NREM, REM, light sleep, and deep sleep.

A t-test was employed to assess statistically significant differences in classifier accuracies among the reviewed devices and algorithms. This involved calculating mean accuracy values for each device or algorithm and comparing them using the t-test, with a set significance level of p < 0.05. This statistical analysis aimed to identify any significant trends or disparities in the performance of various sleep staging technologies.

Data availability

The authors declare that all data supporting the findings of this study are available within this paper.

References

Luyster, F. S., Strollo, P. J., Zee, P. C. & Walsh, J. K. Sleep: a health imperative. Sleep 35, 727–734 (2012).
Article PubMed PubMed Central Google Scholar
Figueiro, M. G. & Pedler, D. Cardiovascular disease and lifestyle choices: Spotlight on circadian rhythms and sleep. Prog. Cardiovas. Diseases (2023).
Jung, I. et al. Sleep duration and the risk of type 2 diabetes: a community-based cohort study with a 16-year follow-up. Endocrinol. Metab. 38, 146–155 (2023).
Article Google Scholar
Isayeva, G., Shalimova, A. & Buriakovska, O. The impact of sleep disorders in the formation of hypertension. Arterial Hypertens. 26, 170–179 (2022).
Article Google Scholar
Nutt, D., Wilson, S. & Paterson, L. Sleep disorders as core symptoms of depression. Dialogues in Clinical Neuroscience (2022).
Garbarino, S., Lanteri, P., Bragazzi, N. L., Magnavita, N. & Scoditti, E. Role of sleep deprivation in immune-related disease risk and outcomes. Commun. Biol. 4, 1304 (2021).
Article PubMed PubMed Central Google Scholar
Huang, B.-H. et al. Sleep and physical activity in relation to all-cause, cardiovascular disease and cancer mortality risk. Br. J. Sports Med. 56, 718–724 (2022).
Article PubMed Google Scholar
Brager, A. J. & Simonelli, G. Current state of sleep-related performance optimization interventions for the e-sports industry. Neurosports 1, 3 (2020).
Google Scholar
Worley, S. L. The extraordinary importance of sleep: the detrimental effects of inadequate sleep on health and public safety drive an explosion of sleep research. Pharmacy Ther. 43, 758 (2018).
Google Scholar
Rundo, J. V. & Downey III, R. Polysomnography. Handbook Clin. Neurol. 160, 381–392 (2019).
Article Google Scholar
Abad, V. C. & Guilleminault, C. Diagnosis and treatment of sleep disorders: a brief review for clinicians. Dialog. Clin. Neurosci. 5, 371–388 (2003).
Article Google Scholar
Djanian, S., Bruun, A. & Nielsen, T. D. Sleep classification using consumer sleep technologies and ai: A review of the current landscape. Sleep Med. 100, 390–403 (2022).
Article PubMed Google Scholar
Baron, K. G. et al. Feeling validated yet? a scoping review of the use of consumer-targeted wearable and mobile technology to measure and improve sleep. Sleep Med. Rev. 40, 151–159 (2018).
Article PubMed Google Scholar
Guillodo, E. et al. Clinical applications of mobile health wearable–based sleep monitoring: systematic review. JMIR mHealth and uHealth 8, e10733 (2020).
Article PubMed PubMed Central Google Scholar
Kwon, S., Kim, H. & Yeo, W.-H. Recent advances in wearable sensors and portable electronics for sleep monitoring. Iscience 24, 102461 (2021).
Article ADS PubMed PubMed Central Google Scholar
Chinoy, E. D. et al. Performance of seven consumer sleep-tracking devices compared with polysomnography. Sleep 44 (2020). https://academic.oup.com/sleep/article/44/5/zsaa291/6055610.
de Zambotti, M. et al. Measures of sleep and cardiac functioning during sleep using a multi-sensory commercially–available wristband in adolescents: wearable technology to measure sleep and cardiac functioning. Physiol. Behav. 158, 143 (2016).
Article PubMed PubMed Central Google Scholar
Sridhar, N., Shoeb, A. & Stephens, P. Deep learning for automated sleep staging using instantaneous heart rate. NPJ Dig. Med. 106 (2020).
Fedorin, I., Slyusarenko, K., Lee, W. & Sakhnenko, N. Sleep stages classification in a healthy people based on optical plethysmography and accelerometer signals via wearable devices. Ukraine Conference on Electrical and Computer Engineering 2019 IEEE 1201–1204 (2019).
Toon, E. et al. Comparison of commercial wrist-based and smartphone accelerometers, actigraphy, and PSG in a clinical cohort of children and adolescents. J. Clin. Sleep Med. 12, 343 (2016).
Article PubMed PubMed Central Google Scholar
de Zambotti, M., Rosas, L., Colrain, I. M. & Baker, F. C. The sleep of the ring: comparison of the ŌURA sleep tracker against polysomnography. Behav. Sleep Med. 17, 124 (2019).
Article PubMed Google Scholar
Pesonen, A. K. & Kuula, L. The validity of a new consumer-targeted wrist device in sleep measurement: an overnight comparison against polysomnography in children and adolescents. J. Clin. Sleep Med. 14, 585 (2018).
Article PubMed PubMed Central Google Scholar
Lee, X. K. et al. Validation of a consumer sleep wearable device with actigraphy and polysomnography in adolescents across sleep opportunity manipulations. J. Clin. Sleep Med. 15, 1337 (2019).
Article PubMed PubMed Central Google Scholar
Godino, J. G. et al. Performance of a commercial multi-sensor wearable (Fitbit Charge HR) in measuring physical activity and sleep in healthy children. PLoS ONE15 (2020). https://doi.org/10.1371/JOURNAL.PONE.0237719.
Menghini, L., Yuksel, D., Goldstone, A., Baker, F. C. & de Zambotti, M. Performance of Fitbit Charge 3 against polysomnography in measuring sleep in adolescent boys and girls. Chronobiol. Int. 38, 1010 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chee, N. I. et al. Multi-night validation of a sleep tracking ring in adolescents compared with a research actigraph and polysomnography. Nat. Sci. Sleep 13, 177–190 (2021).
Article PubMed PubMed Central Google Scholar
Slater, J. A. et al. Assessing sleep using hip and wrist actigraphy. Sleep Biol. Rhythms 13, 172–180 (2015).
Article Google Scholar
Kanady, J. C. et al. Validation of sleep measurement in a multisensor consumer grade wearable device in healthy young adults. J. Clin. Sleep Med. 16, 917 (2020).
Article PubMed PubMed Central Google Scholar
Miller, D. J. et al. A validation study of the WHOOP strap against polysomnography to assess sleep. J. Sports Sci. 38, 2631–2636 (2020).
Article PubMed Google Scholar
Miller, D. J. et al. A validation study of a commercial wearable device to automatically detect and estimate sleep. Biosensors11 (2021). https://doi.org/10.3390/BIOS11060185.
Chinoy, E. D., Cuellar, J. A., Jameson, J. T. & Markwald, R. R. Performance of four commercial wearable sleep-tracking devices tested under unrestricted conditions at home in healthy young adults. Nat. Sci. Sleep 14, 493 (2022).
Article PubMed PubMed Central Google Scholar
De Zambotti, M., Claudatos, S., Inkelis, S., Colrain, I. M. & Baker, F. C. Evaluation of a consumer fitness-tracking device to assess sleep in adults: evaluation of wearable technology to assess sleep. Chronobiol. Int. 32, 1024 (2015).
Article PubMed PubMed Central Google Scholar
Regalia, G. et al. Sleep assessment by means of a wrist actigraphy-based algorithm: agreement with polysomnography in an ambulatory study on older adults. Chronobiol. Int. 38, 400–414 (2020).
Article PubMed Google Scholar
Razjouyan, J. et al. Improving sleep quality assessment using wearable sensors by including information from postural/sleep position changes and body acceleration: a comparison of chest-worn sensors, wrist actigraphy, and polysomnography. J. Clin. Sleep Med. 13, 1301 (2017).
Article PubMed PubMed Central Google Scholar
Peter-Derex, L. et al. Automatic analysis of single-channel sleep eeg in a large spectrum of sleep disorders. J. Clin. Sleep Med. 17, 393–402 (2021).
Article PubMed PubMed Central Google Scholar
Marino, M. et al. Measuring sleep: accuracy, sensitivity, and specificity of wrist actigraphy compared to polysomnography. Sleep 36, 1747 (2013).
Article PubMed PubMed Central Google Scholar
Kuo, C. E. et al. Development and evaluation of a wearable device for sleep quality assessment. IEEE Trans. Biomed. Eng. 64, 1547–1557 (2017).
Article PubMed Google Scholar
Dong, X. et al. Validation of Fitbit Charge 4 for assessing sleep in Chinese patients with chronic insomnia: A comparison against polysomnography and actigraphy. PLoS ONE 17 (2022). https://doi.org/10.1371/JOURNAL.PONE.0275287.
Cook, J. D., Prairie, M. L. & Plante, D. T. Utility of the Fitbit Flex to evaluate sleep in major depressive disorder: A comparison against polysomnography and wrist-worn actigraphy. J. Affect. Disord. 217, 299–305 (2017).
Article PubMed PubMed Central Google Scholar
Mahadevan, N. et al. Development of digital measures for nighttime scratch and sleep using wrist-worn wearable devices. NPJ Dig. Med. 4 (2021). https://doi.org/10.1038/S41746-021-00402-X.
Altini, M. & Kinnunen, H. The promise of sleep: a multi-sensor approach for accurate sleep stage detection using the Oura Ring. Sensors 21 (2021). https://doi.org/10.3390/S21134302.
Ghorbani, S. et al. Multi-night at-home evaluation of improved sleep detection and classification with a memory-enhanced consumer sleep tracker. Nat. Sci. Sleep 14, 645 (2022).
Article PubMed PubMed Central Google Scholar
Devine, J. K., Chinoy, E. D., Markwald, R. R., Schwartz, L. P. & Hursh, S. R. Validation of Zulu Watch against polysomnography and actigraphy for on-wrist sleep-wake determination and sleep-depth estimation. Sensors 21, 76 (2020).
Article ADS PubMed PubMed Central Google Scholar
Haghayegh, S., Khoshnevis, S., Smolensky, M. H., Diller, K. R. & Castriotta, R. J. Performance comparison of different interpretative algorithms utilized to derive sleep parameters from wrist actigraphy data. Chronobiol. Int. 36, 1752–1760 (2019).
Article PubMed Google Scholar
Haghayegh, S., Khoshnevis, S., Smolensky, M. H., Diller, K. R. & Castriotta, R. J. Performance assessment of new-generation Fitbit technology in deriving sleep parameters and stages. Chronobiol. Int. 37, 47–59 (2019).
Article PubMed Google Scholar
Berry, R. B. et al. The AASM manual for the scoring of sleep and associated events: rules, terminology and technical specifications Version 2.2. Am. Acad. Sleep Med. (2015) www.aasmnet.org.
Miller, D. J., Sargent, C. & Roach, G. D. A validation of six wearable devices for estimating sleep, heart rate and heart rate variability in healthy adults. Sensors 22 (2022). https://doi.org/10.3390/S22166317.
Roberts, D. M., Schade, M. M., Mathew, G. M., Gartenberg, D. & Buxton, O. M. Detecting sleep using heart rate and motion data from multisensor consumer-grade wearables, relative to wrist actigraphy and polysomnography. Sleep 43, 1–19 (2020).
Article Google Scholar
Stucky, B. et al. Validation of Fitbit Charge 2 sleep and heart rate estimates against polysomnographic measures in shift workers: Naturalistic study. J. Med. Int. Res. 23 (2021). https://doi.org/10.2196/26476.
Beattie, Z. et al. Estimation of sleep stages in a healthy adult population from optical plethysmography and accelerometer signals. Physiol. Measur. 38, 1968 (2017).
Article ADS CAS Google Scholar
Walch, O., Huang, Y., Forger, D. & Goldstein, C. Sleep stage prediction with raw acceleration and photoplethysmography heart rate data derived from a consumer wearable device. Sleep 42 (2019). https://doi.org/10.1093/SLEEP/ZSZ180.
Pigeon, W. R. et al. Validation of the sleep-wake scoring of a new wrist-worn sleep monitoring device. J. Clin. Sleep Med. 14, 1057 (2018).
Article PubMed PubMed Central Google Scholar
de Zambotti, M., Goldstone, A., Claudatos, S., Colrain, I. M. & Baker, F. C. A validation study of Fitbit Charge 2™ compared with polysomnography in adults. Chronobiol. Int. 35, 465–476 (2017).
Article PubMed Google Scholar
Cole, R. J., Kripke, D. F., Gruen, W., Mullaney, D. J. & Gillin, J. C. Automatic sleep/wake identification from wrist activity. Sleep 15, 461–469 (1992).
Article CAS PubMed Google Scholar
Jean-Louis, G., Kripke, D. F., Mason, W. J., Elliott, J. A. & Youngstedt, S. D. Sleep estimation from wrist movement quantified by different actigraphic modalities. J. Neurosci. Methods 105, 185–191 (2001).
Article CAS PubMed Google Scholar
Sadeh, A., Sharkey, K. M. & Carskadon, M. A. Activity-based sleep-wake identification: an empirical test of methodological issues. Sleep 17, 201–207 (1994).
Article CAS PubMed Google Scholar
Fekedulegn, D. et al. Actigraphy-based assessment of sleep parameters. Ann. Work Exp. Health 64, 350–367 (2020).
Article Google Scholar
Te Lindert, B. H. & Van Someren, E. J. Sleep estimates using microelectromechanical systems (MEMS). Sleep 36, 781–789 (2013).
Article Google Scholar
Khosla, S. et al. Consumer sleep technology: An American Academy of Sleep Medicine position statement. J. Clin. Sleep Med. 14, 877–880 (2018).
Article PubMed PubMed Central Google Scholar
Menghini, L., Cellini, N., Goldstone, A., Baker, F. C. & De Zambotti, M. A standardized framework for testing the performance of sleep-tracking technology: step-by-step guidelines and open-source code. Sleep 44 (2021). https://doi.org/10.1093/SLEEP/ZSAA170.
Younes, M., Raneri, J. & Hanly, P. Staging sleep in polysomnograms: analysis of inter-scorer variability. J. Clin. Sleep Med. 12, 885–894 (2016).
Article PubMed PubMed Central Google Scholar
Rosenberg, R. S., Steven, F. A. A. S. M. & Hout, V. The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring. J. Clin. Sleep Med. 9, 81–87 (2013).
Article PubMed PubMed Central Google Scholar
Rechtschaffen, A. & Kales, A. A manual of standardized terminology, techniques and scoring system for sleep stages of human subjects (U. S. National Institute of Neurological Diseases and Blindness, Neurological Information Network Bethesda, Md, 1968).
Moser, D. et al. Sleep classification according to AASM and Rechtschaffen & Kales: Effects on sleep scoring parameters. Sleep 32, 139 (2009).
Article PubMed PubMed Central Google Scholar
Ryser, F., Gassert, R., Werth, E. & Lambercy, O. A novel method to increase specificity of sleep-wake classifiers based on wrist-worn actigraphy. Chronobiol. Int. (2023). https://doi.org/10.1080/07420528.2023.2188096.
Ryser, F., Hanassab, S., Lambercy, O., Werth, E. & Gassert, R. Respiratory analysis during sleep using a chest-worn accelerometer: a machine learning approach. Biomed. Signal Process. Control 78, 104014 (2022).
Article Google Scholar
Hong, J. et al. End-to-end sleep staging using nocturnal sounds from microphone chips for mobile devices. Nat. Sci. Sleep 14, 1187–1201 (2022).
Article PubMed PubMed Central Google Scholar
Xue, B. et al. Non-contact sleep stage detection using canonical correlation analysis of respiratory sound. IEEE J. Biomed. Health Inf. 24, 614–625 (2020).
Article Google Scholar
Mohamad Adam Bujang, T. H. A. Requirements for minimum sample size for sensitivity and specificity analysis. J. Clin. Diagnostic Res. (2016). https://doi.org/10.7860/jcdr/2016/18129.8744.
Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over f1 score and accuracy in binary classification evaluation. BMC Genomics 21 (2020). https://doi.org/10.1186/s12864-019-6413-7.
Moher, D., Liberati, A., Tetzlaff, J. & Altman, D. G. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ 339, 332–336 (2009).
Article Google Scholar

Download references

Acknowledgements

Open access funding provided by Swiss Federal Institute of Technology Zurich.

Author information

These authors contributed equally: Vera Birrer, Mohamed Elgendi.

Authors and Affiliations

Biomedical and Mobile Health Technology Laboratory, Department of Health Sciences and Technology, ETH Zurich, Zurich, Switzerland
Vera Birrer, Mohamed Elgendi & Carlo Menon
Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
Vera Birrer
Rehabilitation Engineering Laboratory, Department of Health Sciences and Technology, ETH Zurich, Zurich, Switzerland
Olivier Lambercy

Authors

Vera Birrer
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Elgendi
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Lambercy
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Menon
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.E. designed and led the study. V.B., M.E., O.L., and C.M. conceived the study. The literature search was carried out by two reviewers, V.B. and M.E. Both reviewers collaborated in constructing the protocol and developing the search terms. V.B. conducted the initial literature search, while M.E. independently confirmed the eligibility of articles, performed the screening of included articles, and verified the extracted data. O.L. contributed valuable clinical insights regarding sleep monitoring. M.E. directly supervised the work of V.B. M.E and V.B. contributed equally to this work and share first authorship. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Mohamed Elgendi or Carlo Menon.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental Material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Birrer, V., Elgendi, M., Lambercy, O. et al. Evaluating reliability in wearable devices for sleep staging. npj Digit. Med. 7, 74 (2024). https://doi.org/10.1038/s41746-024-01016-9

Download citation

Received: 21 June 2023
Accepted: 18 January 2024
Published: 18 March 2024
DOI: https://doi.org/10.1038/s41746-024-01016-9