Sleep classification from wrist-worn accelerometer data using random forests

Sundararajan, Kalaivani; Georgievska, Sonja; te Lindert, Bart H. W.; Gehrman, Philip R.; Ramautar, Jennifer; Mazzotti, Diego R.; Sabia, Séverine; Weedon, Michael N.; van Someren, Eus J. W.; Ridder, Lars; Wang, Jian; van Hees, Vincent T.

doi:10.1038/s41598-020-79217-x

Download PDF

Article
Open access
Published: 08 January 2021

Sleep classification from wrist-worn accelerometer data using random forests

Kalaivani Sundararajan¹,
Sonja Georgievska¹,
Bart H. W. te Lindert²,
Philip R. Gehrman³,
Jennifer Ramautar²,
Diego R. Mazzotti⁴,
Séverine Sabia^5,6,
Michael N. Weedon⁷,
Eus J. W. van Someren²,
Lars Ridder¹,
Jian Wang⁸ &
…
Vincent T. van Hees^1,9

Scientific Reports volume 11, Article number: 24 (2021) Cite this article

17k Accesses
54 Citations
11 Altmetric
Metrics details

Subjects

Abstract

Accurate and low-cost sleep measurement tools are needed in both clinical and epidemiological research. To this end, wearable accelerometers are widely used as they are both low in price and provide reasonably accurate estimates of movement. Techniques to classify sleep from the high-resolution accelerometer data primarily rely on heuristic algorithms. In this paper, we explore the potential of detecting sleep using Random forests. Models were trained using data from three different studies where 134 adult participants (70 with sleep disorder and 64 good healthy sleepers) wore an accelerometer on their wrist during a one-night polysomnography recording in the clinic. The Random forests were able to distinguish sleep-wake states with an F1 score of 73.93% on a previously unseen test set of 24 participants. Detecting when the accelerometer is not worn was also successful using machine learning ($\hbox {F1-score} > 93.31\%$), and when combined with our sleep detection models on day-time data provide a sleep estimate that is correlated with self-reported habitual nap behaviour ($\hbox {r}=.60$). These Random forest models have been made open-source to aid further research. In line with literature, sleep stage classification turned out to be difficult using only accelerometer data.

Wearable sensors enable personalized predictions of clinical laboratory measurements

Article 24 May 2021

Machine learning prediction of sleep stages in dairy cows from heart rate and muscle activity measures

Article Open access 25 May 2021

40 years of actigraphy in sleep medicine and current state of the art algorithms

Article Open access 24 March 2023

Introduction

Sleep quality and duration play an important role in human health¹. Accurate methods for sleep assessment are needed to monitor the prevalence of poor sleep, to increase our understanding of the relation between sleep and health, and to design effective treatments for insomnia. Additionally, assessment methods need to have a high user-acceptability to reduce the risk of participant dropouts leading to selection bias.

The gold standard for sleep measurement, polysomnography (PSG), is prohibitively expensive and unfeasible for use in large scale population research. On the other hand, the much more feasible sleep diaries can provide information on time in bed but they are subject to recall bias and might be less relevant to assess time slept during this period. Therefore, wearable accelerometers have been explored since the mid-1990s as a possible alternative for multi-day real life (out of the lab) sleep monitoring.

To cope with memory and battery constraints in the early devices, data was pre-processed inside the device. Further, these devices had in common that they relied on piezo-electric acceleration sensors not sensitive to gravitational acceleration under static conditions. Technological advancements in the mid-2000s led to a new generation of accelerometers, referred to as raw data accelerometry, which was based on Micro Electro-Mechanical-Systems (MEMS) and able to store up to a week of digitised but otherwise unprocessed data in memory to facilitate offline analysis. These modern accelerometers are sensitive to gravitational acceleration under static conditions.

Offline access to raw data enabled revisiting the entire data processing pipeline as better algorithms emerge over time, which is needed to facilitate longitudinal studies sometimes spanning a lifetime. Further, access to raw data increased the ability to standardise analysis across studies to allow more meaningful comparisons. As a result, raw data accelerometry is now widely used by the health research community^2,3,4.

Cole-Kripke⁵, Sadeh⁶, and Oakley⁷ proposed sleep detection algorithms for the accelerometer in the 1990s. Their algorithms had in common that data was pre-processed onboard the device towards a 30-second aggregate, called count. Cole and Sadeh derived counts with a zero-crossing technique, while Oakley derived counts with an amplitude-based technique. Cole, Sadeh, and Oakley, used a 7, 11, and 5 min time window for count-based sleep detection, respectively^5,6.

Borazio et al. proposed the Estimation of Stationary Sleep-segments (ESS) algorithm for raw data accelerometry, which aims to detect segments of idleness quantified as a low standard deviation per second lasting for at least 10 min⁸. Next, van Hees et al.⁹ proposed an algorithm that relied on the estimated orientation angle of the accelerometer, based on the detection of time segments where the estimated angle of the accelerometer relative to gravity does not change beyond 5$^\circ$ for at least 5 min. This approach facilitated easier interpretation compared to the conventional approaches based on the magnitude of acceleration and zero-crossing counts. This heuristic algorithm is now extensively used in the research community^1,10,11,12. More recently Trevenen et al. used machine learning to perform sleep classification. They extracted a variety of features from the acceleration vector magnitude and used these as input for a Hidden Markov Model (HMM) to classify sleep versus wakefulness, as well as to discriminate all four sleep stages¹³. The novel attempt to classify sleep stages from accelerometer-only data resulted in poor classification performance and was not able to accurately detect REM nor discriminate between Non-REM stages. Nonetheless, their conclusion about the potential for sleep stage classification was optimistic. Finally, Barouni and colleagues proposed a heuristic approach for sleep classification from raw data accelerometry, but mainly followed the approach used for traditional count-based accelerometers use kinematically hard to interpret the threshold crossing of the magnitude of acceleration¹⁴. Additionally, it should be noted that Willetts et al. used the term sleep classification in their work but relied on wearable cameras as criterion method. Wearable cameras are not able to distinguish sleep from wakefulness, by which their sleep detection claim is inaccurate¹⁵.

Complementary to sleep stage classification, information on day-time nap behaviour is also of interest. The assessment of nap behaviour is challenged by potential removal of the accelerometers for episodes during the day, since the existing heuristic nonwear detection algorithms¹⁶ was designed to only detect nonwear segments lasting for at least an hour.

Although the heuristic approaches have proven their value, their performance does in principle not improve when more data becomes available. In this paper, we explore the potential of random forests machine learning as a more data-driven approach to improve sleep-wake and wear-nonwear classification. Our approach uses data acquired from 158 participants from three different studies representing a wide age range and including both healthy sleepers and those with sleep disorders. The performance of these machine learning models was assessed by cross-validation using data from 134 participants (64 healthy sleepers and 70 with sleep disorders, age range 20–72)^9,17,18. We then report the performance of our trained models on previously unseen test data from 24 remaining participants (16 healthy sleepers and 8 with sleep disorders). These trained models have been made open-source available to aid further sleep research. When used in combination, both sleep detection and nonwear detection approaches may be useful for daytime nap detection, which we evaluate in 109 separate individuals with accelerometer data collected in real life (out of the lab) where self-reported napping behaviour is available. Furthermore, we investigate the possibility of predicting four sleep stages (rapid eye movement (REM) sleep and Non-REM sleep stages N1, N2, and N3), which is not feasible with the current heuristic approaches. Reliable detection of sleep stages from wearable accelerometer data would advance sleep research as it provides an additional level of sleep description.

Results

The number of samples corresponding to wakefulness and different sleep stages among the assessed 30 second intervals are shown in Table 1.

Table 1 Samples per class in the data (Percentages in parentheses).

Full size table

Sleep–wake classification

For sleep–wake classification, samples labeled as N1, N2, N3, and REM are considered as Sleep samples.

vanHees approach

The vanHees heuristic algorithm described in the “Methods” section is applied to the accelerometer data to obtain a binary classification of wakefulness or sleep. The classification performance of the method in the outer cross-validation and test set are reported in Table 2.

Random forests

The chosen hyperparameters for the trained random forests models for Sleep–Wake classification for each of the five cross-validation folds are given as (trees, max-depth) tuple—(400,Full), (500,Full), (400,Full), (200,Full), (200,Full) where max-depth of Full implies that the decision trees are allowed to grow to any depth till termination criteria are satisfied. The classification performance metrics across the five outer folds are averaged and reported in Table 2 along with test set performance. It can be observed that the random forests approach outperforms the vanHees approach on both outer cross-validation and test data. Specifically, random forests perform better at detecting wakefulness compared to the vanHees approach as seen in the confusion matrices in Fig. 1, though the number of sleep–Wake samples is heavily imbalanced.

The important features for sleep–wake classification averaged across all folds are shown in Supplementary Information Figure 4. For more information on feature definition see METHODS section. It can be observed that statistical measures for Locomotor Inactivity During Sleep (LIDS) and Z-angle are the most important features for sleep-wake classification.

Table 2 Binary sleep–wake and nonwear–wear classification.

Full size table

Healthy versus poor sleepers

Figure 2 shows the plots of F1-score with respect to time spent sleeping for each user based on whether they are poor or healthy sleepers. Red markers denote F1-scores of sleep and green markers denote F1-scores of wakefulness for each user. We obtained the Spearman’s correlation coefficient for the time spent sleeping and F1-scores for healthy and poor sleepers. For healthy sleepers, it was observed that wakefulness F1-scores and time spent sleeping were negatively correlated due to fewer wake samples causing poor wakefulness classification. For poor sleepers, time spent sleeping was positively correlated with sleep F1-scores and negatively correlated with wakefulness F1-scores.

Nonwear classification

The chosen hyperparameters for the trained random forests models for Nonwear classification for each of the five cross-validation folds are given as (trees, max-depth) tuple—(500,15), (100,15), (200,15), (300,20), (500,15). The nonwear classification performance metrics across the five outer folds are averaged and reported in Table 2 along with test set performance. It can be seen that nonwear classification using random forests performs quite well on both the outer cross-validation data and previously unseen test set.

The confusion matrices of nonwear classification are shown in Fig. 1 for the test set. The numbers in the matrices indicate the percentage of samples from the true class that was classified as the predicted class. It can be seen that Wear periods are predicted reliably whereas Nonwear periods tend to be confused with Wear periods by 11%.

The important features for nonwear classification averaged across all folds with random forests are shown in Supplementary Information Figure 7. It can be observed that statistical measures for LIDS and Z-angle are the most important features for nonwear classification.

Sleep stage classification

Various classes in sleep classification can be organized into a hierarchy of classes as described in METHODS. Classifying accelerometer samples according to this hierarchy might help understand the discriminative properties (if any) of data to perform nonwear detection, sleep-wake, and sleep stage classification. Hence, we perform hierarchical classification of samples using the 36 engineered features and random forests.

The hierarchical classification performance metrics across the five outer folds are averaged and reported in Table 3 along with test set performance. Unlike F1-score computation of flat classification, F1-score of hierarchical classification takes all correctly classified ancestor classes of the hierarchy into account. It can be observed that the prediction performance drops as we go down the hierarchy. Levels 3, 4, and 5 which consist of leaf nodes like REM, N1, and N3 show a drastic reduction in performance.

Table 3 Hierarchical classification.

Full size table

The confusion matrices of true classes versus predicted classes are shown in Fig. 3. It can be observed that classes Wear, Sleep, NREM, N1 $+$ N2 and N2 seem to be predicted more frequently than other classes. The low diagonal values of REM, N3, and N1 show that it is difficult to discriminate between NREM & REM, N1 $+$ N2 & deep sleep (N3) and N1 & N2. Further, most samples classified as Sleep seem to be further classified as N2 which shows that N2 dominates the classification despite balancing the data with synthetic samples during training.

Nap detection

Self-reported nap duration per week was 13 min less ($\hbox {t} = -0.36, \hbox {p} = 0.72$) compared with accelerometer-based estimates with a correlation of 0.60 ($\hbox {p} < .00001, \hbox {N}= 109$). A Figure of the corresponding data points can be found in the Supplementary information.

Discussion

Based on our experiments, we infer that machine learning approaches such as random forests applied to accelerometer-only data improves the sleep–wake classification compared to the approaches proposed in 1990s^5,6 and as well as the heuristic algorithm proposed by vanHees⁹. Our machine learning approach also enables nonwear detection at a higher time resolution than the vanHees approach⁹. The combination of these enhancements enables us to estimate daytime napping periods. However, the current findings should be seen as an encouragement of further research around nap detection and not as proof to justify immediate application in sleep research. Sleep stage classification from accelerometer data proved to be more challenging due to the absence of discriminative features in the data.

Our sleep classification approach is most similar to that of Trevenen et al.¹³. Based on an exploratory analysis of our data, we observed that transition between sleep states is rare compared to remaining in the same state for prolonged periods of time. Hence, when using approaches like Hidden Markov Models (HMM) as in Trevenen et al.¹³, the transition probability matrix is heavily diagonal and HMMs do not provide any advantage under such scenarios. Therefore, we trained discriminative models for every time interval based on engineered features. However, HMMs inherently ensure that the predicted sequences adhere to the transition probabilities and hence prevents spurious sleep state predictions. To have a similar effect for preventing spurious predictions, we smoothed the prediction probabilities of the individual random forest models over a 5-minute rolling window.

Further, while Trevenen et al. used data collected from healthy 22-year old individuals, our experiments are based on a more challenging, yet more heterogeneous dataset collected from a wide range of ages and includes participants with sleep disorders.

We explored random forest-based nonwear detection to gain insight into the potential for daytime nap detection. Nonwear detection was found to be acceptably accurate. By feeding the classifier both sleep and nonwear data we offered the classifier a challenging task. If we had trained it using data corresponding to nonwear and a person performing activities the classification task would have been easy but not representative of real-life nonwear detection. Future research is needed to identify generic purpose non-wear detection able to both assist in the distinction of daytime naps and the identification of large episodes of nonwear or sleep.

Sleep stage classification was expected to be challenging at the outset of the study. However, the reason why we still explored it is that even a weak classification could be of value in large scale population studies, e.g. UK Biobank¹⁰, where minor effects become only visible when averaged over a large number of individuals. Sleep stage classification may only be realistic with complementary sensor data, e.g. photoplethysmogram (PPG), which was outside the scope of this study as we focus on solutions for the already widely collected accelerometer-only data.

The positive correlation of 0.60 and the lack of a statistical difference between average self-reported habitual napping duration and estimates from our ensemble of accelerometer-based random forest models are encouraging. However, based on these data alone it is hard to say whether the observed individual differences are explained by the subjective nature of a questionnaire, the discrepancy between the questionnaire that asks about habitual behavior and an accelerometer recording corresponding to nine specific days, or the precision of the random forest models. Therefore, further research is warranted involving a more direct comparison, e.g. with video observation.

Most of the data used in this study was collected with the GENEActiv accelerometer brand. Future studies should consider the potential of model transferability across accelerometer brands. Previous research indicates that data is highly comparable across accelerometer brands¹⁹, but confirmation of these specific outcomes is desired.

Models were trained and tested across three different datasets. The PSG data was scored by a different sleep technician at every site, each site had its own PSG equipment, and participants at each site had different demographics. It could be hypothesized that the models have therefore become more robust against signal artifacts related to these experimental differences.

The present study does not look at detecting the beginning and the end of the night (sleep period time window), which is a different but related challenge we looked at in van Hees et al.²⁰.

Raw data accelerometry faces the same challenges as traditional actigraphy in not being able to capture most physiological processes that underlie sleep. Our present work does not prove, or even attempt to prove, that raw data offers more accurate sleep detection than traditional actigraphy. The main advantage of raw data is that it offers increased scientific transparency and can be re-processed for many purposes beyond sleep research alone.

Our work shows that random forests can help to enhance the sleep classification relative to the currently open-source available method by van Hees et al.⁹, and the Sadeh and Cole-Kripke implementation by Hammad et al. ²¹. Sleep researchers will have to decide whether they prefer a more accurate but less interpretable random forest model or a less accurate model by vanHees, Sadeh, or Cole-Kripke. In an earlier publication, we argue that the vanHees heuristic model is more kinematically interpretable compared with conventional algorithms that rely on zero-crossing counts or magnitude of acceleration⁹. Whether the Sadeh and Cole-Kripke algorithms offer better methodological consistency with historical research is difficult to say since the piezo-electric acceleration sensors as used in the 1990s have been replaced by MEMS-based capacitive sensors in the 2000s that have a wider frequency response. We are not aware of any studies that investigate the comparability of Sadeh or Cole-Kripke algorithm output across these hardware generations.

There are also important clinical implications of these results. The assessment of sleep/wake patterns for the diagnosis of sleep and circadian rhythm disorders often requires polysomnography, which is expensive and labor-intensive. Accelerometry is sometimes used as a less expensive form of assessment but current algorithms are limited in their accuracy, particularly in patients with insomnia. Improved algorithms have the potential to make accelerometry a more clinically-useful assessment tool that would permit the measurement of sleep and wake over extended periods of time. This approach could also be implemented more easily than polysomnography in non-sleep clinic settings. Future studies are warranted to investigate the physiology behind misclassifications in order to better understand how sleep classifier performance may vary across specific sleep disorders.

Methods

The raw accelerometer data was extracted from binary files obtained with different accelerometer brands using R package GGIR²². The raw data was then preprocessed using GGIR algorithms for signal calibration relative to gravitational acceleration²³ and alignment of PSG assessment labels with processed data. Next, we explored random forests machine learning to perform sleep-wake, nonwear-wear, and sleep stage classification. Additionally, we explored the value of sleep-wake and nonwear-wear classification to identify daytime naps.

Random forests

The same random forests approach was used for sleep-wake, nonwear-wear, and sleep stage classification. In our initial exploration of the data, we experimented with deep learning techniques but as the results were not better than the vanHees approach we decided to report them in the Supplementary Information to this paper.

Signal features

In our models, we used 36-dimensional features encompassing twelve different statistical measures, listed in Table 4, applied to three derived signals calculated from the three accelerometer axes, $a_x, a_y,$ and $a_z$:

ENMO : The Euclidean Norm Minus One (ENMO) with negative values rounded to zero in g has been shown to correlate with the magnitude of acceleration and human energy expenditure¹⁶. ENMO is computed as follows:
$$\begin{aligned} ENMO = max(0, \sqrt{a_x^2 + a_y^2 +a_z^2} - 1) \end{aligned}$$
(1)
Z-angle: Z-angle, computed using Eq. 3, corresponds to the angle between the accelerometer axis perpendicular to the skin surface and the horizontal plane. As described in “vanHees approach”, any change (or lack of change) in the z-angle over successive time intervals may be an indicator of posture change.
LIDS: Locomotor Inactivity During Sleep (LIDS)²⁴ involves a non-linear conversion of locomotor activity and has shown to be sensitive to ultradian sleep cycles. The original paper did not make use of raw data accelerometry. In this work, LIDS is computed as follows:
$$\begin{aligned} LIDS = \frac{100}{activity\ count + 1} \end{aligned}$$
(2)
where $activity\ count$ is computed using a 10-minute moving sum over $max(0,ENMO - 0.02)$. LIDS is then smoothed using moving average over a 30-min window.

For each 30 s interval, we computed 36-dimensional features which were then used to train the random forest.

Table 4 Statistical measures applied to derived signals.

Full size table

Imbalanced data

We observed from our data that some labels occur more frequently than others leading to an imbalanced dataset. Such data needs to be handled with care when used with machine learning models since the model might learn to always predict the class with the majority of samples. A typical workaround is to undersample or oversample the training samples belonging to various classes such that the model is trained with roughly equal number of samples from each class. In this paper, we followed oversampling of classes using Synthetic Minority Over-sampling Technique (SMOTE)²⁵. SMOTE generates new samples by interpolation of random samples with their nearest neighbors. In our work, we used the SMOTE implementation in the imbalanced-learn python package²⁶ with a sampling strategy to resample all classes to have roughly equal number of training samples.

Performance metrics

As our data is heavily imbalanced, the classification performance of our experiments was evaluated using F1-score and Average Precision, i.e. area under the Precision-Recall curve. Note that SMOTE was only applied to the training data, this why we still need to account for data imbalance in the performance evaluation. F1-score is the harmonic mean of precision and recall with high F1-scores indicating good classification performance. F1-scores of individual classes are averaged to obtain the overall F1-score, i.e. macro-averaging, to treat all classes equally. Additionally, we report Cohen’s Weighted Kappa coefficient for Sleep-wake classification results to facilitate comparisons with other studies²⁷.

F1-scores are computed using predicted classes chosen with specific thresholds. However, the precision-recall curve gives a better picture of the classification performance since it plots recall vs precision by varying thresholds. Better classification performance is indicated by curves tending towards the top right. The area under the precision-recall curve, i.e. Average Precision, gives a quantifiable measure of performance with Average Precision of 1 indicating best performance.

Training and evaluation

The resampled features were used to train random forests models²⁸. Classification using Random forests works by training multiple decision trees with subsets of the data and averaging the decision tree outputs to address overfitting. The features were normalized to have zero mean and unit standard deviation before training.

We used a nested cross-validation approach, involving: fivefold inner cross-validation to optimise hyper parameters, and a fivefold outer cross-validation to obtain generalisation performance. This means that 5 $\times$ 5 models were trained in the process, out of which five models from the outer cross-validation can be used as an ensemble on new data.

In the inner cross-validation, a randomized hyperparameter search was used to choose the number of trees from (100, 150, 200, 300, 400, 500) and the tree depth from (5, 10, 15, 20, Full). Other random forests parameters were retained as default as specified by the scikit-learn package. Each inner cross-validation fold splits the training data into training and validation data (4:1) where the validation data is used to choose the optimal hyperparameters i.e. number of trees and tree depth, for each fold based on Average Precision. Hence, each inner cross-validation fold will use a different random forests model tuned optimally for its corresponding training data. The outer cross-validation is used to obtain both F1-scores and Average Precision (AP) for generalization performance. For the outer cross-validation, the data is split into training and validation partitions such that participants in both partitions do not overlap. This ensures that the algorithm does not learn any patterns specific to participant behavior. To ensure that the output is not spurious, we smoothed the prediction probabilities of the individual random forest models over a 5-min rolling window before computing performance metrics. Finally, a left out test set with 24 individuals is used to obtain the generalisation performance of the ensemble of the models generated in the outer cross-validation using averaged prediction probabilities. The same ensemble of models is used in “Nap detection”.