Sleep classification from wrist-worn accelerometer data using random forests

Accurate and low-cost sleep measurement tools are needed in both clinical and epidemiological research. To this end, wearable accelerometers are widely used as they are both low in price and provide reasonably accurate estimates of movement. Techniques to classify sleep from the high-resolution accelerometer data primarily rely on heuristic algorithms. In this paper, we explore the potential of detecting sleep using Random forests. Models were trained using data from three different studies where 134 adult participants (70 with sleep disorder and 64 good healthy sleepers) wore an accelerometer on their wrist during a one-night polysomnography recording in the clinic. The Random forests were able to distinguish sleep-wake states with an F1 score of 73.93% on a previously unseen test set of 24 participants. Detecting when the accelerometer is not worn was also successful using machine learning (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {F1-score} > 93.31\%$$\end{document}F1-score>93.31%), and when combined with our sleep detection models on day-time data provide a sleep estimate that is correlated with self-reported habitual nap behaviour (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {r}=.60$$\end{document}r=.60). These Random forest models have been made open-source to aid further research. In line with literature, sleep stage classification turned out to be difficult using only accelerometer data.

Sleep quality and duration play an important role in human health 1 . Accurate methods for sleep assessment are needed to monitor the prevalence of poor sleep, to increase our understanding of the relation between sleep and health, and to design effective treatments for insomnia. Additionally, assessment methods need to have a high user-acceptability to reduce the risk of participant dropouts leading to selection bias.
The gold standard for sleep measurement, polysomnography (PSG), is prohibitively expensive and unfeasible for use in large scale population research. On the other hand, the much more feasible sleep diaries can provide information on time in bed but they are subject to recall bias and might be less relevant to assess time slept during this period. Therefore, wearable accelerometers have been explored since the mid-1990s as a possible alternative for multi-day real life (out of the lab) sleep monitoring.
To cope with memory and battery constraints in the early devices, data was pre-processed inside the device. Further, these devices had in common that they relied on piezo-electric acceleration sensors not sensitive to gravitational acceleration under static conditions. Technological advancements in the mid-2000s led to a new generation of accelerometers, referred to as raw data accelerometry, which was based on Micro Electro-Mechanical-Systems (MEMS) and able to store up to a week of digitised but otherwise unprocessed data in memory to facilitate offline analysis. These modern accelerometers are sensitive to gravitational acceleration under static conditions.
Offline access to raw data enabled revisiting the entire data processing pipeline as better algorithms emerge over time, which is needed to facilitate longitudinal studies sometimes spanning a lifetime. Further, access to raw data increased the ability to standardise analysis across studies to allow more meaningful comparisons. As a result, raw data accelerometry is now widely used by the health research community 2-4 .
Cole-Kripke 5 , Sadeh 6 , and Oakley 7 proposed sleep detection algorithms for the accelerometer in the 1990s. Their algorithms had in common that data was pre-processed onboard the device towards a 30-second aggregate, Borazio et al. proposed the Estimation of Stationary Sleep-segments (ESS) algorithm for raw data accelerometry, which aims to detect segments of idleness quantified as a low standard deviation per second lasting for at least 10 min 8 . Next, van Hees et al. 9 proposed an algorithm that relied on the estimated orientation angle of the accelerometer, based on the detection of time segments where the estimated angle of the accelerometer relative to gravity does not change beyond 5 • for at least 5 min. This approach facilitated easier interpretation compared to the conventional approaches based on the magnitude of acceleration and zero-crossing counts. This heuristic algorithm is now extensively used in the research community 1,10-12 . More recently Trevenen et al. used machine learning to perform sleep classification. They extracted a variety of features from the acceleration vector magnitude and used these as input for a Hidden Markov Model (HMM) to classify sleep versus wakefulness, as well as to discriminate all four sleep stages 13 . The novel attempt to classify sleep stages from accelerometer-only data resulted in poor classification performance and was not able to accurately detect REM nor discriminate between Non-REM stages. Nonetheless, their conclusion about the potential for sleep stage classification was optimistic. Finally, Barouni and colleagues proposed a heuristic approach for sleep classification from raw data accelerometry, but mainly followed the approach used for traditional count-based accelerometers use kinematically hard to interpret the threshold crossing of the magnitude of acceleration 14 . Additionally, it should be noted that Willetts et al. used the term sleep classification in their work but relied on wearable cameras as criterion method. Wearable cameras are not able to distinguish sleep from wakefulness, by which their sleep detection claim is inaccurate 15 .
Complementary to sleep stage classification, information on day-time nap behaviour is also of interest. The assessment of nap behaviour is challenged by potential removal of the accelerometers for episodes during the day, since the existing heuristic nonwear detection algorithms 16 was designed to only detect nonwear segments lasting for at least an hour.
Although the heuristic approaches have proven their value, their performance does in principle not improve when more data becomes available. In this paper, we explore the potential of random forests machine learning as a more data-driven approach to improve sleep-wake and wear-nonwear classification. Our approach uses data acquired from 158 participants from three different studies representing a wide age range and including both healthy sleepers and those with sleep disorders. The performance of these machine learning models was assessed by cross-validation using data from 134 participants (64 healthy sleepers and 70 with sleep disorders, age range 20-72) 9,17,18 . We then report the performance of our trained models on previously unseen test data from 24 remaining participants (16 healthy sleepers and 8 with sleep disorders). These trained models have been made open-source available to aid further sleep research. When used in combination, both sleep detection and nonwear detection approaches may be useful for daytime nap detection, which we evaluate in 109 separate individuals with accelerometer data collected in real life (out of the lab) where self-reported napping behaviour is available. Furthermore, we investigate the possibility of predicting four sleep stages (rapid eye movement (REM) sleep and Non-REM sleep stages N1, N2, and N3), which is not feasible with the current heuristic approaches. Reliable detection of sleep stages from wearable accelerometer data would advance sleep research as it provides an additional level of sleep description.

Results
The number of samples corresponding to wakefulness and different sleep stages among the assessed 30 second intervals are shown in Table 1.
Sleep-wake classification. For sleep-wake classification, samples labeled as N1, N2, N3, and REM are considered as Sleep samples.
vanHees approach. The vanHees heuristic algorithm described in the "Methods" section is applied to the accelerometer data to obtain a binary classification of wakefulness or sleep. The classification performance of the method in the outer cross-validation and test set are reported in Table 2.  ) where max-depth of Full implies that the decision trees are allowed to grow to any depth till termination criteria are satisfied. The classification performance metrics across the five outer folds are averaged and reported in Table 2 along with test set performance. It can be observed that the random forests approach outperforms the vanHees approach on both outer cross-validation and test data. Specifically, random forests perform better at detecting wakefulness compared to the vanHees approach as seen in the confusion matrices in Fig. 1, though the number of sleep-Wake samples is heavily imbalanced. The important features for sleep-wake classification averaged across all folds are shown in Supplementary Information Figure 4. For more information on feature definition see METHODS section. It can be observed that statistical measures for Locomotor Inactivity During Sleep (LIDS) and Z-angle are the most important features for sleep-wake classification.
Healthy versus poor sleepers. Figure 2 shows the plots of F1-score with respect to time spent sleeping for each user based on whether they are poor or healthy sleepers. Red markers denote F1-scores of sleep and green markers denote F1-scores of wakefulness for each user. We obtained the Spearman's correlation coefficient for the time spent sleeping and F1-scores for healthy and poor sleepers. For healthy sleepers, it was observed that wakefulness F1-scores and time spent sleeping were negatively correlated due to fewer wake samples causing poor wakefulness classification. For poor sleepers, time spent sleeping was positively correlated with sleep F1-scores and negatively correlated with wakefulness F1-scores.
Nonwear classification. The chosen hyperparameters for the trained random forests models for Nonwear classification for each of the five cross-validation folds are given as (trees, max-depth) tuple-(500,15), (100,15), (200,15), (300,20), (500,15). The nonwear classification performance metrics across the five outer folds are averaged and reported in Table 2 along with test set performance. It can be seen that nonwear classification using random forests performs quite well on both the outer cross-validation data and previously unseen test set.
The confusion matrices of nonwear classification are shown in Fig. 1 for the test set. The numbers in the matrices indicate the percentage of samples from the true class that was classified as the predicted class. It can be seen that Wear periods are predicted reliably whereas Nonwear periods tend to be confused with Wear periods by 11%.
The important features for nonwear classification averaged across all folds with random forests are shown in Supplementary Information Figure 7. It can be observed that statistical measures for LIDS and Z-angle are the most important features for nonwear classification. Table 2. Binary sleep-wake and nonwear-wear classification. F1 F1-score, AP average precision (mean ± standard deviation).

Classification Approach
Outer Cross-validation Test

F1 (%) AP (%) Kappa F1 (%) AP (%) Kappa
Sleep-wake The hierarchical classification performance metrics across the five outer folds are averaged and reported in Table 3 along with test set performance. Unlike F1-score computation of flat classification, F1-score of hierarchical classification takes all correctly classified ancestor classes of the hierarchy into account. It can be observed that the prediction performance drops as we go down the hierarchy. Levels 3, 4, and 5 which consist of leaf nodes like REM, N1, and N3 show a drastic reduction in performance.
The confusion matrices of true classes versus predicted classes are shown in Fig. 3. It can be observed that classes Wear, Sleep, NREM, N1 + N2 and N2 seem to be predicted more frequently than other classes.

Discussion
Based on our experiments, we infer that machine learning approaches such as random forests applied to accelerometer-only data improves the sleep-wake classification compared to the approaches proposed in 1990s 5,6 and as well as the heuristic algorithm proposed by vanHees 9 . Our machine learning approach also enables nonwear detection at a higher time resolution than the vanHees approach 9 . The combination of these enhancements enables us to estimate daytime napping periods. However, the current findings should be seen as an encouragement   13 . Based on an exploratory analysis of our data, we observed that transition between sleep states is rare compared to remaining in the same state for prolonged periods of time. Hence, when using approaches like Hidden Markov Models (HMM) as in Trevenen et al. 13 , the transition probability matrix is heavily diagonal and HMMs do not provide any advantage under such scenarios. Therefore, we trained discriminative models for every time interval based on engineered features. However, HMMs inherently ensure that the predicted sequences adhere to the transition probabilities and hence prevents spurious sleep state predictions. To have a similar effect for preventing spurious predictions, we smoothed the prediction probabilities of the individual random forest models over a 5-minute rolling window.
Further, while Trevenen et al. used data collected from healthy 22-year old individuals, our experiments are based on a more challenging, yet more heterogeneous dataset collected from a wide range of ages and includes participants with sleep disorders.
We explored random forest-based nonwear detection to gain insight into the potential for daytime nap detection. Nonwear detection was found to be acceptably accurate. By feeding the classifier both sleep and nonwear data we offered the classifier a challenging task. If we had trained it using data corresponding to nonwear and a person performing activities the classification task would have been easy but not representative of real-life nonwear detection. Future research is needed to identify generic purpose non-wear detection able to both assist in the distinction of daytime naps and the identification of large episodes of nonwear or sleep.
Sleep stage classification was expected to be challenging at the outset of the study. However, the reason why we still explored it is that even a weak classification could be of value in large scale population studies, e.g. UK Biobank 10 , where minor effects become only visible when averaged over a large number of individuals. Sleep stage classification may only be realistic with complementary sensor data, e.g. photoplethysmogram (PPG), which was outside the scope of this study as we focus on solutions for the already widely collected accelerometer-only data.
The positive correlation of 0.60 and the lack of a statistical difference between average self-reported habitual napping duration and estimates from our ensemble of accelerometer-based random forest models are encouraging. However, based on these data alone it is hard to say whether the observed individual differences are explained by the subjective nature of a questionnaire, the discrepancy between the questionnaire that asks about habitual behavior and an accelerometer recording corresponding to nine specific days, or the precision of the random forest models. Therefore, further research is warranted involving a more direct comparison, e.g. with video observation.
Most of the data used in this study was collected with the GENEActiv accelerometer brand. Future studies should consider the potential of model transferability across accelerometer brands. Previous research indicates that data is highly comparable across accelerometer brands 19 , but confirmation of these specific outcomes is desired.
Models were trained and tested across three different datasets. The PSG data was scored by a different sleep technician at every site, each site had its own PSG equipment, and participants at each site had different demographics. It could be hypothesized that the models have therefore become more robust against signal artifacts related to these experimental differences.
The present study does not look at detecting the beginning and the end of the night (sleep period time window), which is a different but related challenge we looked at in van Hees et al. 20 .
Raw data accelerometry faces the same challenges as traditional actigraphy in not being able to capture most physiological processes that underlie sleep. Our present work does not prove, or even attempt to prove, that raw data offers more accurate sleep detection than traditional actigraphy. The main advantage of raw data is that it offers increased scientific transparency and can be re-processed for many purposes beyond sleep research alone.   21 . Sleep researchers will have to decide whether they prefer a more accurate but less interpretable random forest model or a less accurate model by vanHees, Sadeh, or Cole-Kripke. In an earlier publication, we argue that the vanHees heuristic model is more kinematically interpretable compared with conventional algorithms that rely on zero-crossing counts or magnitude of acceleration 9 . Whether the Sadeh and Cole-Kripke algorithms offer better methodological consistency with historical research is difficult to say since the piezo-electric acceleration sensors as used in the 1990s have been replaced by MEMS-based capacitive sensors in the 2000s that have a wider frequency response. We are not aware of any studies that investigate the comparability of Sadeh or Cole-Kripke algorithm output across these hardware generations.
There are also important clinical implications of these results. The assessment of sleep/wake patterns for the diagnosis of sleep and circadian rhythm disorders often requires polysomnography, which is expensive and laborintensive. Accelerometry is sometimes used as a less expensive form of assessment but current algorithms are limited in their accuracy, particularly in patients with insomnia. Improved algorithms have the potential to make accelerometry a more clinically-useful assessment tool that would permit the measurement of sleep and wake over extended periods of time. This approach could also be implemented more easily than polysomnography in non-sleep clinic settings. Future studies are warranted to investigate the physiology behind misclassifications in order to better understand how sleep classifier performance may vary across specific sleep disorders.

Methods
The raw accelerometer data was extracted from binary files obtained with different accelerometer brands using R package GGIR 22 . The raw data was then preprocessed using GGIR algorithms for signal calibration relative to gravitational acceleration 23 and alignment of PSG assessment labels with processed data. Next, we explored random forests machine learning to perform sleep-wake, nonwear-wear, and sleep stage classification. Additionally, we explored the value of sleep-wake and nonwear-wear classification to identify daytime naps.
Random forests. The same random forests approach was used for sleep-wake, nonwear-wear, and sleep stage classification. In our initial exploration of the data, we experimented with deep learning techniques but as the results were not better than the vanHees approach we decided to report them in the Supplementary Information to this paper.
Signal features. In our models, we used 36-dimensional features encompassing twelve different statistical measures, listed in Table 4, applied to three derived signals calculated from the three accelerometer axes, a x , a y , and a z : • ENMO : The Euclidean Norm Minus One (ENMO) with negative values rounded to zero in g has been shown to correlate with the magnitude of acceleration and human energy expenditure 16 . ENMO is computed as follows: • Z-angle: Z-angle, computed using Eq. 3, corresponds to the angle between the accelerometer axis perpendicular to the skin surface and the horizontal plane. As described in "vanHees approach", any change (or lack of change) in the z-angle over successive time intervals may be an indicator of posture change. • LIDS: Locomotor Inactivity During Sleep (LIDS) 24 involves a non-linear conversion of locomotor activity and has shown to be sensitive to ultradian sleep cycles. The original paper did not make use of raw data accelerometry. In this work, LIDS is computed as follows: (1) ENMO = max(0, a 2 x + a 2 y + a 2 z − 1)  www.nature.com/scientificreports/ where activity count is computed using a 10-minute moving sum over max(0, ENMO − 0.02) . LIDS is then smoothed using moving average over a 30-min window. For each 30 s interval, we computed 36-dimensional features which were then used to train the random forest.
Imbalanced data. We observed from our data that some labels occur more frequently than others leading to an imbalanced dataset. Such data needs to be handled with care when used with machine learning models since the model might learn to always predict the class with the majority of samples. A typical workaround is to undersample or oversample the training samples belonging to various classes such that the model is trained with roughly equal number of samples from each class. In this paper, we followed oversampling of classes using Synthetic Minority Over-sampling Technique (SMOTE) 25 . SMOTE generates new samples by interpolation of random samples with their nearest neighbors. In our work, we used the SMOTE implementation in the imbalanced-learn python package 26 with a sampling strategy to resample all classes to have roughly equal number of training samples.
Performance metrics. As our data is heavily imbalanced, the classification performance of our experiments was evaluated using F1-score and Average Precision, i.e. area under the Precision-Recall curve. Note that SMOTE was only applied to the training data, this why we still need to account for data imbalance in the performance evaluation. F1-score is the harmonic mean of precision and recall with high F1-scores indicating good classification performance. F1-scores of individual classes are averaged to obtain the overall F1-score, i.e. macroaveraging, to treat all classes equally. Additionally, we report Cohen's Weighted Kappa coefficient for Sleep-wake classification results to facilitate comparisons with other studies 27 .
F1-scores are computed using predicted classes chosen with specific thresholds. However, the precision-recall curve gives a better picture of the classification performance since it plots recall vs precision by varying thresholds. Better classification performance is indicated by curves tending towards the top right. The area under the precision-recall curve, i.e. Average Precision, gives a quantifiable measure of performance with Average Precision of 1 indicating best performance.
Training and evaluation. The resampled features were used to train random forests models 28 . Classification using Random forests works by training multiple decision trees with subsets of the data and averaging the decision tree outputs to address overfitting. The features were normalized to have zero mean and unit standard deviation before training.
We used a nested cross-validation approach, involving: fivefold inner cross-validation to optimise hyper parameters, and a fivefold outer cross-validation to obtain generalisation performance. This means that 5 × 5 models were trained in the process, out of which five models from the outer cross-validation can be used as an ensemble on new data.
In the inner cross-validation, a randomized hyperparameter search was used to choose the number of trees from (100, 150, 200, 300, 400, 500) and the tree depth from (5,10,15,20,Full). Other random forests parameters were retained as default as specified by the scikit-learn package. Each inner cross-validation fold splits the training data into training and validation data (4:1) where the validation data is used to choose the optimal hyperparameters i.e. number of trees and tree depth, for each fold based on Average Precision. Hence, each inner cross-validation fold will use a different random forests model tuned optimally for its corresponding training data. The outer cross-validation is used to obtain both F1-scores and Average Precision (AP) for generalization performance. For the outer cross-validation, the data is split into training and validation partitions such that participants in both partitions do not overlap. This ensures that the algorithm does not learn any patterns specific to participant behavior. To ensure that the output is not spurious, we smoothed the prediction probabilities of the individual random forest models over a 5-min rolling window before computing performance metrics. Finally, a left out test set with 24 individuals is used to obtain the generalisation performance of the ensemble of the models generated in the outer cross-validation using averaged prediction probabilities. The same ensemble of models is used in "Nap detection".
Sleep-wake classification. In order to benchmark the performance of our models for sleep-wake classification, we used the previously published vanHees approach 20 as baseline. In addition, we also used the implementations 21 of Sadeh 6 and Cole-Kripke 5 approaches for comparison. Both these approaches use aggregated actigraphy counts with a zero-crossing technique to perform sleep-wake classification.
vanHees approach. To estimate sleep, van Hees et al. 9 proposed a heuristic algorithm using accelerometer data. This algorithm uses (lower) arm angle relative to the gravitational component estimated from accelerometer data to differentiate between sleep and wakefulness states. The arm angle is estimated as: (2) LIDS = 100 activity count + 1 Nonwear detection. Periods of nonwear less than 30 min will go undetected with available heuristic approaches 14,16 . We investigated whether nonwear periods can be determined at a higher resolution with machine learning (random forests). The ground truth labels for our nonwear classification are defined based on two assumptions: The accelerometer is worn during the PSG recording as prescribed by the study protocol and supervised by the researcher, and the accelerometer is not worn outside the PSG recording, according to the study protocol. Only if the standard deviation in the acceleration signal per 15 min is larger than 13.0 mg (1 mg = 0.00981 m/s 2 ) these 15 min outside the PSG recording are labelled as wear. Here, the threshold of 13.0 mg is borrowed from the Heuristic van Hees approach.
Sleep stage classification. The various stages in sleep classification can be organized into a hierarchy of classes as shown in Fig. 4. These follow the standard neurobiological definitions of sleep. We grouped N1 and N2 because they are more similar than N2 and N3, particularly from an electrophysiological perspective (e.g., EEG and EMG). Classifying accelerometer samples according to this hierarchy might help understand the discriminative properties (if any) of the data to perform nonwear detection, sleep-wake and sleep stage classification. Hence, we perform hierarchical classification of samples using random forests as described in "Random forests". For hierarchical classification, we trained a random forest model for every non-leaf node to classify samples into one of its child nodes. Since the samples belonging to each node are imbalanced, we balanced the training samples for each non-leaf node using SMOTE with the sklea rnhie rarch icalc lassi ficat ion implementation.
Nap detection. We combined the random forest models for sleep-wake and wear-nonwear classification as presented in this paper to distinguish: Nonwear, Sleep, and Wake, and applied these to real-life (out of the lab) the accelerometer data. Total weekly napping time was calculated as the total duration of all classified sleep episodes that last at least 15 minutes and are outside the Sleep Period Time Window. Here, Sleep Period Time Window was guided by the available sleep log. A t-test, Pearson's correlation coefficient and scatter plot are used to inspect the relation. The data from 109 individuals as used are a sub-sample of the Whitehall II Study data 9 www.nature.com/scientificreports/ over-sampled with individuals who report nap behaviour, detailed information on the data and sampling can be found in the Supplementary information.
Ethical approval and informed consent. The studies were approved by the University College London ethics committee (85/0938), NRES Committee North East Sunderland ethics committee (12/NE/0406), University of Pennsylvania ethics committee (819591), and VU University Medical Center Amsterdam, respectively. Methods reported in this manuscript were performed in accordance with relevant guidelines and regulations covered by the aforementioned ethics approval committees. All participants provided informed consent.