Self-supervised learning for human activity recognition using 700,000 person-days of wearable data

Accurate physical activity monitoring is essential to understand the impact of physical activity on one’s physical health and overall well-being. However, advances in human activity recognition algorithms have been constrained by the limited availability of large labelled datasets. This study aims to leverage recent advances in self-supervised learning to exploit the large-scale UK Biobank accelerometer dataset—a 700,000 person-days unlabelled dataset—in order to build models with vastly improved generalisability and accuracy. Our resulting models consistently outperform strong baselines across eight benchmark datasets, with an F1 relative improvement of 2.5–130.9% (median 24.4%). More importantly, in contrast to previous reports, our results generalise across external datasets, cohorts, living environments, and sensor devices. Our open-sourced pre-trained models will be valuable in domains with limited labelled data or where good sampling coverage (across devices, populations, and activities) is hard to achieve.


Introduction
Current human activity recognition (HAR) models typically rely on manual feature engineering (Twomey et al., 2018;Haresamudram et al., 2019) partly due to the very limited size of existing labelled datasets.This small data issue caps the effectiveness of data-hungry deep learning methods.In general, obtaining labelled data is labour intensive, but it is especially so for HAR data because one would need to annotate the corresponding video stream for the ground truth.On the other hand, collecting large-scale unlabelled HAR data is highly feasible, as evidenced by projects such as the UK-Biobank (UKB) (Doherty et al., 2017) and NHANES (Belcher et al., 2021).This prompts the use of self-supervised learning (SSL) methods to leverage unlabelled data in a similar manner as language models (Mikolov et al., 2013;Devlin et al., 2018;Radford et al., 2018Radford et al., , 2019) ) and vision models (Doersch et al., 2015;Zhang et al., 2016;Noroozi and Favaro, 2016;Wei et al., 2018;He et al., 2020;Chen et al., 2020).Recent studies explored the utility of SSL for HAR (Saeed et al., 2019;Tang et al., 2020), but these still relied on small-scale laboratory-style datasets, hence the full potential of SSL-HAR remains unknown.
In this paper, we investigate how learning three simple self-supervised learning tasks independently and jointly could facilitate HAR across a variety of environments using the UKB dataset, which contains terabytes of wearable sensor data collected in the real world.To fully appreciate the benefit of SSL in HAR, we chose the tasks that would prioritise the temporal dependencies of human motion, namely, arrow of time (AoT), permutation and time-warping (TW) (Um et al., 2017).Figure 1: A ten-second window of tri-axial accelerometer data when a person is ironing in its original form and three corresponding transformations: reversed, permuted, and time-warped.
illustrates an example of these transformations for a sequence of tri-axial accelerometer data.We show for the first time that multi-task SSL can train a HAR model that generalises well across seven external datasets that differ in activity classes, devices, populations and recording environments.In contrast to previous works, we provide a more realistic evaluation of the utility of SSL-HAR by factoring in common issues seen in the practical use cases of pre-trained models such as domain shift and task shift (Quiñonero-Candela et al., 2008).Our publicly available models will enable the research community to build high-performing activity recognition models even in a resource-restricted environment.

Related work
Activity recognition from on-body inertial sensors is a fundamental problem in wearable computing.HAR models are usually built following the Activity Recognition Chain (Bulling et al., 2014), which describes the pre-processing and segmentation of raw time series into fixed-length frames, from which features are extracted for classification.Early work in HAR found some success in tree-based approaches with hand-crafted features (Hammerla et al., 2013).More recent works focused on translating advances in deep learning into HAR (Yang et al., 2015;Hammerla et al., 2016;Ronao and Cho, 2016) but were limited by small dataset sizes.Indeed, some studies found that earlier approaches, such as random forests, tended to generalise better than deep learning models in certain small data regimes (Willetts et al., 2018;Kwon et al., 2020).
Recent studies explored the use of SSL for HAR, such as forecasting (Taghanaki and Etemad, 2020), masked reconstruction (Haresamudram et al., 2020a), contrastive learning (Haresamudram et al., 2020b;Tang et al., 2020), and multi-task SSL (Saeed et al., 2019;Tang et al., 2021).However, these studies were still limited by small laboratory-style datasets (n < 10,000), which confounded their findings.In comparison, we leverage the largest ever collection of wrist-worn raw accelerometer data (n > 6 billion) collected in the real world.Aside from being several orders of magnitude larger, this dataset is vastly more diverse as it contains hundreds, if not thousands, of natural human activitiesa crucial aspect regarding the input distribution and generalisation.Therefore, we aim to provide a more definitive investigation of the utility of SSL for HAR.

Methods
We used tri-axial accelerometer data from wrist-worn activity trackers, which record acceleration on three orthogonal axes at a high sampling rate (e.g. 100 Hz).The main benefit of wrist-worn activity trackers is their high user compliance, resulting in days, if not weeks, of continuous recordings.Following Bulling et al. (2014), we split the signals into windows of equal duration, effectively treating them as independent inputs to the HAR models.We can then label each window with an activity class.Throughout this study, we linearly resampled all data to 30 Hz resolution and used ten-second-long windows to compare the downstream benchmarks fairly.The 30 Hz sampling rate was used because most human activities have a frequency less than 10 Hz.We used a sampling rate that is higher than the presumed Nyquist rate (20 Hz) to ensure that we did not lose any useful signal.For the subsequent activity recognition benchmarks, we considered seven external labelled datasets that vary in size (600 to 600,000 samples), activity classes (4 to 18 classes), devices used (five different brands), and collection protocol (free-living, scripted, and lab settings).See Table 1 for additional details.Three datasets had license information, and four datasets had explained informed consent information (Appx.Table 5).We removed the classes that were not present in all the subjects in small datasets with less than 10 individuals during data cleaning.
Even though we reused existing datasets, we made our best effort to enumerate the license and consent information for all the included datasets, as our data involved human subjects.We observed that many open benchmark datasets that we used did not have suitable licencing or consent information, possibly due to the lack of data governance awareness at the time of the study.

Multi-task self-supervised-learning
We considered three self-supervised tasks from Saeed et al. (2019), which were first used by (Um et al., 2017) as data augmentation techniques.
Arrow of time (AoT) flips the signal along the time axis, effectively playing the signal in reverse.
Permutation breaks the signal into chunks and shuffles them.We set the number of chunks to four and the minimum length of each chunk to at least 10 timestamps.
Time warping (TW) stretches and compresses arbitrary segments of the signal, effectively slowing down and speeding up the signal randomly.
Following Saeed et al. (2019), we treated each of the tasks as a binary problem predicting whether a transformation has been applied.In the multi-task learning (MTL) setting, not all the tasks might benefit HAR when trained jointly, so we assessed how different task combinations could influence the downstream performance (Section 4.2).We computed the cross-entropy loss for each task and weighed all the tasks equally in the loss calculation.
Weighted sampling Motion data collected in the real world contains large portions of low movement periods that are less informative (Appx.Figure 6), which is an issue for our SSL tasks as static signals remain virtually unchanged after the transformations.We found it crucial to perform weighted sampling for improved training stability and convergence: During training, we sample the data windows in proportion to their standard deviation so as to give more weight to high-movement periods.

Network training
We adapted a ResNet-V2 with 18 layers and 1D convolutions (He et al., 2016) for the main trunk (feature extractor), totalling 10M parameters.The learned feature vector was of size 1024.All the tasks shared the same feature extractor.Then, we attached a softmax layer for each of the self-supervised tasks.In the downstream evaluation, we added a fully-connected (FC) layer of size 512 in between the feature extractor and softmax readout.The network structure was fixed for all the downstream evaluations.
For SSL, we load up to four subjects from the UKB at each iteration.For each subject, we first sampled one day out of the week-long data, from which we again sampled 1500 10-second windows to make up a training batch.Self-supervised transformations were then applied to the batch of data.
Since the axis orientation differs between device manufacturers, we used random axis swaps and rotations to augment the training data to embed this invariance into our models.For optimisation, we used Adam (Kingma and Ba, 2014) with a learning rate of 1e-3.To account for large batch sizes, 1500 × 4 = 6000, we applied linear scaling for the learning rate with 5 epochs as burn-in (Goyal et al., 2017).We distributed the network training over four Tesla V100-SXM2 with 32GB of memory.
It took about 420 GPU hours to train the MTL model (about 20 epochs).We used an 8:2 ratio for the train/test split for all the self-supervised experiments.For fine-tuning, we used the same training setup as the pre-training where possible except for the batch size, which was re-adjusted depending on the size of each dataset.

Evaluation -human activity recognition
To evaluate the downstream (HAR) performance, we used held-one-subject-out cross-validation for the datasets that had < 10 subjects.We additionally removed activity classes not done by all the subjects in these small datasets.For datasets with ≥ 10 subjects, we used five-fold subject-wise cross-validation instead.Each cross-validation had a 7:1:2 split ratio for train/validation/test sets.We used early-stopping with a patience of five to avoid over-fitting.
After the network was trained on the UKB using ∼100,000 participants, we further fine-tuned the network on the seven labelled downstream datasets to perform human activity detection using two approaches: (1) fine-tuning all the layers (2) freezing the trunk (feature extractor) and fine-tuning only the FC layers in the end.We also report the model performance for a network of the same architecture but fully trained from scratch, and a strong random forest model with tried-and-tested time series features, which has often been neglected in baseline model comparisons (Zhang et al., 2012;Mannini et al., 2013;Ellis et al., 2016;Willetts et al., 2018).See Appx.A.3 for the list of features used.
In addition, a shared implementation was introduced for our network training, model evaluation and preprocessing.Differences in experiment setup such as training rates, regularisation and data augmentation can lead to inconsistent results (Oliver et al., 2018).A unified evaluation framework would ensure a fair comparison between different baseline models.Our evaluation framework contrasts with previous work, where there is no fixed evaluation protocol across the benchmark datasets, making it hard to compare model performance with the current state-of-the-art.The results produced in our paper would serve as the baseline for future HAR research.
Transfer learning Pre-training on a larger labelled-dataset and fine-tuning on a smaller dataset is a common technique in practical application that has been under-reported as a baseline for SSL.The success of transfer learning, however, depends on how similar the source and target datasets are.Hence, we included experiments using the two largest labelled datasets, Capture-24 and Rowlands for pre-training, which were then fine-tuned on other labelled datasets.
The benefits of data volume In the ablation studies, we investigated how the downstream performance differ on two axes, the amount of labelled data and the amount of unlabelled data.Concretely, we gradually increase the number of labelled subjects in both Capture-24 and Rowlands in the downstream evaluation to assess whether our pre-trained model can still do well in a limited-data regime.In terms of unlabelled data, we experimented with pre-training that had 100 to 100,0000 participants with one order of magnitude increment.We also varied the amount of unlabelled data per subject from .25 to 1 using 10,000 participants.A data ratio of .25 means that if one day of data per subject was used previously, then only six hours of data per subject would now be used for training.Investigating how unlabelled data influences downstream performance will guide how much data one needs to have to obtain an effective SSL model for HAR.

Understanding network representation
Contextualising layer-wise relevance propagation We applied layer-wise relevance propagation (LRP) to visually investigate the signal characteristics relevant for detecting the pretext tasks (Montavon et al., 2019;Creagh et al., 2021).It is inherently more difficult to visually interpret attribution heatmaps generated through Explainable AI (XAI) frameworks on time-series signals.To overcome this lack of visual ground truth, we devised a set of simple contextual experiments to evaluate our LRP attribution results.Using the same accelerometer as the UKB, we recorded a participant performing two activities under video observation: (1) low intensity scripted (hand-shaking) and ( 2) high intensity unscripted (playing tennis).We acquired a ground truth (the context) for the accelerometer activity through the time-synced video observations, enabling a better visual interpretation of the sensor-based characteristics attributed as relevant for detecting different pretexts.Holistic interpretations were formed based on visualising the raw sensor signal, its analogues time-frequency representation through continuous wavelet transform (CWT) scalograms (Addison et al., 2009), as well as the timeand pretext task-localised LRP relevance scores, all with respect to observing the concurrent video recordings.Details on the XAI contextual LRP (cLRP) framework are described in Appx.A.3.1.

Weighted single task training
When training individual pretext tasks, we found that without weighted sampling, all the tasks had worse convergence behaviour (Figure 2).The performance degradation was most pronounced for the AoT and permutation.The test performance for the AoT stayed at the random chance level, and the test performance for permutation dropped roughly 10 percentage points without weighted sampling.

Multi-task self-supervised learning
To investigate how different SSL configurations perform in three downstream datasets, we picked one large (Capture-24), medium (Rowlands) and small (Opportunity) dataset for evaluation.We trained different tasks both individually and jointly using 1000 subjects from the UKB, then we fine-tuned the models on the subsequent HAR benchmarks (Table 2).
The differences between different SSL combinations on large datasets (Capture-24 and Rowlands) was smaller than that of the smaller dataset (Opportunity).There was no clear best performing configuration, and thus, for ease of comparison, we chose to use all tasks in pre-training for the remaining experiments.In addition, training more tasks together might yield the most general representation for different downstream datasets.

Downstream performance -human activity recognition
Table 3 summarises the F1 and Kappa scores for seven HAR datasets.The random forest models outperformed the deep learning models trained from scratch for all except the Capture-24 dataset, which is the largest dataset in our evaluations (Table 1).The performance gap between random forest and training from scratch was the largest in smaller datasets.Meanwhile, pre-trained models outperformed the models trained from scratch and random forest in all seven datasets.Fine-tuning all layers was better than fine-tuning just the fully connected (FC) layers after the ConV layers.
The most significant improvement using pre-training was seen on the small datasets.Conversely, the SSL benefit was more modest for larger datasets.In Capture-24, the F1 improvement was 2.5% when comparing the model with and without SSL pre-training.Nonetheless, with SSL pre-training, the median relative F1 improvement was 18.4% when compared against the same network trained from scratch and 8.5% when compared against the random forest model.

Transfer learning using labelled pre-training
Even though supervised pre-training can boost the learning outcome substantially than training from scratch (Table 3 vs Table 4), it was surprising to see self-supervised pre-training could outperform supervised pre-training when using Rowlands and Capture-24 as the source data.We suspect the limited number of labels that the source datasets had did not contain enough information as what would have been learnt using the SSL pretext tasks.

Ablation studies
Varying labelled data in the downstream We observed that pre-trained models did well regardless of the number of labelled subjects in two downstream datasets (Figure 3a).However, fully-supervised and random forest models were more susceptible to the number of labelled subjects.The performance gain for having more labelled subjects was roughly linear w.r.t. the number of people included with a greater increase when we had fewer labelled subjects.
Varying unlabelled data in the pre-training We also found that the downstream HAR performance appeared to increase linearly w.r.t. the number of unlabelled subjects on a log scale ( Figure 3b left).This conforms with the previous finding in a semi-supervised setting (Oliver et al., 2018).The SSL performance boost with more SSL unlabelled subjects was most significant in the smallest dataset, Opportunity.Furthermore, if the number of participants is fixed at 10,000 in pre-training, the data ratio included per subject did not significantly influence the downstream performance.To our surprise, the downstream performance did not degrade too much even when we reduced the subject data ratio from 1.0 to .25 (Figure 3b right).We suspect that because we had a sufficiently large sample  size (10,000), having more data per subject would not help to learn a better representation anymore.The trade-off between inter-subject and intra-subject variability in self-supervision warrants future investigation.Knowing how to prioritize the number of subjects and how much data per subject would benefit the data curation process, especially in fields like medical sciences, where it might be easier to get some data from lots of people than get lots of data from some people.

Understanding the representation
Cluster analysis We used UMAP (McInnes et al., 2018) with default parameters for lowdimensional projections for visualization.This was applied to the raw inputs, untrained features, and SSL-derived features without fine-tuning.Results for two of the downstream datasets are shown in Figure 4, and remaining results can be found in Appx. Figure 7. Across all datasets, we observed that the SSL-derived features were better at clustering similar activities (e.g.walking, stair climbing vs sitting, writing, typing) as well as their intensities (e.g.lying down, sitting, standing vs jogging, sports), exhibiting better intra-class compactness and inter-class separability.
Feature interpretation Next, we visualised two exemplary pretext SSL task predictions in the presence of repetitive low-and high-intensity activities: shaking hands (Figure 5a) and playing tennis (Figure 5b).During tennis playing, a repetitive high, intensity activity, relevance scores tended to highlight the moments around the natural movements of swinging and hitting the tennis ball (Figure 5b, Appx. Figure 9).When performing a repetitive low-intensity activity experiment, for example, shaking hands (Figure 5a), LRP appeared to also identify the intensity and natural signal periodicity as indicative of the original activity.In contrast, for augmented signals, our model attributed more during periods of visually unrealistic motion dynamics, such as unnatural fragmentation in activity frequency or synchronisation mismatches between sensor axes.Interestingly, stationary movement periods were not relevant for detecting the pretext tasks, further emphasising the importance of our movement-based weighted sampling approach during training.We found that most tasks tended to identify similar regions as relevant.However, this might be expected due to the large shared feature extractor base across all tasks, with smaller MTL heads to identify each pretext task.
Finally, we empirically compared the faithfulness of the XAI algorithms investigated and the combination of various LRP parameters, using sample-masking experiments for a random subset of 1000 (out-of-sample) subjects in the UKB (Appx.Section A.4.2).Most XAI models consistently demonstrated the ability to identify relevant patterns for discriminating transformed samples from the original raw data when compared against a random model.

Conclusion and future work
Our main findings are: (1) Self-supervised pre-training consistently improved downstream HAR, especially in the small data regime, reducing the need for labelled data.(2) It is possible to learn representations that generalise well across external datasets, tasks, devices and populations -this contrasts with previous studies where SSL training and evaluation were performed on the same data sources.
(3) Multi-task SSL appears to be learning human motion dynamics, intensity levels, and the synchronisation between different axes (Figure 4 and Figure 5).Even before fine-tuning, the pre-trained models already have the capacity to distinguish different human activities.
In the future, we could include other compatible data modalities such as electrocardiogram data.
Having a multi-modal representation would allow us to build foundation models that could be used in critical fields like health research (Bommasani et al., 2021;Spathis et al., 2022).Another potential work is the analysis of inter-subject and intra-subject variability in sensor data which could better inform the data curation procedures.
Due to a current lack of raw accelerometer datasets in different regions of the world, a limitation of our work is that the pre-training data (UKB dataset) consists mostly of Caucasians from the UK.A natural extension of this work could include datasets from different demographic groups, as they become available in the future, to improve model fairness.
(l+1) , and R j←k defines the share of R k that is redistributed to neuron j in layer (l) .The fundamental concept underpinning LRP is that the conservation of relevance per layer, which can be denoted as: The LRP algorithm initiates at the model output and iterates over all layers in the model in a backwards pass until the relevance scores R i for all inputs of x i are computed.Relevance values R i > 0 signify components x i which represent the presence of the predicted class, while conversely R i < 0 contradict the prediction of that class.R i ≈ 0 indicate inputs x i which have little or no influence to the model's decision.
It has been demonstrated that the combination of different rules throughout a network yields the most faithful and understandable LRP explanations (Kohlbrenner et al., 2020).As such, we applied a composite rule (LRP-CMP) with LRP-γ applied to the shallower convolutional layers, LRP-ϵ rules (ϵ={1e −9 , 1e −3 , 10}) in the middle of the network, and LRP-0 applied to the final linear classification layer.
Visually interpreting SSL attribution: To visually interpret SSL attribution, we compared the raw time-series accelerometery with the analogous SSL task transformed data.In in the panel plot depicted in figures 5 and 9, the top rows represent the 3-axis accelerometer trace for each channel: (a x , a y , a z ); the second rows depict the top view of the continuous wavelet transform (CWT) scalogram of ∥a∥, which is the absolute value of the CWT as a function of time and frequency.The bottom rows denote the relevance values (R i ) attributed using LRP.Red and hot colours identify input segments where R i > 0 (contribution to a class prediction), whereas blue and cold hues identify R i < 0 (contradicting a class prediction), while black represents (R i ≈ 0) inputs which have little or no influence to the model's decision.Square patches over the raw accelerometer trace correspond to the video frames depicted above each panel plot.
Evaluating XAI algorithm faithfulness: In order to test the faithfulness of an explanation provided by an XAI framework, a sample-masking experiment was performed, comparing some popular XAI models and LRP parameters options.Briefly, to conduct a permutation analysis, the most relevant samples identified from each XAI algorithm were cumulatively masked (imputation with random Gaussian noise), from most relevant to least relevant (Samek et al., 2021).As such, the faster the accuracy of the model decreases with the number of masked samples, the more faithful the explanation method is with respect to the decision of the neural network.
Permutation tests were performed in random batches containing correctly identified duplicate original and augmented samples (AoT + permutation + TW) from 1000 subjects in the out-of-sample test set sampled in the UKB.The mean degradation in SSL prediction through our permutation experiment is shown in figure 8.We compared various LRP parameter options, LRP-0, LRP-ϵ, LRP-CMP, as well as some popular off-the-shelf XAI attribution frameworks, such as saliency mapping (Simonyan et al., 2013), Guided Backpropagation (GBP) (Springenberg et al., 2014) and Integrated Gradients (IG) (Sundararajan et al., 2017).Table 6: Downstream human activity recognition performance (subject-wise F1 (±SD)) when using different self-supervised learning tasks after fine-tuning: Arrow of the Time (AoT), Permutation (P), Time warping (TW).The tasks were trained using the same 1,000 UK-Biobank participants.

Figure 2 :
Figure 2: Accuracy test curves for training four self-supervised tasks individually using 1000 subjects from the UK-Biobank with and without weighted sampling.The patience for early-stopping was five.

( a )
How different models perform in the downstream tasks when we change the number of labelled subjects: Capture-24 (left) and Rowlands (right).
data ratio in SSL (b) Left: More #unlabelled subjects in the SSL leads to better downstream performance.Right: Unlabelled data per subject does not improve downstream performance when using 10,000 subjects for pre-training.

Figure 3 :
Figure 3: The impact of varying amount of labelled or unlabelled data in the self-supervised training and the fine-tuning stage on subsequent human activity recognition performance.Mean F1± SD are plotted.

Figure 4 :
Figure4: Cluster analysis on raw inputs, untrained features and SSL-pretrained features.We use color gradients to denote activity intensities.Results suggest that SSL-derived features are better at clustering similar activities (e.g.walking, stair climbing vs. sitting, writing, typing) as well as their intensities (e.g.lying down, sitting, standing vs. jogging, sports).

Figure 6 :
Figure 6: Data collected from free-living environment tends to have more stationary periods than the data collected in lab environment.

Table 1 :
Wrist-worn accelerometer datasets used to evaluate the utility of self-supervised learning for human activity recognition tasks

Table 2 :
Downstream human activity recognition performance (subject-wise F1 (±SD)) for different self-supervised task combinations using 1,000 UK-Biobank participants.N is the number of samples.

Table 3 :
for downstream human activity recognition tasks (mean ± SD) using 100,000 participants for pre-training.The relative improvement compares the performance between the model that is trained from scratch and fine-tuning using all the layers.

Table 4 :
performance comparison between supervised pre-training with self-supervised pre-training.

Table 7 :
Test accuracies on different tasks 1000 subjects.