Semantic representation and comparative analysis of physical activity sensor observations using MOX2-5 sensor in real and synthetic datasets: a proof-of-concept-study

Chatterjee, Ayan; Gerdes, Martin W.; Prinz, Andreas; Riegler, Michael A.; Martinez, Santiago G.

doi:10.1038/s41598-024-55183-6

Download PDF

Article
Open access
Published: 26 February 2024

Semantic representation and comparative analysis of physical activity sensor observations using MOX2-5 sensor in real and synthetic datasets: a proof-of-concept-study

Ayan Chatterjee^1,2,
Martin W. Gerdes²,
Andreas Prinz²,
Michael A. Riegler¹ &
…
Santiago G. Martinez³

Scientific Reports volume 14, Article number: 4634 (2024) Cite this article

406 Accesses
Metrics details

Subjects

Abstract

The widespread use of devices like mobile phones and wearables allows for automatic monitoring of human daily activities, generating vast datasets that offer insights into long-term human behavior. A structured and controlled data collection process is essential to unlock the full potential of this information. While wearable sensors for physical activity monitoring have gained significant traction in healthcare, sports science, and fitness applications, securing diverse and comprehensive datasets for research and algorithm development poses a notable challenge. In this proof-of-concept study, we underscore the significance of semantic representation in enhancing data interoperability and facilitating advanced analytics for physical activity sensor observations. Our approach focuses on enhancing the usability of physical activity datasets by employing a medical-grade (CE certified) sensor to generate synthetic datasets. Additionally, we provide insights into ethical considerations related to synthetic datasets. The study conducts a comparative analysis between real and synthetic activity datasets, assessing their effectiveness in mitigating model bias and promoting fairness in predictive analysis. We have created an ontology for semantically representing observations from physical activity sensors and conducted predictive analysis on data collected using MOX2-5 activity sensors. Until now, there has been a lack of publicly available datasets for physical activity collected with MOX2-5 activity monitoring medical grade (CE certified) device. The MOX2-5 captures and transmits high-resolution data, including activity intensity, weight-bearing, sedentary, standing, low, moderate, and vigorous physical activity, as well as steps per minute. Our dataset consists of physical activity data collected from 16 adults (Male: 12; Female: 4) over a period of 30–45 days (approximately 1.5 months), yielding a relatively small volume of 539 records. To address this limitation, we employ various synthetic data generation methods, such as Gaussian Capula (GC), Conditional Tabular General Adversarial Network (CTGAN), and Tabular General Adversarial Network (TABGAN), to augment the dataset with synthetic data. For both the authentic and synthetic datasets, we have developed a Multilayer Perceptron (MLP) classification model for accurately classifying daily physical activity levels. The findings underscore the effectiveness of semantic ontology in semantic search, knowledge representation, data integration, reasoning, and capturing meaningful relationships between data. The analysis supports the hypothesis that the efficiency of predictive models improves as the volume of additional synthetic training data increases. Ontology and Generative AI hold the potential to expedite advancements in behavioral monitoring research. The data presented, encompassing both real MOX2-5 and its synthetic counterpart, serves as a valuable resource for developing robust methods in activity type classification. Furthermore, it opens avenues for exploration into research directions related to synthetic data, including model efficiency, detection of generated data, and considerations regarding data privacy.

Self-supervised learning for human activity recognition using 700,000 person-days of wearable data

Article Open access 12 April 2024

A multi-sensor dataset with annotated activities of daily living recorded in a residential setting

Article Open access 23 March 2023

Large-scale assessment of physical activity in a population using high-resolution hip-worn accelerometry: the German National Cohort (NAKO)

Article Open access 04 April 2024

Introduction

This section covers overview, motivation, novelty, and aim of the study.

Overview

Regular physical activity is one of the most important contributors to our health. Physical activity improves brain health, manages weight, reduces chronic disease risk (e.g., diabetes type II, metabolic syndrome, cardiovascular disease, cholesterol level, blood pressure, and some cancers), strengthens bones and muscles, lowers symptoms of mental health (e.g., depression, anxiety), and improves individual ability to perform everyday activities, irrespective of age, abilities, and ethnicity^1,2,3. The World Health Organization (WHO) defines physical activity as any body movement that requires energy-consuming skeletal muscles. Physical activities, including recreational sports, conveyance to and from places (movements), or as part of an individual's work. Both moderate and vigorous physical activity can improve health. Popular ways to be active include walking, cycling, running, weight exercise, and active recreation, and it can be practiced at any intensity level or age³. People who do not exercise enough have a 20–30% increased risk of death compared to those who are adequately active³. More than 80% of young people worldwide are not physically active enough³. WHO recommends that adults aged 18–64 should do at least 150–300 min (about 5 h) of moderate physical activity (MPA), or at least 75–150 min (about 2 and a half hours) of vigorous physical activity (VPA), or an equivalent combination of MPA and VPA throughout the week³. One possible way to prevent a decrease in physical activity and an increase in sedentary behavior is to use physical activity monitoring technology⁴. Monitoring daily physical activity towards the management of a healthy lifestyle goal has been a challenging task and one of the most prevalent research challenges in health informatics. However, this has been associated with more physical activity and less sedentary behavior⁴. Different smart devices (e.g., Fitbit, Garmin, Smartwatches, Sensewear Mini Armband, My Wellness Key Accelerometer, Actigraph, Pedometer, smartphone with installed applications) are available in the market to monitor and track fitness-related metrics (e.g., steps, VPA, MPA, low physical activity (LPA), sedentary bouts, calorie burnt, distance covered via running or walking) and related vital health signs (e.g., heart rate variability, respiratory rate, heart rate). The collected activity data is often available preprocessed (e.g., PMData⁵, Zenodo activity data⁶) or raw (e.g., UCI-HAR, WISDM, SHL, MD, HARTH, and AlgoSnap)⁷. Such data is seen as very important in the scientific research community. Several researchers have explored the use of sensors available in mobile devices to identify stationary activities for further applications in different scenarios related to ambient assisted living (AAL) and augmented living environments (ALE)⁷. Prior to this point, there has been a scarcity of openly accessible datasets capturing physical activity data using the MOX2-5 activity monitoring medical-grade (CE certified) device.

Motivation

According to the scientific database searches, many articles reported their experiments on activity datasets collected by different wearable activity devices; however, most of the datasets are private; therefore, results are difficult to replicate or extend. Furthermore, the availability of high-quality, diverse, and sufficiently large datasets for training and evaluating algorithms remains a bottleneck in research and development. To address this challenge, we present a proof-of-concept study that utilizes the MOX2-5 activity sensor⁸ to generate a comprehensive dataset for physical activity monitoring. Synthetic datasets offer a promising solution to the problem of scarcity of real-world data, giving researchers and practitioners access to a wider range of scenarios and activities. This study not only releases the MOX2-5 dataset to the public but also showcases the viability and efficacy of synthetic datasets in enhancing the accessibility of training data for activity recognition models. The MOX2-5 dataset featured in this article offers preprocessed daily physical activity data.

One of the key aspects of our approach is the emphasis on semantic representation. We recognize the need for semantic enrichment of data to exploit the full potential of activity sensor observations. This semantic representation enables data interoperability, knowledge sharing, and advanced analytics. The Semantic Sensor Network (SSN) ontology represents sensor-related information (such as data repositories, processing services, and metadata) and observations and is therefore valuable in environments where sensor data and observations play an important role. SSN leverages Semantic Web technologies and ontologies to provide a standardized and machine-understandable way to describe, discover, and reason about sensors and sensor data. SSN is an important component of the Internet of Things (IoT) and the broader Semantic Web concept. They enable more intelligent, contextual, and data-driven applications by improving the understanding, discovery, and use of sensor data in various fields. Integrating real-world ontologies with SSNs can be more complex and requires careful modeling and adjustment to domain-specific standards and requirements. Our study includes a comprehensive comparative analysis between real and synthetic datasets. We evaluate the performance of activity recognition models trained on both data types, considering factors such as accuracy, robustness, and generalizability. Results reveal the utility of synthetic datasets and their potential to accelerate research progress and algorithm development in the field of physical activity monitoring.

Novel contribution

This is an extended version of our previous study⁹. In this study, we have extended the semantic ontology design for annotating the sensor observations (e.g., our MOX2-5 physical activity datasets) with well-established SSN-based semantic information, elaborate the data collection process, and make the dataset public with its synthetic version. SSN is intended to promote the semantic interoperability between sensors and data systems. It standardized the way they describe and comprehend sensor data, this facilitates the communication and sharing of information between different systems and applications. A universal ontology that is common to all sensors may not have the same degree of standardized data output. In large-scale sensor networks and IoT applications, SSN can offer a scalable infrastructure for the management and comprehension of sensor data. Using a flexible SSN ontology-based knowledge-graph design solution, we lay the foundation for cross-IoT-domain collaboration and innovative research.

Use case and baseline we used the MOX2-5 dataset for daily activity-level classification with an MLP model as derived from our previous study⁹ where we compared the performance of the proposed MLP model with other state-of-the-art classifiers (such as Rocket, MiniRocket, MiniRocketVoting), and the proposed MLP model performed the best. Here, we explore how synthetic data enhanced training data to increase the performance of the used MLP model. We have shown a direction to predict daily physical activity levels into the following activity classes: sedentary (0), low (1), active (2), highly active (3), and vigorously active (4) with the MLP classification model. This proof-of-concept study addresses the generation of synthetic datasets based on the baseline MOX2-5 dataset and the semantic annotation of physical activity sensor observations with an SSN integrated OWL (Web Ontology Language) ontology. To verify the structural consistency, we use an ontology reasoner available in Protégé. We use SPARQL Protocol and RDF Query Language (SPARQL) for precise and efficient data retrieval and manipulation as a part of ontology verification^10,11. We anticipate that our findings will contribute to the broader discussion on the role of synthetic data in data-scarce domains and the importance of semantic enrichment for meaningful and interoperable data. Additionally, we aim to provide insights into the practical applications of original and synthetic datasets in real-world scenarios, particularly in healthcare, fitness, and sports science. According to the literature search, no similar studies have been found. Therefore, the contribution is novel. Furthermore, we make the real and synthetic MOX2-5 datasets public in GitHub to practice open-access research with MOX2-5 dataset as a first study.

Aim of the study

In healthcare, finding high-volume lifelogging data is challenging, and due to privacy and ethical issues, most datasets are private. Synthetic data generation techniques, such as GC¹², CTGAN¹³, and TABGAN^14,15,16, have been used for synthetic data generation with a focus on large-scale data sharing, experimentation, and analysis without revealing sensitive information. We have performed a comparative study with statistical metrics to find the best synthetic data generation method from our real MOX2-5 dataset. Moreover, we generate synthetic data from the best performing data generation method and contribute for open access. The MOX2-5 activity dataset and its synthetic version can be beneficial for other researchers for sedentary pattern analysis, posture detection and step forecasting. Till date, not publicly available MOX2-5 activity datasets exist. Thus, the main contributions of this work are summarized as follows.

1)
We design and develop an ontology for semantification of observable and measurable physical activity sensor data and predictive analysis on the data.
2)
We provide and open dataset containing MOX2-5 activity measurements and provide a baseline analysis of the data.
3)
We provide synthetic data too, generated from the real values of the MOX2-5 dataset, and describe empirically the advantages of synthetic data generation in healthcare using well-established generative methods.
4)
We evaluate the quality and the usefulness of the synthetic data.
5)
We capture the risks and challenges in participant recruitment for sensor-based activity data collection.

Methods

This section describes how we conducted our research, including data collection, ontology development, dataset generation, and analysis methods. Figure 1 represents the structure of the study for the data acquisition, processing, synthetic data generation and comparative analysis. This study used the Standards for Reporting Implementation (StaRI) checklist (see Supplementary Material-1). Data collection has been carried out in accordance with relevant guidelines and regulations in the “Ethics approval and consent to participate” section under Declarations. We followed the rules of the General Data Protection Regulation (GDPR).

Data collection

Participants and related distributions

Initially, we recruited twenty-five participants (19 men and 6 women) aged between 18 and 64; however, nine participants dropped in the middle of data collection due to medical reasons (e.g., pregnancy), lack of self-motivation, and device incompatibility issues. Therefore, the final data acquisition was performed with sixteen volunteering healthy individuals (12 men and 4 women) from Grimstad, Norway, for a period of 30–45 days (about 1 and a half months). We targeted normal-weight and overweight adults (based on BMI standards). The demographic statistics of the final population have been described in Table 1.

Table 1 Demographic characteristics of participants (N = 16).

Subjects

Abstract

Similar content being viewed by others

Introduction

Overview

Motivation

Novel contribution

Aim of the study

Methods

Data collection

Participants and related distributions

Device information, value type, and specification

Sampling rate

Amplitude of the acceleration signal and movement intensity

Device wear location

Selection of activity sensor for this study case

Data acquisition procedure

Proposed ontology model for semantic representation

Ontology vs. databases

Ontology structure

Proposed ontology

Activity level classification with MLP model

Synthetic tabular dataset generation

Gaussian Capula (GC)

CTGAN

TABGAN

Performance metrics

Classification

Synthetic data quality evaluation

Ethical approval and consent to participate

Results

Data records

Experimental setup

Ontology evaluation and querying

Evaluation outcomes

Discussion

Principal findings

Importance of synthetic data generation in healthcare

Challenges associated with data collection

Conclusions

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Comments

Search

Quick links