Introduction

Mobility has been defined as the ability to move freely and easily, representing an essential component of health and quality of life, being key to physical, mental, and social well-being1. Sudden loss in mobility has been associated with morbidity, falls, dementia, cognitive decline, hospitalizations, mortality and symptoms of chronic disorders2,3,4,5,6. Mobility loss translates to an inability to perform activities of daily living, which has also been defined as mobility disability1,7. According to the World Health Organization (WHO), it is estimated that 1 billion people currently live with mobility disability due to impairments in their respiratory, cardiovascular, musculoskeletal or neurological systems8. This number is expected to rise further due to increased longevity of the worldwide population and prolonged survival in people with chronic diseases leading to motor detriment. This will entail large public health implications and increase an already growing social and economic burden upon healthcare systems.

Efforts to mitigate the loss of mobility disability are of increasing priority and have been the focus of several recent clinical intervention trials9,10,11,12. Existing mobility endpoints are based upon patient self-reporting and one-off assessments of physical function, such as the timed up and go test or a 6 minutes walk test. While such an assessment provides useful insights on an individual’s mobility capacity (how much they can do), it lacks ecological validity as it does not reflect an individual’s mobility performance (how much they actually do) during daily life1,7,13,14. This incomplete assessment of mobility limits therapeutic development and clinical management14. Therefore, valid and easy-to-use methods to accurately and reliably assess mobility performance would provide insight into how mobility disability manifests in the real-world7.

Digital health technology, such as wearable devices, offer an objective, low-cost and ecologically valid approach of continuously monitoring real-world mobility performance through characterisation of Digital Mobility Outcomes (DMOs)7. A single wearable device can be worn unobtrusively and comfortably on the lower back, attached to a belt or affixed to the skin15,16. Walking speed remains the most widely explored DMO17, where reduced walking speed has been associated with ageing, mortality, neurological and cardiovascular conditions, cognitive decline, and risk of falling2,5,18,19,20,21. Furthermore, walking speed represents a composite measure of walking ability, as it is estimated from the combination of other spatial and temporal DMOs, specifically stepping cadence and stride length. As such, walking speed represents a global measure of mobility that can be interpreted and is meaningful to patients and clinicians alike17.

A lack of robust technical validation of real-world walking speed measurements has prevented the adoption of walking speed derived from wearable devices as a clinical endpoint for interventional trials7,22,23. Technical validation requires the comparison of DMOs quantified from a wearable device with DMOs quantified by an established reference system, whilst accounting for and acknowledging a wide range of contextual factors24. The majority of algorithms to estimate walking speed have typically been validated based upon healthy adults assessed in simple laboratory tasks in standardized and supervised settings that do not represent the more challenging and variable nature of real-world environments24. Furthermore, studies often validate DMO algorithms in isolation, without considering the complexity of a comprehensive multi-stage pipeline needed to estimate walking speed25,26,27,28. This first requires the identification of walking activity, followed by the quantification of DMOs (e.g., steps, cadence, stride length, walking speed). Interactions between all algorithmic steps in the pipeline will influence final outputs, and errors will accumulate along the pipeline. A robust validation of walking speed therefore requires analysis of the estimate of walking speed at a walking bout (WB) level from the implementation of the full pipeline.

The aim of this study was to provide a comprehensive validation of walking speed estimated from a single inertial measurement unit based wearable device against a multi-sensor reference system integrating pressure insoles (INDIP)29,30,31, to enable a robust validation: (i) in both laboratory and real-world settings, (ii) across different clinical cohorts with a range of mobility disabilities, (iii) across gait tasks of varying complexity and (iv) accounting for confounding factors (WB duration and walking speed). Subsequently, we provide recommendations for the suitability of a wearable device paired with the proposed analytical pipeline as a measure of real-world mobility and suggest a framework for future studies aiming to validate DMOs for real-world monitoring.

Results

Clinical and demographic characteristics of the participants are presented by cohort in Table 1 (Mean ranges: Age 47–79 years, Height 166–176 cm, mass 70.6–83.6 kg). Participants were recruited (n = 108) from the following cohorts: congestive heart failure (CHF), chronic obstructive pulmonary disease (COPD), healthy adult (HA), multiple sclerosis (MS), Parkinson’s disease (PD) and proximal femoral fracture (PFF). Eleven participants (CHF: 4, MS: 3, PD: 1, PFF: 3) were excluded from the laboratory recordings and 26 participants (CHF: 3, HA: 3, MS: 7, PD: 5, PFF: 8) from the real-world recordings due to technical difficulties with either the reference system or the wearable device.

Table 1 Demographic and clinical characteristics of the participants included in the real-world analysis.

An overview of the method of validating walking speed can be viewed in Fig. 1.

Figure 1
figure 1

Overview of (a) the TVS protocol, (b) the analytical pipeline applied to estimate walking speed from the wearable device data (WD), (c) the approach to validating walking speed estimated from the analytical pipeline.

True Positive Evaluation

In the laboratory assessments a total of 1365 WBs were detected by the reference system and 1298 WBs by the wearable device. To be able to compare DMOs on a WB level, the analysis included WBs that were concurrently detected by both systems (true positive analysis). All WBs with a time-overlap of more than 80% of their duration were considered true positive, resulting in 692 WBs that were considered for analysis and considered a TP (see Methods and Supplementary Figs. 1 and 2 for more details). Based on these true positive WBs, we observed a mean error of 0.01 m/s (MRE = 5.9%), and MAE of 0.10 m/s (MARE = 14.96%) across all cohorts (Fig. 2, left. Table 2). We found that walking speed was estimated with good reliability (ICC = 0.84) by the wearable device, with a slight overestimation compared to the INDIP reference system.

Figure 2
figure 2

Residual plots of walking speed for all true-positive WBs recorded in the laboratory (left) and during the real-world recording (right). The margin plots represent the overall speed and error distributions. The margin plots are further grouped by the performed tests for the laboratory and by the cohort for the real-world recordings. The light blue bars around the Limits of Agreement (LOA) (dashed horizontal lines) represent their bootstrapped confidence intervals. The dashed black line represents the result of a linear regression on all datapoints. The grey area around the regression line represents the bootstrapped 95% confidence intervals.

Table 2 characterization of relative and absolute errors, Intraclass correlation coefficient (ICC), Limits of agreement (LoA), for walking speed estimated from the true-positive walking bouts (WBs) from all Laboratory tasks combined and the real-world assessment.

In the 2.5-h real-world assessment, the reference system detected 4409 WBs, while the wearable device identified 4620 WBs. The average sensitivity and specificity for WB detection compared to the reference system were 0.65 and 0.99, respectively (Table 3). Across all detected WBs, 1414 (30% of all WBs) were identified as true positive WBs (i.e., more than 80% overlap with a reference WB). Based on these WBs, we observed a mean error of 0.06 m/s (MRE = 14.48%) and a MAE of 0.11 m/s (MARE = 20.31%) across all cohorts (Fig. 2, right, Table 2). As observed in the laboratory data, results showed a good reliability (ICC = 0.77) (Table 4), with an overestimation of walking speed by the wearable device (0.01 m/s).

Table 3 The performance of the WB detection calculated by comparing, sample by sample, the detected walking bout regions by the single wearable device with the detected walking bout regions by the reference system in the real-world recordings.
Table 4 Walking speed ranges and error analysis for the results combined per Laboratory test (laboratory) or over the entire real-world recording.

Combined evaluation

To remove potential bias by focusing only on the true-positive WBs and to mimic actual use of wearable device where reference data may not be available, we performed a second evaluation for which we combined all WBs for a Laboratory test and 2.5 h recording in the real world by taking the median of the calculated DMOs (see Methods). These combined values were then compared between the systems. Results from laboratory data showed a mean error over all tests of 0.01 m/s (MRE = 7.47%) and a MAE of 0.12 m/s (MARE = 17.82%) (Fig. 3, left. Table 4). In contrast, in the real-world we observed a higher mean error over all participants of 0.11 m/s (MRE = 24.48%) and a MAE of 0.13 m/s (MARE = 26.47%) (Fig. 3, right. Table 4). For both environments the errors were higher than those estimated from the analysis on true-positive WBs. The biggest effect was seen on the ICC during the real-world recording which dropped considerably to 0.33 (poor) across all cohorts and showed no (PFF: 0.04) or even negative correlation (MS: − 0.15). This is not surprising due to the limited number of datapoints included for this type of analysis (one datapoint per participant).

Figure 3
figure 3

Residual plots for the walking speed combined over all identified WBs. For the laboratory tests the median over all WBs within one motor task is taken (left). For the real-world recording the median over all WBs in the entire real-world assessment is shown (right), where each datapoint represents an individual participant. The margin plots represent the overall speed and error distributions. The margin plots are further grouped by the performed tests for the laboratory and by the cohort for the real-world recordings. The light blue bars around the Limits of Agreement (LOA) (dotted horizontal lines) represent their bootstrapped confidence intervals. The dashed black line represents the result of a linear regression on all datapoints. The grey area around the regression line represents the bootstrapped 95% confidence intervals.

Factors that can influence walking speed validity

Influence of the cohort

The MAE based on the true-positive evaluation differed by < 0.05 m/s between cohorts in both laboratory and real-world settings. In the laboratory, the COPD cohort had the lowest MAE (0.06 m/s) followed by HAs (0.08 m/s) (Table 2), whereas the PFF and CHF cohorts had the largest MAE of 0.12 m/s. In the real-world, HAs presented the lowest MAE (0.09 m/s) followed by the PFF cohort (MAE = 0.11 m/s) (Table 2). Walking speed tended to be overestimated for all cohorts apart from CHF, for which walking speed was underestimated by 0.06 m/s in the laboratory and 0.04 m/s in the real-world.

Influence of WB duration and walking speed

In the analysis based on the true positive WBs, errors decreased for longer WB durations (Fig. 4). MAE across all cohorts for very short WBs < 10 s ranged between 0.09 and 0.16 m/s, compared to 0.06–0.11 m/s for long WBs (between 60 and 120 s). However, as WB duration increased, the number of available WBs included in the validation analysis decreased as well. When looking at the combined approach (across all detected WBs) (Fig. 5) the trends from the true-positive analysis were confirmed. The MAE for the very short WBs (< 10 s) was lower than for the short WBs (10–30 s), and the number of very short WBs detected by the wearable device was disproportionally higher compared to the reference system. Overall, about two thirds of all WBs were shorter than 30 s. When removing very short WBs (< 10 s) from the calculation of the mean/median errors, the range in error was marginally smaller for some cohorts than the error observed at all WBs for the true positive analysis (improvement < 0.1 m/s). The median of the absolute difference of the combined analysis increased from 0.1 m/s over all WBs to 0.14 m/s for the WBs longer than 10 s.

Figure 4
figure 4

The dependency of the absolute walking speed error of all true-positive WBs from the real-world recording on the WB duration reported by the reference system. In the top, WB errors are grouped by various duration bouts. In the bottom the number of bouts within each duration group is visualized.

Figure 5
figure 5

The walking speed estimations from the real-world recording of the reference system and the wearable device, from all WB within the respective duration bouts. The boxplots show the distribution over all WBs. The bars in the upper plot show the absolute difference between the medians of the distributions (see right y-axis). The bottom plot shows the number of WBs in each duration bout.

In both environments, a clear linear negative relationship between the magnitude of the reference walking speed and the measurement errors was observed (Fig. 2). For the slowest WBs (< 0.6 m/s), we observed the largest absolute errors, increasing to 0.8 m/s in real-worlds WBs. Walking speed tended to be overestimated for slow WBs and underestimated for fast WBs. This trend can also be observed in the overall speed distribution of the WBs (Supplementary Fig. 2), which shows a larger number of slow walking bouts and a lower median gait velocity for the reference system compared to the wearable device.

Influence of task complexity

As task complexity increased, so did the MAE. For instance, the most complex laboratory gait task (“simulated daily activities”) presented the highest MAE across all cohorts (0.17 m/s) and the least complex task (the slow straight walking test) presented the lowest MAE (0.08 m/s) (Table 5, Fig. 6). Furthermore, the influence of task complexity was cohort dependent. The largest differences between the simple and complex gait tests were observed for the MS, PD, and PFF cohorts (P1 pipeline). In the real-world, for the same cohorts, differences were observed in the errors estimated between the WBs without turns and WBs with turns. For all cohorts, mean error and MAE from real-world assessments were comparable or slightly lower than from simulated daily activities.

Table 5 Dependency on complexity for a selection of the gait tasks.
Figure 6
figure 6

The dependency of the absolute walking speed error on the different defined complexity tasks (see text). The results are split by patient cohort. The “All” group represents the statistics over all WBs independent of the cohort.

Discussion

To our knowledge, this study is the most extensive validation of a complex comprehensive multi-stage analytical pipeline for estimation of walking speed from a single wearable device. Overall, our findings showed good to excellent validation results in the laboratory and moderate to good agreement in the real-world. We demonstrated that validity of walking speed estimation is slightly impacted by several factors including environment (laboratory vs real-world), clinical cohort, gait task complexity and other confounding factors (number of turns, WB duration, WB speed). Our results have strong implications for future research, below we provide our recommendations for future validations and on the use of wearable device-based walking speed in daily life and more broadly, DMOs in general.

Overall validation results

Overall, laboratory walking speed demonstrated excellent agreement with the reference system, with the ICCs of the true positive WBs ranging from good (0.79) to excellent (0.91) and MAEs ranging from 0.06 to 0.12 m/s across all cohorts. Within the combined evaluation, the ICC of walking speed was slightly lower (0.72–0.82), indicating that only a small difference was introduced by the true positive evaluation. Previous studies conducted across various HAs and various clinical cohorts in laboratory settings have shown lower or comparable results27,32,33,34. However, in comparison to those studies, the pipelines in this study were validated over a wider variety of more complex gait tasks, challenging the estimation of walking speed as the signals are more variable and less cyclic, in comparison to steady-state and straight path gait.

Estimating walking speed in real-world gait assessment poses challenges due to the complexity and non-standardized nature of environments. This difficulty is supported by previous literature, which has found that real-world assessments present a greater challenge for DMO estimation35,36.

Despite these challenges, we achieved good results, since agreement was found to be moderate to good (ICCs within true positive WBs ranging between 0.57 and 0.88) and MAE ranged from 0.09 to 0.13 m/s. As regards to the combined real-world WB analysis, the ICCs were lower than the ICCs from true positive WBs. The MAE remained within usable ranges (< = 0.18 m/s), but MARE increased up to 44% primarily due to large relative errors for low gait speeds. In the combined analysis, median average walking speeds for each participant was calculated, which may have increased the impact of individual datapoints with larger errors, as there was only one data point per participant. In some instances, we also observed negative ICCs (MS cohort = − 0.15), which indicates a very poor correlation. Furthermore, this analysis reduced the range in walking speeds, where a larger number of slow WBs were included, which further increased the estimation error.

The WB detection results further show that, dependent upon the cohort, the detected WBs on average only cover between 57 and 72% of the overall walking present in the data. As the pipeline is tuned to provide high specificity, this relatively low sensitivity is expected and this difference in the underlying data distribution partially explained the increased error values for the combined analysis. This demonstrated the bias introduced in the true-positive analysis. Furthermore, the combined approach aggregates real-world data into singular values, which does not reflect the entire distribution of walking speed37.

Comparing to one of the few other studies that performed a real-world validation of walking speed, the work by Soltani et al.38 validated an algorithm based on a single wrist-worn sensor against a head-mounted Global Navigation Satellite System device, finding low bias [interquartile range (IQR) = − 0.01, 0.00 m/s] and an accuracy expressed by root mean square error [IQR = 0.04, 0.06 m/s]. However, this validation was only performed in 30 HAs (mean age = 37 years)38. Given the promising results we report for estimation of walking speed, and the results provided for the individual algorithmic blocks previously reported26, we demonstrate that it is possible to use a single wearable device on the lower back for accurate quantification of mobility. However, it must be considered that the performance of algorithms and pipelines are dependent upon a variety of factors that should be taken into consideration during study design, future validation, and data interpretation.

Recommendation for real-world DMO validation

Validation protocol

Across all cohorts we observed larger absolute errors and lower ICCs with walking speed estimated from the real-world in comparison to laboratory assessment, showing the importance of real-world validations to obtain realistic and ecologically valid error estimates of DMOs.

Despite this, our results also show that some real-world challenges can be replicated within laboratory settings, as the errors observed during the simulated-daily activities in the laboratory were in fact higher than in the real-world. In these tasks, participants undertook short WBs containing turns, changes of direction and transitions. Scott et al.39 compared the walking speed ranges recorded from the laboratory and 2.5 h protocol that were adopted in the present study and found a diverse profile of walking speed ranges in the laboratory that was representative of the walking speed range observed in the real-world. Future validation studies should take into account an adequate balance between challenging tasks (short WBs, turns and transitions) and long uninterrupted walks in the laboratory protocol to properly replicate the expected error ranges from the real-world.

In general, the expected error ranges were dependent upon task complexity. Most condition specific differences were only prevalent in the real-world, which is consistent with previous research reported in HAs and people with MS and PD13,18,40. Our findings motivate the inclusion of complex tasks and simulated daily activities into any future laboratory validation. However, we also recommend inclusion of real-world measurements to capture the true range of gait task complexity performed in daily life as well as a myriad of contextual factors, including the distribution of WB duration and walking speed.

Reference system

Utilization of the INDIP system as a reference during both the laboratory-based and real-world protocol proved to be successful in overcoming limitations in accuracy, battery life and usability, all of which are common restrictions of real-world reference systems previously adopted in the literature (e.g., wearable camera and GNSS (global navigation satellite system)38,41). Specifically, the INDIP system has been validated, showing excellent agreement (ICC > 0.95) and very low MAEs (simulated daily activities =  ≤ 0.05 m/s) against a stereophotogrammetric system in the same cohorts and laboratory protocol as in the present study31. The INDIP system was designed to enable the detection of gait and calculation of parameters based on as few assumptions as possible, particularly concerning the type of walking and the walking environment. Gait event detection relies on pressure insoles that are expected to work independently of the setting, and spatial parameter estimation is based purely on physics-based integration methods that estimate the 3D trajectory of the foot. The INDIP’s performance was evaluated based on a complex experimental protocol specifically designed for mobility assessment. Experiments included selected cohorts of participants with various conditions affecting gait characteristics, performing a complex battery of motor tests designed to produce a heterogeneous and broad range of gait patterns. Results showed overall good/excellent reliability and high repeatability and accuracy for the DMOs analyzed across populations, walking speeds, and WBs. Therefore, the INDIP system is a valuable candidate to collect reference standard data for the analysis of gait in real-world conditions42,43. Other existing technologies can be used for obtaining reference data “out-of-the laboratory” (e.g., cameras, markerless systems), but they have intrinsic limitations that make their use inefficient (time consuming data analysis or small volume of data capture), less accurate for stride-by-stride description or not robust to quantify specific gait outcomes (e.g., spatial outcomes).

The INDIP system can be used in both laboratory and real-world settings to enable a concurrent validation of walking speed measurements, as provided in the present study. The recording duration of 2.5 h with the INDIP system enabled recording of a wide range of activities and walking speeds.

Data analysis

We adopted two approaches to analyzing walking speed, (i) only considering WBs that were directly matched between the wearable device and reference system (true positive evaluation) and (ii) considering the median value of walking speed across all available data (combined analysis).

The true positive evaluation allowed comparisons to be performed with high granularity on a WB level, allowing better understanding of the circumstances under which the wearable device performs best. However, for the true positive analysis we observed bias with regards to the overall walking speed ranges (Supplementary Figs. 1 and 2), resulting in a non-negligible impact on the real-world results. Therefore, the combined analysis is required to confirm observed error ranges and differences in the results of the two approaches should be considered and discussed. Furthermore, this type of analysis introduced the true-positive threshold as a parameter that influences the results. While we could not find a relevant effect of the selection of this parameter value on the walking speed error (Supplementary Figs. 1 and 2), this might influence other DMOs. As our results indicate, no single type of analysis can provide a definite and full picture of the error ranges. Given the lack of other established approaches to perform real-world comparisons of DMOs with a high granularity, we suggest our framework as a basis for future DMO validation studies.

Practical recommendations for the use of wearable devices for real-world walking speed measurements

Our results demonstrate that walking speed can be estimated accurately and reliably across a range of environments, cohorts, tasks and contextual factors. Based upon our promising validation results, below we provide our recommendations on the use of wearable device for real-world walking speed measurement.

Influence of pipeline

For improved understanding of the error, the impact on individual DMOs within the pipeline should be considered. In our case, given the complexity of the respective algorithms, stride length is expected to have a larger contribution to the observed walking speed error compared to cadence (Supplementary Tables 1 and 2)26. This motivates further research in more robust methods for spatial parameter estimation. Furthermore, the wearable device seems to record more shorter WBs than the reference, suggesting that longer continuous bouts of walking were split into multiple shorter WBs. Based on specific investigations of such cases, this was often due to limitations of the initial contact and left–right detection. Under challenging conditions (e.g., turns or stairs), these algorithms could not provide reliable stride information leading to a separation of longer periods of walking into multiple WBs, as no valid stride was detected for multiple seconds. This could be the result of the wearable device being positioned on the lower back, where the reference system was also comprised of feet sensors, thus being more robust to quantify gait events across longer periods. The full pipeline is implemented separately for each system, so the combined estimates of all DMOs needed to meet criteria of a WB leads to heterogeneity between systems. This motivates further research in the detection of initial contacts and their laterality under challenging real-world conditions.

Walking bout duration

Real-world walking speed encapsulates a rich dataset of mobility that has been undertaken across various WBs which differ in their length, duration, and context. Each WB reflects a different profile of walking in terms of the number of turns, transitions, and periods of straight walking, which influences walking speed measurement. Therefore, it is not surprising that walking speed estimations were influenced by the WB duration, where very short WBs (< 10 s) presented the largest error. Additionally, the wearable device tends to detect a larger number of shorter WBs than the reference system but fewer medium WBs with intermediate durations (Fig. 5). This suggests that the wearable device tends to fragment gait sequences into smaller segments, possibly attributable to mis-detected initial contacts. We speculate that short WBs predominantly took place within confined indoor spaces such as the home environment. While walking speed captured at this short duration does not reflect steady state gait activities, it could still hold valuable information about balance and functional status (e.g., postural transition, weight-shift, sit-to-stand44. Algorithms optimized for straight walking in controlled settings had an increased likelihood of higher absolute errors at very short durations. Based on this, we would recommend using a lower cut-off (WBs > 10 s), to trade-off between the number of removed WBs and still including a minimum threshold of 401 WBs, needed to ensure reliability and validity for real-world gait monitoring in a single cohort39.

Moving toward clinical application of wearable devices and walking speed measurement, it is important to consider in which specific real-world context wearable devices can quantify mobility most accurately and reliably. Our findings demonstrated that WBs > 30 s provide the most accurate and reliable measurement. WBs > 30 can be characterized as medium to long in their duration. Walking speed estimated from medium length WBs (between 30 and 60 s), may reflect activities of daily living, such as intermittent periods of shopping or undertaking other errands in public spaces outside the home18,45. In contrast, longer WBs (> 60 s), typically capture faster walking speeds that are closer to what is already being measured in the laboratory. Thus, walking speed measured in medium length WBs reflects a balance between capturing activities of daily living, and sufficient periods of straight walking activity that enable the robust quantification of walking speed. However, for certain patients walking continuously for 30 s as our cut-off suggests might already be strenuous. Thus, we would recommend using all (WBs > 10 s) to include a balance between capturing a sufficient number of WBs for patients with a variety of condition severities, whilst ensuring walking speed can still be quantified reliably. Future clinical validation studies with a larger number of participants with severe gait impairments are required to confirm the reported error ranges for specific disease populations and can confirm the influence of WB duration upon the functional insight of mobility provided by walking speed46.

Walking speed

The influence of the speed at which each WB is completed upon the validity of the walking speed is considered a confounding factor of gait analysis47. When exploring the average walking speed across all WBs, we found that walking speed in WBs undertaken at slower speeds (< 0.6 m/s) tended to be overestimated by ≤ 0.8 m/s. The wearable device and WBs with faster speeds were underestimated, where moderate walking speeds provided the highest accuracy (Fig. 2). Longer WBs (> 60 s) were completed at faster walking speed in comparison to medium length WBs, which were undertaken at moderate speeds. Further, the overall number of slow WBs appears to be smaller for the wearable device. The speed distribution of only the true-positive WBs exhibits a similar shift in distributions but at overall higher speeds (Supplementary Fig. 2). In conjunction with the presented error values, this suggests that slow WBs are detected correctly, but their speed values are overestimated. Notably, these lower speeds were predominantly observed in short WBs. Consequently, this further justifies our recommendation that medium-length WBs provide the right balance between functional relevance and accuracy. Exploration of the clinical properties of walking speed encapsulated within these WBs, will become a topic of research and further investigated in on-going clinical validation efforts46. Furthermore, measurements with cohorts consisting of predominantly slow walkers will likely result in larger error ranges. This is consistent with previous research that also validated algorithms based upon a single lower back sensor33,48.

The algorithms used in this study were optimized and developed based on independent datasets to avoid bias. We foresee that future algorithms developed on the TVS dataset, and other similar real-world datasets can improve on the speed dependency observed here.

Real-world complexity

Aside from the WB duration, accuracy of walking speed estimation was also dependent upon the complexity of tasks/activities. The influence of complexity was cohort dependent and had the largest influence upon error for the MS, PD and PFF cohorts (estimated from P1 pipeline). We would expect those cohorts to experience more gait impairments than the CHF, COPD and HA cohorts (estimated from the P2 pipeline). However, whether the observed effect is caused by specific gait properties of the respective cohorts or by shortcomings in the selected algorithms, cannot be concluded based on the performed analysis.

Despite the challenges posed by outdoor environments (changes in terrain, weather and traffic negotiation (humans and vehicles))49,50, outdoor environments capture more prolonged and uninterrupted walks in comparison to indoor environments, such as the household, which represent more confined and cluttered spaces with limited capacity for completing sequences of straight walking. Thus, we would expect error ranges from long uninterrupted outdoor walks are expected to be lower than results from confined indoor environments. Therefore, the combination of gait parameters with further contextual information might help to take this into account during data interpretation.

Limitations

While this study is one of the largest and most comprehensive validation studies for gait analysis based on wearable devices to date, the analysis of specific subgroup effects would require larger sample sizes. Potential links between the error, the condition severity and other medical comorbidities could not be established. Furthermore, the effect of walking aid use on results has not been assessed in this study. Future studies with more variable condition severity are needed to explore the influence of walking aid usage upon the validity of the analytical pipelines.

The real-world data was limited to 2.5 h for technical reasons. However, we accept that a recording of this length may not be sufficient to capture all the variability and patterns that would be included in multiple days of consecutive assessment. Due to technical issues with the devices, we were unable to assess some participants, which reduced our dataset. Data was also collected during the COVID-19 pandemic which may have impacted on participants’ activity. While the analytical pipeline offers several strengths, its combined implementation does have limitations. As previously stated, the analytical pipelines for the wearable device data have the tendency to split longer WBs into multiple individual WBs. Hence, future research should explore whether this is caused by limitations of the initial contact and Left–Right detections, and how specific real-world contexts may influence walking speed performance. We found that error in walking speed estimation was more dependent upon stride length (spatial) estimation (MARE across all cohorts; laboratory = 14.31% and real-world 20.35%), however cadence (temporal) can be estimated with substantially lower errors (MARE across all cohorts; laboratory = 4.1% and real-world 4.8%) (Supplementary Tables 1 and 2). Therefore, future researchers looking to further improve performance of walking speed estimation, should target optimization of spatial algorithms.

Conclusion

Through the extensive real-world and laboratory validation across multiple cohorts, this study represents, to the best of our knowledge, the most accurate estimate of the expected error ranges of a lower-back wearable device for estimation of walking speed. The presented state-of-the-art algorithms pipelines could reliably estimate DMOs across a wide range of scenarios, providing a solid foundation for future studies to establish their clinical meaningfulness46. While complex setups like camera-based motion capture systems in the laboratory and wearable multi-modal sensor system in real-world scenarios still provide superior performance and might be required for certain types of clinical analysis, we demonstrated the suitability of a single easy-to-use and inexpensive wearable device for movement monitoring across a wide range of clinical indications. This has the potential to make gait related parameters from long-term real world recordings ubiquitously available for clinical decision making. Our results showed that various parameters can influence DMO performance and multi-faceted analysis is crucial for understanding of the capabilities of any DMO pipeline. This motivates the capture of additional context information during real-world measurements to focus analysis on signal areas where high reliability can be expected. Furthermore, we identified clear areas where future algorithm pipelines can still improve, and we believe that the captured dataset will be vital for the development of future algorithms specifically targeting the challenges of unsupervised real-world recordings.

Methods

Participants

For the Mobilise-D technical validation study (TVS), participants were recruited from five clinical cohorts (CHF, COPD, MS, PD, and PFF) alongside HA. Participants were recruited at five sites: The Newcastle upon Tyne Hospitals NHS Foundation Trust, UK (Sponsor of the study) and Sheffield Teaching Hospitals NHS Foundation Trust, UK (ethics approval granted by London – Bloomsbury Research Ethics committee, 19/LO/1507); Tel Aviv Sourasky Medical Center, Israel (ethics approval granted by the Helsinki Committee, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel, 0551-19TLV), Robert Bosch Foundation for Medical Research, Germany (ethics approval granted by the ethical committee of the medical faculty of The University of Tübingen, 647/2019BO2), University of Kiel, Germany (ethics approval granted by the ethical committee of the medical faculty of Kiel University, D438/18). Informed consent was provided by all participants to take part in the study and all research was performed in accordance with the Declaration of Helsinki. Inclusion and exclusion criteria are fully described in24.

Protocol

The protocol has been extensively detailed in24. Participants were assessed in the laboratory and during a 2.5-h real-world observation. Mobility data was collected with a wearable device (McRoberts Dynaport MM+ , sampling frequency: 100 Hz, triaxial acceleration range: ± 8 g/resolution: 1 mg, triaxial gyroscope range: ± 2000 degrees per second (dps)/ resolution: 70 mdps), secured at the lower back with a Velcro belt. Participants were also asked to wear a multisensor INDIP reference system (sampling frequency: 100 Hz)24,30. Specifically, two magneto-IMUs were positioned over the instep and fixed to shoelaces with clips, and a third IMU was attached to the lower back with Velcro. Distance sensors were then positioned asymmetrically with Velcro (one above left ankle and another 3 cm higher on the right leg). Pressure insoles were selected for each participant’s foot size and inserted into the shoe. The INDIP system has been validated in previous studies across a range of conditions and in this TVS cohorts also, showing excellent results and reliability in the qualification of mobility outcomes (MAE laboratory ≤ 0.02 m/s, simulated daily activities = 0.03 to 0.05 m/s), a complete overview of the validation results can be found in31. The INDIP and the wearable device were synchronized using their timestamps (± 10 ms). Participants only performed tasks that they felt comfortable and safe to do in both protocols.

Laboratory protocol

Participants were asked to complete seven motor tasks with increasing complexity: Straight walking (slow, normal and fast speed), Timed Up and Go, L-Test, Surface Test, Hallway Test and Simulated Daily Activities. Each task was designed to capture and assess various elements associated with real-world walking including a range of walking speeds, incline/steps, surface, path shape, turns and specific motor tasks to simulate typical real-world transitions24,39.

Real-world protocol

Participants were assessed for up to 2.5 h in the real-world, as they went about their normal activities unsupervised (home/work/community/outdoor). The duration of the observation has been established as a trade-off between experimental, clinical, and technical requirements. To capture the largest possible range of activities during this assessment, participants were guided by the following list of activities: if relevant for their chosen environment, rise from a chair and walk to another room; walk to the kitchen and make a drink; walk up and down a set of stairs (if possible); walk outdoors (if possible, for a minimum of 2 min); if walking outside, walk up and down an inclined path. We did not provide supervision or structure on how these tasks should be completed to the participants24.

Calculation of walking speed

The evaluation of walking speed requires the combination of various algorithmic steps, including the identification of gait sequences and of initial contacts, estimation of DMOs, i.e., cadence and stride length. Selection of the top-ranked algorithms to detect gait-sequences, estimate initial contact events, cadence and stride length within identified gait-sequences was determined in our previous work26 (Fig. 1). The best performing algorithm was then used to estimate walking speed using the outputs of the stride length and cadence algorithms using Eq. (1):

$$Walking\;speed\left[ {m/s} \right] = \left( {cadence\left[ {step/min} \right]/\left( {2*60} \right)} \right)*stride\, length\left[ m \right]$$
(1)

Two independent analytical pipelines (P1 and P2) were identified in this process due to differences in algorithm selection for gait-sequence detection and cadence for the different conditions included in the study26. P1 provides the optimal combination of algorithms selected for HA, COPD, and CHF conditions, and P2 provides the optimal combination for PD, MS, and PFF (Fig. 7).

Figure 7
figure 7

Overview over the different algorithmic steps of the analytical pipeline with short explanations of the intermediate and final outputs of each of the algorithmic blocks; gait sequence detection (GSD), initial contact detection (ICD), cadence estimation (CAD) and stride length estimation (SL). The algorithm column indicates the used algorithms for the two pipelines P1 (HA, COPD, CHF). (MS, PD, PFF) and P2 (MS, PD, PFF) Short citations for the algorithms are provided below the figure. For more details see Table 1 in26.

Two additional algorithms were added to both gait analysis pipelines: turn detection algorithm51, and a customized algorithm to detect the laterality (left or right step) of each IC52. Laterality was used to interpolate the cadence, stride length, and walking speed parameters (provided as per-second values by the algorithms) to stride-level values (stride interpolation).

DMOs were evaluated on a stride level, conforming to consensus agreed definitions53 for WBs. Accordingly, a WB was defined as a continuous sequence containing at least two consecutive strides of both feet (e.g., R–L–R–L–R–L or L–R–L–R–L–R, being R/L the right/left foot making contact with the ground); consecutive WBs were defined if a break greater than 3 s was identified between them; and, for a stride to be included in a given WB it had to have a duration between 0.2 and 3.0 s and a stride length > 0.15 m54. WBs compliant with this definition were generated by first filtering the list of identified strides based on the stride level definition (stride selection) and then grouped into final WBs based on breaks within the stride sequence (walking bout assembly). The same definitions were also used to define the WBs for the reference system. For both systems final DMOs were calculated as the average value over all strides within a WB.

Validation of walking speed

All comparisons between the wearable device and the reference system were performed based on the average walking speed within each identified WB. In addition, comparison results for cadence and stride length can be found in Supplementary Tables 1 and 2. Further, we evaluated the performance of the WB detection on the 2.5-h real-world assessment to provide additional context for the error parameters. For this we calculated the accuracy, sensitivity, specificity and positive predictive value, by comparing the regions of the WBs detected by the wearable device with the WBs reported by the reference system on a sample-by-sample basis following the same approach used for evaluation of the gait sequence detection (GSD) methods in26. Real-world recordings also provide new challenges during data analysis. WBs detected by the wearable device, and the reference system might not match up, thus direct comparison of individual strides or WBs is not possible. One straight-forward approach is to average DMOs across all WBs before comparison. However, this reduced granularity makes it difficult to fully understand under which circumstances a wearable device works well and can “mask” the bias or error (e.g., over or underestimation under specific circumstances) that only considering a single WB could be identified. We proposed a new approach for these types of data analysis, by splitting the analysis into a detailed comparison of only WBs that were identified in both systems (True positive WBs) and a traditional analysis of all data combined (Combined WBs).

  1. 1.

    True Positive Evaluation Novel method of analysis, which directly compares the performance of the DMOs on only the WBs that were detected in both systems (true positives). This allows for the calculation of traditional of comparison metrics (e.g., interclass correlation and Bland–Altman plots), that require a direct comparison of individual measurement points. WBs were included in the true positive analysis, if there was an overlap of more than 80% between the two systems (details about the selection of this threshold can be found in Supplementary Fig. 1). The threshold of 80% was selected as a trade-off to allow us: (i) to consider as much as possible a like-for-like comparison between selected WBs (INDIP vs. wearable device), and at the same time (ii) to include the minimum number of walking bouts to ensure sufficient statistical power for the analyses (i.e., at least 101 walking bouts for each cohort). This target was based upon the number of walking bouts rather than a percentage of total walking bouts that would allow us to meet criteria established by statistical experts for robust statistical analysis after sample-size re-evaluation (total walking bouts number > 101 corresponding to ICC > 0.7 and a CI = 0.2).

  2. 2.

    Combined Evaluation Traditional method of analysis, where we calculated the median walking speed from all identified WBs within each laboratory task (resulting in one value per gait task per participant) or within the 2.5 h real-world assessment per participant (resulting in one value per participant) and compared these combined values between the systems. This comparison is free of potential biases introduced by the selection of only the true-positive WB and reflects how DMOs will typically be evaluated in a research or clinical setting or when reference data may not be available.

Factors that can influence walking speed validity

A range of factors can influence walking speed, and this may impact the algorithm performance and validity of the results. We investigated the possible sources of confounding such as: the cohort, environment (laboratory vs real-world), task complexity, walking speed and walking bout duration, and participant performance upon walking speed validity. All comparisons (unless otherwise stated) are performed using WBs identified as true positive (true positive evaluation).

Influence of the cohort and environment

We compared errors in the estimation of walking speed between each of the different clinical cohorts included in the study, alongside differences between laboratory and real-world environments.

Influence of gait task complexity

Real world walking contains complex gait sequences, which are comprised of short steps, frequent turns, or obstacle negotiation where individuals often multitask during walking. Thus, gait patterns observed in the real-world are not comparable with the straight walking tasks undertaken in controlled environments, even if we account for differences in WB duration and walking speed. To assess the effect of gait-task complexity we compared validation results of walking speed estimated from the (i) simulated daily activities (high complexity), (ii) slow straight walking (low complexity), (iii) all straight walking tasks (low complexity), and (iv) all laboratory tests with the validation results of walking speed estimated from real-world walking. We further subdivided real world walking based on the percentage of a WB that was assessed to be a turn. Based on this we defined the following levels of gait complexity: (v) “simple” straight gait (< 20% covered by turns) and (vi) “complex gait” (> = 60% covered by turns).

Influence of walking speed and walking bout duration

Given the impact of real-world WB durations and speeds44 on the adopted biomechanical strategies55, we analyzed their influence on the validity of the walking speed. For this, we assessed whether validity of walking speed estimation differed within specific WB durations bins (< 10 s, > 10 s, 10–30 s, 30–60 s, > 60 s and > 120 s). This was first performed for all true positive WBs comparing their errors across each WB threshold, and subsequently repeated for the combined analysis, by calculating the median walking speed for each participant within the respective speed bout and comparing the median values between the reference system and the wearable device. All these analyses permitted the validation of the quantification of walking speed across different walking strategies.

Validation measures

For all types of evaluations (all available WBs/aggregated values or on the respective subgroups), we calculated various statistical/comparison measures to quantify the walking speed estimation error for the sensitivity analysis:

  • Intraclass Coefficient (ICC(2,1))56 was calculated to assess the association between the DMOs of the two systems. Based on ICC estimates, values < 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and > 0.90 were deemed to be indicative of poor, moderate, good, and excellent reliability, respectively57.

  • Absolute agreement was assessed by quantifying (i) the accuracy/mean absolute error (MAE), (ii) bias/mean error and (iii) precision/limits of agreements (LoA)58 between walking speed estimates of both systems.

  • Mean relative errors (MRE) and mean absolute relative error (MARE) were estimated as the ratio between the (absolute) errors per WB and the corresponding estimates from the reference system, expressed as a percentage.