Introduction

The quality and accuracy of instrumented gait analysis in level walking depends on the nature of gait dysfunction and the measurement technology used1,2,3,4,5. The influence of rater performance on the reliability of outcomes has been assessed for highly instrumented methods like 3D optoelectronic systems1, as well as for observational ratings6. Inter-rater reliability of gait parameters was shown to be an important metric in a team of raters performing gait analysis7. Test–retest reliability is considered a quality benchmark in the analysis of longitudinal changes in level walking8. However, compared to the number of studies on level walking, few papers have been published on inter-rater and test–retest reliability, which also serves as a standardization basis for the respective gait labs.

To optimize observational gait analysis, some institutions have implemented standardized instrumental three-dimensional gait analysis (3DGA) by means of optoelectronic movement analysis systems to assess level walking. 3DGA is a widely accepted reference standard for assessing gait parameters if applied by a rater following marker placement training9. However, for reasons of cost and space, 3DGA is not yet very common and cannot easily be used in extramural settings. For assessing spatio-temporal parameters, pressure distribution platforms10,11, LED bars, and inductive walkway systems12 are available and were tested for reliability with partly excellent results. To obtain additional 3D kinematic parameters, systems based on inertial measurement units were introduced13. Low-cost and mobile depth-finding camera-equipped game consoles may not accurately obtain lower body kinematic data, but show potential for spatio-temporal assessments14,15. Overall, it was concluded that validation studies on some of these technologies are of limited quality, but reliability was better investigated than concurrent validity, and spatio-temporal gait parameters consistently outperformed planar joint angle data16. To our knowledge, only four commercially available video-based movement analysis systems have been assessed for their accuracy in level walking5.

Considering the advantages of two-dimensional (2D) video-based kinematic analyses with marker tracking, which are inexpensive, commonly available, and highly mobile, the number of studies using this method is astonishingly small. Although a 2D analysis system can only serve for gait patterns with strides close to the sagittal plane in the walking direction17, there are still various indications for its use in preventive and rehabilitative settings with a need for assessing kinematic and spatio-temporal parameters of level walking in the sagittal plane. The commonly used system TEMPLO™ which also provides optional integration of analog devices has no published data concerning the reliability and validity of the derived data. Therefore, this study was performed to assess (i) the validity of the video-based 2D system TEMPLO™ against a 3DGA reference standard, as well as (ii) the inter-tester reliability for three raters, and (iii) the test–retest for one rater, by means of 3DGA.

Methods

Study design

After giving informed consent, participants were examined for eligibility criteria. Gait examination started with three pre-trials and a calibration routine after the first rater had applied the marker set. Participants completed six valid trials for both, the right and the left side. Afterwards, markers were taken off. Before the next rater applied the marker set, a 30-min break was allowed in order to avoid skin irritation. The third rater finally repeated the procedure. To period effects, the order in which raters applied the marker set was randomized by a prepared random sequence list. A wash-out phase of mean 1-week was foreseen before the procedure was repeated by rater three only. All gait assessments were simultaneously captured by the 2D motion analysis system TEMPLO™ (Contemplas, Kempten, Germany) and the 3D motion analyses system VICON™ (Vicon Motion Systems Ltd, Oxford, United Kingdom) (Fig. 1). No changes were made to the trial design or eligibility criteria after the study has commenced.

Figure 1
figure 1

Flow diagram of the study design.

Participants

Potential participants were invited via the FH Campus Wien-University of Applied Sciences in-house info screens to be screened for eligibility. No monetary or other incentives were offered. Eligibility criteria were (i) age from 18 to 30 years, (ii) measured body mass index (BMI) from 18.5 to 24.99 kg/m2, and (iii) no musculoskeletal abnormalities in the lower extremity and/or spine. Data were collected in the movement lab of FH Campus Wien-University of Applied Sciences. Based on comparable publications, we hypothesized a sample of 22 subjects to be sufficiently powered for this study18. This would include a drop-out rate of 10%. Based on their profession, the three physiotherapists acting as raters in this study may be considered advanced in placing markers after a period of training sessions9, as opposed to non-experienced operators and examiners, not educated in anatomical palpation. Training sessions were supervised by rater three and an additional experienced gait analyst, acting in the role of the clinical trial monitor. The two training sessions consisted of marker placement, measuring anatomical distances, and participant instruction, which was performed by all three raters. In each assessment, the highly experienced monitor and the moderately experienced rater three checked and discussed the quality of the aspects mentioned.

All participants were enrolled by the principal investigator. Due to the nature of the assessments being studied, no arrangements were made related to allocation concealment or participant blinding. The ethical committee of the Medical University of Vienna approved the study protocol (1195/2016) and all participants provided written informed consent, before starting data collection. The Austrian Federal Office for Safety in Health Care (BASG) approved this clinical evaluation of a medical device (Ref-Nr 9119547) according to EN ISO 14155. The study was conducted in accordance with the approved study protocol, which is in accordance with the European Medicines Agency “ICH Good clinical practice” scientific guideline and the Declaration of Helsinki.

Instrumentation

A 3D optoelectronic system controlled by the software Nexus 2.4 (Vicon Motion Systems Ltd., Oxford, United Kingdom) with reflective skin-markers of 14 mm diameter was used as reference standard. Fourteen cameras (200 Hz frame rate) were synchronized with two floor-mounted force plates (AMTI, Watertown, United States) recording at a 1000 Hz frame rate. Markers were placed according to the VICON Plug-in-Gait Lower Body model (PiG).

The 2D system was driven by the software TEMPLO™ (Contemplas, Kempten, Germany). Highspeed cameras (acA640gc-120, Basler AG, Ahrensburg, Germany) with a resolution of 480 × 640 pixels and a frame rate of 100 Hz captured videos for the right and left strides respectively. Cameras were connected to the computer via PoE-switch (GS108P, Netgear Inc., San José, CA, USA). Parallel to the camera view, spotlights illuminated the reflective markers for contrast reasons during data recording (Supplementary Figs.1, 4, 5). The camera images were calibrated using a one-meter by one-meter calibration object placed normal to the walkway (Supplementary Figs. 2, 3). Camera images captured a view of 3.36 m in length and 1.58 m in height from a distance of 2.5 m, which results in a resolution of 0.53 cm in length and 0.33 cm in height per pixel, which is sufficiently accurate with regard to measurement errors caused by soft tissue artefacts19. The positions of the cameras were optimized for capturing the strides with valid force plate strikes of the leg near to the relevant camera. For the segmental model (Supplementary Fig. 8), an additional marker was placed over the apex of the great trochanter, representing the hip joint. Due to central projection effects concerning metric calibration for 2D systems, identifying and data extraction for Heel Strike of the contralateral leg to the camera position was not possible.

Procedure and data processing

In the clinical examination, the physiological range of motion of the spine and lower extremities were tested by an experienced physiotherapist to rule out any musculoskeletal abnormalities. BMI was calculated from measured body weight (M-420, Marsden Weighing Machine Group Ltd., Rotherham, United Kingdom) and height (Seca 213, Seca GmbH & Co KG, Hamburg, Germany). The length of the lower leg was measured from the floor to the medial knee joint space by one rater for all subjects. The following distances were measured by each rater and used for individual data processing: malleolus medialis to spina iliaca anterior superior, right to left spina iliaca anterior superior, and knee and ankle width.

After attaching the skin-markers, a static calibration trial was captured and joint centers were calculated. Participants were then asked to walk from a starting mark at self-selected speed with elbows bent for visibility of the pelvis and hip markers20. The recording trials started when they had become habituated to the arm constraint, their walking pattern appeared observationally stable, and the start mark had been optimized for valid force plate strikes. Six valid trials were recorded and averaged, with one left and one right step fully placed on the corresponding force plate.

Raw 3D data underwent the standard PiG pipeline using a 5th-order Woltering filter (mean square error-value: 20) of trajectories. Heel Strike and Toe-Off were determined using the standard event detection function with a 20 N force threshold. After time normalization in Polygon 4.2 (Vicon Motion Systems Ltd., Oxford, United Kingdom), data curves were exported to a spreadsheet for final parameter extraction.

For processing the predefined 2D gait parameters (as listed in Table 2), a segmental model, angle algorithms, and spatio-temporal parameter (STP) algorithms (Supplementary Fig. 916) were developed in Motus 10.1 (Contemplas, Kempten, Germany). Using this template, videos (Supplementary Fig. 6, 7) were imported from Templo™, and marker trajectories were automatically tracked. For detection of the event Heel Strike, customized algorithms using linear velocity and acceleration of the marker placed on the dorsal heel were used. Toe-Off was identified using parameters derived from the second toe marker coordinates. After running the above-mentioned calculations and applying a Butterworth 6 Hz 2nd-order filter21, data curves were pasted into ProEMG 2.1 (Prophysics AG, Kloten, Switzerland) for time normalization. Final parameter extraction was performed using the datasheet.

Statistical analysis

Metric outcome parameters were first checked for outliers and normal distribution. Outliers were defined as an exclusion criterion if they deviated more than two standard deviations from the sample’s mean. Normal distribution was tested with the Shapiro–Wilk test and graphical inspections of Q–Q plots.

Agreement of the 2D with the 3D system (concurrent validity), as well as the test–retest reliability of rater three marker application consistency between two measurement days, was expressed by intraclass correlation coefficients (ICC, 3.k), i.e. absolute agreement, average measures, two-way mixed. Furthermore, mean differences were tested by repeated sample t-tests with Cohen´s d as standardized effect size and graphically visualized by Bland–Altman plots. Consistency between the three raters applying the marker set (interrater reliability) was expressed by ICC (2.k), i.e. absolute agreement, average measures, two-way random. Besides the ICCs, mean differences were tested by repeated sample ANOVA with eta-squared as standardized effect size.

For validity, ICC values above 0.722 and for reliability, ICC values above 0.7523 were considered acceptable. A level of at least 0.9 was considered acceptable if the measure is used for decisions about an individual, rather than in a group of patients or a clinical trial24. The standard error of measurement (SEM = SD √ (1 − ICC)) was calculated as a further reliability metric, where the precision of the measurement is expressed in the unit of the specific outcome (e.g. °). In this context, the standard deviation for all test scores was derived from the total sum of squares of the ICC’s ANOVA (SD = √ (SS/ (n − 1))25,26. All statistical analyses were carried out with IBM SPSS statistics version 28 (IBM Corp., Armonk, NY). No additional subgroup or adjusted analysis was conducted. Alpha was set at 0.05. However, exact p-values have been reported.

Results

Sample characteristics

Age and anthropometric characteristics of the sample are summarized in Table 1. Twenty-one (21) subjects were screened for eligibility, and 19 thereof who passed the functional check were randomly assigned to a predefined sequence of three raters applying the marker set. All of these 19 participants completed all study procedures, including the retest where markers were applied by rater three only. One participant was excluded from further analysis due to outcome values differing more than two standard deviations from the sample’s mean27, resulting in a sample size of 18 participants (10 of which women). Technical issues occurred in terms of markers not being covered by the 2D camera view. Hence, outcomes related to the right ankle could not be analyzed in the 2D assessment for six participants (n = 12), and outcomes related to the right knee could not be analyzed in the 2D assessment for one participant (n = 17). This was because the right camera could not capture the individually placed first step due to its positioning. Follow-up assessments were conducted after a mean washout phase of six (min 1; max 18) days.

Table 1 Age and anthropometric characteristics, mean (standard deviation).

Concurrent validity

We found acceptable (> 0.7) to excellent and statistically significant agreement of the 2D system with the 3D reference system in the assessed kinematic and spatio-temporal parameters, except for the parameters RoM hip and ankle with moderate ICC values, as well as toe-off and RoM pelvis with low ICC values. Absolute mean differences ranged from very small (e.g. 0.01 m/s velocity) to fairly high (e.g. 11.56° RoM pelvis). Due to the very low variability, several small differences were statistically significant (Table 2).

Table 2 Concurrent validity of selected outcomes assessed with TEMPLO (2D) against VICON (3D) as reference method, n = 18.

Bland–Altman plots show mean differences including their 95% limits of agreement for selected parameters (Fig. 2).

Figure 2
figure 2

Bland–Altman plots for selected outcomes assessed with TEMPLO (2D) against VICON (3D) as reference method, n = 18; blue lines indicate mean differences and red lines 95% limits of agreement.

Reliability

We found acceptable (ICC > 0.75) to excellent and statistically significant consistency between three raters for all spatio-temporal and kinematic parameters when assessed with 3DGA. Consistency that is also acceptable for decisions about an individual (> 0.9) was found for stride time, step length, step time, and RoM hip. For RoM parameters, standard error of measurement ranged from 0.23° to 2.86°. Significant differences across the three raters did not occur for any of the parameters (Table 3).

Table 3 Inter-rater reliability of selected outcomes assessed with VICON (3D). n = 18.

We found acceptable (ICC > 0.75) to excellent and statistically significant consistency between test and retest of one rater for most of the parameters when assessed with 3DGA. Consistency that is also acceptable for decisions about an individual (> 0.9) was found for RoM hip, RoM knee (left leg only), RoM ankle, stride time, and step time (left leg only). Toe-off resulted in ICCs close to 0.75 threshold. Only values for RoM pelvis (left 0.22; right 0.53) were below the clinical acceptable threshold. For the RoM parameters, SEM ranged from 0.4° to 1.9°. Absolute mean differences ranged from very small (e.g. 0.01 s step time) to fairly small (e.g. 1.51° RoM knee) (Table 4). Mean walking velocity ranged from 1.40 to 1.42 m/s for the left leg, and 1.39 to 1.40 m/s for the right.

Table 4 Test re-test reliability of selected outcomes assessed with VICON (3D), n = 18.

Discussion

This study provides deeper insights into the validity and reliability of a video-based system for assessing basic sagittal gait parameters. Considering the agreement of 2D derived values with the 3D reference system, both statistically significant and clinically relevant deviations were found for the parameters RoM hip and RoM ankle, where the 2D system generated lower RoM values.

Stride time, stride length, velocity, and RoM knee showed at least acceptable agreement and fairly small deviations of the 2D system with the 3D reference. Stride time, for instance, gave a mean system difference of 0.02 s, with limits of agreement between −0.01 and 0.04 s. For this and the other aforementioned parameters, the 2D system achieves an accuracy that is acceptable for many clinical applications. The low toe-off ICC value (left 0.26; right 0.32) might be due to the 2D-marker-based calculation versus the 3D method using force-plate thresholds. For RoM pelvis the agreement with the reference method was only weak, which might be caused by the varying view of the pelvic markers between 2D and 3D tracking. Although an acceptable agreement was achieved, the deviation from the reference method was fairly large for RoM hip and RoM ankle. The latter could be caused additionally by the camera setup, as the right ankle was captured farthest from the center of the camera view. Regarding RoM hip the differing view and following tracked coordinates of pelvic markers for the 2D and 3D technology might be of high relevance. To minimize lens distortion-related errors in 2D sagittal gait analysis, scan cameras with a resolution above two megapixels and a minimum distance between the sidewalk and the camera of 3.2 m can be recommended5. A possible variation of the distance from the camera to the sagittal plane of walking should not essentially bias the obtained results as participants were walking on a 0.5 m narrow walkway, which should facilitate a rather linear walking regarding the calibration plane.

One limitation is that the algorithms used for event detection were developed based on existing literature28 and observed agreement using the video. But spatio-temporal parameters mainly had an ICC close to one and a not clinically relevant standard error of measurement (derived from ICC). Through data fusion with spatio-temporal parameter-specific systems, this could even be optimized29. The levels of agreement found in this study when comparing 2D with 3D are somewhat worse when compared to the results of a study in which time parameters were validated for a markerless system2. For knee RoM Peebles et al. found an ICC of 0.9430, which is slightly higher than in our study (0.86–0.88). However, an absolute comparison between these studies is limited, as the participants in the Peebles study performed treadmill running and absolute values for angles were reported differently.

Inter-rater consistency resulted in ICCs ranging from 0.86 to 0.97 with no significant deviations across the three raters. Compared to a recent study conducted on a treadmill using a webcam recording at 30 Hz, ICC for repeatability was similar for hip and knee RoM31, supporting the findings of our study. Test–retest consistency was predominantly weaker than inter-rater consistency. Although the sample consisted of healthy, young, movement proficient participants, the wide range of 1 to 18 (mean 6) days between test and retest may have influenced intra-rater outcomes to some extent. However, there was no statistically significant correlation between the actual duration of wash-out phase and the ICC. In summary, most parameters achieved acceptable ICCs, and absolute mean differences were very small. Yet, ICCs considered acceptable for evaluating repeated scores of an individual (> 0.9) were not achieved for all parameters. For the interpretation of repeated assessments of a single client, clinicians are therefore advised to calculate the so-called minimum detectable change (MDC) based on the values of the standard error of measurement (derived from ICC) given in Table 4. MDC95 indicates the required change in a repeated measurement that would exceed the test–retest variability of the outcome with a 95% confidence level (MDC95 = SEM × 1.96 × √2)32.

It is assumed that both inter-rater and test–retest inconsistency result from a mixture of the variability of the analysis system itself, the rater-dependent marker application, and the patient’s gait pattern. Gait patterns may also be influenced by the patient's attempt to hit the force plates, which is recognizable to some extent in most laboratories and can be monitored to a limited extent visually or by assessing potential stride variability. The combination of these components reflects the actual real-life situation, but the extent to which each of these three components contributes to the inconsistency remains unclear and is subject for future research. Therefore, a design would need to be developed that compares identical movements to eliminate gait variability as an influencing factor. However, the higher test–retest inconsistency (compared to inter-rater inconsistency) observed in most cases is likely due to higher variability in the gait pattern between two measurement days. In our study, gait variability was minimized because participants were mainly exercise proficient and accommodation trials were performed on the walkway prior to gait assessment. Considering the influence of the technology used and the algorithms for processing parameters, a standardized data generation of gait data is urgently needed. Especially methods of kinematic assessments should be stated with a reference regarding the measurement properties. Additionally, a consensus group could provide a list of criterion-proven parameters with a related essential minimum standard of the instrumentation used. Based on such standards, marker based 2D-video gait analysis might be superior to the currently available markerless technologies in evaluation of gait in overweight and obese persons concerning sagittal hip and knee RoM.

Conclusions

2D gait analysis provides a possibility to accurately assess parameters such as sagittal knee RoM, stride time, stride length, and gait velocity in a healthy population with a generally stable gait pattern. However, this may be different in specific patient populations. However, the parameters toe-off and pelvic RoM are not correctly captured by the 2D gait system and the accuracy of the other RoM parameters is limited and their applicability thus depends on the accuracy demands. Further, clinicians should keep in mind that 2D gait analysis provides relative angles and not joint-center-based calculation. Inter-rater and test–retest reliability of the 3DGA are generally acceptable, except for the parameter pelvic RoM. Nevertheless, the expertise of the raters in using the system should be taken into account when considering reliability and validity in interpreting findings. Clinical practitioners are advised to use the MDC to interpret whether a change detected in repeated measurements of a single client exceeds the test–retest variability of the outcome with a 95% confidence level.