Impact of measurement timing on reproducibility of testing among haemodialysis patients

Accurate evaluation of physical function in patients undergoing haemodialysis is crucial in the analysis of the impact of exercise programs in this population. The aim of this study was to evaluate the reproducibility of several physical functional tests, depending on the timing of their implementation (before the HD session vs. non-HD days). This is a prospective, non-experimental, descriptive study. Thirty patients in haemodialysis were evaluated twice, 1 week apart. The test session was performed before the haemodialysis session started and a retest was performed in non-dialysis day. The testing battery included the short physical performance battery, sit-to-stand tests, 6 min walk test, one-leg stand test, timed up and go, and handgrip strength with and without forearm support. The intra-rater reproducibility was determined by the intraclass correlation coefficients and the agreement was assessed by Bland–Altman analysis. The intraclass correlation coefficients values ranged from 0.86 to 0.96, so that all tests showed good to very good relative reliability. The mean differences between trials of sit to stand 10 and 60, timed up and go and all the handgrip tests were close to zero, indicating no systematic differences between trials. Large range of values between trials was observed for the 6 min walk test, gait speed, one-leg stand test and short physical performance battery, indicating a systematic bias for these four tests. In conclusion, the sit to stand 10 and 60, timed up and go and handgrip tests had good to excellent test–retest reliability in measuring physical function in different dialysis days of patients undertaking haemodialysis. The minimal detectable change values are provided for this population. Bias were found for the 6 min walk test, gait speed, Short physical performance battery or one-leg stand test when the testing day changed.

Procedure. The study consisted of repeating the same tests in two different occasions, trials 1 and 2, to evaluate the reproducibility. It was always performed by the same experienced nurse. The test session was performed before HD treatment, as described elsewhere in the literature 17,19,20 . Before the first HD session of the week, the participants underwent the short physical performance battery (SPPB), one-leg stance test (OLST), and timed up and go (TUG) tests. Before the second HD session in the same week, the patients performed the Sit to stand 10 (STS-10) and sit to stand 60 (STS-60) tests. Finally, the participants undertook the 6 min walk test (6MWT) before their third HD session in the week.
The retest session was performed on a non-HD day by the same nurse. Participants completed the same battery of tests in a single test session.

Definition the tests. Short physical performance battery (SPPB).
Objectively measures lower extremity function and includes several tests, balance, gait speed, and sit to stand 5 repetitions (STS-5). This is a commonly used test in patients undertaking HD 17,21 .
One-leg standing test (OLST). It consists of maintaining a one-leg stance for as long as possible, with a maximum of 45 s per leg in three trials 19 .
Timed up-and-go test (TUG). The participants were given verbal instructions to stand up from a standard armchair (using their arms if necessary), walk 3 m as quickly and safely as possible, turn back at a cone set out by the researchers, walk back, and sit down again in the chair. The patients could wear their regular footwear and to use a walking aid if needed. A stopwatch was started on the word "go" and stopped when the patient was fully seated again with their back against the backrest. The time taken to complete the test was recorded in three consecutive trials, using the first one to familiarise the patients with the test. The best time from the three trials was analysed 22 .
Sit-to-stand tests (STS). The STS10 consisted of performing 10 complete movements of sitting down and standing as fast as possible, with the arm held tightly against the chest. STS10 elapsed time was recorded. In the STS-60 test, the number of repetitions performed for 60 s was recorded 17,20,23 .
Handgrip (HG) with or without arm support. Two different procedures were compared, with and without arm support. In the HG test without support, the participant was seated in a chair. Participants performed three consecutive 3 s repetitions using an approved Jamar hand dynamometer, with 15 s rest periods between repetitions. The same test was then performed with the arm supported by the surface of a table providing support 24,25 .
The 6-min walk test (6MWT). It consisted of assessing the maximum distance walked during a 6 min period 26 . Statistics. The normality of the data distribution was assessed using the Kolmogorov-Smirnov test. Normally distributed descriptive data were reported as the mean plus the standard deviations (SDs) and non-parametric data were reported as the median plus the range. We also performed paired comparisons with paired t-tests or Wilcoxon signed rank tests to assess any systematic bias between the trials.
Bland-Altman plots were used to visually assess the disagreement between the measurements in two different measurement days. A plot of each participant's mean score plotted against the patient score difference (test on non-dialysis day minus retest before HD treatment) was constructed to check for possible systematic bias. The Bland-Altman plots displayed the 95% limits of agreement (95% LOA) which give a range within which it is expected the 95% of future differences in measurements between measurement days to lie. The 95%LOA was calculated as the difference in the mean scores of the test ± the score difference SD × 1.96.
The intraclass correlation coefficient (ICC; model alpha) and a two-way random-effects model were used to assess relative intra-rater reliability which was rated 'excellent' (ICC ≥ 0.900), 'good' (≥ 0.750) or 'fair' (0.600 to 0.749) 27 . We assumed that there was no systematic bias between measurements within subjects and that the within-subject SDs were equal for all measurements since the same rater measured participants 1-week apart. www.nature.com/scientificreports/ We calculated the absolute reliability, standard error of measurement (SEM), and minimal detectable change (MDC) 90% confidence interval (MDC 90 ) thresholds for these tests. The SEM and the MDC 90 were calculated using the following formulas 17,23 .
where r = ICC for the participant group and MDC 90 = SEM × 1.65 × √ 2. The SEM measures absolute reliability and represents the extent to which a variable can fluctuate during the measurement process 28 .To be 90% confident about the range for a measurement, the calculation 1.68 × SEM was used 15,16 . The MDC is defined as the amount of change in a measurement required to conclude that the difference is not attributable to error and is the smallest change that falls outside the expected range of error 16,29,30 . We set the level of significance required to a probability of p ≤ 0.05 for all our statistical analyses and the data were managed and analysed using the Statistical Package for the Social Sciences (SPSS) version 20.0 for windows (IBM Corp., Armonk, NY).

Results
Thirty participants with a mean age of 66.4 years (SD = 16.3), mean time on HD of 34.4 months (SD = 51.4), and mean Charlson comorbidity index of 8.5 (SD = 2.5) completed this study. The demographic and clinical data statistics for all the participants are shown in Table 1. No adverse events occurred during the testing.
Descriptive statistics of trial 1 (before the HD session) and trial 2 (non-dialysis day) as well as differences, are shown in Table 2.
Overall, MDC and SEM were quite large, especially for the 6MWT. Since SEM values can be translated to normal curve probabilities, Table 3 values can be applied to the practice. Using STS-10 as the example, it can be expected with the probability of 96% chance that the value of repeated tests will be in approximately ± 7.2 s of the original value.
Given the value of the MDC calculated in the present study is 8., and the value of the test in both trials is around 25 s, these results suggest that a change in the individual performance of less than one third of the mean cannot be considered a real change and it would be considered a measurement error for the STS-10. Table 1. Demographic, biochemical, haematological, and dialysis adequacy data as well as nutritional parameters for the patient cohort. N = 30. Ca calcium, HD haemodialysis, HDL high-density lipoprotein, i-PTH intact parathyroid hormone, k potassium, kt/v Daugirdas formula for second-generation logarithmic estimates of single-pool variable volume, LDL low-density lipoprotein, P phosphorus.  (Table 3). Confidence intervals were narrow, except for the relatively large confidence interval obtained for gait speed test and the STS-10.
Bland-Altman scatterplots were created to estimate disagreement between the two trials. The mean differences of STS-10, STS-60, TUG and all the handgrip tests were close to zero, indicating no systematic differences between trials. All, except for the handgrip tests, presented better values on non HD day. Figures 1, 2 and 3 show the agreement between STS-10 ( Fig. 1), STS-60 (Fig. 2), and TUG (Fig. 3) before the HD session and on a non-dialysis day. For the STS-10 there was a mean difference of 0.9 s between the days (95% LOA − 9.9 and 11.6 s). For the STS-60 there was a mean difference of − 0.5 repetitions (95% LOA − 6.6 and 5.6 repetitions). For the TUG there was a mean difference of 0.2 s (95% LOA − 2.3 and 2.8 s). For the HG strength with forearm support there was a mean difference of 1.1 kg between the days for the right (95% LOA − 5.3 and 7.6 kg) and 1.1 kg for the left hand (95% LOA − 6.8 and 8.9 kg). For the HG strength without forearm support there was a mean difference of 0.7 kg between the days for the right hand (95% LOA − 5.1 and 6.6 kg) and 0.6 kg for the left hand (95% LOA − 4.7 and 6.0 kg).
All figures show that there is not much change in the differences as the mean increased while the variation of data was constant.
Large range of values between trials was observed for the 6MWT, gait speed, OLST and SPPB (Table 2). Thus, Bland-Altman plots indicated a systematic bias for these four tests (Fig. 6). The mean difference scores between the different days for the same rater differed significantly from exact agreement (p < 0.001).

Discussion
The study attempted to clarify if physical function tests measured in patients undertaking HD are reproducible when changing the testing day (before the HD session vs. non-dialysis day). The sample size reached the recommended number of 30 31 . Although high ICC coefficients were obtained, ICC is a ratio index of within and between subjects' variability, therefore agreement between groups of subjects does not provide information about the individual change or error in scores. Additionally, ICC is dependent of the sample variability, and thus ICC should not be employed isolated 32 . The Bland-Altman plots were useful in exposing the relationship between the trials.
The present study shows a high degree of agreement between measurements on different days (HD day before the session vs. non-HD days) and good or excellent ICC results (above 0.86) only for some tests (STS-10, STS-60, TUG and HG tests) demonstrating lack of systematic bias when the measurement day changed. Thus, our results support the use of these tests when there is a change in the timing for assessment.
The scores from our participants were the similar to those reported by previous research of our group, with a slight difference only for the handgrip tests (STS-10: 25. Our results suggest that without arm support HG test is also reliable and has even lower values of MDC, what would made it easier to find true changes out of the variability of the measurement. The present ICC results concur with those from our previous studies in similar samples (39 participants for the STS-10, STS-60, HG) 17  Our results show that there was no systematic bias for the STS-10, STS-60, TUG, or HG tests and so, these tests can be measured on different days. Nevertheless, this study shows a systematic bias for the SPPB, gait speed, and 6MWT when the timing (before the HD session vs. non-dialysis day) changes. Systematic bias have been explained by the learning effect once the participant repeats the test and improves results during the re-test, albeit to a non-significant degree 34 . A previous intra-rater study also showed a non-learning effect 19 . Our results do not show this learning effect, since gait speed and 6MWT performance was better before the HD session on trial 1 compared to the retest session on non-HD days ( Table 2). Some authors suggest that the testing before the HD session may have reduced the effects of fatigue from the previous HD session 33 . Additionally, it is well-known the high variability of functional results in this cohort 17,20 , so it seems very important to keep the same testing circumstances when testing this cohort.
Hence, the use of Bland-Altman method evidenced that 6MWT, gait speed, OLST and SPPB showed substantial bias and large disproportion of the LOA. This case, large ICC values but lack of agreement with Bland-Altman method, was also found when establishing reliability of some motor tests 32 . Gait speed, and 6MWT achieved higher results when testing before the HD session, while balance achieved higher results on non-HD days. Fatigue, as a result of administering all the tests in a row on a non-HD day could explain why some tests obtained poorer results on non-HD days, which should not affect balance. Previous research has tested a battery of three test on non-HD days 33 . Clinical feasibility does not allow us to test patients on several non-HD days because these participants already spend many hours in a clinical setting for their treatments and so it would be difficult to convince them to spend extra time in for physical function testing alone. Finally, our results may help to clarify Figure 2. Bland-Altman plots showing agreement for the time required to perform the sit-to-stand-to-sit 60 test, obtained before the haemodialysis session and on a non-dialysis day by the same rater. Y axis difference between (non-dialysis-before the haemodialysis session) in seconds. X axis average (non-dialysis + before the haemodialysis session)/2 s. www.nature.com/scientificreports/ which tests could be measured before the HD session by the same rater, because there is no consensus on this regard and clinical applicability should be considered to extend testing into routine treatment.
The main strength of this study was that, to the best of our knowledge, this was the first time that the reproducibility of physical function tests in patients undergoing HD has been tested with different test administration timings. Assessment at the nephrology units could be difficult to implement because of a lack of human resources and logistics in many clinical settings. Thus it is important to be flexible regarding the test timing in this cohort, but it is also important to note that these changes impact the reproducibility of several commonly used physical function tests. The main weaknesses of this work were that the sample size was relatively small. Another limitation is that we did not make two measurements with each timing. Since there was only 1-week difference between measurements, we believe we may assume that there were no systematic biases between measurements within subjects and that the within-subject SDs were similar for all measurements.
Our results have important implications in the implementation of physical function testing in HD units and indicate that the same assessors should test patients. Future work should be multicentric and include higher sample sizes to confirm it and should also aim to clarify the ideal battery for clinical assessments in this population by assessing other tests, such as lower-muscle strength tests.

Conclusion
The STS-10, STS-60, TUG and handgrip tests had good to excellent test-retest reliability in measuring physical function in different dialysis days of patients undertaking HD. The MDC values are provided for this population. Bias were found for the 6MWT, gait speed, SPPB, or OLST when the testing day changed. Future studies should be conducted to clarify the ideal battery for routine clinical assessments in this population, including lower-limbs muscle strength tests. Figure 3. Bland-Altman plots showing agreement for the time required to perform the timed up-and-go test, obtained before the haemodialysis session and on a non-dialysis day by the same rater. Y axis difference between (non-dialysis-before the haemodialysis session) seconds. X axis average (non-dialysis + before the haemodialysis session)/2 s.