Introduction

Voluntary control over the arm and hand is an important issue during and after the rehabilitation of persons with a cervical spinal cord injury (C-SCI).1, 2 Patients are more interested in what they actually will be able to do with their hands (ICF activity level3), rather than (for example) the amount of force they can generate (ICF body function level). Moreover, therapists are particularly interested in the way patients perform daily activities. In order to rate the quality of upper limb (arm and hand) performance regarding activities that are performed commonly or daily, the Van Lieshout hand function test for Tetraplegia (VLT) has been developed4, 5 (www.vlt-test.eu). Next to evaluation of arm–hand performance, and given the ordered structure of the test score levels per test item, the VLT may also provide insight to the therapist about the next higher performance level on a test item a specific patient may attain. The clinical version of the VLT consists of 19 items divided into 5 areas of interest, that is, arm ability to transfer the body, arm positioning and stabilizing, hand opening and closing, grasping and releasing, and manipulating (www.vlt-test.eu). Each item has six ordinal response categories ranging from 0 (unable to perform the task) to 5 (maximum performance level). An example of one task item is given in Figure 1.

Figure 1
figure 1

Example of VLT-SF test item.

A short, less time consuming 10-item VLT research version (VLT short form or VLT-SF) (www.vlt-test.eu) has been developed and proved to be valid, reliable and responsive.5, 6 The 10 VLT-SF are listed in Table 1. However, both VLT and VLT-SF generate ordinal total scores while relying on counts of potentially unequal units. Ordinal scores of the item response categories are separated by unknown distances. Moreover, obtaining the same score to different items does not necessarily represent the same amount of the measured variable.7, 8 As the VLT-SF does not fully satisfy the criteria of an objective scientific measure,9 its interpretation and use are somewhat restricted. To correct this flaw, the ordinal scores should be converted into linear interval level scores before quantitative comparisons across patients and across time can be done. Of the measurement models that can be used to estimate linear measures from raw total scores, the Rasch model is the most promising.10 This model prescribes that only the item difficulty, the person’s ability (in terms of the VLT-SF the arm–hand performance) and the threshold difficulty (thresholds are the ability levels required to have an equal probability of endorsing one response rather than the previous one (in terms of the VLT the test score level (www.scionn.nl) determine the probabilities of endorsing any category to an item), regardless of other subjects’ attributes like age and gender. In case of the VLT-SF, a Rasch analysis10 can be used to mathematically model the relationship between a C-SCI person’s ability and VLT-SF test item difficulty as a function of response probabilities. This model can be used to verify the following:

  1. 1

    order of response categories (Do successive response categories of each item represent increasing levels of a person’s ability?);

  2. 2

    unidimensionality (Does a (sub-)scale or variable solely represent a single property or attribute under investigation?);

  3. 3

    reliability of the scale;

  4. 4

    scale invariance (Is the difficulty of items stable in different subgroups?).

Table 1 Recoded VLT-SF response scale and example of patient score

The aim of the study is to apply the Rasch model to the VLT-SF to upgrade its ordinal scale towards an interval level scale and to verify if the requirements of an objective measure are satisfied.

Materials and methods

Subjects

Data from a large prospective cohort study in SCI patients (www.scionn.nl) across eight Dutch Rehabilitation centres and a research project at Adelante Rehabilitation Centre called ToCUEST (Task-oriented Client-centred Upper Extremity Skill Training)11 were used. Data stem from patients admitted for initial rehabilitation. Inclusion criteria were: complete or incomplete C-SCI (including lesion at T1), and age between 18 and 70 years. Exclusion criteria were: severe additional neurological, orthopaedic or rheumatologic disease, hampering the upper limb performance, and total inability to perform upper extremity measurements. Three months after the patients were able to sit in a wheelchair for at least 3 consecutive hours, VLT-SF data (of the least affected arm–hand) were collected.

Data analysis

A Rasch analysis, featuring the Partial Credit Model, was performed using RUMM2030 software. The Partial Credit Model was preferred, as it accommodates the use of scales with different response categories (number of response categories and/or content of one category) in which the location of the thresholds (relative to the item difficulty) is not the same from one item to the other. Based on a (experimental) data set provided (in our case the VLT-SF test data set), the Rasch Partial Credit Model (A) mathematically determines the position of each response category of each test item along a (new) linearised interval scale (common to all test items), representing item difficulty, and (B) mathematically determines the position of the thresholds between adjacent response categories per test item, based on threshold position probability. As a mathematical consequence of (A) and (B), each response category of each test item is positioned relative to all other response categories of all test items. Once the observed responses are found to fit the unidimensionality requirement (by comparing the observed scores (responses given by the patients) with the expected scores (see also next paragraphs)), the Rasch model can be used to determine the level of arm and hand skilled performance of each patient and the difficulty of each item or threshold on a common linear interval scale.12 The Rasch model uses a logistic transformation to convert the ordinal scores into linear measures expressed in ‘logits’ (that is, log-odds units). The logit is defined as the probability unit that expresses the natural logarithm of the odds of success (that is, pass to fail probability ratio of a subject to an item).13 A ‘logit’ is, in effect, the unit in which the new measure (=r_VLT-SF) is expressed in.

To evaluate whether the VLT-SF satisfies the requirements of an objective measurement, the following steps were performed:

The order of response categories (and their thresholds) were verified to determine whether the multiple response categories reflect an increasing continuum of arm and hand skilled performance. This means that subjects with a higher level of performance always select a higher response to any given item, and subjects selecting a higher response for a given item indeed present a higher level of performance.14 Categories not discriminated can be collapsed (=combined) with an adjacent category into a single category,15 thus necessitating the Partial Credit Model mentioned above. Two guidelines have been used to decide on collapsing/combining of categories.16 Firstly, the collapsing/combining should make sense from a clinical point of view, for which the opinion of four experts was gathered. Secondly, when the collapsing is clinically relevant, the response category frequencies were investigated (that is, number of responses per categories and shape of the frequency distribution).

The unidimensionality was assessed by χ2 fit statistics and a principal component analysis on the residuals (that is, the differences between observed and expected responses). After dividing the total sample into three class intervals of patients with increasing ability, the item χ2 fit statistic was computed as the sum of the squared standardized residuals of each of the three class interval.17 Items with a P-value <0.05 indicate a threat to the fit requirement.13 The principal component analysis on the residuals gives the percentage of variance attributable to the Rasch factor and the first residual factor. Independent t-tests were used to compare the estimates for each subject, which were derived from the highest positive and negative loading items (correlated at 0.3 and above with the component) on the first principal component of the residuals. The scale is considered unidimensional when the percentage of tests outside the range of ±1.96 is <5%.18, 19

The reliability of the scale, that is, the internal consistency was assessed using the Pearson’s Separation Index (PSI).9 The PSI should be 0.70 for research purposes and 0.85 for clinical use.20 The traditional Cronbach alpha coefficient was also used to assess the reliability.

The invariance of item difficulty hierarchy across groups of patients with either a motor complete or incomplete lesion (two subgroups) was tested using Differential Item Functioning (DIF) tests.9 If an item measures the same ability in the same way across groups then, except for random variations, the same success rate should be found, irrespective of the nature of the group. Items that give different success rates for two or more groups, at the same ability level, are said to display DIF. A two-way analysis of variance was computed on the standardized residuals.21 In case of a significant motor lesion completeness main effect, the relative difficulty of the item is different between the patients’ subgroups (uniform DIF). A significant interaction effect between the patients’ subgroups and the class intervals of patients with increasing ability indicates a non-uniform DIF.

Results

Patient characteristics are given in Table 2.

Table 2 Patient characteristics

Verification of the order of response categories

Although Rasch analysis of the original VLT-SF data indicated a good capacity of the scale to discriminate persons’ ability levels (PSI 0.91), 7 out of 10 items demonstrated disordered response categories, that is, the order of response categories within a test item did not match successive increase of response difficulty. Therefore, the original six response categories were collapsed/combined into either three categories (range: 0–2) for four items, or four categories (range: 0–3) for six items. The resulting recoded VLT-SF response scale is shown in Table 1. Note that its maximum recoded (raw) score is 26 (=4 items × 2 and 6 items × 3).

Verification of unidimensionality requirement

After recoding the response scale, and Rasch transformation of the data into logits (r_VLT-SF), all items showed ordered response categories and contributed to the definition of an unidimensional measure of arm–hand functioning (total and individual item χ2 probabilities >0.05). The average item fit residual was equal to −0.22 (s.d. 1.33), indicating a slight overfit (that is, more proficient persons perform better and less proficient persons perform worse than expected by the model). After the principal component analysis, the percentage of individual t-tests outside the range of ±1.96 (95% confidence interval) was 4%, which is significant (P<0.05), indicating unidimensionality of the r_VLT-SF.18

Targeting of the r VLT-SF test relative to the patients’ performance and verification of the scale reliability

The distributions of the persons’ ability values and the items’ threshold are displayed in figure 2. The average person’s ability was 1.49 logits (s.d. 2.85), deviating from the average difficulty of the test that was arbitrarily set at 0.0 logits by the Rasch model. The difficulty of the test items are relatively well-targeted on the persons’ ability range, although some gaps are present (for example, around 0 logits) (Figure 2). Persons’ ability values cover a range of 11.1 logits (maximal person’s ability−minimal person’s ability, excluding extreme (floor and ceiling effect) values). A slight floor effect (4% of the sample), as well as a more pronounced ceiling effect (11% of the sample) exist. A PSI of 0.91 and a Cronbach alpha of 0.95 indicate that the r_VLT-SF scale is reliable and allows for the patients to be discriminated.

Figure 2
figure 2

Distribution of persons’ ability and response category difficulty. ES (Extreme subject, showing either a ceiling effect or a floor effect); n (number). Please note: in the bottom panel, each vertical line represents a threshold between adjacent response categories (difficulty) within each test item. Two threshold locations overlap. (See also Figure 3, middle panel).

The item difficulty hierarchy of each item relative to the other items in the total test difficulty continuum

In Table 3 the item difficulties are sorted, from top to bottom, in order of decreasing difficulty (range: from 1.341 to −1.110 logits). ‘Thumb grip’ was the item requiring the highest ability level to be performed, whereas ‘Arch task’ was the easiest item. Table 3 also reports the standard error associated with each item difficulty, as well as each corresponding χ2 probability.

Table 3 r_VLT-SF calibration for patient with C-SCI

Description of the r_VLT-SF scale

The definition and use of the r_VLT-SF scale is depicted in Figure 3. The top panel shows the distribution of arm–hand performance measures of the patients across the entire r_VLT-SF range. The bottom panel illustrates the s-shaped relationship between the recoded VLT-SF raw total scores (y axis) ranging from 0 to 26, and the arm–hand functioning linear measures expressed in logits (x axis) ranging from −4.78 to 6.32 logits. This relationship is approximately linear between total scores of −2 and 2 logits. Outside this central range, however, a unitary progression in total score accounts for an increasing number of r_VLT-SF measures. In the central range, the change in r_VLT-SF measures corresponding to an unitary increment in total score from, for example, 12 to 13 is equivalent to 0.24 logits. Outside this central range, it increases to 1.44 logits for the same increment in total score from, for example, 25 to 26. This six-fold difference denotes the nonlinearity of the total score. The middle panel demonstrates the expected score to a given item as a function of the r_VLT-SF arm–hand performance measure. By comparing the ability of a given patient to the difficulty of each item, it is possible to determine the expected score of the patient to that item. For example, a patient with a total score of 2 logits would be expected to obtain a score of 2 at the easiest item (item 2 arch task) and a score of 1 at item 5 (strength thumb), a more difficult item (see Figure 3).

Figure 3
figure 3

Distribution of arm–hand skilled performance scores of patients (top panel); expected response categories (and thresholds) within each test item in relation to the recoded VLT-SF in logits (middle panel); relationship between recoded VLT-SF total raw score and r_VLT-SF expressed in logits (bottom panel). n (number); X (hypothetical patient scoring 2 logits on the r_VLT-SF); dotted line (expected scores on each test item); colour coding in middle panel (order of response categories per test item). A full color version of this figure is available at the Spinal Cord journal online.

Verification of the invariance scale requirements

The invariance of r_VLT-SF was validated in our sample across motor complete or incomplete lesions, as neither uniform nor non-uniform DIF was detected (using Bonferroni adjustment). This means that the difficulty hierarchy of the VLT-SF activities was invariant across patients’ with motor complete or incomplete lesions. Therefore, r_VLT-SF can be used to measure arm–hand skilled performance in patients with C-SCI, regardless the motor completeness of the lesion.

Discussion

The purpose of the study was to apply the Rasch measurement model to the VLT-SF to verify whether the requirements of an objective measurement were satisfied.

The Rasch analysis of the original VLT-SF indicated that although the overall power of fit was good, there were some shortcomings regarding the ordering of the thresholds and the unidimensionality of the scale. As the number of response categories may be largely responsible for disordered thresholds, it was decided to reduce the number of response categories from six to three or four. The problem that may arise by reducing the response categories is that the scale loses the ability to discriminate between persons. However, the original PSI of 0.91 did not change after the collapsing of response categories, indicating that the scale allows for patients to be discriminated. Another option to obtain ordered response categories and unidimensionality would be to discard items of the original VLT-SF. However, disordered thresholds were present in most items (7 out of 10), and Rasch analysis of the original VLT-SF demonstrated that person–item distribution was well targeted and that the items were locally independent. As discarding items would influence the comprehensiveness with which the VLT-SF scale covers the clinically relevant areas, the latter option was not applied. The reducing and recoding of the response categories from six to three or four led to ordered thresholds and improved the item fit to the model. The same observation was made in a study of Tennant et al.22 investigating the response categories of the Functional Independence Measure. In r_VLT-SF, both total and individual item χ2 probabilities were higher than 0.05, indicating that the scale and all items meet the criteria for unidimensionality. This is confirmed by the results of the principal component analysis on the residuals. The misfit observed in the original VLT-SF scale was probably caused by the inability of clinicians to discriminate the original six response categories. Actually, when two categories are indistinguishable for assessors, there is no threshold between them. However, when assessors try to estimate a hypothetical one, this may introduce bias, which is highlighted by the fit statistics.23

Considering the advantage to recode the VLT-SF response categories into the r_VLT-SF ones, one might want to use only the reduced response categories for clinical purposes. However, clinicians indicate that besides the purpose of the upper extremity assessment, the different response categories of the original version are also a relevant tool to guide them in their clinical reasoning and therapy planning. Therefore, it was decided to not replace the VLT-SF by the r_VLT-SF for daily practice, but to provide a recalculation table (software package) for the therapists (see also (www.vlt-test.eu). This will enable the clinicians to transform the original item scores into r_VLT item scores. The latter allows interval level measures to be generated, from which it is possible to derive the expected individual item responses, as it is presented in Figure 3.

As persons with motor complete and incomplete lesions may have different levels of arm–hand skilled performance, it was necessary to investigate whether the different items are invariable, regardless of the completeness of the lesion. The Rasch analysis has demonstrated that the item difficulty hierarchy of the r_VLT items does not change according to the motor completeness of the lesion. This is in accordance with a previous study showing that the VLT can be used both in persons with motor complete and incomplete lesions.6 However, our small sample size may have failed to detect DIF. According to Scott et al.,24 a sample size of 200 is required to detect DIF in a scale with more than two items, with adequate power (>80%).

Performing a Rasch analysis on the VLT-SF provides additional clinical value to the instrument. It proved to be a unique high precision instrument (PSI=0.91; Cronbach alpha=0.95) to quantify the quality of arm–hand skill performance in persons with C-SCI, supporting the clinicians in assessment and prediction of outcome. In practice, the conversion of the VLT-SF ordinal total scores into linear measures r_VLT-SF will permit a better quantification of individual patient’s progress regarding arm–hand skill performance. The hierarchical nature of the Rasch-improved scale can be used to identify patients’ pattern of improvement, given the current arm–hand skill performance measurement. Furthermore, the r_VLT-SF scale has the ability to detect discrepancies between the observed score to each item and the score expected by the model, given the overall patients’ measure. For instance, it will be possible to identify on which items the person’s performance lags behind expectations. This in turn may lead to a more client-centered approach, in which therapy will be customised to the patient’s specific needs.

The Rasch investigation and calibration of the r_VLT-SF scale were based on data from the assessment performed 3 months after the start of the active rehabilitation. This assessment time was chosen because: (1) a greater number of data were available and (2) this moment was considered the most representative in the rehabilitation process by omitting major floor or ceiling effects in patients who were either unable to perform any task (at the start of the rehabilitation) or patients who obtained the maximum score (at the end of the rehabilitation). Maximum and minimum total scores (that is, floor and ceiling scores) are per definition (statistically) discarded from any Rasch procedures, as they do not contribute to the sensitivity of the scaling.9 However, some floor and ceiling effects are present even at 3 months after the start of active rehabilitation (see Figure 2). This may be due to the fact that data of the best hand were used. Gathering data of both hands may be more representative of patients’ actual arm–hand ability. This, however, should be investigated in future research.

As clinicians are generally unfamiliar with the concept of logits, transforming the linear measures expressed in logits into 0–100 scores may improve the clinical application of the VLT-SF scale and the communication amongst its users. Moreover, in order to improve the prediction of arm–hand skilled performance outcome with regard to the extent of the lesion and the time since injury, it is essential to obtain reference values. The accompanying paper by Spooren et al.25 describes the transformation from linear measures expressed in logits into 0–100 total scores and provides reference values of r_VLT-SF.

Data archiving

There were no data to deposit.