Introduction

The annual incidence of spinal cord injury (SCI) is estimated to lie between 10 and 83 per million inhabitants per year and often results in catastrophic dysfunction and disability.1 Progress is gradually being made in the treatment of SCI to limit damage, prevent or treat complications prolonging survival, improve function and enhance recovery. Any new interventions will likely include pharmacological, surgical and rehabilitation approaches and all will require evaluations of their efficacy using appropriate outcome measures.2

SCI is initially diagnosed in terms of the level at which the injury has occurred, which tends to equate with the observed degree of neurological and functional deficit. Nearly half of all spinal cord injuries are functionally incomplete, with some function preserved below the level of the lesion (although there is much variation between groups3). In such cases, the majority of people will likely experience useful recovery (to ASIA grade C or D4), including the ability to walk.5 Rehabilitation interventions and outcomes of SCI have thus tended to particularly focus on functional status.

Since the mid-twentieth century, health status questionnaires and rating forms have been used to assess patients in a range of clinical settings, to document outcomes of care. These have usually been completed by health care staff and primarily represented their perspectives. However, during the past two decades, health care has become more patient-centered, with measures emerging which assess the impact of a wide range of health care interventions, from the patient's perspective.6 Such ‘patient-reported outcome measures (PROMs)’ have gradually been introduced as an important outcome (or ‘end point’) in randomized clinical trials and observational studies.7

For results to be meaningful, it is imperative that any measures used to assess outcomes in any health care context cover domains (for example pain, physical function, perceived independence) that are relatively specific and appropriate to a particular context or study aim. Evidence also needs to have been presented demonstrating that the questionnaire (and any associated scales) has acceptable measurement properties, including: reliability, validity, responsiveness, acceptability and feasibility.7 Another property that overlaps with reliability, validity and responsiveness and which is particularly pertinent to measurement scales, is that of precision. In prospective outcome studies, such as a trial, the responsiveness of an outcome measure, that is, its ability to accurately detect change when it has occurred, is a particularly important aspect.8 These stipulations apply to the SCI context no less than they do for any other condition.

The purpose of this paper is to provide a structured review of instruments that are widely used for the assessment of function or mobility in the context of SCI where they have also received any form of psychometric evaluation in that context. Evidence of their measurement properties is presented and non-scientific practical considerations are also highlighted to further facilitate clinical decisions.

Methods

Search strategy

The following databases were searched for the years 1969–2006: National Library of medicine (Pub Med), Cochrane, CINAHL and AMED. The search was limited to the English language. The term ‘spinal cord injury’ was combined with the terms ‘classification or assessment, Index, Scale, outcomes measure or measurement, functional outcomes, mobility and functional assessment’. Papers were selected by reviewing their titles and abstracts with additional references identified from the reference lists of selected papers. General search engines were used to access non-peer reviewed professional and specialist guidelines and workshops on Spinal Cord Injury websites such as the International Campaign for Cures of Spinal Cord Injury Paralysis, Spinal Cord Medicine, American Spinal Injury Association, the National Institution of Neurological Disorder and Stroke (NINDS Spinal Cord Injury), American and Canadian Spinal Research Organization, International Spinal Injuries and Rehabilitation Centre (UK).

Inclusion criteria

Reports of any studies evaluating the use of an outcome measure to assess function or mobility in spinal cord injured patients were initially identified. Abstracts of all papers and titles were independently assessed by two reviewers (HA and DS) and agreement confirmed by a third (JD). Full copies of the selected papers were then obtained. Details of a measure were only included in the final review where some evidence of its psychometric properties had been published, which had been evaluated in the context of SCI.

Data extraction

Using criteria for evaluating outcome measures described by Fitzpatrick et al. (1998),7 data were independently extracted by three reviewers [DS, HA, JD]. Evaluation of measures gave particular consideration to the following criteria:

Reliability

Concerned with reproducibility and internal consistency, it assesses the extent to which an instrument is free from random error or the amount of a score that represents the signal rather than the noise. Test-retest reliability is designed to take account of variation over time in stable patients. The results of tests of internal consistency (for example Cronbach's alpha (α)) and test-retest reliability (for example intraclass correlation, Bland–Altman methods) are presented. Reliability estimates of α0.7 are needed to claim internal consistency and are recommended for instruments intended for use at the group level.7 Estimates need to be higher (α0.9) where instruments are to be applied to individuals.9

Validity

Addresses whether an instrument measures what it is intended to measure.7 The best evidence for validity involves assessing an instrument against a true value for the measure: a ‘gold standard’.10 In the SCI context, the American Spinal Injury Association (ASIA) Standards for Neurological Classification of Spinal Cord Injury is the most widely used and accepted method to evaluate and classify the level and degree of impairment of patients’ SCI.11 This system represents a ‘gold standard’ for assessing neurological (motor and sensory) impairment in SCI, but was not designed to assess functional ability or locomotion and does not therefore represent a ‘gold standard’ for assessing criterion validity of instruments focused on these domains. Thus, any evidence presented for the validity of instruments that involved comparisons with ASIA scores has been cited in this review as evidence of convergent or concurrent, rather than criterion, validity (This is irrespective of the term used in any cited articles).

In the absence of a ‘gold standard’ for direct comparison, evidence for validity can take many forms. The source of instrument items and evidence for content and face validity may be presented, which can include qualitative examination of instrument content. Quantitative evidence derived from factor analysis or principal components that support dimensionality, or internal construct validity, is commonly presented. External construct validation generally includes comparisons with other instruments which may include standard clinical assessments.7 Frequently, this involves demonstrating that a measure is closely correlated with different measures of the same trait (‘convergent validity’), or that a measure correlates little with measures intended to indicate a different trait (‘discriminant validity’).9

More recently, Rasch analysis: a more stringent assessment of underlying scale structure and dimensionality, is increasingly being undertaken.12 Rasch models test how well instruments conform to uni-dimensionality, hierarchy and interval location of items by examining patterns of individuals’ performance on the range of items in a scale and patterns of items’ difficulty or severity.7

Responsiveness/sensitivity

Concerned with the extent to which an instrument is sensitive to meaningful changes in health status. This property is particularly important for instruments applied in clinical trials.9 Responsiveness preferably needs to be assessed in a prospective study, where change in the health status is likely to occur for the majority. Here, effect sizes are commonly employed (other methods include using paired t-test comparisons or the responsiveness statistic8), which is a method of calculating the magnitude of change measured by an instrument in a standardized way which allows direct comparisons to be made between different instruments and scales.13

Results

Instruments

Table 1 lists details of any instruments identified as being widely used for the assessment of function or mobility in the context of SCI which had received any form of psychometric evaluation in that context, together with brief details of published studies contributing to that evidence.

Table 1 Health status instruments used for the assessment of outcomes of care for spinal cord injury, that have been evaluated, in that context, for any of their psychometric properties

A total of eight instruments (plus modified versions) were identified, namely: the Barthel Index of Disability (BI)14 (Modified BI17); the Functional Independence Measure (FIM)20 (Adapted Turkish version29 and shortened version—the Fone FIM27); the Quadriplegia Index of Function (QIF)31 (Short-form QIF32); the Spinal Cord Independence Measure (SCIM)33 (revised SCIM35); the Walking Index for Spinal Cord Injury (WISCI);36 the Needs Assessment Checklist (NAC);40 the Spinal Cord Injury Functional Ambulation Inventory (SCI-FAI)43 and the Short Sickness Impact Profile (SIP68).47

In contrast to other measures cited in Table 1, the BI, MBI, FIM (and adapted versions) and the SIP68, were not originally designed to assess patients with SCI specifically, (although the SIP68 developmental study included 5% SCI patients), but were instead designed for application in a range of rehabilitation settings. While the psychometric properties of the BI and MBI have been evaluated within a number of contexts (for example older people, stroke patients), few details of any such evaluations could be found involving SCI patients, apart from within a study involving a Turkish translation of the MBI.19 This latter version had items altered to suit different cultural norms, which also risked altering their meaning from earlier English formats (although psychometric reassessment was formally conducted).

In contrast with the BI and MBI, the original FIM has been assessed in a number of studies with SCI patients48, 21, 22, 23, as have the more recent adapted versions29, 27, 28, 49 of the FIM. The SIP68 has also been evaluated with SCI patients in two studies with relatively large sample sizes.45, 46

Instruments that were designed specifically to assess the function or mobility of SCI patients are: the Quadriplegia Index of Function (QIF)31, 30 and Short-form QIF;32 the Spinal Cord Independence Measure (SCIM)33 and Catz-Itzkovich revised SCIM;35 the Walking Index for Spinal cord injury (WISCI);36 the Needs Assessment Checklist (NAC),40 and the Spinal Cord Injury Functional Ambulation Inventory (SCI-FAI).43

Item generation

While the precise formal method of item generation was rarely specified (exceptionally, originators of the WISCI specify using a modified Delphi technique36), the majority of instruments were devised by health care providers. Only the SIP68 appeared to have involved patients and their carers44 in the initial process. Two instruments (WISCI,36 SCI-FAI43) used blinded ratings of videotaped footage of patients to aid consensus within a research team, after a list of candidate items had been produced. Computerized techniques (regression or principal components analysis) were employed to select/reduce items to produce the Short-form QIF32 and the SIP68.44 The former also involved interviews with patients at this stage.

As well as being chiefly designed by health care providers, the majority of instruments were designed to be used by clinical raters. While this generally did not preclude gaining input from patients and carers, many of the instruments also had quite complex scoring systems requiring raters to undergo training in their use and interpretation. The minority of instruments that could be self-rated were: the FIM20 (particularly, the adapted version25) and shortened Fone-FIM,27 the NAC40 and the SIP68.44

Measurement properties of instruments within the SCI context

Table 2 provides details of the measurement properties of instruments tested in the context of SCI, reported by studies cited in Table 1.

Table 2 Summary of measurement properties of health status instruments cited in table 1, where (if) evaluated in the context of spinal cord injury

The BI, FIM and SCIM appeared to have been used with SCI patients most often, largely reflecting the greater length of time that had elapsed since the first publication. This did not necessarily indicate a greater degree of instrument evaluation having occurred in the SCI context.

Ceiling/floor effects

In many cases (BI, QIF, SCIM, SCI-FAI) little formal evidence had been presented concerning overall score floor or ceiling effects. In all other cases, clustering of extreme values (allowing for no further improvement or deterioration to be measurable on subsequent assessment) commonly occurred in some subscales for example cognitive scale: FIM;29, 22 communication scale: FIM;29, 22 and SIP68;46 stairs: MBI;19 mobility: SCIM-III3 and SIP68;46 walking: WISCI.57, 60 An item level analysis of the BI noted the presence of floor or ceiling effects for some items, including feeding and grooming, at admission or discharge.16

Reliability

No evidence of the internal consistency of the instrument or subscales could be found for the BI, QIF, WISCI or SCI-FAI. Evidence of adequate internal consistency (Cronbach's α>0.7) was reported for all other measures apart from the FIM locomotion (Cronbach's α=0.4)50 and SIP68 emotional stability (Cronbach's α=0.68)45 subscales. Evidence of optimal internal consistency (Cronbach's α range, 0.80–0.90) was reported for the Turkish MBI,19 Short-form QIF,32 NAC (most subscales)40, 42 and SIP68 (all but one subscale).45

Test-retest reliability (TRT) (assessing the same rater's responses on different occasions) had not been assessed in relation to the QIF, the short-form QIF, The WISCI, or the SCI-FAI. TRT assessment of the NAC, had used correlations alone (correlation is inadequate, as it is not a test of agreement), which produced high values r6.9.40 TRT reliability (using Kappa or ICC) was good (K-values 0.61–0.80, or ICC values >0.70) for the MBI,19 SCIM,53 SCI-FAI43 and SIP68.46

Evidence for inter-rater (or observer) reliability was not found for the BI, short-form QIF, NAC or SCI-FAI. Assessments of the QIF and FIM (patient versus clinician rating compared in the latter case) used correlations alone, which were moderate to high (r range, 0.55–0.95).31, 21 Using Kappa, ICC or Kendall coefficient of concordance,61 the FIM (Turkish version)29 and SCIM-II produced some moderately low values (<0.5);53 the MBI19 had moderate to good levels of inter-rater reliability (K-values >0.5),19 with very high values (K-values 0.81–1.00 or ICC values >0.90) presented for the SCIM-III.35

Validity

Given the absence of any ‘gold standard’, correlation comparisons between other measures (purporting to measure a broadly similar construct) were taken to denote evidence, or otherwise, of convergent, concurrent or discriminant validity. Results of principal (factor) components, or alternatively Rasch analysis, supporting underlying scale structure were taken as evidence of construct validity.

Evidence of acceptable concurrent validity was found with the FIM;24 and for some subscales, some of the time, representing the FIM Turkish version,29 MBI (Turkish version),19 short form QIF32 and SIP68.45

Evidence of construct validity could appear inconsistent or contradictory in some instances. Thus, Rasch analysis revealed problems concerning disordered thresholds (grooming and stairs items) and model misfit affecting bladder and bowel items in the FIM motor domain29, 23 (bladder and bowel items in the MBI Turkish version were also associated with considerable levels of misfit in Rasch analysis19), that appeared to be accentuated in cross-cultural comparisons;22 whereas evidence from factor analysis, produced an unproblematic two-factor solution (together using all 18 items), with each scale further associated with a high level of internal consistency (itself a form of convergent validity).26

Findings from Rasch analysis highlighted some flaws relating to construct validity for the SCIM version III, with ‘walking outdoors’ and ‘stair management’ items within the mobility subscale, and ‘toilet use’ within the respiration/sphincter subscale exhibiting misfit. The latter subscale also contained some items with disordered category thresholds for example bowel management.3, 35 Factor analysis on the SIP68 represented data from a heterogeneous (mainly non-SCI) population and was therefore considered largely inappropriate, and no evidence of construct validity (note that absence of evidence is not synonymous with evidence of absence) was found for the BI, the QIF (or short-form QIF), the SCIM (early version), the WISCI, the NAC, or the SCI-FAI.

Evidence of convergent validity (correlations) was presented for most instruments in relation to other instruments purporting to measure something similar. Thus the short-form QIF score correlated with the Upper Extremity Motor Score (r>0.8);32 the SCIM with the FIM (r>0.8),33, 54 the WISCI (r=0.97)38 and with the NAC (r=0.47–0.85).40 Considerable evidence of convergent validity was presented for the WISCI: with the FIM (r0.7),36, 38 the LEMS(r=0.47–0.91),58 the Berg Balance Scale,57 the BI and Rivermead mobility index (both r=0.67),38 and the SCIM (r=0.97).38 The NAC correlated with the WISCI and the Hospital Anxiety and Depression Scale (HADS) (r=0.47–0.85);40 the SCI-FAI with gait scores, walking speed (r0.7) and the Lower Extremity Motor Score (LEMS) (r>0.6)25; and the SIP68 correlated with relevant domains of the SF-36 (r0.57), ADL (somatic autonomy r=0.81),46 and with the BI (r=0.54–0.91).45

Responsiveness

Evidence of good responsiveness was found for the BI and FIM motor scale (both had effect size 0.9, comparing scores between patients’ admission and discharge dates).31 There was other (weaker) evidence suggesting that the FIM was less sensitive than the QIF);15, 24 that the FIM was similar to the BI,15 the QIF was better than the BI31 and the FIM.24 The original SCIM and SCIM version III were each found to be superior to the FIM33, 34, 56 and the WISCI was possibly superior to the (Locomotor Functional Independence Measure) LFIM and SCIM57 and had superior sensitivity to walking recovery than the BI, RMI, SCIM, LEMS or FIM.38 Few studies used effect sizes and overall, evidence of responsiveness was generally quite weak. There was no evidence concerning the responsiveness of the MBI, the short-form QIF, the NAC, or the SIP68 in the SCI context.

Discussion

This review focused on instruments that are widely used to assess function or mobility in patients with SCI, which have also received some form of psychometric evaluation in that context, and complements and extends the scope of previous reviews in this area.62, 63, 64, 65 Eight instruments were identified, together with adapted or shortened versions. There were two main findings.

The first finding was that, with the exception of the SIP68, none of the measures identified had involved interviews with any patients at all, at the design stage, for the purpose of item generation. This finding naturally leads to the conclusion that current measures may not represent SCI patients’ perspective, but more likely represent the perspective of clinicians.

Patients increasingly expect to be involved in decisions about their care and to receive accurate information to facilitate their involvement.66 Thus, the use of instruments that represent chiefly the clinician's perspective might be considered inappropriate by some, or only appropriate in certain circumstances or in relation to particular domains. Nonetheless, the extent to which patients are involved at all, even in rating the different instruments, remain generally quite limited.

Variation in the extent to which patients are involved in rating questionnaires could be influenced by a number of factors. For instance clinicians may (not unreasonably) believe that patients’ and clinicians’ ratings of their functioning will differ, but may also assume that clinical observers will provide more objective and accurate scores. Indeed it has been asserted by Itzkovich et al.53 that direct observation of individuals’ functioning is more accurate and less subject to bias than patients’ self-reports because patients may have unrealistic or uninformed expectations, particularly in relation to goal-setting and achievement. Their score ratings for the SCIM have therefore tended to rely entirely on observations rather than subjective reporting. However, results from a small-scale study, by the same authors, found that any differences between patients’ ratings obtained by interview and ratings produced by observers actually appeared insignificant.53 The extent to which patients are involved in rating questionnaires may also relate to the intended purpose of the instrument. Thus, by contrast with the SCIM, the NAC is more concerned with measuring individuals’ rehabilitation success in achieving set goals and patients are invariably involved in rating questionnaire items.41, 67 However, item ratings on the NAC differ in another way from the SCIM (and other measures), in that no distinction is made between someone being able to carry out a task verbally (by asking someone to do it for them) versus carrying out the task themselves. Berry et al.41, 67 defend this, believing that a patient with a higher level of injury should be able to achieve independence, through others, by articulating their own needs. They also argue that, while the patient's perception of their independence might be at the cost of accuracy, their active involvement in the process engenders compliance. Others have also noted that perceived control has the strongest association with life satisfaction.49 These arguments appear reasonably compelling where outcome measures are used for the purpose of individual goal-setting, but are problematic in other contexts for example, trials comparing outcomes of different interventions.

Clearly while current measures mainly represent clinicians’ perspectives, this does not mean that all clinicians share the same perspective (many of these issues have been discussed elsewhere62, 63). This nonetheless still leaves the question of whether there is a need for a new self-reported measure to be developed for SCI, that fully represents patients’ perspectives.

Our second main finding was that the quality of evidence for the psychometric properties of instruments reviewed was very variable, occasionally quite poor. Evidence for responsiveness, particularly evidence of instruments’ ability to detect meaningful change was particularly lacking. Evidence of instruments’ psychometric properties also sometimes appeared to be conflicting (for example, different studies’ evidence for construct validity). There are a number of likely reasons for this. In the first instance, it is only since the early to mid 1990s that a well-described psychometric methodology has become established for developing and reporting health status questionnaires, that is applicable to clinical situations.68 A number of the instruments here reviewed, were produced and assessed prior to the mid-1990s.

A possible reason to explain apparently conflicting findings—particularly regarding construct validity—is that different statistical procedures, such as factor analysis (representing Classical Test Theory) versus Rasch analysis, have fundamentally different requirements. For instance, instrument scales that arise through application of factor analysis are treated as interval scales, when they are generally based on ordinal level item scoring; while the Rasch model—to which a scale is compared in Rasch analysis—is a more stringent test, as it is a statistically proven interval scale.69

While Rasch analysis may be regarded as more stringent than conventional psychometric analysis, the context of its application remains important, as is the case with classical psychometric methods, Thus another reason why evidence from different studies might differ relates to differing sample sizes and composition relating to age, gender; the range, extent and type of SCI; as well as the context in which studies have occurred—often varies from one study to the next. This is important because the measurement properties of an instrument are not just of the instrument: they are of the instrument pertaining to the population and context in which they are developed, and tested.68 Thus, if a measure is designed and calibrated with one group of patients, its measurement properties may change if applied to a different group of patients, such as those representing different age-groups or different clinical characteristics. The use of a measure in a different context from the original developmental study (for example, a hospital pre-/post-surgical context versus the context of community-based rehabilitation) can also affect the measurement properties.

This leads to the issue of how health and social care providers are to choose between instruments designed for assessing outcomes of interventions for SCI, and whether this review can support particular recommendations.

There will never be a perfect questionnaire or measure of outcome and efforts to produce one risks a proliferation of imperfect examples from which assessors and trialists must then choose, which is to be discouraged. Choosing the right measure involves identifying the most appropriate measure for the chosen patient group, context and purpose, where evidence exists to show that the questionnaire has exhibited adequate measurement properties pertaining to a similar patient group and context.

Of those instruments reviewed, if a generic measure is considered to be appropriate for a particular purpose, then, of the BI/MBI, FIM and SIP68 generic measures, the SIP68 has the best measurement properties. However, evidence for its responsiveness has not been evaluated within the SCI context, only once this has occurred can its use in clinical trials be sanctioned.

Regarding SCI condition-specific, multidimensional measures (that is that comprise different dimensions represented by a number of subscales), which aim to cover the full range of SCI; of these the SCIM and the NAC had comparable measurement properties. These were mostly good, although a few shortcomings concerning some subscales of the SCIM-III, based on Rasch analysis, may indicate the need for further refinements.3 The responsiveness of the NAC has also not been assessed. However, as the NAC and the SCIM each reflect somewhat different (although likely related) constructs, and are each applied in different ways, choosing between these instruments depends crucially on the purpose that any potential user has in mind.

Where highly specific measures are required, for the assessment of mobility/ambulation alone, the WISCI and SCI-FAI are both supported by evidence for acceptable levels of reliability, validity and responsiveness. The only caution concerns their use with low-level quadriplegic subjects with whom ceiling effects are likely.

For the assessment of patients with quadriplegia, generally the QIF has good measurement properties. The short-form QIF has particularly high levels of internal reliability and could therefore be used to assess progress in individual patients, which is not the case for any of the other measures reviewed—excepting the Turkish versions of the FIM and MBI—no other measures had sufficiently high reliability to permit this application, and they are therefore only suitable for making group comparisons.70

While conducting this review, we considered whether SCI represents a particularly challenging area for outcomes measurement. For instance, as others have noted,64 SCI is a heterogeneous disorder in terms of level and severity of injury, and it is unsurprising that most measures will exhibit floor or ceiling effects when applied to groups of patients that largely represent one or other extreme end of the spectrum of injury. If ‘broad spectrum’ measures are considered appropriate, then this particular ‘flaw’ may need to be accommodated. In addition, traumatic SCI may be accompanied by other injuries. These have the potential to produce considerable amounts of ‘noise’ where the measurement of change in function is specifically concerned with interventions directed towards the SCI. We suggest that there are no simple means of dealing with these substantial challenges.

Recent developments in psychometric theory offer the opportunity to develop item banks that can be retained on computer. Patients can then complete items online and, on the basis of their responses to certain items, computer-adaptive testing will select the most appropriate items for them to complete thereafter. This method can reduce patient burden as it leads to fewer items being asked.71 Whether such methods offer other improvements in SCI assessment remains to be evaluated.

Generally, the use of condition-specific measures with adequate measurement properties is clearly necessary in the context of SCI; but even this is not sufficient if outcome studies are to produce meaningful data. This requires using such measures to be complemented by rigorous planning and conduct of data collection methods, with outcomes data obtained at appropriate points in time, relative to a meaningful date: defined and operationalized in the same manner for all subjects. Study sample sizes also need to be adequate.

The process of conducting this review revealed that, if applied appropriately, while never perfect, a number of outcome measures that are currently available are likely good enough.