Introduction

The use of generic outcome measures is seen as a common part of the management of spinal cord injury (SCI).1 Although many such measures are available, their use in clinical practice throughout Europe is limited.2 While the demands of clinical management in a hospital setting may indicate a preference for a given instrument, there are several factors that may determine which, if any generic measure is chosen. For example, within Europe, outcome measures will need to be adapted to a particular language,3 and thus, preference may be given to those outcome measures that already have a local adaptation.

If data are to be pooled to facilitate European-based comparisons of health care, a requirement is for outcome measures to work in a consistent manner across countries to demonstrate cross-cultural validity. A project called ‘European Standardisation of Outcome Measurement in Rehabilitation’ (Pro-ESOR) was established under the Framework IV programme of the European Commission (EC) to examine the internal- and cross-cultural validity of commonly used outcome measures. In patients with SCI, Manual Muscle Testing, the Functional Independence Measure, and the Ashworth Scale for Spasticity (different modifications) were the most widely used assessments.2 Using the terminology from the International Classification of Functioning, Disability and Health (ICF)4 the first of these is a physiological assessment of impairment; the latter two are outcome scales for activity limitation and impairment, respectively.

This paper is concerned with the cross-cultural validity of the FIM™ in SCI. The FIM™ is a measure of activity limitation that is used across a wide range of conditions and in a variety of situations in rehabilitation. There is an extensive body of literature supporting reliability, validity, and the responsiveness of FIM™, although the latter may vary with the population being assessed.5 Through the original recommendation of the American Spinal Cord Injuries Association (ASIA) the FIM™ has seen widespread use in outcomes research.6, 7, 8, 9 Segal et al10 found high reliability on total FIM™ scores between two spinal cord injury rehabilitation facilities, though this varied in terms of individual item scores. In particular, the social cognitive items showed low reliability, which the authors suggested may be related to the complexity and lack of understanding of these items. Others11 have demonstrated that the social cognitive items cannot be used as a substitute for a comprehensive neuropsychological assessment, and the high percentage of ceiling scores found on these items in a number of studies12 suggests that they are not sensitive enough for this patient group.

The focus of this paper is purely on the cross-cultural validity of the FIM™ by analysing pooled data from a Rasch measurement model perspective. It will be demonstrated that pooling of raw score data from the motor scale across countries is not valid but, after necessary adjustments, comparison between different countries can be made. The validity of the cognitive scale for this diagnostic group is questioned.

Methods

Patients and settings

Nineteen rehabilitation facilities within four different countries contributed anonymous data from patients recently admitted with a diagnosis of SCI. The only data required from the facilities were raw scores on each item of the FIM™ scale at admission, together with the age and gender of the patient.

Functional independence measure

The FIM™ consists of 13 motor and five social cognitive items, assessing self-care, sphincter management, transfer, locomotion, communication, social interaction, and cognition.6 It uses a 7-level scale anchored by extreme rating of total dependence as 1 and complete independence as 7; the intermediate levels are as follows: 6 modified independence, 5 supervision or setup, 4 minimal contact assistance or the subject expends >75% of the effort, 3 moderate assistance or the subjects expends 50–75% of the effort, and 2 maximal assistance or the subject expends 25–50% of the effort. Although developed originally as an 18 item scale, it has been shown that there are two scales, a 13 item motor and a 5 item social-cognitive scale.13 In the present study, the original scales will be referred to as FIM™ motor and FIM™ social cognitive items or scales, respectively. Owing to copyright issues, if once changed in any way, these will be referred to as the FIM motor and FIM social cognitive scales.

Rasch analysis

The Rasch model is used as a methodological basis for examining the internal construct validity of a scale, its scaling properties, and cross-cultural validity through fitting data from the scale to the Rasch model. It is a unidimensional measurement model, which assumes that the easier the item the more likely that it will be passed, and the more able the person, the more likely that they will pass an item compared to a less able person.14 In other words, there is the probability, in the dichotomous case, that a person will affirm an item is a logistic function of the difference between the person's ability (θ) and the difficulty of the item (b), and only a function of that difference. From this, the expected pattern of responses to an item set is determined given the estimated θ and b. When the observed response pattern coincides with or does not deviate too much from the expected response pattern, then the items constitute a unidimensional measure. Taken with confirmation of local independence of items, that is, no residual associations in the data after the Rasch trait has been removed, this confirms unidimensionality.15 For cases where items have more than two categories, the model includes an explicit ‘threshold’ parameter (τ),16 where the threshold represents the equal probability point between any two adjacent categories within an item. In the logit form

giving the log-odds of person n affirming category k in item i, θ is person ability, b is the item difficulty parameter, τk is the difficulty of the k threshold, and Pnik is the probability for person n to answer item i in category k. The units of measurement obtained from the equation are called ‘logits’ which is a contraction of log-odds probability units. These threshold estimates should be correctly ordered if the categories are being assigned in the intended way. Consequently, this can be empirically verified against the model expectation and deviations identified where the categories fail to express an increasing level of the trait (disordered thresholds).

Where disordered thresholds are identified, the discrimination of each category is examined. Choices of the way in which categories are collapsed are based on the category discrimination, or where unclear, on clinical knowledge. Once disordered thresholds are removed, fit of data to the Rasch model is assessed in a number of ways. Item fit statistics indicate how well the items are fitting the model individually. These are given in the form of residuals value (the standardised difference between the observed and the expected score for each person), which should be between −2.5 and 2.5, and χ2 statistics, which should show nonsignificant deviation from the model. The χ2 values are calculated based on ability groups (or class intervals) of approximately 50 people, which the patients are assigned to based on their total score. Overall fit of the scale as a whole is also given by standardised fit statistics for persons and items (mean zero, SD of one where the data fit the model perfectly), and a χ2 Item–Trait interaction statistic (calculated by summing all the item χ2 values and degrees of freedom) to determine scale invariance that once again should indicate nonsignificant deviation from the model.

A principle components analysis (PCA) is performed on the person-item residuals to confirm local independence, that is, no pattern in the residuals following the removal of the Rasch factor.

Much of the published work on Rasch analysis in rehabilitation has explored issues of unidimensionality and scaling properties.17, 18, 19 However, Rasch analysis allows for much more than an empirical test for unidimensionality. The basis of the approach to the analysis of cross-cultural validity lies in the plot of the proportion of individuals at the same ability level (grouped into class intervals) who answer a given item correctly (or can perform a particular task). These proportions, except for random variations, should be the same irrespective of the nature of the group for whom the proportions are plotted.20 Items that do not yield the same plot for two or more groups display differential item functioning (DIF) and are violating the requirement of unidimensionality. Consequently it is possible to examine whether or not a scale works in the same way by contrasting the response function for each item across cultures. This process has been described in more detail in another paper from our group, including the procedure for rescoring the responses.21 DIF may manifest itself as a constant difference between countries across the trait (Uniform DIF – the main effect), or as a variable difference, where the response function of the two groups cross over (nonuniform DIF – the interaction effect). Both the country factor and the interaction with the class interval might be significant in some cases, as with any ANOVA's main and interaction effects. Tukey's post hoc tests determine where the statistically significant differences are to be found where there are more than two groups.

Where some but not all items display DIF, it is possible to make an adjustment to allow items with DIF to vary by country. To do this, an item is substituted for a series of country-specific items (eg Bathing becomes Bathing – Israel, Bathing – Italy, etc.). For each country, only the scores observed in its corresponding item are considered, while the other items are assigned missing values. Subsequent analysis is undertaken on this expanded data set (ie original plus split items). Finally, the tests for local independence are undertaken to confirm unidimensionality of the scale.22

Many of the fit statistics for Rasch analysis are χ2 based, even those based on the likelihood of the data capitaling on the fact that a −2 log (likelihood) is asymptotically χ2.23 Given that tests of fit are set against ‘perfect fit’, the evidence appears to be that sample sizes between 50 and 250 data points are appropriate. However, if many patients are at the margins, and thus, the scale is poorly targeted, this also has an influence on sample size.23 Consequently, if a scale is well targeted (ie 40–60% success rates on dichotomous test items), then a sample size of 108 will give 99% confidence of being within ±0.5 logits. If not well targeted (ie <15% or >85% success rate), this rises to 243. For tests for differential item functioning, a sample size of 200 or less has been suggested as adequate.24 For example, 150 cases per country are sufficient to test for DIF where at α of 0.01 a difference of 0.5 logits within the residuals can be detected for any two groups with β of 0.20.25

The Rasch analysis was undertaken with the RUMM2020 software.26 Owing to the number of tests of fit undertaken (eg 13 for each item in the motor scale), Bonferroni corrections were applied, giving a significant P-value of 0.004 for the motor FIM™ and 0.01 for the social cognitive FIM™.27

Results

Participants

A total of 647 patients were recruited with a mean age, which varied between 35 and 57 years (range 11–93) across the countries, of whom 31% were female (Table 1). For pooled data, the common median age of 46 years is used for the whole sample. The number of contributing institutions varied from Italy having 10 contributing institutions to Denmark with only one contributing institution. The mean admission FIM™ Motor Score was 43.1, with a significant difference between countries (F=9.732; P<0.001). Post hoc tests showed that Israel and Italy had much lower admission Motor scores (about 10 points) than the UK or Denmark. The mean admission FIM™ Cognitive Score was 32.3, again with a significant difference across countries (F=6.981; P<0.001), but on this occasion the difference was between the UK and some other country.

Table 1 Characteristics of patients recruited into the study

Pooled data and cross-cultural validity

When data from the FIM™ motor scale from the four countries were pooled, 12 of the items displayed disordered thresholds and needed to be rescored. Following this, fit to the model was poor, with seven items showing significant misfit (Table 2). An examination of DIF found little evidence for age or gender for the motor scale, with just two items. Transfer bed and Walk/Wheelchair showing nonuniform DIF by age and gender, respectively. In contrast, eight items showed DIF by country, and Tukey's post hoc comparison of these items showed a complex pattern of item-country DIF (Table 3). Consequently, where an item worked differently in different countries, it was split into four separate items, one for each country. This resulted in 37 items (five original items and eight items split for each country). These data were refitted to the Rasch model. The item ‘Transfer tub/shower UK’ was removed from the analysis, as it was an extreme item (ie, all responses were at the extreme ends of the response categories). After rescoring for disordered thresholds, five misfitting items needed to be removed from the scale; these were ‘Dressing upper body-UK’, ‘Bladder management Denmark’, ‘Walk/wheelchair Denmark’, ‘Eating’ and ‘Grooming’ (Table 4). Following this, the items showed good fit at the individual level, though overall item trait interaction still showed significant deviation from model expectations (χ2=60.056, df=31, P=0.0014). PCA of the residuals indicated that the first factor explained 17% of the variation, and the second factor 15%, indicating little substantive patterning in the residuals and thus supporting the unidimensionality of the scale.

Table 2 Individual motor item fit across all countries using pooled data after rescoring
Table 3 Tukey's post hoc comparison for seven motor items with DIF by country
Table 4 Fit of FIM Motor scale following adjustment for DIF

The FIM cognitive scale was found to fit the model well after rescoring items, both at the individual item level (Table 5) and overall (χ2=5.121, df=5, P=0.390527). No DIF by age, gender or country was observed. The PCA analysis of the residuals showed a first factor accounting for 34% of the variation, and the second 28%, again supporting the unidimensionality of the original Rasch factor.

Table 5 Fit of FIM cognitive scale

Discussion

In the present paper, the cross-cultural validity of motor items of FIM™ and the possibility to pool data from different countries in patients with spinal cord injury is evaluated using Rasch analysis. It is demonstrated that the number of categories presently used in FIM™ with data from routine clinical settings in the participating countries in Europe is not sustainable. Also after adjusting for such problems, items have different levels of difficulty across countries. Thus, the pooling of raw data is not recommended, casting doubts on the cross-cultural validity of the instrument. However, it is demonstrated that by allowing some items to be unique for each country, this limitation can be overcome to a great extent. This allows generating estimates of ability for a patient, which can be confidently used to represent the same level of dependence regardless of the country of the patient. Unfortunately, the rescoring and splitting of items across countries is a very complex solution, and not ideal for clinical studies of activity limitation. A similar approach has been used for the motor items FIM™ in patients after stroke,28 using eight split and five original common items as in the present study of patients with SCI, although the common items were different. In SCI, four country specific items had to be deleted to get fit to the model, as well as two items common to all countries. In stroke, however, only three country-specific items had to be deleted.28 To delete an item, as for example, the Eating and Grooming item in SCI, jeopardises the clinical use of the instrument as essential daily activity items are omitted. Thus, the relevance of the FIM™ for this patient group can be questioned, at least when comparing data from different countries. A comparison of data from different settings within the same country may show similar problems, although this is not addressed in the present report.

At least 30% of the patients were at the ceiling of the social cognitive scale in each country, with 61% at the extreme in Denmark. As a result, the data from approximately 60 patients were available for the Rasch analysis for each country, with consequent attendant problems of parameter estimation and precision of the item difficulty estimate. The large number of extreme patients raises questions about the validity of the social-cognitive items for this patient group. In addition, earlier studies have shown no relationships between results from a comprehensive, predominately motor-free, neuropsychological test battery and the results from the FIM™ social-cognitive items, also attributed to the ceiling effects in the FIM ratings.11

The approach used in this paper is one of identifying problems with cross-cultural validity through the analysis of DIF. DIF is a breach of the assumptions of unidimensionality,20 but it has also been argued that DIF can be evaluated only if the conditions for fit to the Rasch model have been satisfied.29 As it is possible to have data which fit the model, but which also display DIF, we have adopted a pragmatic approach whereby we have considered DIF as one possible contribution to misfit, and thus deliberately refitted data to the model after adjustment for DIF.

Many factors may have contributed to the cross-cultural variability observed in the data. There may be some difference in the reliability of the professionals who rate the FIM™, and in the translation of the FIM™ manual. Different manuals and training in different settings may have had an influence on the psychometric quality of the instrument, although all things being equal, those with formal FIM™ training has been shown to give more reliable ratings.30 As data collection was restricted to scale items, age, and gender, we have not been able to analyse the impact of training or rater reliability. True cultural differences in the ways people dress or bathe may also contribute to the lack of invariance.31 The influence of all these factors and the extent of their interaction are unknown. Therefore, the lack of information about differences between datasets and case mix (eg level of injury, time since injury, complete/incomplete injury, traumatic or nontraumatic cause) is a limitation of the study. However, it must be clearly stated that a requirement for pooling data from different countries is that the scale should be invariant across those countries. If the lack of invariance was attributable to aspects such as case mix, level of training of raters, and other administrative factors, then this suggests that the scale would be unsuitable for such studies. The task of adjusting for so many effects would be onerous, given the example above where there was a loss of clinically relevant information in solving the problem at the crude country level.

In conclusion, cross-cultural data from the FIM™ motor scale for patients with spinal cord injury cannot be pooled in its raw form. In order to make comparison across countries, it is necessary to accommodate cultural differences in the measurement construct by making complex adjustments using the Rasch model. In measurement terms the scale still remains problematic. The calculation of change scores and use of parametric statistics on the raw score, between and within different countries, are inadvisable. This does not preclude, in any way, the use of the FIM instrument in clinical practice, though the analysis clearly highlights problems with the seven category response function.