Main

Gaucher disease (GD) is an autosomal-recessive inborn error of glycosphingolipid metabolism caused by loss-of-function mutations in the gene encoding lysosomal glucocerebrosidase (GCase; EC 3.2.1.45). When GCase activity is diminished, glucosylceramide accumulates primarily in the lysosomes of macrophages (“Gaucher cells”). GD is broadly subdivided into neuronopathic (types 2 and 3) and non-neuronopathic (type 1) phenotypes. GD type 1 (GD1, OMIM 230800), the most common variant with an estimated prevalence of between 1:40,000 and 1:60,000,13 is traditionally defined by the absence of primary central nervous system involvement. Worldwide, GD1 constitutes 95% of all cases,4 although the relative number of patients with neuronopathic variants is often substantially greater in Asian, Middle Eastern, and African populations. The common clinical manifestations are hematological cytopenias, hepatomegaly, splenomegaly, and various skeletal disorders. Disease expression is diverse. Some individuals homozygous for mutations in the glucocerebrosidase gene, especially those of Ashkenazi Jewish origin, are asymptomatic or are only mildly affected, at least until late in adulthood.4,5 However, others, including Ashkenazi Jews, have more severe and potentially disabling manifestations including irreversible bone disease with osteolytic lesions, avascular necrosis and marked osteoporosis.5,6 The rate and extent of disease progression are variable. Treatment, most commonly with enzyme replacement therapy, effectively ameliorates many of the manifestations of GD1.710

However, because genotype-phenotype correlations are imprecise, it is difficult to predict who will benefit from expensive, potentially life-long therapies and who may suffer if treatment is withheld, interrupted, or otherwise modified. A validated, reproducible, and broadly applicable tool to classify GD1 severity at any stage of the disease would greatly facilitate clinical studies that seek to identify predictive factors for prognosis or define disease management criteria.

Current severity scoring indices for GD, such as the Zimran Severity Score Index (SSI)11 and Hermann score for bone disease,12 were conceived in the pretreatment era. These systems include quantifiable parameters that can change over time or with treatment (organ volumes, serum liver-related tests, subjective pain assessments) but also assign substantial weight to constants (splenectomy, osteonecrosis, imaging abnormalities) and thus may not be sufficiently sensitive to define slowly responding therapeutic changes.13 The Zimran SSI score has not been tested with contemporary methods to assess performance parameters nor has its responsiveness to intrapatient changes in disease manifestations over time been established. As a result, the instrument has not been widely used in routine clinical practice or as an endpoint in clinical trials. More recently, an alternate SSI (GauSSI-I) has been proposed for GD1.14 However, this scoring system has yet to be validated and its routine use may be limited because of its complexity and the technological requirements of some of the component assessments. It may prove useful in limited applications such as single-center clinical trials.

A disease severity scoring system (DS3) expresses an integrated assessment of the burden of disease in a given patient. Ideally, the DS3 could be used to assess patient status, determine endpoints in clinical studies, classify disease subgroups, and compare outcomes among patients with similar levels of disease severity. DS3s use a minimal data set to score the patient in a comprehensive manner using groups of domains (often by organ system) that are populated with nonredundant items. Items should be valid and reliable, use feasible, standardized methods of assessment, and be weighted based on associated morbidity and mortality. A validated DS3 for GD1 is needed to guide clinicians as to when GD-specific therapy should be started, to monitor disease progression and treatment response in individuals, and to compare patient cohorts in clinical studies. Such disease severity scoring systems have already been developed for other chronic diseases such as rheumatoid arthritis.15,16 Here, we describe the development of a DS3 for adult GD1 patients as well as testing of the instrument for validity (content, face, criterion, discriminant, and construct validity), reliability, and feasibility.

MATERIALS AND METHODS

Instrument development

Domain selection

Recognized experts in GD1 were invited to participate and a DS3 Working Group comprised of nine GD1 experts from across the globe, representing multiple medical specialties, was formed; a methodologist-biostatistician experienced in the development of instruments for outcomes measurement was engaged to facilitate construction and validation of the GD1-DS3. A consensus conference was held in December 2006 to initiate development of the instrument. Putative domains were identified using the nominal group technique (NGT) of consensus formation.17

Domains and items

Seventy-four international non-Working Group physicians with experience in GD were invited in an electronic survey to propose quantifiable items for each domain that they considered “most relevant to the assessment and quantitative measurement of the activity/severity of Gaucher disease.” They were also asked to rank the proposed domains in order of importance and to propose additional domains. The Working Group then used NGT to determine whether any domains originally proposed should be eliminated or whether new domains should be included. Items proposed were evaluated in terms of reliability, feasibility, and validity; duplicative or unnecessary items were eliminated. The methods of assessment appropriate for each item were selected by the Working Group using NGT consensus. A preliminary weighting system for items and domains was arrived at by NGT.

Validation and optimization

Content, face, criterion, discriminant, and construct validity were evaluated at two face-to-face conferences attended by 12 international experts in GD with no prior exposure to the DS3 (Table 1).18 Validation conference participation and logistics were arranged to coincide with major US and European Gaucher-related medical congresses. Deidentified patient profiles (n = 20) were suggested by Working Group members and were obtained from the International Collaborative Gaucher Group Gaucher Registry. Each patient profile contained complete clinical and diagnostic information from two sequential clinic visits, an initial and a follow-up visit, and was selected regardless of treatment status (all patients were untreated at the time of the initial visit). No values were imputed. Twenty profiles were chosen to achieve a reasonable assessment of sensitivity and specificity while limiting the time investment involved in scoring. Patient profiles were scored using the Clinical Global Impression (CGI) scale, which is used in a variety of other disease states.1921 Each profile was scored for each visit for the overall extent of disease severity summarized as the CGI-Severity (CGI-S) score. The degree of change between visits was characterized using a CGI-Improvement score to rate each patient's status at the follow-up visit as “improved,” “not changed,” or “worsened” from the initial visit. Participants were also asked to determine whether the change in the patient's status would trigger a change in treatment or prognosis. NGT was used to achieve consensus on CGI scores and therapy/prognosis change. Participants then individually scored the patient profile for each visit using the DS3. Results from the two individual conferences were not substantially or statistically different and were combined for analyses.

Table 1 Commonly used criteria for evaluating disease scoring systems18

Participants also rated the feasibility and content validity of the DS3 using a validated technique that uses a 4-point scale, where 1 represented the worst feasibility/content validity and 4 represented the best feasibility/content.22 Additional feasibility and content validity testing was later carried out when an additional group of 23 physicians attending a GD workshop reviewed the DS3, scored a sample case, and rated the DS3 for feasibility and content validity using the same scale used at the validation conferences. Results from all 35 ratings were used to generate Feasibility Index and Content Validity Index scores (see Statistical Methods section).

An additional correlation exercise compared CGI-S scores with that of the Zimran SSI.11 The original Working Group scored the 20 GD1 patient profiles for the initial and the follow-up visits using the Zimran SSI. For each patient, Zimran SSI scores were compared with the CGI-S scores obtained at the second validation conference.

Inclusion of items was refined according to statistical testing of reliability and stability. Additionally, scaling of and maximum scores for individual assessments within the DS3 were optimized to maximize the correlation between the total scores of the instrument and the consensus CGI-S scores using Excel 2002 Solver; this tool uses the Generalized Reduced Gradient-2 algorithm for optimizing nonlinear problems. The DS3 was optimized to maximize correlation with the CGI-S under conditions where bone density and infiltration data (the items judged most likely to be missing) were either present or absent.

Determination of the minimal clinically important difference

Minimal clinically important difference (MCID) represents the amount of change, either increase or decrease, in the instrument score that would usually indicate a change in some aspect of the disease or trigger an adjustment in medical care or prognosis. It is not synonymous with a difference that is statistically significant. We estimated MCID for the DS3 using data from the validation conference. Physicians were asked whether they felt that the observed change from one visit to the next would worsen, not change, or improve the patient's prognosis. Using CGI, the physicians came to a consensus decision regarding these two standards for each of the 20 sample patient profiles, given the change in the DS3. MCID was calculated as the amount of change in the DS3 score at which at least 75% of the physicians agree that there would be a change in prognosis.23 MCID for worsening was the change in the DS3 score at which at least 75% of physicians agreed that the patient's prognosis had worsened, and MCID for improvement was the change in the DS3 score at which at least 75% of physicians agreed that the patient's prognosis had improved.

Statistical methods

The Pearson rank coefficient was used to determine the correlation between the SSI and the DS3 scores. Each patient profile included data from an initial and follow-up visit, and each visit was scored using both the SSI and DS3. To increase the precision of the estimate of Pearson's rank coefficient (R), data from both the initial and follow-up visits were used in the calculation. Interrater reliability of the optimized DS3 instrument was assessed by having two independent raters score 10 of the patient profiles, and the level of agreement was estimated using Cohen's kappa coefficient.

The Content Validity Index is the proportion of physicians who scored the instrument as either 3 (relevant but needs minor alterations) or 4 (very relevant and succinct). The Feasibility Index was calculated in the same manner as the Content Validity Index, with a score of 3 equal to feasible but needs minor alterations and a score of 4 equal to very feasible. A value of 0.8 or greater indicates that at least 80% of respondents find the instrument acceptable with respect to the quality assessed.

RESULTS

Instrument development

Domain and item selection

Six domains were initially selected for inclusion in the DS3: bone disease/skeletal, growth/metabolism, hematological, visceral, patient reported, and physician reported. Thirty-six of 74 (49%) respondents from around the world completed the electronic survey. Table 2 shows the relative importance of each domain as ranked by the survey respondents.

Table 2 Importance of each domain, as ranked by 36 Gaucher disease physicians in the initial survey

The three lowest-rated domains (patient reported, physician reported, and growth/metabolism) were eliminated. Patient-reported pain was integrated into the bone domain. The primary impact on the health-related quality of life in patients with GD1 is pain.23 Including the pain component in the bone domain captures this dimension with less risk of redundancy. Fatigue, another patient-reported item, was considered by the Working Group to be a symptom often associated with other disease manifestations (e.g., anemia) and was therefore not incorporated. The growth/metabolism domain was eliminated, because it applies primarily to children and adolescents, whereas this instrument is designed solely for adult patients. Finally, the physician-reported domain, including a physician's global assessment, was eliminated because in rare diseases in which caregivers often see too few patients to develop clinical expertise the physician's global assessment may not be reliable. All items in the hematologic domain (thrombocytopenia, bleeding, anemia) were retained, because each is considered to be a clinically significant, nonredundant manifestation of GD1.

Selection and weighting of item measurements

For each item, the Working Group evaluated all known assessment options and selected methods of measurement based on feasibility and availability of technology, with the intention of containing costs and avoiding additional risks to patients. Preliminary weighing of measurements arrived at by Working Group consensus was assigned according to the extent that morbidity and mortality in each measurement contributes to GD1 severity. Weighing was optimized as described to obtain the final working model of the GD1-DS3 (Fig. 1).

Fig. 1
figure 1figure 1

Gaucher Disease Type 1 Disease Severity Scoring System instrument.

Working model and scoring method

The GD1-DS3 has three domains: bone, hematological, and visceral. Each domain contains three or more items, each scored individually by the evaluating physician. The domain score is tabulated by averaging the scores for all items within the domain. A total GD-DS3 score is the sum of the three domain scores, with a maximum score of 19 points. A scoring reference guide was developed.

Validation

Feasibility and content validity

The Content Validity Index was 0.96, and the Feasibility Index was 0.95.

Interrater and intrarater reliability

The intraclass correlation coefficient between any two physicians in scoring 10 patients using the GD1-DS3 was 0.97 (Cohen's kappa), indicating no significant difference between raters and a high level of concordance. Assessment of intrarater reliability, although initially planned, was not carried out.

Correlation with existing scales (construct and criterion validity)

Figure 2 represents the Pearson correlation of the GD1-DS3 with CGI-S scores from the validation conferences. In general, patients with “mild” disease as assessed by the CGI-S had GD1-DS3 scores <3, “moderate” disease correlated with DS3 scores of 3 to 6, “marked” disease 6 to 9, and “severe” disease >9. Correlation with the CGI-S was R2 = 0.89 when both bone density and infiltration data were available. In the absence of these data points, the correlation was R2 = 0.77.

Fig. 2
figure 2

Correlations between the Clinical Global Impression Scale–Severity (CGI-S) and the Gaucher Disease Type 1 Disease Severity Scoring System (GD1-DS3) score for 20 patient profiles, each scored at two different time points by 12 Gaucher disease experts. CGI-S scores of 1–1.5 indicate “mild,” 1.5–2.0 “moderate,” 2.0–2.5 “marked,” and 2.5–3.0 “severe” disease. Boxes show that, for most patients at most time points, CGI-S scores in each half-point range correspond with a 3-point range of GD1-DS3 scores.

Correlation between the CGI-S and the SSI was weak (R2 = 0.51). The Pearson correlation between the SSI and the GD1-DS3 was R2 = 0.60.

Minimal clinically important difference

The MCID for improvement in GD1-DS3 score was found to be a decrease of −3.17 and for worsening was +3.86. Cases scored as “no change in prognosis” by at least 75% of the physicians had changes in GD1-DS3 score that fell between these two values.

DISCUSSION

Methodological approaches for development of instruments to assess severity and monitor disease progression in any disease are necessarily complex. Rare conditions with heterogeneous clinical presentations such as GD1 pose even greater challenges. This study describes the development and validation of a new disease severity scoring system for GD1 and compares its outcomes with the only widely used GD1 severity score, the Zimran SSI,11 which was developed before disease-specific treatment for GD was available. Initial estimations of validity, reliability, and theoretical feasibility suggest that it is a practicable and reliable instrument for assessing disease severity and progression.

The GD1-DS3 offers an advantage over simply monitoring individual signs and symptoms of GD1. In a disease where varying clinical manifestations may present irregularly and progress at different rates even within the same patient, a weighted global assessment is required to accurately portray individual disease status. Besides providing a conspectus of disease severity and progression, the structure of the GD1-DS3 corresponds well with published therapeutic goals for the treatment of GD124 and may facilitate monitoring of these goals as part of an integrated management approach to this chronic disease. It improves upon the Zimran SSI by incorporating modern methods for assessing bone marrow infiltration and bone mineral density.

The GD1-DS3 is also potentially useful for reliable interpatient comparisons. In diseases such as GD1 with no well-documented biomarker or singular symptom complex that adequately captures disease severity, clinical trial endpoints are often difficult to define. A validated scoring system with established reliability and sensitivity can facilitate comparison of trial groups and monitoring patient outcomes over time. It may also serve as a quantitative means to facilitate health outcomes research as well as epidemiological investigations.

Although the GD1-DS3 seems to perform well in those parameters tested, additional refinement and continued validation are required. Further evaluation of performance characteristics must be completed. For example, the comparison of the GD1-DS3 with the CGI provided one measure of criterion validity, which seems to be high. Additional evidence of criterion validity will be provided as sensitivity, specificity, and area under the curve are determined. Feasibility, although thought to be high based on the validators' assessments, also needs to be proven in the clinic.

Prospective validation exercises are typically the final test of the validity of an instrument. Such testing is currently being planned for the GD1-DS3. Predictive validity has yet to be determined. Construct validity has been partially demonstrated, as the instrument was shown to correlate very well with the CGI-S. Divergent validity, frequently ignored in validity testing, should also be established; this can be accomplished using the GD1-DS3 to assess patients with diseases with similar symptom profiles, such as rheumatoid arthritis, and comparing the scores with those obtained from patients with GD1.

This DS3 scoring system is unsuitable for use in pediatric patients with GD1. This is a significant limitation because about 26% of the International Collaborative Gaucher Group Registry population are currently younger than 18 years (Genzyme Corporation, data on file) and new patients are increasingly diagnosed in childhood. The authors believe that a separate DS3 will be necessary for pediatric patients, as disease domains that are critical in pediatric GD are not relevant to adults (e.g., those relating to growth and development). In its current form, the tool is also inapplicable to patients with neuronopathic GD. Finally, the GD1-DS3 is subject to confounding by concurrent acquired illnesses.

Although the GD1-DS3 will provide clinicians with an important tool for assessing and monitoring adult patients with GD1, its contextual interpretation is subject to the clinical judgment of the treating physician. The GD1-DS3 may provide a useful tool for evaluating a patient over time and in relation to other patients, but it is not a substitute for careful and thorough listening and observation. GD1-DS3 scores should not be expected alone to dictate treatment decisions such as initiation of treatment or dosage adjustment, particularly because therapy for GD1 is often intended to prevent development of symptoms as much as to treat existing symptoms. Clinical and laboratory assessments other than those incorporated into this instrument, as well as patient-reported outcomes, play a critical role in the effective management of GD1. Individual patients will have unique combinations of GD-related symptoms as well as comorbidities, emotional issues, and socioeconomic concerns that must be evaluated and dealt with on a case-by-case basis.

The GD1-DS3 provides a reliable method of assessing both intra- and interpatient severity of GD1 in adults. It is easy to implement and requires no assessments beyond the normal standard of care for such patients in the contemporary clinical setting. Our results demonstrate that, in its current form, the instrument is highly correlated with the clinical assessment of physicians who are expert in this field. Furthermore, we have shown that it is perceived as feasible and highly acceptable by physicians of varying specialties and nationalities who care for Gaucher patients. Definitive validation will require more extensive retrospective studies and prospective studies in different populations.