Use of item response theory to develop a shortened version of the EORTC QLQ-BR23 scales

It is important that questionnaires are as short as possible while still capturing the scope of problems relevant in an effective and reliable manner, to minimize the response burden. The purpose of our study was to develop a shortened version of the EORTC QLQ-BR23 for using in breast cancer survivors. Our data come from 10794 breast cancer survivors who completed the EORTC QLQ-BR23. Two-thirds of the sample was randomly selected from the original sample for development, and the remaining was used for validation. Item response theory methods were applied to shorten scales. The graded response model of Samejima was used to fit the item responses. The shortened scale was evaluated with the validation set by examining the mean difference, the proportion of respondents correctly predicted, correlation and weighted kappa between the shortened form and the original observed scores. Results reveal that a three-item BRBI, a four-item BRST, a three-item BRBS and a two-item BRAS forecast the scores on the original scales with wonderful consistency and are alike in measurement precision with no loss or only little loss in detecting group differences. Prospective validation on new diagnosed breast cancer patients and with poor QOL is needed.

The development of item response theory (IRT) has reached a point where testing applications 1,2 , whether in educational [3][4][5] or psychological [6][7][8][9] testing programs or in research, can be performed entirely with IRT methods. Nevertheless, IRT has only come into application a short while ago in the field of health outcomes instruments [10][11][12][13][14] . According to previous researches, IRT methods have obvious advantages compared with classical test theory 2,15,16 . A crucial distinction between IRT and classical test theory is that IRT defines a scale for the potential variable being measured by a set of items, and items are calibrated as for the same scale. Therefore, using IRT method can easily calibrate two assessments of different lengths 17,18 .
The European Organization for Research and Treatment of Cancer (EORTC) Breast Cancer-Specific Quality of Life Questionnaire (QLQ-BR23) is one of the most widely used supplementary questionnaire modules for evaluating the quality of life in breast cancer patients in particular 19 . The EORTC QLQ-BR23 consists of 23 items. In most cases, breast cancer patients are usually extremely ill and too weak to complete the entire wordy questionnaire in a given short time. Therefore, the brevity of the questionnaires, following with non-inferior validity and reliability, is of great importance for researchers to lower the response burden they might encountered. The goal of this study was to evaluate the possibilities for shortening the EORTC QLQ-BR23 (body image, systemic therapy side effects, breast symptoms, arm symptoms) scales for using in breast cancer survivors while still be able to compare the results of the shortened scales with the non-shortened scales firsthand.

Materials and Methods
Study design and sample. The example data come from 10794 breast cancer survivors from a cross-sectional study conducted in 2013, who were the member of the affiliated groups of Cancer Recovery Clubs in 34 cities across China. Informed written consent was obtained before we start the investigation from each participant. Approval for the study was received from the Ethic Committee of Public Health School of Fudan University (protocol number RB # 2013-04-0450). More detailed information on this study were available in the previous Questionnaire. The EORTC QLQ-BR23 consists of 23 items 21 . Twenty of the items constitute five scales and three single-items symptom measures. The sexual function only have two items and the estimation procedure could not converge, therefore, our study here is on the body image (BRBI), systemic therapy side effects (BRST), breast symptoms (BRBS) and arm symptoms (BRAS) scales. These consist of four, seven, four and four items, respectively. Each item has four response categories: "Not at All" = 1, "A Little" = 2, "Quit a Bit" = 3, and "Very Much" = 4. The scale scores are constructed by averaging items within scales and transforming average scores linearly, ranging from 0 to 100. The procedure is as follows: 1) Raw score: estimate the average of the item that contribute to the scale; 2) Linear transformation: use a linear transformation to standardize the raw score, so that scores range from 0 to 100 (functional scales: S = {1 − (Raw score − 1)/Range} * 100, symptom scales/items: S = {(Raw score − 1)/range} * 100, Range is the difference between the maximum possible value of RS and the minimum possible value). For the missing value, if less than half of the items from the scale have been answered, we set scale score to missing; if no, we using the mean value of the answered items to replace the missing items. And for single-item measures, set missing value to missing 22 .
Statistical methods. IRT-based methods were used to shorten scales. As the response of the items are polytomous and ordered, with scoring categories ranging from one to four, we used the gradual response model of Samejima (GRM) 23 to fit the item responses. One of the most important assumptions of the application of IRT analysis is unidimensional. We used the factor analysis to test the unidimensionality of the EORTC QLQ-BR23 scales. The results show that the scales are sufficiently unidimensional for application of unidimensional IRT analysis. Item parameter estimates were carried out using STATA software program with the marginal maximum likelihood method. This method supposes that, for a given item n, the probability of choosing a category m or higher (with m = 2,3, …, k n ) is specified as a logistic function of theta (θ) as where θ represents the potential ability of the individual, an individual who have a better QOL would have a higher θ score, namely the latent level of quality of life; a n is the slope parameter, represents the discrimination of the item; b nm is the category threshold parameter, represents the difficulty of the item, can be interpreted as the θ value at which exactly 50 percent of the population scores in category m or higher; D is the scale constant specifying the metric of the potential disability scale, and in the conventional logistic metric D is equal to 1.7. Samejima (1969) further defines P (x in ≥ 1) = 1 and P (x in ≥ k n + 1) = 0, therefore, the probability of observing a specific category m for a given disability θ is then equal to for all m = 1, 2, …, k n . The item information functions (IIFs) is a measure of how much information an item provides about the IRT score. More details about the explanations of parameter refer to previous research 23 . The IIFs and the ability to predict scores on the full scales were used to select the items for the shortened scales. Item Characteristic Curves are the trace lines for each response choice, which plot how the individual items function in relation to the quality of life (the underlying trait). Difficulty and discrimination are two properties of the item characteristic curves. The parameter of difficulty describes where the item functions along the ability scale; and the parameter of discrimination of the item describes how well an item can differentiate between individuals having abilities above the item location and those having abilities below. Both the parameter of slope and the location of the items were considered during item removing. Items were examined by subscale to determine which items to remove in the development of a shortened version of the EORTC QOL-BR23. We compared the shortened scales scores with the full scales scores by calculating the difference in mean scores; the percentage of correctly predicted groups; the Pearson correlation r, and the weighted k measure of agreement between the shortened and full scale scores. Informed consent. Informed consent was obtained from all individual participants included in the study.

Results
Demographic and clinical characteristics. Of the 10794 participants in the database, two-thirds of the sample (7196) was randomly selected from the total sample for simulation, and the remaining one-third (3598) is used for verification. The sample characteristics were reported in Table 1. Approximately 90 percent of the participants aged from 50 to 70. With the TNM system used for the evaluation of the stage of disease, T represents the size of the original (primary) tumor and whether it has invaded nearby tissue; N represents nearby (regional) lymph nodes that are involved; M represents distant metastasis (spread of cancer from one part of the body to another). We found that more than 70 percent of the participants were in an early stage of the disease (TNM classification 0 or 1 or 2) and 23% (development set) and 25% (validation set) were in stage 3 or 4. The most prevalent primary treatment was surgery combined with chemotherapy, followed by surgery combined with chemotherapy and radiotherapy. Slightly more than half of the breast cancer survivors survived more than 5 years, and 22% survived over 10 years.

Item content and information by item.
The number of the non-missing responses, mean scores, standard deviations (SD) and the information by item for the EORTC QOL-BR23 items were listed in Table 2. The item scores were transformed to a 0-100 scale, the mean scores ranged from 63.26 to 90.62, with SD ranging from 17.20 to 31.92. The information of each item within the range of −2 to 2 was shown as Table 2. Among the 18 items for the IRT analysis, the mean information of body image ranged from 0.69 to 1.47. All of the 7 items in the Systemic therapy side effects had a lower information, ranging from 0.24 to 0.55. Only one item of breast symptoms had a lower information (0.46), and the other three items all had a higher information ranging from 0.90 to 1.12. For the arm symptoms scale, the mean information ranged from 0.67 to 0.91.    Item properties. The estimation of item parameters from the GRM calibration were showed for each item in Table 3. The estimation of slope ranged from 1.14 to 4.44, showing a great variability in discrimination among all the items. The threshold estimates for each item were presented in an increasing order, and there were no inverse threshold values. The threshold estimates endorsing 1 versus ≥2 (b 1 ) ranged from −0.56 to 0.99, and endorsing 2 versus ≥3 (b 2 ) ranged from 0.67 to 3.18, and endorsing 3 versus 4 (b 3 ) ranged from 1.25 to 4.32.    Table 4 displayed the results for the shortened for each of the four domains. We divided the survivors into four groups according to the quartile of the scores of the short form and the original, respectively. The proportion of respondents correctly predicted was high and similar, as compared to the original scale. The mean difference between the shortened form and the original observed BRST scores and BRBS scores are less than 1; BRBI scores and BRAS scores were less than 2.5. Both the correlation and the weighted kappa were high.

Discussion
The expansion of study on the cancer survivors' quality of life, and the great need for well-validated questionnaires suitable for evaluating the construct with more than a single dimension, led us to conduct this study to develop a shortened version of the EORTC QLQ-BR23. One of the important assumptions of IRT analysis is unidimensionality 24 , referring to the question whether the items measure the same potential traits. All the items in the data evidently measured some aspects of quality of life, therefore, we analyzed each dimension separately. Since the sexual function only have two items, and during the past four weeks more than 80% participants reported they   had no interested in sex and had no sexually active, less than 2% participants reported the response of "Quite a bit" or "Very Much". This phenomenon might be attributed to the fact that the women were shy and reserved when talked about sexuality, especially old women in China. Therefore, our study here is on the BRBI, BRST, BRBS and BRAS scales. Generally, when the standard error is less than 0.2 we consider the item has a high quality; while the standard error is less than 0.25 we define the item as acceptable but needs to be improved; whereas, when the standard error is more than 0.25, we define the item as poor quality and consider deleting it 25 . According to the formula: I = 1/σ 2 26 , the total item information should be higher than 16. The EORTC QLQ-BR23 consists of 23 items, therefore, the information of each item greater than 0.70 (16/23) was defined as good quality, and if the information of each item more than 1.09 (25/23) then defined as excellent. For the dimension of body image, breast symptoms and arm symptoms, item 9, 23 and 19 were deleted based on the information criterion. However, this also reminded us that this dimension might need to be improved when used in Chinese population.
The evaluations for the information of systemic therapy side effects scale were all less than 0.70. In order to maintain the balance of the content dimension of the whole scale, we kept the four items with the highest   The BRBI scores predicted with item 10, 11 and 12, the BRBS scores predicted with item 20, 21 and 22, and the BRAS scores predicted with item 17 and 18, were all in a great agreement with the original scales. The correlation and weighted kappa coefficient of BRBI between predicted and original scores were 0.98 and 0.9, respectively. Using item 10, 11 and 12 may be expected to result in the same findings and conclusions as using the full BRBI scale. The shortened BRBS and BRAS scales were extremely perfect in predicting the original scale scores, the percent correctly predicted scale scores were all 100% and the correlation coefficient were all higher than 0.95.
Unlike classical test theory, results from IRT calibration contain detailed item-level information that can be considered from many useful perspectives 27 . For example, the test characteristic curves and the summation of Item characteristic curves for the entire instrument are especially useful in defining the cutoff value between the shortened and the raw data score, and also useful in estimating of the original scale score. IRT had been used by many other researchers to create short versions of existing instruments 14,28-31 . Some of the previous studies use a similar strategy as reported here for shortening the EORTC QLQ-C30 scale [32][33][34] . Our results were so consistent with these IRT based prediction methods that it seemed possible to shorten scales and simultaneously provide high precision in predicting the scores on the original scale.
Based on the present study we expect an application of a four-item BRST scale composed by items 2, 3, 6 and 8; a three-item BRBI scale composed by items 10, 11 and 12; a three-item BRBS scale composed by 20, 21 and 22; and a two-item BRAS scale composed by 17 and 18 in a shortened version of the EORTC QLQ-BR23 for breast cancer survivors with severity. In all six items were deleted using the IRT based approach. We hope that some of the single items or scales (i.e., sexual function) will be deleted, and the questionnaire could be cut off by a half, so that it could dramatically expand the scope of application of the questionnaire in the future studies.
A limitation of the study is that the sample was recruited from the Cancer Recovery Clubs, with a long-term survival and a higher quality of life. Therefore, further studies are needed to investigate the results in newly diagnosed breast cancer patients with poorer quality of life. Notwithstanding its limitations, some strengths of our study are still far from being neglected. For instance, the large size of the sample enhanced power of the estimation procedures, and the application of IRT methodologies for identifying a subset of items maximized reliability and maintained adequate precision.

Conclusions
IRT is an effective analysis method to shorten the scales and simultaneously provide high quality in predicting the scores on the full scale. Prospective validation on newly diagnosed breast cancer patients and with poor QOL is needed for further studies. Given the favorable results for the BRBI, BRST, BRBS and BRAS scales we expect that the shortened version of the EORTC QLQ-BR23 is of potentially practical value for researchers and clinicians.