Temporal stability of quality of life assessments in cancer patients

Quality of life (QoL) is an important outcome criterion in cancer research and practice. Multiple studies have been performed to test the short-term temporal stability (1 day–2 weeks) of the European Organization for Research and Treatment of Cancer Quality of Life Questionnaire EORTC QLQ-C30, but its stability over longer periods of time is largely unknown. The EORTC QLQ-C30 was administered at two time points between 3 and 12 months apart in six samples of cancer patients with varying characteristics (N between 298 and 923). Averaged across the six samples, the coefficients of temporal stability (intra-class correlation coefficients ICC) were between 0.31 and 0.59 for the single scales. The 2-item global health/QoL scale showed a mean coefficient of 0.44. When the stability coefficients were calculated separately for males and females and for younger vs. older patients, no systematic gender or age differences were found in the temporal stability of the QoL scales, though the stability was slightly higher in males (vs. females) and in older subgroups (vs. younger subgroups). It is nearly impossible to predict the course a cancer patients’ QoL will take over a several month period. Repeated measurements are necessary to track QoL developments.


Scientific Reports
| (2021) 11:5191 | https://doi.org/10.1038/s41598-021-84681-0 www.nature.com/scientificreports/ paper were (a) to examine the temporal stability of the 15 dimensions of QoL, (b) to test whether global aspects of QoL are more stable than specific aspects, and (c) to examine whether there are age and gender differences in the stability of these variables.

Methods
Samples. The data set consists of six different samples with two measurement points each. These six studies have already been analyzed and published with other objectives. We shortly describe the samples and procedures in Table 1 and mention the references concerning further aspects of the studies. We only analyzed the data from respondents who had complete data sets for both t1 and t2. The sample sizes are therefore sometimes lower than those given in the original publications since these publications often only refer to t1. Table 1 gives a summarizing overview of the six samples. While the t1 measurements were performed in the clinic, the t2 questionnaires were completed at home and mailed to the clinic. The format was paper-pencil with the exception of sample 3 (AYAs) where both paper-pencil and online forms were used.
Sample 1: REHA-mixed cancer patients, rehabilitation clinic. Sample 1 was composed of cancer patients who underwent a rehabilitation program to regain physical fitness (n = 923). The most frequent cancer diagnoses were breast cancer (25%), prostate cancer (19%), and cancer of the gastrointestinal tract (18%). Six months after their stay at the rehabilitation clinic (t1) the patients received the t2 questionnaire by mail along with a letter. Further details of the sample have been described elsewhere 19 .
Sample 2: MIXED-mixed cancer patients, hospital. This sample included 897 cancer patients who were treated in a large German university hospital. The most frequent cancer localizations were: prostate (19%), breast (11%), rectum (7%), and cervix (6%). The first measurement point was hospital admission, and the second point was six months later 20 .
Sample 3: AYA-adolescents and young adults. A sample of 514 AYAs (age 15-39 at diagnosis) was included in this study. The most common tumor diagnoses were breast (27%), non-Hodgkin lymphoma (18%), gynecological tumors (9%), testicular tumor (8%), and hematological cancer (7%). Patients were recruited in 16 German acute care hospitals, four rehabilitation centers, and from two cancer registries. Twelve months after the first measurement point the patients were contacted again 21 .
Sample 4: URO-GYN-urological and gynecological cancer patients. This sample was composed of 314 male patients with urologic cancer and 103 female patients with gynecological cancer treated in a German university hospital. In this analysis, we use the data from the first measurement, obtained while the respondents were hospitalized, and the follow-up measurement taken three months later 22,23 .
Sample 5: GYN. Gynecologic cancer patients. The participants in this study were 298 patients with gynecological or breast cancer who were recruited in the gynecological clinics of three German hospitals. The first measurement was performed one or two days before hospital discharge, and the second questionnaire was sent to the participants three months thereafter 24 .
Sample 6: BREAST-breast cancer survivors. This sample included 308 women who took part in a routine radiologic after-treatment (breast cancer) examination. The mean time since first diagnosis was 7.6 years. The participants were asked to complete the questionnaires immediately after the radiologic examination (t1), and were sent a letter and the t2 questionnaire by mail three months later 25 .
Ethical approval. The studies were approved by the respective ethics committees. All procedures performed in the studies were in accordance with the 1964 Helsinki declaration and its later amendments of comparable ethical standards.
Informed consent. Informed consent was obtained from all individual participants included in the studies. symptom scales (including single symptom items and an item reflecting financial difficulties), and a 2-item global health/QoL scale. All items must be answered with one of four given categories (not at all, a little, quite a bit, very much). The EORTC Quality of Life Group proposed a summarizing score of higher order which is composed of the five functioning scales and eight symptom scales 26 .
Statistical analysis. The temporal stability was calculated with intraclass correlation coefficients (ICCs).
Among the different versions of the ICCs we chose the two-way mixed-effect ANOVA model which is recommended by Qin et al. 27 , and which is model (A,1) in the terminology of McGraw and Wong 28 . ICCs reflect both fluctuations in the ranking of the subjects and mean score changes in the sample. In addition to these ICCs, Pearson correlations were calculated since several studies report test-retest correlations in terms of Persons correlations, and since these coefficients only consider the aspect of the maintenance of the (linear) order of the subjects. Pearson correlations explain the square root of the proportion of the t2 variance explanation. To summarize assessments of the temporal stability across the six samples we calculated the averaged correlations after Fisher's z-transformation. The subsamples according to age and QoL were defined on the basis of the median age and the median global health/QoL score of the sample to get nearly equal sample sizes. Mean score differences between the t1and the t2 measurements were calculated to help interpret the differences between the ICCs and the Pearson correlations. All analyses were performed with SPSS version 24.

Results
Temporal stability of the scales. Table 2 presents the ICCs and the Pearson correlations (test-retest correlations) for each of the EORTC QLQ-C30 scales and each sample. The highest stability scores (ICCs) were observed among breast cancer survivors. The right column of Table 2 gives the means of the six coefficients, averaged across the six examinations. They range from 0.31 (nausea/vomiting) to 0.63 (sum score). The mean stability of the 2-item global health/QoL score (ICC = 0.44) was lower than the coefficients of most of the specific scales. When compared with the Pearson correlations, the ICCs were slightly smaller in most cases. Table 3 shows the mean score differences between the t1 and t2 measurements. In the sample BREAST the mean differences were lowest.  Table 4 presents the mean stability coefficients (ICCs). Two samples (GYN and BREAST) included only females, and one sample (AYA) included only adolescents and young adults. We did not calculate age-specific results for the latter sample due to the limited age distribution. Averaged across all samples, the males' scores were slightly more stable than the females' scores, and older patients' scores were slightly more stable than those of younger patients. The single samples' results were mixed, however. While in the URO-GYN sample, males were much more stable than females, the other samples showed only small and unsystematic dif-  www.nature.com/scientificreports/ ferences between males and females. Concerning age, in three of the five samples (MIXED, URO-GYN, GYN), the older subgroups reached higher stability coefficients than the younger groups, while the coefficients were nearly equal in the REHA sample, and in the BREAST sample we observed an opposite trend. In four of the six samples, the stability among the patients with relatively high QoL at t1 was higher than the corresponding scores of the low QoL groups, but in one sample (MIXED) there was an opposite trend, and in one further sample (GYN), the results were mixed, with higher stability of the global health/QoL score and lower stability of the QoL sum score for the high QoL subgroup.

Discussion
As was to be expected, the coefficients of temporal stability were much smaller than those obtained in studies with time intervals of only a few days between measurement points. In the nine studies featuring time intervals between 1 day and 2 weeks in lengths 5-13 the stability coefficients of the 2-item global health/QoL scale were between 0.82 and 0.93, with the exception of one study that examined brain tumor patients. The averaged Pearson correlation our study found (r = 0.45) for measurement periods between 3 and 12 months long means that only 20% (r 2 = 0.202) of the variance of the t2 measurement can be explained by the t1 scores. The relatively low stability coefficients (as compared with those reported above) in our samples can have two reasons: longer temporal intervals and changes in the health situation of the patients from t1 to t2. Among the six studies, the sample of breast cancer survivors showed the highest stability coefficients in all scales, a result which is obviously due to the absence of treatment and the longer period of time that had elapsed since diagnosis (7.6 years) compared with the other samples. While Pearson correlations only indicate changes in the (linear) rank position of the individuals, ICCs are also sensitive to mean score changes. Therefore, we focused on these ICCs. However, the compilation of ICCs and Pearson correlations in Table 2 shows that in most cases the differences between these two types of coefficients are small in magnitude. Temporal stability as measured with the ICC or with Pearson correlations should not be considered a criterion of psychometric quality of the instrument. This would only be justified if no systematic changes in the patients' health situation had occurred. This however is obviously www.nature.com/scientificreports/ not the case. Low temporal stability coefficients do not mean that the questionnaire is unreliable. The stability coefficients provide clinicians with information about the degree of precision with which patients' future QoL can be predicted on the basis of a baseline measurement. We had hypothesized that global health/QoL would be more stable than the specific facets of QoL. This was not confirmed. For most of the functioning scales, the stability was higher than that of the global health/QoL scale, and when compared with the symptom scales, the stability of the global health/QoL was in the middle range. This means that, even if there are only slight changes of the global assessment of health and QoL on the group level 29 , there are nevertheless remarkable changes in the individual assessments.
The sum score of the EORTC QLQ-C30 showed the highest stability scores. Since calculating sum scores is relatively new in the research on the EORTC QLQ-C30, there are no studies in the literature which have reported the stability of this sum score. The summarizing assessment of QoL was markedly more stable (averaged ICC = 0.63) than the 2-item global health/QoL scale (ICC = 0.44). This underlines the usefulness of this sum score when a generalized assessment of QoL is intended.
To our knowledge, this is the first study that explicitly examined gender and age differences in temporal stability of QoL. Females generally report higher levels of anxiety, emotional lability and neuroticism than males 18 . However, our results show that this does not necessarily mean that females are inconsistent in their judgments. Four of the samples profiled here included males and females, two of which showed higher stability scores for the males, while the other two showed mixed results. Here one has to take into account that most of the female participants were post-menopausal; the individual changes might be more pronounced among younger women. Concerning age, on average older patients gave somewhat more stable responses than younger patients did, a tendency which was observed in three of the five samples which included age comparisons. Taken together, we detected only inconsistent age and gender tendencies in the stability of QoL assessments. Further research is needed to examine whether it is really necessary to consider age and gender peculiarities of the samples when studying the stability of QoL assessments.
This study has several strengths and limitations. One strength is the relatively large sample size given that longitudinal study samples with more than 300 participants are rare. This gave us the opportunity to divide the samples into subgroups and to address the questions of how age and gender differences impact stability. The comparison of the six samples gives an impression of the generalizability of the results obtained from a single examination. A limitation of this study is the heterogeneity of the six samples in terms of cancer type, treatment protocols, and stages. Nevertheless, even readers who question the justification of averaging the stability values over these heterogeneous studies may still be interested in the results of the six individual examinations. In the cases where low temporal stability was observed it remains unclear to what degree that instability was due to natural fluctuations or to individual differences in disease processes. The relatively low stability coefficients of some of the one-item symptom scales may also be due to low reliability of the measure and to real changes. We chose to average (after z-transformation) the correlation coefficients of the studies with different sample sizes. Another option would be to weight the samples according to the sample sizes. Doing this would however have resulted in different weights for the different settings. We could not discuss all relevant aspects of age and gender differences in the fluctuations of QoL responses, we could not present and discuss the age and gender differences of the 15 scales of the EORTC QLQ-C30, and we could not discuss the peculiarities of the six samples in more detail.
Nevertheless, our general conclusion is that measuring QoL in a hospital setting does not provide sufficient information on what QoL the patient may expect some months later. Repeated measurements are necessary to follow individual courses of QoL.