Introduction

Overweight and obesity are pervasive conditions in childhood, with prevalence in developed countries of 24% in boys and 23% in girls [1]. In Australia, in 2018, 25% of children and adolescents aged 2–17 years were in an overweight or obese weight range. This is concerning due to the associations during childhood with elevated cardiometabolic risk factors, [2] type 2 diabetes [3] asthma [4], musculoskeletal pain [5] and depression [6]. Beyond clinical outcomes, there is evidence that patient reported outcome measures (PROMs) such as functional health status or health-related quality of life (HRQoL) are also impacted by weight status in childhood [7,8,9]. HRQoL measures are important aspects of patient-centred care and can inform economic evaluations and funding decisions regarding prevention and treatment [10]. HRQoL may be condition specific or generic, with the advantage that generic measures can be used across a range of childhood diseases or conditions. When used in economic evaluations HRQoL measures also require a preference-based value set (or utilities) to calculate quality-adjusted life-years (QALYs).

Two paediatric HRQoL measures commonly used in overweight and obesity research are the Paediatric Quality of life Inventory (PedsQLTM), the most widely used generic paediatric HRQoL measure [11], and the Child Health Utilities 9 Dimensions (CHU9D)[12], the first ‘preference-based’ measure designed specifically for children and with their involvement. PedsQL and CHU9D are different instruments with different purposes: the former measuring general quality of life and the latter measuring utilities for economic evaluation. There is consistent evidence that childhood overweight and obesity are associated with impaired HRQoL when measured with the PedsQL [9] which is the most frequently used HRQoL measure in the context of obesity [13]. Similarly, the CHU9D is widely used in cost-utility analyses of obesity prevention, yet the evidence for reduced HRQoL associated with overweight and obesity, using this measure, has been mixed and may depend on age and context [14,15,16]. The underlying pathways between weight and child HRQoL have not been established, but may be due to obesity-related co-morbidities [7], and reduced psychosocial and physical health [7, 8, 17].

Psychometric evaluation is important for establishing that an instrument is ‘fit for purpose’ in the measurement of HRQoL. The psychometric properties of the PedsQL have been established in general child and adolescent populations in the USA, [18] Netherlands [19] and Greece, [20] thus validating its measurement properties in population health. Similarly, the feasibility, reliability and validity of the CHU9D has been established in general child populations in Denmark [21], Sweden [22], Australia [23] and China [24]. Many of these studies have also established known group validity to self-reported health or chronic disease, but emphasised the need to establish the psychometric properties of these measures in different clinical populations. Both PedsQl and CHU9D have been used as PROMs in clinical trials of obesity prevention and treatment, yet only two studies [15, 25] have investigated the psychometric properties of these measures in the context of overweight and obesity. These studies in the UK and China found that among children aged 5–6 years, neither PedsQL nor CHU9D discriminated between children with healthy weight and those with overweight or obesity. The importance of re-evaluating the psychometric properties of HRQoL measures in different clinical populations has been noted [12, 21] as has the importance of evaluating responsiveness to change in health over time [26, 27]. A recent review of the psychometric performance of utility-based HRQoL instruments [28] highlighted that good psychometric performance in a general population would not necessarily indicate good performance in specific clinical conditions. Thus, it is important that psychometric assessment is conducted in the population of interest, with age, health condition and context likely to influence the performance of an instrument. Given the gap in our current knowledge, the aim of the present study was to assess the psychometric properties, including reliability, acceptability, validity and responsiveness of the PedsQL and the CHU9D in the measurement of HRQoL among children and adolescents in different weight status groups. We hypothesised that good psychometric performance would be shown across children in different weight status groups.

Methods

Participants

Data used in this study were from 6544 child participants of the Longitudinal Study of Australian Children (LSAC) [29]. The LSAC is an ongoing, large population survey of children and their families, that collects data on child development and wellbeing. It uses a stratified sampling design and is designed to be representative of the Australian child population. The LSAC recruited two cohorts of children in 2004 using clustered sampling methods: 5107 children in the Birth (B) cohort and 4983 children in the Kindergarten (K) cohort [30]. Children and their caregivers were interviewed every 2 years, with the most recent wave of data collection in 2020. In the present study, we used all longitudinal data from both the B and K cohorts in which both PedsQL and CHU9D were included. This covered the ages of 10–17 years and encompassed waves 6, 7 and 8 of the B cohort, in which children were aged 10/11, 12/13 and 14/15 years, and waves 6 and 7 of the K cohort in which children were aged 14/15 and 16/17 years.

HRQoL measures

PedsQL

Generic health-related quality of life was measured using age-appropriate versions of the PedsQL v4.0 Generic Core Scales [31], with parent proxy report, hereafter referred to as PedsQL. From age 10 to 12 years, the ‘Parent report for Children’ was used and from 13 to 17 years the ‘Parent report for Teens’ was used. The PedsQL consists of 23 questions covering domains of physical, emotional, social and school functioning. Each item is scored on a 5-point scale (0 = never a problem; 1 = almost never a problem; 2 = sometimes a problem; 3 = often a problem; 4 = almost always a problem) and then reverse transformed, such that the Total Scale Score represents the sum of scores across all 23 items and ranges from 0 to 100, with higher scores indicating better HRQoL. Summary scores for Physical functioning, Emotional functioning, Social functioning and School functioning can also be calculated from subsets of the 23 items.

CHU9D

The CHU9D is a preference-based health-related quality of life measure, developed with children and validated for a target age of 7–11 years. It has also been validated for use amongst a general population of adolescents, aged 11–17 years [23, 32, 33]. It comprises 9 dimensions: worry, sadness, pain, tiredness, annoyance, school, sleep, daily routine and activities; each of which are scored at 5 levels of difficulty, self-reported by the child. The CHU9D has been valued in several different country contexts, which enables the calculation of utility scores used for estimating QALYs in economic evaluations. In the LSAC data, utilities from the CHU9D were determined using the Australian valuation algorithm developed in adolescents [32] and take the possible range of values from −0.1059 (poorest health) to 1 (perfect health).

Weight status

At each wave of data collection in the LSAC, consenting children had their height and weight measured by trained research assistants. Height was measured with a laser stadiometer and weight was measured with Tanita body fat scales [30]. From height and weight, we calculated BMI-z scores (BMI-z) according to WHO standards [34]. Weight status was determined from BMI-z using the following definitions: healthy and underweight: BMI-z < 1; overweight: BMI-z ≥ 1 and <2; obesity: BMI-z ≥ 2. The proportion of child records in the underweight category (BMI-z < −1) was extremely low (<1%) so they were included with healthy weight. BMI-z values >5 and <−5 (n = 15) were dropped from the analyses as these are considered biologically implausible [35].

Demographic characteristics

Age (in years), sex (male or female), socioeconomic position (SEP) (High or Low), culturally and linguistically diverse (CALD) status (CALD/not CALD) and Indigenous status (Aboriginal or Torres Strait Islander) were included as controls in the analyses. Individual-level socioeconomic position was measured at each wave using a variable developed by the LSAC study investigators which combined the education level, occupation type and income of the child’s caregivers into a z-score [36]. For simplicity, we categorised this variable into high (SEP z-score ≥ 0) and low SEP (SEP z-score < 0). A language other than English regularly spoken to the child, collected at age 2–3 years for the B cohort and age 4–5 years for the K cohort, was used as a proxy for CALD status.

Psychometric properties/statistical analyses

The analyses of psychometric properties were conducted in accordance with practice guidelines and criteria for psychometric assessment [27, 37, 38]. For all analyses, except those assessing acceptability through missing data, we used observations that were complete for BMI, PedsQL, and CHU9D.

Reliability The only aspect of reliability we were able to assess with our existing dataset was internal consistency reliability, which is the degree of interrelatedness among items from the same scale [37]. Cronbach’s alpha and item-total correlations were used to assess the interrelation of the relevant individual items of PedsQL with the four summary scores and with the total score, and for the individual items of CHU9D with the total utility score scale, among children with overweight and obesity. A Cronbach’s alpha value ≥0.7 and item-total correlations ≥0.2 are considered acceptable thresholds for internal reliability consistency [38, 39].

Acceptability measures the quality of the data and is assessed by the completeness of the data and score distributions, including floor and ceiling effects. Acceptability may also include the practicality and feasibility of using a particular instrument among children with overweight and obesity, and may include measures of comprehension or burden of completion. Without access to respondents, we investigated acceptability through the assessment of missing data and the proportion of ceiling and floor values for the PedsQL total scores and CHU9D utility scores [40] across age and weight status. A low and acceptable level of missing data <5% was used as a benchmark [41], and the threshold for the acceptable floor and ceiling values was <10% [40].

Validity was addressed through known groups validity and convergent validity. Known groups validity is the extent to which a HRQoL measure can distinguish groups of children with and without a health condition, or between children with different severity of a condition. We hypothesised that children with higher weight status would have lower HRQoL and investigated known groups validity using general estimating equations (GEE) to account for the repeated measures of weight status and HRQoL among the same children, with adjustment for socio-demographic characteristics known to impact on HRQoL [39]. The GEE models included binomial family, log-link function and robust variance estimation. PedsQL Total Scale Scores (transformed to 0–1 scale) and CHU9D utility scores were the response variables; explanatory variables were weight status (healthy, overweight, obesity) and demographic variables, as described above. Interaction terms of weight status and significant demographic variables were included to identify whether these parameters modified the association of HRQoL and weight status. Models were fitted separately for girls and boys and significance levels were set at p < 0.05 for main effects and p < 0.01 for interaction terms (Wald tests). The margins command in STATA was used to predict marginal effects of weight status on reduced HRQoL, and to predict HRQoL by age and weight status, using final models including interaction terms, where significant.

Convergent validity measures the level of agreement between instruments that purport to measure the same construct, and usually uses an existing health measure as a comparator. As PedsQL and CHU9D are well established and accepted measures of the same general construct i.e. HRQoL, we assessed convergent validity by calculating Spearman’s correlations between the CHU9D utility scores and the PedsQL Total Scale Score among children in each weight status group. Correlation coefficients >0.8 are regarded as strong, between 0.61 and 0.8 as good, between 0.41 and 0.6 as moderate and <0.4 as weak convergent validity [28]. We hypothesised that there would be moderate correlation between the two instruments, as they are both measures of HRQoL, but one is child report and the other is parent proxy.

Responsiveness is the ability of a measure to detect change over time when there are known changes in health status. [42]. This was examined by whether changes in the PedsQL total score and CHU9D utility score were responsive to changes in weight status between subsequent waves in the LSAC. Children were classified as to whether their weight status stayed the same, improved or deteriorated between consecutive waves of LSAC, according to the three weight status groups: healthy, overweight or obese. Both the B and K cohorts were used in the analysis, providing data on the change in HRQoL scores for individual children over 2-year intervals from mean ages 11–13, 13–15 and 15–17 years. We hypothesised that deterioration in weight status (healthy to overweight; healthy to obese; overweight to obese) would result in a negative HRQoL score change, whilst improvement in weight status (overweight to healthy; obese to overweight; obese to healthy) would result in a positive change in HRQoL scores, and no change in weight status would result in a HRQoL score change close to zero. Standardised response means (SRM) and effect sizes (ES), which take into account the change in HRQoL score in relation to the SD of baseline score, were calculated according to the method outlined in [43].

Results

Participants

Descriptive statistics of the analysis population are shown in Table 1. Across all ages and cohort groups, a total of 15,166 records from 6544 children were available for analysis. The distribution of demographic characteristics varied across weight groups, with a higher proportion of boys, children at low SEP, and children from linguistically diverse or Indigenous families having obesity compared with the healthy weight and overweight categories. At all ages and cohorts, mean PedsQL scores decreased with higher weight status. For the CHU9D, mean scores decreased with higher weight status at age 14–15 years in the B and K cohorts and 16–17 years in the K cohort, but not among children 10/11 and 12/13 years.

Table 1 Descriptive characteristics of analysis population by weight status and age.

Psychometric properties

Internal consistency

Among children and adolescents with overweight and obesity, internal consistency was strong for the PedsQL total score scale and the individual summary score subscales for physical health and emotional, social and school functioning (Cronbach’s alpha ranged from 0.77 to 0.92 and item-total correlations ranged from 0.40 to 0.77). CHU9D utility scores also showed strong internal consistency (Cronbach’s alpha 0.82 and item-total correlations ranged from 0.40 to 0.62) (supplementary Table 1).

Acceptability of the two measures was high, based on the overall low level of missing PedsQL scores of 1.6–2.0% and missing CHU9D utility of 1.3–1.6% (supplementary Table 2). Examination of missing data across age and weight status groups, also indicated an acceptable level of missing values <5% for both PedsQL and CHU9D. No floor effect was observed for the PedsQL and floor effects for the CHU9D were <0.1%. There were no ceiling effects for PedsQL, but >10% of children scored at full health (=1) on the CHU9D, which is normal and acceptable for a preference-based measure (i.e. one providing utilities) [44].

Known groups

The PedsQL was able to discriminate between children with overweight and obesity compared to those in healthy weight (Table 2). After adjustment for demographic factors and compared with healthy weight, the differences in marginal predictions of PedsQL score for boys and girls with obesity were: boys −5.6 (95%CI −6.7, −4.4), p < 0.001; girls −6.7 (95%CI −8.1, −5.4), p < 0.001, and for those with overweight: boys −2.2 (95%CI −3.0, −1.4), p < 0.001; girls −1.3 (95%CI −2.0, −0.5), p = 0.002. The PedsQL also indicated known groups validity for boys and girls from low compared to high SEP (p < 0.001) and from CALD compared to non-CALD households (p < 0.001). All interaction terms investigated were non-significant (p > 0.01), indicating no evidence that age or demographic characteristics modifies the relationship between PedsQL score and weight status (see supplementary Table 3).

Table 2 Association of CHU9D utility and PedsQL total score with weight status, using general estimating equations with binomial log link, and adjustment for age, Indigenous status, cultural diversity and socio-economic position.

Similarly, CHU9D utility scores were lower for boys and girls with obesity compared to those with healthy weight (Table 2). Differences in marginal predictions of CHU9D for obesity compared with healthy weight were: boys −0.02 (95%CI −0.034, −0.006), p < 0.002; girls −0.035 (95%CI −0.054, −0.015), p = 0.001. However, for those with overweight, only among girls was there a statistically significant difference in CHU9D utility score compared to healthy weight: girls −0.014 (95%CI −0.026, −0.003), p = 0.02); boys −0.008 (95%CI −0.018, 0.002), p = 0.146. CHU9D utility scores declined with increasing age for girls (p < 0.001), but did not discriminate between groups of culturally diverse, indigenous or socioeconomically disadvantaged children. Interaction terms between family demographic factors and weight status were not significant (p > 0.01), indicating similar utility score differences by weight status, for indigenous children and those from low SEP and CALD groups. However, a significant interaction between age and obesity for girls (p = 0.004) indicated that utilities decline with age, but they decline faster for girls with obesity than for those in healthy weight (supplementary Table 4).

Marginal predictions for final models (including interaction terms where significant) for CHU9D utility and PedsQL total score by age and weight status, depict the age-independent association of weight status and PedsQL score, and the age-dependent association of CHU9D utility for girls and the stronger decline in HRQoL for those affected by obesity (Fig. 1). For example, for girls aged 12 years, the predicted CHU9D utility decrement for obesity was 0.015, but at age 17 was 0.065. A rule of thumb for the minimal clinically meaningful difference of utility scores of 0.03 [45] was exceeded with CHU9D for girls with obesity aged 14 and above, but not for boys or for overweight. For PedsQL, the clinically meaningful difference of 4.5 points [18] was exceeded for obesity compared to healthy weight for girls and boys across all ages, but not for overweight.

Fig. 1: Margins predictions of PedsQL total score and CHU9D utility score by age and weight status from adjusted models.
figure 1

Red = obesity; orange = overweight; blue = healthy weight.

Convergent validity

In the tests of convergent validity between PedsQL and CHU9D by weight status and age, the Spearman correlation coefficients ranged from 0.16 to 0.29, which is considered low evidence of convergence between the two instruments (supplementary Table 5).

Responsiveness

The effect sizes of PedsQL were mostly consistent with the hypothesised direction according to change in weight status (Fig. 2), producing a positive ES for ‘better’, close to 0 for ‘same’ and negative ES for ‘worse’ weight status change. CHU9D ES were less consistent with the hypothesis, for example between 11 and 13 years, all CHU9D ES were positive regardless of the direction of change in weight status, and among 15–17 years all ES were negative, regardless of actual weight change. Between 13 and 15 years, the pattern was less consistent for both measures, with negative ES for all changes in weight status, including the ‘same’ category and the ‘better’ category. In addition to greater consistency in the hypothesised direction of change, PedsQL ES were larger than for the CHU9D, although all ES were relatively small.

Fig. 2: Responsiveness, measured by effect size of changes in PedsQL total score and CHU9D utility score in response to changes in weight status over childhood and adolescence.
figure 2

Green = improvement in weight status; purple = no change in weight status; red = worsening weight status.

Discussion

Two paediatric HRQoL instruments have been investigated in the context of their application to child and adolescent weight status, focusing on reliability, acceptability, validity and responsiveness. This is the first time that rigorous psychometric assessment of PedsQl and CHU9D has been carried out for children with overweight and obesity in this age range. Internal consistency reliability was demonstrated for the PedsQL and the CHU9D, and both measures had high acceptability based on the low proportion of floor values. There were no ceiling effects for the PedsQL, but >10% of children scored at full health (=1) on the CHU9D, which is normal and acceptable for a preference-based measure. Based on the low number of missing responses, both PROMs were acceptable for the measurement of HRQoL in children and adolescents with overweight and obesity. We were unable to investigate patient burden with our dataset, but this aspect of acceptability and patient comprehension remain important features to assess in future studies.

The PedsQL demonstrated very good known group validity, with the ability to discriminate between groups of children based on weight status, and between those from lower sociodemographic and culturally diverse households. The CHU9D showed good discrimination between obesity and healthy weight, but was unable to discriminate between weight categories in younger girls and between overweight and healthy weight in boys. This may be because the CHU9D questions are not sensitive enough to pick up on the changes in HRQoL associated with weight-related co-morbidities in younger children and boys.

Known group validity is an important property for the measurement of patient HRQoL in trials of obesity prevention, with intervention effectiveness contingent on the instrument having the ability to detect differences between weight status groups. There is no existing literature on children and adolescents aged 11–17 years for direct comparison with our results, but two previous studies with younger children aged 5–6 years found that PedsQL and CHU9D had poor known group validity with respect to weight status. [15, 25]. The differences between these studies and our study are most likely a result of the different age groups studied.

PedsQL had quite consistent quality of life scores associated with overweight and obesity in girls and boys and across all ages studied. The age-related decline of CHU9D utility, particularly among girls with obesity has been noted before [14] and resulted in quite different CHU9D utility scores among boys and girls living with obesity, particularly in late adolescence. The implications for cost-utility analyses using this measure, are that similar weight status among girls and boys would lead to different utilities and different QALYs, and thus poorer cost-effectiveness in boys compared with girls. In addition, the CHU9D did not discriminate between boys with overweight and those with healthy weight, which may limit its application among boys, when using the CHU9D in economic evaluations of prevention and treatment of overweight. These limitations may partly explain the paucity of economic evaluations of childhood obesity prevention, which remains a major challenge [46].

Our investigation of convergent validity found low correlation between the PedsQL total scores and the CHU9D utility score, despite the fact that they are measuring the same construct. The low correlation may be explained by a number of points of difference in the LSAC data: PedsQL was proxy report and the CHU9D was self-report, they have different numbers of items: 23 and 9, respectively, and different scoring systems. The PedsQL total score is a summative average of the reverse scored items of the respondents, whilst for the CHU9D a preference-based value set/algorithm is applied to the respondents’ item scores to calculate the CHU9D utility score.

Responsiveness is an important quality of HRQoL measures in health [42] that has rarely been assessed among children and adolescents [28] and never before in the context of overweight and obesity. We found the PedsQL to be responsive to changes in weight status in that the effect size and standardised response means were consistent with the hypothesised direction, while the CHU9D was less responsive. This suggests that the PedsQL would be an appropriate instrument to use in obesity management and prevention, but there may be some limitations in using the CHU9D due to lower responsiveness. The better validity and responsiveness of the PedsQL over the CHU9D may be due to the larger number of items, or because the questions themselves are more relevant to the impacts of obesity.

Strengths and weaknesses

Strengths of this study include the size and richness of the dataset (n = 15,166 data points), and its longitudinal nature which allowed us to investigate responsiveness to weight change which has not previously been evaluated. Overweight and obesity were based on objectively measured height and weight and thus not subject to reporting bias. The psychometric methods used were rigorous and based on established gold standard criteria. There are some weaknesses: as mentioned, we were unable to investigate patient burden and comprehension from our established dataset. In addition, the 2-year interval of data collection could not assess test-retest reliability. Another potential limitation is that we were only able to evaluate the parent proxy report of PedsQL, as the child self-report version was not used in the LSAC. Previous studies [47] however, have found very similar HRQoL scores for proxy and child report, but nevertheless it may impact comparison with the CHU9D which is child/adolescent self-reported measure.

While this study explores the psychometric properties of PedsQL and CHU9D across children in different weight status groups, defined by BMI-z, future research may consider repeating these analyses using other measures of child growth, such as waist circumference, or using co-morbidities of overweight and obesity such as asthma and depression. Another important area of future research would be to assess psychometric performance of the two measures in further paediatric clinical conditions to ascertain whether the poorer responsiveness and validity of the CHU9D is specific to obesity or a more general feature of this measure.

Conclusion

Overall, both PROMs demonstrated adequate reliability and acceptability for the measurement of HRQoL in children and adolescents with overweight or obesity. However, PedsQL appears to be superior to the CHU9D in terms of its ability to discriminate between children of different weight status and to respond to changes in weight status over time. This represents a dilemma in cost-utility analysis of overweight and obesity prevention and treatment, as the CHU9D is unlikely to be sensitive enough to detect improvements in weight status. Evidence of value for money based on QALYs underpins decision making by health technology assessment agencies in many jurisdictions around the world, thus the psychometric properties of preference based HRQoL measures are vitally important.