Introduction

Xerostomia is one of the most frequently reported side effects following radiotherapy of head and neck cancer (HNC) patients and affects patient-reported quality of life1. For the prediction of late xerostomia, Normal Tissue Complication Probability (NTCP) models have been developed with pre-treatment dose-volume parameters and baseline complaints as most important predictors2,3. However, xerostomia NTCP models based on information during treatment are less explored. Since in-treatment parameters contain information on the patient-specific response to treatment, they may resolve some of the unexplained variability that remains for NTCP models that are based on pre-treatment variables only. These in-treatment parameters could therefore be used to improve the prediction of late xerostomia. Adequate prediction supported by in-treatment data may offer new opportunities to guide treatment adaptation aiming at a further reduction of late radiation-induced side effects.

Several studies have investigated changes of the parotid glands during and after treatment in CT images4,5,6,7 and have shown a weak to moderate relationship between parotid gland dose and volume change4,6. However, knowledge of the relationship between parotid gland changes and patient-reported xerostomia is still limited. Therefore, in our previous study, we investigated the association between late patient-reported xerostomia and parotid gland changes quantified in radiomics features, or also called image biomarkers, extracted from CT images before and 6 weeks after treatment. That study showed that the parotid gland surface reduction (∆PG-surface6w-postRT) was strongly associated with the development of xerostomia at 6 and 12 months after radiotherapy8.

However, this post-treatment model does not allow for treatment adaptation, as the total prescribed radiation dose has already been administered. Hence, the next step is to investigate parotid gland changes during treatment.

The aim of the current study was to identify quantitative parotid gland changes during treatment that predict the development of late xerostomia. These parotid gland changes were extracted from pre-treatment and weekly CT-images during radiotherapy, from which delta radiomics features (∆features) were quantified, representing differences in intensity, texture and geometric characteristics of the parotid glands.

Results

Patients

Moderate-to-severe xerostomia was reported by 26 (38%) of all 68 patients included at 12 months after radiotherapy (22 in the training set and 4 in the test set). At 6 months after radiotherapy, the moderate-to-severe xerostomia reporting rate was 46 (52%) out of a total of 88 patients.

∆features selection

For the geometric features, a change of contralateral parotid gland surface (∆PG-surface) was the most frequently selected ∆feature for all weeks predicting Xer12m (see Supplementary Data 4 for frequency plot), except for week 2 where the ∆PG-‘bounding box volume’ frequency was slightly higher. The ∆PG-surface frequency was especially high in week 3 (725 times selected in the 1000 bootstrap samples). In other weeks, the subsequently selected ∆features, ∆‘volume’, ∆‘bounding box volume’ and ∆‘volume times mean intensity’, were highly correlated with ∆PG-surface for all weeks with ρ = 0.79–0.93, ρ = 0.66–0.79 and ρ = 0.82–0.91, respectively.

For the intensity and texture ∆features, no clear selection of ∆features that were most frequently selected for all weeks could be made (see Supplementary Data 4 for frequency plot). Overall, the most frequently selected ∆features on average were: coarseness from the neighbourhood grey tone difference matrix (coarseness), kurtosis and the median intensity (median). The ∆features kurtosis and median were weakly correlated to coarseness (ρ = 0.13, 0.18), but were stronger correlated to each other (ρ = 0.79).

A heatmap of all ∆IBMs and histograms of the most frequently selected ∆IBMs are depicted in Supplementary Data 5 to illustrate the distribution and relation between the ∆IBMs.

∆feature: univariable analysis

The univariable analysis also showed that ∆PG-surface was the most significant ∆feature. This geometric ∆feature was significantly associated with Xer12m at all weeks (p < 0.04) but was most significant in week 3 (p < 0.001). This week showed the largest regression coefficient of all weeks (Supplementary Data 6).

For the intensity and texture ∆features, none of the most frequently selected ∆features were significantly associated with Xer12m in any of the weeks (coarseness: p ≥ 0.06; kurtosis ≥0.08; median: p ≥ 0.07) (Supplementary Data 6).

∆features, dose and toxicity: multivariable analysis

Since the ∆features showed the best performance in week 3, the multivariable analysis was performed with the selected week 3 ∆features only.

In the training set (56 patients), discrimination of the reference ‘pre-treatment’ model (Xerbaseline and PGdose) was good (AUCtrain = 0.83 (AUCinternal.val. = 0.82)), yet the geometric ∆feature model with ∆PG-surfacew3 and Xerbaseline performed better (∆feature model 1: AUCtrain = 0.88 (AUCinternal.val. = 0.84)) in predicting Xer12m (Table 1). Moreover, the addition of ∆PG-surfacew3 to the pre-treatment model (likelihood-ratio test, p = 0.003), significantly improved different aspects of performance (∆feature model 2: AUCtrain = 0.92 (AUCinternal.val. = 0.89); Table 1). Validated in the test set (14 patients), the models showed stable performance (Table 1). Furthermore, also in the test set ∆feature model 2 showed the highest performance (AUCtest = 0.93 and R2 = 0.49). Ultimately, Table 2 gives the model coefficients that were optimized for the entire cohort, with the coefficients corrected for optimism using bootstrapping. The performance measures of the final model are shown in Supplementary Data 7.

Table 1 Performance of NTCP models predicting Xer12m with and without ∆image biomarkers in the training and the test set.
Table 2 Estimated coefficients (uncorrected and corrected for optimism) of pre-treatment and ∆image biomarkers models fitted to the entire dataset.

Acute xerostomia scores at week 3 (Xerw3) significantly improved ∆feature model 2 (Xerbaseline, PGdose and ∆PG-surfacew3) (likelihood-ratio test, p = 0.01), but the improvement in performance was relatively small (AUCtrain = 0.93 (AUCinternal.val. = 0.89)) and decreased when validated in the test set (AUCtest = 0.88). The relationship between ∆PG-surfacew3 and Xerw3 was not significant (p = 0.14). This is also demonstrated in Fig. 1, where patients with a large and small surface reduction at week 3 (median ∆PG-surfacew3 = −2.73) showed a clear differentiation of actual moderate-to-severe xerostomia incidences at 6 or 12 months after treatment, but not for acute time points.

Figure 1
figure 1

Actual moderate-to-severe xerostomia incidence and 95% confident intervals at baseline, weekly during, and 6 weeks (week 12), 6 months, and 12 months after treatment for patients, with parotid gland surface reduction at week 3 (∆PG-Surfacew3) larger (blue) or smaller (yellow) than the median reduction (|median| = 2.73).

No significant relationship was found between ∆PG-surface and Xerbaseline (p = 0.17). Xerw3 was significantly associated with both PGdose (p = 0.04) and Xerbaseline (p = 0.03), probably explaining the marginal prediction improvement of Xer3w to the ∆feature model with PGdose and Xerbaseline.

For the secondary endpoint Xer6m, ∆PG-surfacew3 also added significantly to the pre-treatment model in predicting Xer6m (likelihood-ratio test, p = 0.02). See Supplementary Data 8 for more details. None of the frequently selected intensity or texture ∆features showed any significant improvement either compared to or in addition to the pre-treatment model (likelihood-ratio test, p > 0.27) in predicting Xer12m or Xer6m.

Parotid gland dose and ∆features

The linear relationship of contralateral parotid gland mean dose and ∆PG-surface was significant for all weeks (p < 0.008; Fig. 2). Depicted in Fig. 2, the regression coefficients of this linear relationship effectively increased over time, as did the coefficient of determination, but remained weak.

Figure 2
figure 2

Univariable linear regression of contralateral parotid gland mean dose (PGdose) predicting parotid gland surface reduction (∆PG-surface) for different weeks (lines) and regression characteristics (Table). Correlation increases over time, but remained weak. Data point represent ∆PG-surface values for week 6.

The selected intensity and the texture ∆features, ∆median and ∆coarseness were significantly correlated (p < 0.05) to parotid gland dose for week 2, 3, 5, 6 and 5, 6, respectively (Supplementary Data 9). However, the coefficient of determination was relatively low (R2 = 0.00–0.21). ∆LZLGE was not significant for any week.

Discussion

The current study shows that surface change of the contralateral parotid gland (∆PG-surface) assessed during the course of radiotherapy was strongly associated with the development of late xerostomia (Xer12m and Xer6m). The association of this geometric ∆feature was statistically significant during the entire course of treatment but performed best for changes obtained between treatment planning and week 3. This time point is still clinically relevant, as any treatment adaptations could still influence the patient’s toxicity outcome. ∆PG-surfacew3 did not only show improved predictive performance over PGdose, but it also improved the pre-treatment model performance significantly. The resulting model that was based on Xerbaseline, PGdose and ∆PG-surfacew3 showed excellent performance when predicting Xer12m in both the train (AUC = 0.92), test (AUC = 0.93) and when fitted to the entire cohort (AUC = 0.91). However, these results should be confirmed by external validation in larger patient cohorts.

Castelli et al. showed that parotid gland dose could significantly be reduced with an adaptive radiotherapy approach (ART)9. By re-planning the dose distribution on weekly CTs, an average NTCP reduction of 11% (maximum 30%) was observed. However, weekly re-planning is time consuming. This highlights the potential of the ∆feature NTCP model with ∆PG-surfacew3, since it could select patients during treatment that have a high risk of developing xerostomia. If these high-risk patients could, subsequently, receive less PGdose by re-planning, their risk of xerostomia could be further reduced. Alternatively, the model-based approach has been introduced to select patients for proton therapy. Patients can be selected that have a clinically relevant NTCP-reduction with a proton plan compared to their photon based treatment plan10. Proton therapy has the potential to better conform the dose to the tumour while sparing the surrounding normal tissue, due to the intrinsic properties of protons11. By incorporating patient-specific ∆feature response information in the pre-treatment reference model, patients that do not initially qualify could be reclassified for proton therapy. Accordingly, treatment can be changed from photon to proton therapy, when relevant differences are seen in the new ΔNTCP values.

In a previous study, the geometric radiomics feature differences were calculated between 6 weeks post-treatment and prior to treatment (∆feature6week-postRT)8. The association of ∆feature6week-postRT with Xer12m was investigated in a patient cohort (n = 107) independent of the current cohort. Interestingly, the most stable and predictive post-treatment ∆feature was also the contralateral ∆PG-surface. Similar to the results of the current study, inclusion of ∆PG-surface substantially improved the pre-treatment model. Using the same coefficients of the post-treatment model with ∆PG-surface6w-postRT, Xerbaseline and PGdose in the current cohort, also showed a comparable performance (AUC = 0.89) to that of the model trained in the current cohort (AUC = 0.91). The other way around, using the coefficients of the current model in the previous cohort also resulted in a comparable improvement in performance for the model in the post-treatment cohort (Supplementary Data 10). This suggests that the ∆PG-surfacew3 model would also perform well when externally validated in a cohort where ∆PG-surface is acquired at week 3. In both studies, ∆PG-volume was highly correlated to ∆PG-surface, and also performed well in predicting late xerostomia.

In line with other studies that observed a relationship between PGdose and PG shrinkage4,6,12,13, linear regression in the current study also showed that there was a weak to moderate correlation between ∆PG-surface and PGdose. Interestingly, the correlation between PGdose and ∆PG-surface effectively increased over the time of treatment, illustrated by the increasing values of the regression coefficients and R2 every consecutive week. This suggests that the effect of planning PGdose on ∆PG-surface becomes clearer as more dose is administered. However, such an effect was not seen for the association between ∆PG-surface and Xer12m, since the univariable logistic regression coefficient and the performance of ∆PG-surface increased from week 2 to 3, but decreased for the subsequent weeks. Hence, we concluded that the best moment for predicting Xer12m was during week 3. The explanation may be that most parotid glands shrink when irradiated, as reported in previous studies4,5,6,7, but patients that have a parotid gland that shrinks early in treatment have a higher risk of developing late xerostomia. Therefore, ∆PG-surfaceweek3 could be a marker to differentiate between patients that develop permanent damage of the parotid gland versus those that can recover.

In addition to these observations, ∆PG-surface was not associated with acute xerostomia, although it was strongly associated with the development of late xerostomia. Figure 1 also demonstrated this, as ∆PG-surfacew3 did not show a clear differentiation between the actual incidences of moderate-to-severe xerostomia at week 3 or any of the other acute time points. In contrast, this differentiation can be clearly seen for 6 and 12 months after radiotherapy. Furthermore, acute xerostomia scores at 3 weeks (Xerw3) did significantly add to the model with Xerbaseline, ∆PG-surface and PGdose, although the improvement in performance measures was small. This is probably due to the correlation between Xerw3 and both PGdose and Xerbaseline. Further research needs to be performed on larger datasets in order to investigate whether acute toxicities can contribute to ∆feature models.

Changes in intensity or texture features were not related to the development of xerostomia. In contrast, many of these ∆features were significantly related to PGdose, even though no relationship was seen with the development of xerostomia. Furthermore, detailed investigation of the most frequently selected intensity or texture ∆features showed that these ∆features contained one or two outliers that determined the effect. The influence of outliers indicates the importance of evaluating the selected features before presenting them in a final model. In this study, the analysis of ∆features was used rather than the features directly extracted per week. The results of these absolute weekly features were not significant. In contrast, in a previous study, a pre-treatment CT feature that indicates tissue heterogeneity, was significantly associated with the development of late xerostomia14. It might be that the effect of pre-treatment is too weak to be observed in this relatively small dataset. In addition, using proportional ∆features instead of absolute difference ∆features did not improve the results of this study either.

The limitations of this study are the low numbers of patients included in this analysis and no direct external validation was performed. Nevertheless, we have validated our results by splitting our dataset in a training set, for which variables were selected, and tested them in a small unseen cohort. The models performed well in both the training as the test cohort. Finally the models were fitted to the entire dataset, to make full use of the total number of patients in estimating the most optimal coefficients given all data. Furthermore, only pre-treatment PGdose rather than the accumulated dose over all weekly CT scans was evaluated, since this was outside the scope of the paper. Brouwer et al. showed that accumulated dose calculated on weekly CTs was almost equal to the pre-treatment PGdose15. Using accumulated PGdose could improve the predictive performance of PGdose. Additionally, other modalities, such as positron emission tomography and magnetic resonance imaging could potentially provide better information during treatment on function loss of the PG gland. Future studies using these modalities could improve the quantification and understanding of the development of late xerostomia.

In conclusion, contralateral parotid gland surface area reduction during the course of radiotherapy (ΔPG-surface) was associated with the development of late xerostomia both at 6 and 12 months after radiotherapy. The model consisting of Xerbaseline, parotid gland dose and ∆PG-surface, as assessed at week 3 during treatment (ΔPG-surfacew3), showed the best performance, and substantially improved the pre-treatment model based on parotid gland dose and Xerbaseline only (from AUC of 0.83 to 0.91). This mid-treatment model may be a good candidate to identify patients most at risk of developing late xerostomia and who may benefit from treatment adaptations, but external validation is warranted.

Method

Patients and image acquisition

The study cohort included consecutive HNC patients that were treated with definitive radiotherapy and received weekly CTs between January 2014 and December 2016. The cohort, sorted by treatment start date, was split in a training set (80%; 54 patients) and test set (20%; 14 patients). Radiation plans were adapted where necessary due to anatomical changes causing reduced target coverage. Patients were treated with IMRT or VMAT using a simultaneous integrated boost (SIB) technique, either as a single modality or in combination with concurrent chemotherapy or cetuximab. Plans were optimised to spare the parotid glands and swallowing organs at risk (superior pharyngeal constrictor muscle and supraglottic area) as much as possible without compromising the dose to the target volumes16. The primary tumour and pathologic lymph nodes were generally prescribed 70 Gy (2 Gy per fraction) and the cervical lymph node levels were prescribed an elective radiation dose of 54.25 Gy (1.55 Gy per fraction)17. More detailed descriptions of the radiation protocols used was reported in previous work16. Patient characteristics are listed in Table 3.

Table 3 Patient characteristics of patients that had follow-up information available at 12 and 6 months after treatment.

Patients were excluded if they had salivary gland tumours, underwent prior surgery and/or underwent re-irradiation. An additional requirement was that patient-rated follow-up information at 6 and/or 12 months after radiotherapy was available.

CT scans (Somatom Sensation Open, Siemens, Forchheim, Germany; voxel size: 0.94 × 0.94 × 2.0 mm3; 100–140 kV) were acquired within 2 weeks prior to treatment (CT0) and weekly during treatment (CTw1–6), where CTw1 was generally acquired on the day of the first or second fraction. Patients only received intravenous contrast for CT0. All scans were acquired with a thermoplastic mask in their radiotherapy treatment position.

All patients provided written informed consent before starting therapy that their data could be used within the department’s research program. The Dutch Medical Research Involving Human Subjects Act is not applicable to data collection as part of routine clinical practice and therefore, the hospital ethics committee granted us a waiver from needing ethical approval for the conduct of studies based on these data. All patients received standard clinical care of adaptive radiotherapy.

Endpoints

Patient-rated xerostomia scores were collected prospectively on a routine basis; before, weekly during, and subsequently 6 and 12 months after radiotherapy using the EORTC QLQ-H&N35 questionnaire, as part of the standard follow-up programme (SFP) (NCT02435576)2,18. The primary endpoint of this study was moderate-to-severe patient-rated xerostomia at 12 months after radiotherapy (Xer12m) and the secondary endpoint was moderate-to-severe patient-rated xerostomia at 6 months (Xer6m). This corresponds to the 2 highest scores of the 4-point Likert scale (not, a bit, quite a bit, a lot).

∆features definitions

Parotid glands were delineated on the CT0 and CT1 according to the consensus guidelines of Brouwer et al.19. Delineations were warped from CT1 to the weekly CTs using the deformable image registration tool in the treatment planning system RayStation v5.99 (RaySearch Laboratories, Stockholm, Sweden), the warped structures were carefully checked and manually corrected where necessary.

The radiomics features were extracted from the planning and the weekly CTs with Matlab-based (Mathworks, Natick, MA, USA; version R2014a) in-house developed software. The definitions and formulas were in line with the ‘Image biomarker standardisation initiative’20. The geometric changes (geometric ∆features) were calculated by subtracting the radiomics features of CTw2–6 from those of CT0. This resulted in 15 geometric ∆features per weekly CT, that for example represent volume, surface or compactness changes. Intensity features describe first order and histogram characteristics of CT intensities of a parotid gland (e.g. mean or variance). Textural features describe the intensity heterogeneity and were extracted from the grey level co-occurrence matrix (GLCM)21, grey level run-length matrix (GLRLM)22,23, grey level size-zone matrix (GLSZM)24 and neighbourhood grey tone difference matrix (NGTDM)25. Contrast enhancement was only used for CT0, and not for the weekly CT-scans. Since this can affect the intensity and texture feature values, CTw1 (generally acquired before the 2nd radiation fraction) was considered to be the baseline CT. Hence, intensity and texture changes were quantified by calculating the difference between CTw1 and the CTw2–6. As the intensity and texture features can be influenced by metal artefacts, slices with metal artefacts were deleted and features were calculated on the remaining slices only.

Figure 3 depicts the calculation of the ∆features and CT time points. For a complete list of the 15 geometric, 17 intensity and 66 texture features refer to the Supplementary Datas 13, respectively. Only ∆features of the contralateral parotid gland were reported, as they performed better than those of the ipsilateral parotid gland. The geometric ∆features were analysed separately from intensity and texture ∆features.

Figure 3
figure 3

The difference between features extracted from the weekly CTs (CTw.) and either the planning (geometric features) or week 1 CT (intensity and texture features) resulted in ∆features per week. Generally, scans were taken at the start of consecutive radiation week.

Clinical variables age and gender were investigated, but not reported as they did not show, to have a significant relation with the development of late xerostomia, which is in line with previous analyses2,26,27.

∆feature selection

To identify the most predictive ∆features, ∆feature variable selection was performed each week. Firstly, ∆features values were normalised by taking the difference between each value and the average across patients, and then dividing by the standard deviation. Secondly, a pre-selection that was based on inter-variable correlation was performed to reduce the number of variables. If the (Pearson) correlation of two variables was larger than 0.80, only the ∆feature with the highest univariable association with the endpoint was selected. Thirdly, stepwise forward selection was used to select the most important predictors (likelihood-ratio test: p-value < 0.01)28. The entire variable selection procedure (normalisation, pre-selection and forward selection) was repeated on 1000 bootstrapped samples (i.e. with replacement), according to the TRIPOD guidelines29. The variable selection frequencies were evaluated to identify the most stable predictive variables per week. The Pearson correlation between the selected ∆features was also investigated.

∆feature: univariable analysis

In order to identify the optimum week for predicting Xer12m with ∆features, the univariable associations were investigated for the selected ∆features per week in the entire cohort.

∆feature, dose and toxicity: multivariable analysis

A reference ‘pre-treatment model’ that was based on baseline xerostomia scores (Xerbaseline; none vs. any) and the contralateral PGdose, was fitted to the current train set2. The prediction performance of the ‘pre-treatment model’ was first compared with that of models based on Xerbaseline and the selected ∆features. Subsequently, the addition of the selected ∆features to the ‘pre-treatment model’ was investigated in terms of significance (likelihood-ratio test) and performance.

The resulting multivariable logistic regression models were tested in the unseen data of the test set. Model discrimination was measured with the area under the receiver operating characteristic curve (AUC) and the discrimination slope. Nagelkerke R2 was used as a measure for explained variance. Ultimately, the final model coefficient estimations were performed on the entire cohort. Model calibration was tested for these final models with the Hosmer–Lemeshow test and by repeating the entire variable selection on 1000 bootstrap samples, and by calculating the average of all resulting linear predictor slopes and intercepts. The coefficients were corrected for optimism according to this internal validation procedure.

Since our previous study showed that acute xerostomia scores significantly improved the ∆feature model 6 weeks after treatment8, we also investigated if the addition of acute toxicity as assessed during treatment to the ∆feature-models improved model performance.

The relationships between the resulting ∆feature predictors, Xerbaseline, PGdose and acute xerostomia scores, were additionally explored with univariable logistic regression.

Parotid gland dose and ∆features

The relationship of the mean contralateral PGdose and the ∆features was investigated with linear regression. Model performance was measured with the coefficient of determination (R2), and normality of the residuals of the regression models was checked.