Introduction

Radiation therapy is a cancer treatment that involves delivering a high dose of radiation to a target volume in the body while minimizing exposure to critical organs to prevent complications. Early and late adverse effects are significant factors when planning radiation therapy due to their impact on the patients' health and daily lives post-treatment. In head and neck malignancies in particularly, the risk and severity of late effects are influenced by the total dose received and the organ exposed1. The thyroid gland is highly susceptible to complications because it is radiosensitive and its location anterior to the neck region exposes the organ to radiation beams. This makes hypothyroidism one of the most common side effects of radiation therapy in head and neck tumors that occurs in 15% to 48% of the patients2,3.

Currently, laboratory test results and clinical symptoms are used to diagnose radiation-induced hypothyroidism (RIH). Lertbusayanukul et al.4 validated a prior report of dose factors in hypothyroidism after intensity-modulated radiation treatment (IMRT) in patients with nasopharyngeal carcinoma (NPC), reporting that a thyroid-stimulating hormone (TSH) level of greater than 1.55 μU/ml and a thyroid volume spared from 60 Gy radiation (VS60) of less than 10 cm3 were important predictors. Similarly, Peng et al.5 showed that the pre-treatment thyroid volume and the percentage of thyroid volume exposed to 30–60 Gy radiation (V30,60) can predict RIH with moderate area under the receiver operating characteristic curve (AUCs) of 0.64. Therefore, an effective predictive model is needed to improve treatment planning and reduce the occurrence of RIH.

Medical imaging provides essential information about the pathophysiology of the tumor that is useful for cancer treatment planning, monitoring, and post-therapy evaluation. In addition to qualitative visual inspection of radiological images by clinicians, quantitative numerical data can also be extracted from radiological images via radiomics approaches. Radiomics features describes various metrics of the shape, size, and heterogeneity of the tumor that can complement clinical and dosimetric data when developing predictive machine learning models for clinical applications. Past studies shown the benefits of radiomics information on the prediction of locoregional recurrence, response to treatment, survival, and complications in head and neck cancer patients6. Khadija et al.7 reported that radiomics features from pre-treatment computed tomography (CT) and magnetic resonance (MR) imagings of salivary glands might be able to reflect the functional states of the glands and are predictive of post-radiation xerostomia. Hence, the incorporation of radiomics features of the thyroid gland should improve the ability to predict RIH compared to conventional methods that rely on only clinical and dosimetric data.

In this study, the occurrence of RIH in nasopharyngeal cancer patients within 2 years after treatment were predicted using radiomics features from pre-treatment contrast-enhanced CT images together with clinical and dosimetric data. Conventional models using only clinical and dosimetric data were compared to the combined models that utilized radiomics features to evaluate the benefits of radiomics features on RIH prediction. Multiple machine learning models were explored to assess the reproducibility of the findings. The results can lead to the development of a pre-treatment planning tool that helps clinicians optimize radiation dosage to manage the risk of RIH.

Results

Table 1 summarizes the demographic and clinical characteristics of the 220 nasopharyngeal cancer patients, 106 (48.18%) of which developed RIH within 2 years after radiation therapy. The average age was 48.28 ± 11.71 years, with the majority being male (72.27%). Most patients were of clinical staged 3 (51.82%), N stage 2 (51.82%), and T stages 2 or 1 (33.18% and 31.36%, respectively). Two pre-treatment clinical variables, the level of thyroid-stimulating hormone (TSH) and the thyroid volume, were significantly different between the patients with and without RIH (adjusted Mann–Whitney U test p-values < 0.05). The average levels of TSH were 2.68 ± 7.08 μU/ml in patients with RIH and 1.83 ± 2.30 μU/ml in patients without RIH. The average volumes of the thyroid glands were 13.23 ± 6.43 cm3 in patients with RIH and 15.06 ± 7.07 cm3 in patients without RIH. There was no significant difference in dosimetric variables between the two groups of patients as shown in Table 2.

Table 1 Patient characteristics.
Table 2 Dosimetric parameters.

A total of 1,288 radiomics features were extracted from pre-treatment contrast-enhanced CT images and categorized into four classes: shape, first-order statistics, texture, and filter-based. The robustness of each radiomics features to the variation in regions of interest drawn by different radiation oncologists were evaluated using intraclass correlation (ICC). Filtering based on ICC values greater than 0.5 and 0.75 reduced the number of radiomics features to 838 and 1026, respectively (Supplementary Table 1). Univariate analysis of radiomics features indicated that while the values of several features significantly differ between patients with and without RIH, they are only moderately predictive of RIH (Supplementary Table 2, AUC = 0.64–0.65). Highly predictive radiomics features include the wavelet-HLL_glcm_MaximumProbability, log-sigma-1-0-mm-3D_ngtdm_Coarseness, wavelet-LLH_ngtdm_Strength, and wavelet-LLH_ngtdm_Strength.

Five machine learning models were trained and cross-validated on different combinations of clinical, dosimetric, and radiomics data (Table 3). The areas under the receiver operating characteristic curve (AUC) of the models that incorporated radiomics data were compared to the performance of the conventional models to assess the benefits of radiomics. In all cases, the addition of radiomics data significantly improved the validation AUCs (adjusted signed rank test p-values < 0.05). The combined logistic regression model which incorporated all data types achieved a validation AUC of 0.80 ± 0.06 compared to the AUCs of up to 0.68 ± 0.07 when only clinical and dosimetric data were used. Similarly, the combined random forest model achieved a validation AUC of 0.81 ± 0.06 compared to the AUCs of up to 0.71 ± 0.06 when only clinical and dosimetric data were used. The highest performing models based on support vector machine (SVM) with radial basis kernel, extreme gradient boosting (XGBoost), and adaptive boosting (AdaBoost) achieved slightly lower validation AUCs of 0.77 ± 0.06, 0.77 ± 0.05, and 0.78 ± 0.05, respectively, but still significantly outperformed the model variants that utilized only clinical and dosimetric data.

Table 3 Model performance on the training and validation sets.

The top performing models, namely the combined logistic regression model and the combined random forest model, were than evaluated on a held-out test dataset (Fig. 1 and Table 4). The combined models again outperformed the models variants that utilized only clinical and dosimetric data in almost all metrics. The only exception is sensitivity where the combined models did not achieve the best performances. It should be noted that although the combined logistic regression model and the combined random forest model achieved similar AUCs (0.72 and 0.74, respectively), they have different tradeoffs. While the combined random forest model achieved higher sensitivity in the high specificity range (> 0.70), the opposite is true in the intermediate specificity range (Fig. 1c). The confusion matrices also suggested that the combined logistic regression model tends to produce slightly more false positives than false negatives while the combined random forest model behaves in the opposite manner (Fig. 2). Hence, multiple metrics should be considered when selecting the best model and cutoff value.

Figure 1
figure 1

Performances of the logistic regression and random forest models. (a) Distributions of the areas under the receiver operating characteristic curve (AUCs) for the logistic regression models trained on different input data combination. (b) Similar visualization for the AUCs of the random forest models. (c) The receiver operating characteristic curves of the best combined models on the held-out test dataset.

Table 4 Model performance on the held-out test dataset.
Figure 2
figure 2

Confusion matrices for the best combined models on the held-out test dataset. (a) The combined logistic regression model. (b) The combined random forest model.

During model development, features that did not strongly contribute to the prediction were iteratively removed. Starting from more than 800 input features, the final combined logistic regression model consisted of three clinical features, namely bilateral neck metastasis status, pre-treatment TSH level, and age, one dosimetric feature, namely the percentage of thyroid volume that received at least 40 (TR V40), and 26 radiomics features (Fig. 3). The final combined random forest model contains one clinical feature, namely pre-treatment TSH level, and one dosimetric feature, namely the mean dose to thyroid (TR mean), and 34 radiomics features.

Figure 3
figure 3

Coefficients and importance of features selected by either the best combined logistic regression model (top) or the best combined random forest model (bottom). Red bars indicate clinical and dosimetric features.

Discussion

This study aimed to develop a new predictive model for radiation-induced hypothyroidism (RIH) using the combination of clinical, dosimetric, and pre-treatment CT radiomics data and to evaluate the benefits of radiomics data in this context. The results showed that the combined models which incorporated all three data types significantly outperformed the conventional models that utilized only clinical and dosimetric data. In good agreement with prior studies4,8,9,10, the best combined models assigned high importance to the pre-treatment TSH level, age, and metastasis status which have been implicated in RIH. Here, younger age, positive nodes, and high pre-treatment TSH levels were associated with a higher risk of developing RIH. Among dosimetric features, TR V40 and TR mean selected by the best combined models had been reported as good predictors for RIH10,11. Moreover, the best combined models (validation AUCs of 0.80–0.81 and test AUCs of 0.72–0.74) also compared favorably to recent normal tissue complication probability (NTCP) models12,13 which achieved an AUC of only 0.58 on this cohort. These NTCP models considered only age, TR mean, and thyroid volume as predictors.

The majority of radiomics features selected by the best combined models came from the texture and filter groups, such as log-sigma_ngtdm_contrast, log-sigma_ngtdm_coarseness, and wavelet-HHL_glszm_SmallAreaEmphasis. Texture features describe the spatial distribution of voxel intensity levels in a region of interest as fine, coarse, grainy, or smooth, which may reflect the pathophysiological status of the organs or malignancies. Ishibashi N et al.14 reported that decreased thyroid gland CT intensity might indicate a higher risk of RIH. As subtle changes in the textures of the thyroid may not be discernable by human eyes, radiomics was required to obtain a more precise assessment (Fig. 4).

Figure 4
figure 4

CT images of the thyroid gland with different radiomics feature values. (a) Thyroid gland in a patient who developed RIH. (b) Thyroid gland in a patient without RIH.

Although radiomics has been successfully applied to enhance the prediction of radiation-induced complications in various studies6,7,15, there are also reports that the incorporation of radiomics features did not significantly improve the prediction of RIH some cancer types, such as oropharyngeal cancer16. One possible reason for the contradictory findings could be due to the difference in etiologies, which would influence the association between radiomics features and the clinical status of the tumors. However, our study is not without limitations. First, the retrospective design of the study did not allow for a complete standardization of the CT imaging acquisition protocols. Moreover, RIH requires a long follow-up period which resulted in a relatively small sample size.

In conclusion, while the mechanism of RIH remains unclear, ionizing radiation is expected to damage the thyroid gland, altering the morphology, vessel structure, and immune response of the organ. Our results demonstrated that the combination of information from pre-treatment CT imaging with clinical and dosimetric data significantly improves the prediction of RIH across many performance metrics. We contend that these findings will lead to the development of a better pre-treatment planning tool that helps radiation oncologists optimize dose constraints on the thyroid gland to reduce the risk of hypothyroidism.

Methods

Population and sample

This study is retrospective and has been approved by the Ethic Committee at the Faculty of Medicine of Chulalongkorn University (IRB No. 745/61). The need for written informed consent was waived by the same Ethic Committee at the Faculty of Medicine, Chulalongkorn University. All methods were performed in accordance with relevant guidelines and regulations. A total of 220 patients with nasopharyngeal cancer (NPC) whose hypothyroidism statuses were confirmed within 2 years after radiation therapy treatment by the Division of Radiation Oncology, Department of Radiology, King Chulalongkorn Memorial Hospital during the period 2010–2020 were included. The inclusion criteria were age of at least 18 years, treated with definite RT (IMRT or VMAT) with or without chemotherapy, and received a radiation dose of 70 Gy in 33–35 fractions. The patients must also have a normal baseline thyroid function, and no history of pre-existing thyroid disease, thyroid surgery, or radiation therapy in the neck.

CT image acquisition

All patients underwent a CT simulation before radiation therapy treatment. A 64 detector-row CT simulator was used to acquire CT images (Revolution CT; GE Healthcare, Chicago, IL, USA). Acquisition protocols included a non-contrast phase and a contrast-enhanced phase in helical mode at 120 kV with smart mA by 2.5 mm slice thickness.

Thyroid segmentation

Manual 3D segmentation of the thyroid gland was performed by radiation oncologists. The region of interest (ROI) covering the thyroid gland in contrast-enhanced CT images were drawn using the Eclipse Contouring software (Varian Medical System, Inc: version 15.5). To assess the robustness of each radiomics feature, an intraclass correlation (ICC) test was performed by having three radiologists segment the same set of thirty randomly selected patients. Robust radiomics features should maintain the same values across ROIs delineated by different radiologists for the same CT image. ICC cutoffs of 0.5 and 0.75 were then applied to select robust radiomics features.

Radiomics features extraction

Radiomics features were extracted from contrast-enhanced CT images using the PyRadiomics package (version 3.0) through the 3D slicer software (version 4.11.2)17,18. There were 14 shape-based features, 18 first-order statistics features, 73 texture-based features, and 1183 filter-based features. The bin width parameter was varied between 0.05, 0.1, 0.15, and 0.2.

Clinical and dosimetric data collection

Clinical variables were collected from the hospital information system of the King Chulalongkorn Memorial Hospital. Dosimetric variables were calculated from the dose volume histogram, namely V40, V50, V60, Pit50, Pit55 (Vx: Percentage of thyroid volume that has received at least × Gy radiation, Pitx: Percentage of pituitary volume that has received at least × Gy radiation), VS40, VS50, VS60 (VSx: Percentage of thyroid volume preserved from × Gy of radiation), the mean dose of thyroid and pituitary gland, the maximum dose of thyroid and pituitary gland, and the minimum dose of thyroid and pituitary gland dose, via the treatment planning system.

Model development

The dataset was first divided into training (80%) and testing (20%). Then during model development, the training set was further divided into 5 equal partitions to perform a fivefold cross-validation. The fivefold cross-validation process was repeated 20 times with different random partitioning of the training set. Five machine learning model families, namely regularized logistic regression, random forest, support vector machine (SVM), gradient boosting trees (XGBoost), and adaptive boosting (AdaBoost) were developed through the scikit-learn and xgboost packages19,20. Multiple combinations of clinical data, dosimetric data, and radiomics data were considered as inputs. For each initial set of input features, recursive feature elimination was performed to iteratively remove unimportant features that do not strongly contribute to the prediction. The area under the receiver operating characteristic curve (AUC) was used to rank the performance of the models. Figure 5 summarized data acquisition and model development workflow.

Figure 5
figure 5

Data acquisition and model development workflow.

Statistical analysis

The mean and standard deviation (SD) values were calculated for continuous variables, while the counts and percentages were used to summarize categorical features. The differences in feature values between patients with and without RIH were evaluated using the Mann–Whitney U tests and Chi-square tests. The difference in predictive performance of the models were evaluated using signed rank tests. The Benjamini–Hochberg procedure was performed to control for multiple testing. A p-value cutoff of 0.05 was set to define statistical significance.