Improving the diagnosis of thyroid cancer by machine learning and clinical data

Thyroid cancer is a common endocrine carcinoma that occurs in the thyroid gland. Much effort has been invested in improving its diagnosis, and thyroidectomy remains the primary treatment method. A successful operation without unnecessary side injuries relies on an accurate preoperative diagnosis. Current human assessment of thyroid nodule malignancy is prone to errors and may not guarantee an accurate preoperative diagnosis. This study proposed a machine learning framework to predict thyroid nodule malignancy based on our collected novel clinical dataset. The ten-fold cross-validation, bootstrap analysis, and permutation predictor importance were applied to estimate and interpret the model performance under uncertainty. The comparison between model prediction and expert assessment shows the advantage of our framework over human judgment in predicting thyroid nodule malignancy. Our method is accurate, interpretable, and thus useable as additional evidence in the preoperative diagnosis of thyroid cancer.

Improving the diagnosis of thyroid cancer by machine learning and clinical data Nan Miles Xi 1 , Lin Wang 2 & Chuanjia Yang 3* Thyroid cancer is a common endocrine carcinoma that occurs in the thyroid gland. Much effort has been invested in improving its diagnosis, and thyroidectomy remains the primary treatment method. A successful operation without unnecessary side injuries relies on an accurate preoperative diagnosis. Current human assessment of thyroid nodule malignancy is prone to errors and may not guarantee an accurate preoperative diagnosis. This study proposed a machine learning framework to predict thyroid nodule malignancy based on our collected novel clinical dataset. The ten-fold cross-validation, bootstrap analysis, and permutation predictor importance were applied to estimate and interpret the model performance under uncertainty. The comparison between model prediction and expert assessment shows the advantage of our framework over human judgment in predicting thyroid nodule malignancy. Our method is accurate, interpretable, and thus useable as additional evidence in the preoperative diagnosis of thyroid cancer.
Thyroid cancer is the most frequent endocrine malignancy and represents about 2.5% of all new cancer cases in the United States 1 . According to NIH's Surveillance, Epidemiology, and End Results Program (SEER), the occurrence of thyroid cancer has increased by 5.5% annually from 2005 to 2015 2 . In the United States, the annual incidence rate stood at 14.1 per 100,000 between 2014 and 2018. The annual death rate is 0.5 per 100,000 between 2015 and 2019. The American Cancer Society estimated that there would be 43,800 new cases and 2,230 deaths caused by thyroid cancer in 2022 3 . Among thyroid cancers, 96% originate from follicular cells, and of these, 99% are differentiated thyroid cancer (DTC) 4 . The treatment methods for DTCs, especially papillary thyroid cancer (PTC), mainly include surgery, TSH suppressive therapy with levothyroxine, and radioactive iodine remnant ablation 5 . While individualized treatment depends on the nature of the lesion, surgical operation remains the primary tool at present 6 . One focus of the surgical operation is to distinguish between benign and malignant thyroid nodules. An accurate preoperative diagnosis is conducive to a smooth operation, avoids unnecessary side injuries, and reduces the risk of post-operative recurrence 7 . It also helps the selection of comprehensive post-surgery treatment to extend the survival period. Therefore, it is crucial to make accurate diagnoses and predictions based on thyroid ultrasound, blood tests, and other basic clinical information.
To date, the diagnosis of malignant nodules largely relies on the clinical experience of surgeons and radiologists 8 . In many cases, human judgment is time-consuming and prone to error. Accurate and explainable predictive models are urgently needed to assist medical decisions and reduce labor work. Previous studies have built statistical models to predict the occurrence of malignant nodules based on various datasets 9-12 . Those models mainly utilized descriptive statistics or logistic regression, which ignored the complex, nonlinear relationship among clinical and demographical variables. Other machine learning-based models only provided the point estimation of the model performance without considering the uncertainty in the model prediction 13 . More seriously, no study has compared the diagnostic accuracy between model prediction and expert assessment. The lack of such comparisons makes it difficult to evaluate the advantage of using predictive models to assist the diagnosis of malignant nodules.
In this paper, we proposed a comprehensive machine learning framework to predict nodule malignancy accurately. We collected a novel clinical dataset containing 724 patients with 1232 nodules. We trained six cutting-edge machine learning models on this dataset and estimated their unbiased prediction performance by ten-fold cross-validation. The uncertainty of model performance was further quantified by bootstrap analysis. We identified the important variables through the analysis of normalized permutation predictor importance. Finally, we compared the model performance to expert assessment and demonstrated the advantage of the  Fig. 1. In general, the proposed machine learning models exhibited high prediction accuracy and diverse capacity to identify benign or malignant nodules. The result is consistent under both point estimation and model uncertainty analysis. Many of the identified important variables confirm similar findings in previous studies. The best-performed models outperformed the expert assessment by a large margin on the same dataset, which indicates the benefits of using machine learning models to improve the preoperative diagnosis of nodule malignancy.

Data collection and preprocessing
The present study was approved by the Medical Research Ethics Committee of China Medical University. Methods performed in the study were in accordance with the Declaration of Helsinki and relevant guidelines. The informed consent was obtained from all subjects involved. The dataset used in this study was collected from 724 patients who were admitted to Shengjing Hospital of China Median University between 2010 and 2012. All patients underwent thyroidectomy, and their nodule malignancy, demographic information, ultrasound features, and blood test results were recorded in the dataset. We observed one or multiple nodules located in each patient in three areas, i.e., left lobe, right lobe, and isthmus. If the patient had multiple nodules in one area, we kept the largest one in the dataset. After removing missing values, there are 1232 nodules and 19 variables in the dataset. The descriptive statistics of those nodules and variables are described in Table 1.
The average age of patients is 46.61, with a range of 13-82. There are 200 male-patient-affiliated nodules (16.23%) and 1032 female-patient-affiliated nodules (83.77%). We recorded five thyroid function tests, including free triiodothyronine (FT3), free thyroxine (FT4), thyroid-stimulating hormone (TSH), thyroid peroxidase antibodies (TPO), and thyroglobulin antibodies (TgAb). All thyroids are categorized into even (89.12%) and uneven (10.88%) based on their echogenicity. There are 11 variables that describe the characteristics of nodules obtained from ultrasound: (1) size is defined as the maximum between the length and width of each nodule; (2) location is either the left lobe, right lobe, or isthmus for each nodule; (3) multifocality indicates if there are multiple nodules identified in one location; (4) shape describes the nodule's regularity; (5) margin characterizes if the nodule has a clear or unclear margin; (6) calcification suggests the existence of nodule calcification; (7) echogenicity describes the nodule's ability to bounce echoes in ultrasound; (8) blood flow is defined as normal or enriched for each nodule; (9) nodule's composition is categorized into cystic, mixed, or solid; (10) laterality demonstrates if the nodule has counterparts in other locations of the same patient; (11) malignancy is determined by examining the nodule specimen after thyroidectomy. The workflow diagram of this study. First, the clinical information of 724 patients and their 1232 nodules were collected and preprocessed from medical records. Second, six cutting-edge machine learning models were trained on the dataset. Third, the model prediction performance was measured by five metrics under ten-fold cross-validation. Forth, the model uncertainty and important variables were further analyzed by bootstrap and permutation predictor importance. Finally, the model prediction was compared with expert assessment.

Methods
We utilized gradient boosting machine (GBM) 14 , logistic regression, linear discriminant analysis (LDA) 15 , support vector machine (SVM) with radial or linear kernel 16 , and random forest 17 to train six machine learning models to predict the nodule malignancy based on the dataset described in the last section. The malignancy was treated as the response in the predictive models, and the other 18 variables were predictors. We conducted ten-fold cross-validation to obtain an unbiased estimation of prediction accuracy 18 . First, we randomly split the patients into ten groups. Second, one patient group was selected, and its affiliated nodules were used to generate the test set. Nodules of the other nine patient groups were treated as the training set. Third, we trained the machine learning models on the training set and then predicted the nodule malignancy in the test set. Finally, we performed the same process until every patient group and their affiliated nodules were predicted by the machine learning models. We repeated the ten-fold cross-validation by ten times to reduce the variability introduced by random splitting. Figure 2 shows the pseudocode of model training, predicting, and ten-fold cross-validation. We compared the model prediction with the true nodule malignancy to evaluate the model performance. For each model, we calculated the accuracy, area under the receiver operating characteristic (AUROC), sensitivity, specificity, and precision. The accuracy is the proportion of correct predictions among all nodules in the dataset. The AUROC measures the overall diagnostic ability of a binary predictive model as its discrimination threshold is varied 19 . The sensitivity represents the proportion of malignant nodules that are correctly predicted as malignant. The specificity represents the proportion of benign nodules that are correctly predicted as benign. The precision is defined as the proportion of true malignant nodules among those predicted as malignant. These five measurements together provide a comprehensive summary of the diagnostic capacity of predictive models. We implemented the model training and evaluation by the R programming language 20 . Figure 2. The pseudocode of model training, predicting, and ten-fold cross-validation. The ten-fold crossvalidation was repeated ten times to reduce the variability introduced by random splitting.

Results
Overall model performance. Table 2 summarizes the accuracy, AUROC, sensitivity, specificity, and precision of six machine learning models calculated under ten-fold cross-validation. All the measurements are averaged across ten repetitions described in the Methods section. Among all models, random forest achieves the highest prediction accuracy (0.7931) and AUROC (0.8541), which indicates a solid overall capacity to differentiate between benign and malignant nodules. The GBM model outperforms others in terms of sensitivity (0.8750). The high sensitivity of the GBM model shows its strongness in finding malignant nodules. On the other hand, logistic regression has advantages in specificity (0.6806) and precision (0.8384). Unlike the GBM model, logistic regression is more capable of identifying benign nodules. Also, the high precision of logistic regression means that among its predicted malignant nodules, a large proportion is truly malignant. Overall, machine learning models exhibit mixed performance in predicting nodule malignancy. There is no single model dominating others on all five measurements. We can choose different models according to the specific requirement in diagnosing malignant nodules.
Model uncertainty measurement. The five measurements in Table 2 are point estimations of the model performance. To further understand the uncertainty of the model prediction, we conducted a bootstrap to construct the empirical distributions of the five model performance measurements 18 . In each step of the ten-fold cross-validation, we resampled with replacement from the original training set to generate a bootstrap training set. Then we trained machine learning models on this bootstrap training set and evaluated its prediction performance on the test set. We repeated this resampling 1000 times and followed the previous model training and evaluation process to obtain the empirical distributions of prediction accuracy, AUROC, sensitivity, specificity, and precision. Figure 3 and Table 3 compare those five empirical distributions and their summary statistics. The six machine learning models show a similar asymptotical performance ranking compared with their point estimation in Table 2. The SVM with linear kernel has the highest median prediction accuracy, followed by random forest and SVM with radials kernel. The random forest and GBM outperform other models on AUROC, and the GBM also shows advantages in terms of sensitivity. The logistic regression is the best-performed model measured by precision and specificity. Although the two SVM models perform relatively well in general, they introduce low accuracy in all five measurements, as reflected by the small outliers in Fig. 3. The performance difference among the six models is relatively small on the accuracy and AUROC, the two general accuracy measurements. However, the gaps are more significant if measured by precision, sensitivity, and precision, indicating diverse model behavior in the differentiation of malignant and benign nodules.
Variable importance analysis. We conducted a variable importance analysis to examine the impact of nodule characteristics on the model performance. We utilized the permutation predictor importance to measure the contribution of each variable to the model prediction 17 . The permutation predictor importance of one variable is defined as the decrease of the AUROC when that variable's value is randomly shuffled. Since the random shuffling breaks the relationship between the variables (characteristics of patients and nodules) and response (nodule malignancy), any decrease in AUROC indicates the model's dependency on that shuffled variable. Using permutation predictor importance has three advantages. First, its calculation does not rely on the specific model form. Second, the predictive model only needs to be trained once. Third, the random shuffling can be repeated multiple times to reduce the variability in the calculation.
For each variable, we averaged its permutation predictor importance across all six models. Then we normalized each average by their maximum among all variables to obtain the final normalized permutation predictor importance. Figure 4 shows the top ten variables ranked by their normalized permutation predictor importance. Calcification has the most substantial impact on the prediction of nodule malignancy, followed by laterality, blood flow, and location. The composition and size have similar predictor importance, less than half of the top variable calcification. The shape of nodules is the last-tier important variable, with only 20% importance as the calcification. Other variables have significantly less impact on the model performance.
In addition to identifying the important variables, we further explored how they would impact the prediction of nodule malignancy. Figure 5 shows the percentage of malignant nodules corresponding to each value of the top-six important variables. A large percentage (close to 100%) indicates that the specific value of that variable is an indicator of malignant nodules. A small percentage (close to 0%) indicates that the specific value of the Table 2. The model prediction performance measured by five measurements. Each measurement was calculated under ten-fold cross-validation and then averaged across ten repetitions. The highest values among the six models are underscored.

Model
Accuracy   Comparison between model prediction and expert assessment. One objective of building machine learning models is to assist the diagnosis of malignant thyroid nodules before surgery. To evaluate how well the proposed models fulfilled this objective, we performed a comparative analysis between model prediction and expert assessment. First, we removed the true nodule malignancy from the original dataset to create a blind dataset. Second, we provided the blind dataset to three surgeons specializing in thyroid cancer. The surgeons were then asked to assess the malignancy of all 1232 nodules in the dataset. Two of the three surgeons are from Shengjing Hospital of China Medical University, and one is from China -Japan Union Hospital of Jilin University. All three surgeons own more than ten years of clinical experience. Third, we adopted the majority vote among three surgeons' judgments as the final prediction for each nodule. Finally, we compared the prediction results from the surgeons with the ones from the machine learning models. It is worth noting that the surgeons and models have the same access to the variables in Table 1. Therefore, the comparison is fair. The following text will refer to the three surgeons as the experts. Table 4 compares the five prediction measurements between the expert assessment and random forest. We choose random forest since it has the highest prediction accuracy, AUROC, and the second-highest sensitivity, specificity, and precision (Table 2). Unlike the model prediction, the expert assessment does not provide the probability of benign or malignant. Therefore, the AUROC cannot be used in this comparison. Instead, we use F1 score, the harmonic mean of the precision and sensitivity, to evaluate the overall diagnosis 21 . The F1 score measures the model's overall capacity to identify malignant nodules. We found that the random forest outperformed expert assessment on the accuracy, F1 score, and sensitivity, with leading margins of 11%, 12%, and 24%, respectively. The measurements expert assessment shows advantages are specificity and precision, with leading margins of 15% and 3%, respectively.
The comparison results in Table 4 demonstrate the model's superior predictive performance over the expert assessment. In summary, the random forest is (1) more accurate than the expert in general (higher accuracy);  www.nature.com/scientificreports/ (2) more capable of finding malignant nodules from the dataset (higher sensitivity and F1 score). On the other hand, the experts tend to find benign nodules more efficiently (higher specificity). Even though the expert assessment is slightly more accurate among predicted malignant nodules (higher precision), the higher F1 score of the random forest indicates its overall stronger capacity in identifying malignant nodules.
To understand the predictive behavior of the experts and random forest, we compared their confusion matrices in Table 5. Among the correct predictions (diagonal elements), the random forest found more malignant nodules than the expert (714 vs. 508). In contrast, the experts identified more benign nodules than the random forest (335 vs. 272). This result echoes Table 4, where random forest shows high sensitivity (true malignant rate) and experts show high specificity (true benign rate). Among the wrong predictions (off-diagonal elements), the random forest overestimated more nodules' malignancy than the expert (141 vs. 78). However, the experts underestimated more nodules' malignancy than the random forest (311 vs. 105). This comparison implies an opposite predictive behavior -the experts are more conservative in predicting nodules as malignant, while the random forest is more aggressive in the prediction. The errors caused by aggressive prediction are less than those caused by conservative prediction, making the random forest more accurate than the expert assessment in general.

Discussion
In this study, we utilized machine learning methods to improve the diagnosis of malignant thyroid nodules. We collected a real dataset of 724 patients' demographic and clinical information. The dataset contains 1232 nodules from those patients with their true malignancy confirmed by thyroidectomy. Based on this dataset, we built six machine learning models to predict the malignancy of thyroid nodules. We used five measurements to provide a comprehensive evaluation of the model performance. Although no single model outperforms others among all measurements, the decision-tree-based nonlinear models, i.e., random forest and GBM, exhibit better overall diagnostic accuracy (measured by accuracy and AUROC) and the capacity to identify malignant nodules (measured by sensitivity). Similar model performance is observed in both point estimation and uncertainty measurement (Tables 2 and 3). The linear predictive model, logistic regression, is good at finding benign nodules (measured by specificity). Consequently, random forest and GBM are more suitable for early cancer screening, but at the expense of false malignant diagnosis. Interestingly, the logistic regression made the most correct predictions among predicted malignant nodules, which is shown by the highest precision in point estimation and uncertainty measurement. Thus, the logistic regression can be ensembled with random forest or GMB to improve the model's diagnostic capacity.
Overall, the machine models exhibit satisfactory prediction performance. The average accuracy and AUROC of the six models are 0.78 and 0.85, respectively. In practice, an AUROC greater than 0.8 indicates excellent discrimination between binary outcomes 19 . One encouraging result of our study is the superior model performance over the expert assessment. The best-performed model, random forest, beat the expert assessment by 11% on accuracy, and 12% on F1 score, the two general measurements. One interpretation of better prediction by machine learning models is that they are able to capture the complex nonlinear relationships among different variables. Such relationships are implicitly contained in the dataset and are challenging for humans to identify. The models are also more aggressive in predicting nodules as malignant. As a result, the machine learning model is valuable for diagnosing thyroid cancer.
Our variable importance analysis identified key variables in diagnosing malignant thyroid nodules (Figs. 4 and 5). Those variables are consistent with previous findings in clinical and modeling studies. For example, it has been recognized that the existence of calcification, cystic composition, and large nodule size are strong indicators of malignant nodules [9][10][11][12][13] . Such consensus is confirmed in our analysis. Regarding other important variables, there are some debates about the role of blood flow in the diagnosis of malignant nodules 22 . Our study suggests that the nodules with enriched blood flow are more likely to be malignant. Laterality, one of the important variables, was largely ignored in previous studies. We find that the unilateral nodules have a high possibility of becoming malignant. We suspect the potential reason is that in the ultrasound examination, if nodules are observed in multiple locations (left lobe, right lobe, or isthmus), only those with a high likelihood of being malignant are recorded. Therefore, there tend to be more unilateral malignant nodules in the dataset. The model also recognizes the nodule location as an important variable. However, being located at the isthmus reduces the possibility of Table 5. The confusion matrices of expert assessment and random forest. The expert assessment is the majority vote of three surgeons' judgments. The diagonal elements are the correct predictions. The off-diagonal elements are the wrong predictions. www.nature.com/scientificreports/ malignancy, which contradicts previous literature 23 . More data from larger patient cohorts would be necessary to further investigate the true impact of nodule location. Several topics are worth exploration in future. First, the current dataset contains 724 patients with 1232 nodules. Although the sample size is not small, collecting more data will increase the diversity of patients and nodules. The model trained on a more extensive and diverse dataset would generalize better to new patients when deployed in real-world scenarios. Second, the ultrasound-related variables in our dataset were extracted by sonographers from the original ultrasonography. Such extraction may omit important features only detectible in the raw images. A deep convolutional neural network, the backbone of modern artificial intelligent systems, can be applied to catch those features directly from the ultrasonography and would improve the model performance 24 . Third, we can encompass novel techniques beyond traditional ultrasound and blood tests into the modeling process. For example, contrast-enhanced ultrasound (CEUS) and ultrasound elastography (USE) could potentially enhance diagnosis through better evaluation of nodule perfusion and vascularity 25 . Additionally, single-cell RNA-sequencing (scRNA-seq) can be applied to reveal the transcriptomes of thyroid nodules and identify new prognostic molecular biomarkers [26][27][28][29][30] . The gene expression dynamic associated with nodule malignancy will potentially increase the model performance, similar to the progress made by scRNA-seq in other cancer diagnoses and precision medicine 31,32 . Finally, the machine learning framework used in this study can evaluate the quality of clinical data for thyroid cancer diagnosis 33 . A machine learning model can be trained on datasets collected from different studies. A high-quality dataset is expected to contain enough information for models accurately predict nodule malignancy. Therefore, the model prediction accuracy will serve as a proxy for the data quality of different datasets.

Data availability
The data used in this study is available at Zenodo repository: https:// doi. org/ 10. 5281/ zenodo. 64654 36. The source code that implemented the result in this study is available at GitHub repository: https:// github. com/ xnnba 1984/ Thyro id-Cancer.