Introduction

Thyroid ultrasonography (US) is now regularly performed in clinical practice and thyroid nodules are exceedingly common on US with as many as 68% of adults having one, leading to issues of overdiagnosis and overtreatment1,2. Many guidelines recommend fine-needle aspiration (FNA) based on several risk stratification systems which use different US features and even different size thresholds3,4,5,6,7. Current risk stratification systems using US features can be broadly divided into two types: the point-scale Thyroid Imaging Reporting and Data System (TIRADS) suggested by Kwak et al. 8, Park et al. 9 and the American College of Radiology (ACR)3 and the pattern-recognition TIRADS suggested by Horvath et al. 10, the 2015 American Thyroid Association (ATA)7, and European Thyroid Association (EU)11. Different size criteria have been suggested by the ATA guideline, ACR and EU TIRADS3,7,11. Although there are many guidelines for recommending FNA for thyroid nodules on US, a worldwide communicable system does not presently exist.

Recently, Grani et al. 12 demonstrated that the ACR TIRADS reduced unnecessary FNAs more than other international guidelines with a very low false-negative rate (2.2%, 6/268). The ACR TIRADS suggests a higher size threshold for FNA than other guidelines while still recommending similar malignancy risks for each final assessment category3,7,11, and this higher size threshold is thought to explain the decrease in unnecessary FNAs3. However, physicians may need more time to classify a nodule on US when using the ACR TIRADS because each US feature is weighted differently3. On the other hand, one of other point-scale risk stratification systems proposed by Kwak et al. (Kwak TIRADS) has been proven to be practical and easily applicable in the assessment of thyroid nodules8,13,14,15,16,17,18,19,20, and can be performed by simply counting the number of suspicious US features without considering the malignancy probability of each US feature. One recent study compared the diagnostic efficiency of Kwak and ACR TIRADS and found the former to have higher AUC and accuracy19. However, the study did not consider the size threshold for recommending FNA19. We assumed that if they have similar diagnostic performances with the same size threshold for thyroid nodules, radiologists and clinicians can choose the more convenient risk stratification system for daily practice.

To find an effective guideline for recommending FNA for thyroid nodules, we investigated the diagnostic performances and unnecessary FNA rates of several guidelines in their original form, and their modified versions using the size threshold proposed by the ACR TIRADS.

Results

Baseline clinicopathological characteristics

Of 1,384 thyroid nodules, 1,093 (79%) were benign and 291 (21%) were malignant (Fig.Ā 1, Table 1). 397 nodules (28.7%) underwent surgery, 10 nodules (0.7%) were diagnosed by core needle biopsy and the last 977 (70.6%) nodules were diagnosed by cytologic findings from FNA. Among the 397 nodules which underwent surgery, 264 (66.5%, 264/397) were diagnosed as malignant and 133 (33.5%, 133/397) as benign. The malignant nodules were comprised of 234 papillary thyroid carcinomas (197 conventional, 33 follicular, 2 solid, 1 columnar and 1 oncocytic variant), 21 minimally invasive follicular carcinomas, 5 medullary carcinomas, 3 anaplastic carcinomas and 1 metastatic nasopharyngeal carcinoma. The most frequently excised benign nodules were follicular adenoma (nā€‰=ā€‰70) followed by adenomatous hyperplasia (nā€‰=ā€‰59), Hurthle cell adenoma (nā€‰=ā€‰3), and fibrotic nodule (nā€‰=ā€‰1). Demographics and US features of the patients and nodules are summarized in Table 1. The mean age (mean 51.1ā€‰Ā±ā€‰13.4; range, 18ā€“90) was significantly higher in patients with benign nodules than patients with malignant nodules (mean 47ā€‰Ā±ā€‰13.7Ā years; range, 18ā€“85Ā years) (Pā€‰<ā€‰0.001). Malignant thyroid nodules were significantly smaller than benign nodules (mean diameter 20.3ā€‰Ā±ā€‰12.9Ā mm and 24ā€‰Ā±ā€‰12.3Ā mm, respectively) (Pā€‰<ā€‰0.001). The malignant thyroid nodules had significantly higher rates of solid composition, hypoechogenicity or marked hypoechogenicity, microlobulated or irregular margins, microcalcifications or mixed calcifications, and nonparallel shape than benign nodules (Pā€‰<ā€‰0.001 for all).

Figure 1
figure 1

Diagram of the study cohort. FNA fine-needle aspiration, US ultrasonography.

Table 1 Demographics of patients and nodules.

Malignancy rates according to categories in the risk stratification systems

Each risk stratification system had significantly different malignancy rates according to categories (Table 2, Pā€‰<ā€‰0.001 for all). Most of the categorized lesions according to ACR and EU TIRADS were all in the range of the recommended risks of malignancy except for the not suspicious lesions (category 2) of ACR TIRADS and low risk (category 3) lesions of EU TIRADS. All categories except nodules of intermediate suspicion (category 4) in the ATA guideline were outside the recommended range.

Table 2 Comparison of Malignancy Rates with Several Risk Stratification Systems.

Diagnostic performances of the guidelines

Among the original guidelines we evaluated, the ACR TIRADS had highest specificity, accuracy, LR and AUC (62.2%, 66%, 2.128 and 0.713, respectively) (Pā€‰<ā€‰0.001 for all, Tables 3 and 4, Figs.Ā 2 and 3) followed by Kwak guideline (35%, 47.5%, 1.458 and 0.649, respectively), EU guideline (28.1%, 42.2%, 1.324 and 0.616, respectively) and ATA guideline (19.9%, 36.4%, 1.231 and 0.592, respectively). Sensitivity was the highest with the ATA guideline (98.6%) and the lowest with the ACR guideline (80.4%, Pā€‰=ā€‰0.011 comparing ATA and Kwak, Pā€‰=ā€‰0.001 comparing the ATA and EU guidelines, Pā€‰<ā€‰0.001 for the other guidelines).

Table 3 Diagnostic Performances of the Four Guidelines and their Modified Guidelines.
Table 4 Comparison of Diagnostic Performances of the Four Guidelines and their Modified Guidelines.
Figure 2
figure 2

Receiver operating characteristic curves of the four guidelines and their modified guidelines. The modified Kwak (mKwak), modified ATA (mATA) and modified EU (mEU) guidelines incorporated the size threshold suggested by the ACR guideline. ACR American College of Radiology3, Kwak Kwak et al.ā€™s study8, ATA American Thyroid Association7, EU European Thyroid Association11.

Figure 3
figure 3

Diagnostic performances of the four guidelines and their modified guidelines. The modified Kwak (mKwak), modified ATA (mATA) and modified EU (mEU) guidelines incorporated the size threshold suggested by the ACR guideline. ACR American College of Radiology3, Kwak Kwak et al.ā€™s study8, ATA American Thyroid Association7, EU European Thyroid Association11.

When the size threshold of ACR TIRADS was applied to the original TIRADS, the diagnostic ability increased in terms of specificity, accuracy, LR and AUC for all guidelines (Tables 3 and 4, Figs.Ā 2 and 3). The modified Kwak (mKwak) guideline had a specificity of 64%, accuracy of 68.6%, LR of 2.389 and AUC of 0.75 while the Kwak guideline had a specificity of 35%, accuracy of 47.5%, LR of 1.458 and AUC of 0.649 (Pā€‰<ā€‰0.001 for all). The modified ATA (mATA) guideline had a specificity of 57.2%, accuracy of 63.2%, LR of 1.998 and AUC of 0.714, while the original ATA guideline had a specificity of 19.9%, accuracy of 36.4%, LR of 1.231 and AUC of 0.592 (Pā€‰<ā€‰0.001 for all). The modified EU (mEU) guideline had a specificity of 40.1%, accuracy of 51.4%, LR of 1.565 and AUC of 0.669, while the EU guideline had a specificity of 28.1%, accuracy of 42.2%, LR of 1.324 and AUC of 0.616 (Pā€‰<ā€‰0.001 for all). However, the sensitivities of the modified guidelines were lower than their original versions. The sensitivity of the original guidelines was 94.8%, 98.6%, 95.2% for the Kwak, ATA and EU guidelines, respectively, while the modified versions showed a sensitivity of 85.9%, 85.6% and 93.8% for the mKwak, mATA and mEU guidelines, respectively. Among all the original and modified guidelines, the mKwak guideline had the highest specificity, accuracy, LR and AUC (64%, 68.6%, 2.389 and 0.75, respectively) (Pā€‰=ā€‰0.014 comparing the specificity of with ACR and Pā€‰<ā€‰0.001 for the others).

The unnecessary FNA rate was the lowest with the mKwak guideline (61.1%, 393/643) followed by the ACR (63.8%, 413/647), mATA (65.3%, 468/717), mEU (70.6%, 655/928), Kwak (72%, 711/987), EU (73.9%, 786/1,063) and ATA guidelines (75.3%, 876/1,163) (Table 5, Fig.Ā 3). In all modified guidelines, the unnecessary FNA rate decreased comparing to the original guidelines when the size threshold of the ACR TIRADS was applied.

Table 5 Unnecessary Fine-needle Aspiration Rates.

Discussion

Currently, many guidelines composed of various TIRADS and size thresholds exist for further work-up such as FNA or follow-up US3,4,7,11. However, there has been no proven universal guideline proposed to reduce unnecessary FNAs and to find as many thyroid cancers as possible. It has also been difficult to compare the risk stratification systems themselves as each uses a different size threshold to recommend FNA although many studies have compared the diagnostic performances and unnecessary FNA rates of these guidelines12,20,21,22,23,24,25. To overcome this problem, we applied the size threshold of the ACR guideline to the Kwak, ATA and EU guidelines by matching the recommended malignancy rates. After applying the ACR TIRADS size threshold in the modified guidelines, diagnostic ability increased in terms of specificity, accuracy, LR and AUC compared with the original guidelines and the unnecessary FNA rates were also lower. The mKwak guideline which incorporated the ACR size threshold showed the best diagnostic results among the original and modified guidelines in terms of specificity, accuracy, LR and AUC.

Recently, many researchers demonstrated that the ACR TIRADS had superior diagnostic performance compared to other guidelines and reduced larger number of unnecessary FNAs (compared with guidelines from ATA, EU, American Association of Clinical Endocrinologists/American College of Endocrinology/Associazione Medici Endocrinologi, National Comprehensive Cancer Network, French Society of Endocrinology, Society of Radiology in Ultrasound and Korean Thyroid Association/Korean Society of Thyroid)12,21,22,23,25. Considering that the ACR incorporates a larger size threshold for FNA despite using similar recommended malignancy risks, the better diagnostic ability of the ACR guidelines can be explained by the size criteria for FNA and not the complicated US risk stratification system itself26. In this study, the ACR guideline showed better diagnostic accuracy than the original Kwak guideline which uses a 10Ā mm size threshold to recommend US-guided FNA (US-FNA) regardless of the number of suspicious US features. However, the mKwak guideline showed higher diagnostic accuracy than the original ACR guideline after the size threshold of the ACR guideline was applied. When US risk stratification systems are compared between the ACR and Kwak guidelines, the Kwak guideline is more straightforward and practical to use than the ACR guideline which uses a different point system for individual US features as they are assigned different weights3,8. Therefore, a combination of the easier US risk stratification system of the Kwak guideline and the size threshold of the ACR guideline can help clinicians in daily practice.

Increasing the size threshold of US-FNA resulted in decreasing the unnecessary FNA rate in all the guidelines we evaluated, which was the trade-off for lower sensitivity. In our study, the unnecessary FNA rate decreased more than sensitivity did for both the Kwak and EU guidelines. Size modification reduced the unnecessary FNA rate of the Kwak and EU guidelines by 10.9% and 3.3%, respectively while reducing sensitivity by 8.9% and 1.4%, respectively. When the ATA and mATA guidelines were compared, sensitivity decreased by 13% and the unnecessary FNA rate decreased by 10% with the mATA guidelines. As the only difference between the modified and original guidelines was size criteria, we can assume that the size threshold proposed by the ACR guideline increased diagnostic accuracy and reduced the unnecessary FNA rates. In one recent study, diagnostic performance and the unnecessary biopsy rate were evaluated with simulations using various nodule size cutoffs applied to the ATA and Korean Thyroid Association/Korean Society of Thyroid Radiology guidelines (KTA/KSThR)22. Among the various simulations, the 15Ā mm cutoff for intermediate suspicion, 25Ā mm cutoff for low suspicion and eliminating FNA for nodules of very low suspicion in the ATA guideline showed the highest specificity, accuracy and the lowest unnecessary biopsy rate22. These results suggest that the high specificity and low unnecessary FNA rate of the ACR guideline was due to the larger size cutoff which is in line with our study results22.

There are several limitations to this study. First, 1,244 of the 1,384 thyroid nodules (89.9%) were diagnosed based on cytologic findings alone, which could have resulted in some missed malignancies. We only included the nodules with definitive diagnostic cytopathologic findings (benign or malignant) at US-FNA, core needle biopsy, or surgery. Also, 5.2% (21/396) of the follicular carcinomas were diagnosed after surgery. Thus, a selection bias exists. Second, an experienced radiologist retrospectively re-assigned categories to thyroid nodules according to different risk stratification systems using US features prospectively recorded by 14 radiologists who were familiar with point-scale risk stratification. When US descriptors were recorded in this study, they could not be defined with the exact same definitions used in the other original guidelines, an issue which was not considered during data analysis, and this might have led to differences in the final assessments made in real-time examinations. Reassigning categories previously assigned according to the point-scale system to categories based on the pattern-recognition system might have also affected the results of this study. Third, the 14 radiologists performing the prospective imaging acquisition and analysis had variable levels of experience. Although interobserver variability and consistency are important considerations for choosing appropriate guidelines27,28, our study is reflective of actual clinical practice. Forth, the relatively high malignancy rate of thyroid nodules in our study is probably because we only included thyroid nodules which underwent FNA, which would naturally lead to a higher number of malignant nodules. Also, our institution is a tertiary referral center and that itself is a reason for the high malignancy rate of the study population.

In conclusion, application of the larger US-FNA size threshold of the ACR guideline resulted in increased diagnostic accuracy and decreased unnecessary FNA rates at the expense of decreased sensitivity. The mKwak guideline which is practical and easy to use showed superior diagnostic accuracy than the other guidelines, both original and modified. Further longitudinal multicenter studies with larger data are needed in the future to choose an accurate and effective risk stratification system for daily practice.

Methods

The institutional review board (IRB) of the Yonsei University College of Medicine approved this retrospective study and the requirement for informed consent for review of images and medical records was waived. And all methods were performed in accordance with the Declaration of Helsinki.

Study cohort

This study was performed from December 2015 to November 2016, during which 2,179 patients underwent US-FNA to diagnose thyroid nodules at our institution, a tertiary referral center. Among them, a total of 1704 thyroid nodules in 1602 patients were 10Ā mm or larger on US. 320 nodules were excluded because of a lack of definitive cytopathologic results after being initially diagnosed as nondiagnostic (nā€‰=ā€‰176), atypia or follicular lesion of undetermined significance (nā€‰=ā€‰110), follicular neoplasm or suspicion of follicular neoplasm (nā€‰=ā€‰27), or suspicion of malignancy (nā€‰=ā€‰7). Nodules were included if they had definitive diagnostic cytopathologic findings (benign or malignant) at US-FNA, core needle biopsy, or surgery. Finally, 1,384 thyroid nodules in 1,301 patients were included (Fig.Ā 1).

Mean age of the 1,301 patients was 50.2ā€‰Ā±ā€‰13.6Ā years old (range 18ā€“90Ā years). Mean size of the 1,384 thyroid nodules was 23.2ā€‰Ā±ā€‰12.6Ā mm (range 10-100Ā mm). Of the total patients, 1,062 (81.6%) were women and 239 (18.4%) were men. Of the total patients, 77 had two nodules and three had three nodules.

US examinations

Thyroid US was performed with a 5ā€“12Ā MHz linear array transducer (iU22; Philips Medical Systems). US examinations were performed by one of 14 board-certified radiologists (5 faculties and 9 fellows) with 1ā€“20Ā years of experience in thyroid imaging. US-FNAs were subsequently performed by the same radiologist who performed the thyroid US examination.

US features of thyroid nodules which underwent US-FNA were prospectively described and recorded in our institutional database at the time of US-FNA by the radiologist who performed the US and US-FNA according to composition, echogenicity, margin, calcifications, and shape. The composition was classified as solid, predominantly solid, predominantly cyst, spongiform nodule and cyst, the echogenicity was classified as hyperechogenicity, isoechogenicity, hypoechogenicity and marked hypoechogenicity, the margin was classified as well-defined, microlobulated and irregular margin, the calcification was classified as negative, egg-shell calcification, macrocalcification, microcalcification and mixed calcification. And the shape was classified as parallel and non-parallel. At our institution, US findings of solid composition, hypoechogenicity or marked hypoechogenicity, microlobulated or irregular margins, microcalcifications, and nonparallel shape were considered to be suspicious features for malignancy29.

Data and statistical analysis

Cytopathology results from FNA and surgery were considered as the standard reference. One radiologist (J.Y.K) with 17Ā years of experience in thyroid imaging, blind to the patientsā€™ clinical data and pathological results, retrospectively re-assigned the TIRADS categories of each thyroid nodule using our institutional database which was made up of data collected by the radiologists who performed the US-FNAs. Ninety thyroid nodules (6.5%, 90/1,384) unspecified according to the ATA guideline including isoechoic or hyperechoic nodules with suspicious US features7 were regarded as intermediate suspicion as the calculated malignancy rates of these nodules were within the range of 10ā€“20%30.

Indications for FNA were based on US features and lesion size according to the various guidelines we used in this study3,7,11. A size threshold of 10Ā mm was used to indicate US-FNA in all thyroid nodules with suspicious US features in the Kwak TIRADS because the Kwak TIRADS recommends US-FNA when thyroid nodules more than 10Ā mm in size have suspicious US features rather than applying different size thresholds according to the final assessment category8,29. We applied the size criteria of the ACR TIRADS to the Kwak, ATA and EU guidelines according to similar recommended malignancy risk of each category3,7,8,11, and defined the new guidelines as the mKwak, mATA and mEU guidelines, respectively (Supplementary Table S1 online). The ACR TIRADS recommends no FNA for not suspicious thyroid nodules with recommended risk of malignancy of 2%3. The same strategy was applied for very low suspicion category of ATA guideline with recommended risk of malignancy of less than 3%7. For mildly suspicious thyroid nodules with a recommended malignancy risk of 5% in the ACR TIRADS, FNA was recommended when the nodule was 25Ā mm or larger3. The same size threshold was applied for nodules of low risk according to the EU guideline rather than the present size threshold of 20Ā mm because the recommended risks of malignancy was 2ā€“4%11. The recommended malignancy risk was 5ā€“20% for moderately suspicious nodules in the ACR TIRADS and FNA was recommended when the nodule was 15Ā mm or larger3. A size threshold of 15Ā mm was applied instead of 10Ā mm for nodules of intermediate suspicion according to the ATA guideline with a recommended malignancy risk of 10ā€“20%7. We also used a size threshold proposed by the ACR TIRADS to the Kwak guideline3,8: 25Ā mm size threshold for category 4a, 15Ā mm for category 4b and 10Ā mm for category 4c and 5. As the spongiform nodule and isolated macrocalcifications have no suspicious US feature according to Kwak TIRADS, they are considered as category 38.

Thyroid nodules were classified as nodules for which US-FNA was indicated and those for which it was not, according to the FNA criteria provided by each guideline3,7,8,11.

To compare the demographics between benign and malignant nodules, the independent two sample t-test was used to compare continuous data including patient age and the Chi-square test was used to compare categorical data including patient sex. Since some patients had more than one nodule, the generalized estimated equation (GEE) was used to compare both continuous and categorical data between benign and malignant nodules. Malignancy rates according to the final assessment by each system were calculated and compared with GEE. We also evaluated diagnostic performances including sensitivity, specificity, accuracy, negative predictive value (NPV), positive predictive value (PPV), likelihood ratio (LR) and area under the receiver operating characteristic curve (AUC) along with 95% confidence intervals (CI). The sensitivity, specificity, accuracy, NPV, PPV and LR were compared with GEE. The Delong method was used to compare AUC. The unnecessary biopsy rate for the diagnosis of thyroid cancer was defined as the number of benign nodules among the biopsy-required nodules. Statistical analysis was performed with SAS software (version 9.4, SAS Inc.). A two-sided Pā€‰<ā€‰0.05 was considered to indicate statistical significance.