Diagnostic performances and unnecessary US-FNA rates of various TIRADS after application of equal size thresholds

We compared the diagnostic performances and unnecessary FNA rates of several guidelines and modified versions using the size threshold of the ACR TIRADS. Our Institutional Review Board approved this retrospective study and waived the requirement for informed consent and all methods were performed in accordance with the Declaration of Helsinki. A total of 1,384 thyroid nodules in 1,301 patients with definitive cytopathologic findings were included. US categories were assigned according to each guideline. We applied the size threshold suggested by the ACR TIRADS for FNA to the Kwak, ATA and EU guidelines and defined these modified guidelines as the modified Kwak (mKwak), modified ATA (mATA) and modified EU (mEU) guidelines. Diagnostic performances and unnecessary FNA rates of all guidelines were evaluated. Of 1,384 thyroid nodules, 291 (21%) were malignant. Among the original guidelines, the ACR TIRADS had the highest specificity, accuracy, LR and AUC (62.2%, 66%, 2.128 and 0.713). The mKwak, mATA and mEU guidelines had higher specificity, accuracy, LR and AUC (P < 0.001 for all), and fewer unnecessary FNAs, compared with their original guidelines. Among all original and modified guidelines, the mKwak guideline had the highest specificity, accuracy, LR and AUC (64%, 68.6%, 2.389 and 0.75). The unnecessary FNA rate was the lowest with the mKwak guideline (61.1%). The highest sensitivity was observed with the ATA guideline (98.6%). After incorporating the size threshold of the ACR TIRADS to other TIRADS, all guidelines showed higher diagnostic accuracy and lower unnecessary FNA rates than their original versions. The mKwak guideline showed the best diagnostic performances.

Scientific RepoRtS | (2020) 10:10632 | https://doi.org/10.1038/s41598-020-67543-z www.nature.com/scientificreports/ proposed by Kwak et al. (Kwak TIRADS) has been proven to be practical and easily applicable in the assessment of thyroid nodules 8,[13][14][15][16][17][18][19][20] , and can be performed by simply counting the number of suspicious US features without considering the malignancy probability of each US feature. One recent study compared the diagnostic efficiency of Kwak and ACR TIRADS and found the former to have higher AUC and accuracy 19 . However, the study did not consider the size threshold for recommending FNA 19 . We assumed that if they have similar diagnostic performances with the same size threshold for thyroid nodules, radiologists and clinicians can choose the more convenient risk stratification system for daily practice.
To find an effective guideline for recommending FNA for thyroid nodules, we investigated the diagnostic performances and unnecessary FNA rates of several guidelines in their original form, and their modified versions using the size threshold proposed by the ACR TIRADS.
Malignancy rates according to categories in the risk stratification systems. Each risk stratification system had significantly different malignancy rates according to categories (  Fig. 3). In all modified guidelines, the unnecessary FNA rate decreased comparing to the original guidelines when the size threshold of the ACR TIRADS was applied.

Discussion
Currently, many guidelines composed of various TIRADS and size thresholds exist for further work-up such as FNA or follow-up US 3,4,7,11 . However, there has been no proven universal guideline proposed to reduce unnecessary FNAs and to find as many thyroid cancers as possible. It has also been difficult to compare the risk stratification systems themselves as each uses a different size threshold to recommend FNA although many studies have compared the diagnostic performances and unnecessary FNA rates of these guidelines 12,[20][21][22][23][24][25] . To overcome this problem, we applied the size threshold of the ACR guideline to the Kwak, ATA and EU guidelines by matching Scientific RepoRtS | (2020) 10:10632 | https://doi.org/10.1038/s41598-020-67543-z www.nature.com/scientificreports/ the recommended malignancy rates. After applying the ACR TIRADS size threshold in the modified guidelines, diagnostic ability increased in terms of specificity, accuracy, LR and AUC compared with the original guidelines and the unnecessary FNA rates were also lower. The mKwak guideline which incorporated the ACR size threshold showed the best diagnostic results among the original and modified guidelines in terms of specificity, accuracy, LR and AUC.
Recently, many researchers demonstrated that the ACR TIRADS had superior diagnostic performance compared to other guidelines and reduced larger number of unnecessary FNAs (compared with guidelines from ATA, EU, American Association of Clinical Endocrinologists/American College of Endocrinology/Associazione Medici Endocrinologi, National Comprehensive Cancer Network, French Society of Endocrinology, Society of Radiology in Ultrasound and Korean Thyroid Association/Korean Society of Thyroid) 12,[21][22][23]25 . Considering that the ACR incorporates a larger size threshold for FNA despite using similar recommended malignancy risks, the better diagnostic ability of the ACR guidelines can be explained by the size criteria for FNA and not the complicated US risk stratification system itself 26 . In this study, the ACR guideline showed better diagnostic accuracy than the original Kwak guideline which uses a 10 mm size threshold to recommend US-guided FNA (US-FNA) regardless of the number of suspicious US features. However, the mKwak guideline showed higher diagnostic   www.nature.com/scientificreports/ accuracy than the original ACR guideline after the size threshold of the ACR guideline was applied. When US risk stratification systems are compared between the ACR and Kwak guidelines, the Kwak guideline is more straightforward and practical to use than the ACR guideline which uses a different point system for individual US features as they are assigned different weights 3,8 . Therefore, a combination of the easier US risk stratification system of the Kwak guideline and the size threshold of the ACR guideline can help clinicians in daily practice. Increasing the size threshold of US-FNA resulted in decreasing the unnecessary FNA rate in all the guidelines we evaluated, which was the trade-off for lower sensitivity. In our study, the unnecessary FNA rate decreased more than sensitivity did for both the Kwak and EU guidelines. Size modification reduced the unnecessary FNA rate of the Kwak and EU guidelines by 10.9% and 3.3%, respectively while reducing sensitivity by 8.9% and 1.4%, respectively. When the ATA and mATA guidelines were compared, sensitivity decreased by 13% and the unnecessary FNA rate decreased by 10% with the mATA guidelines. As the only difference between the modified and original guidelines was size criteria, we can assume that the size threshold proposed by the ACR guideline increased diagnostic accuracy and reduced the unnecessary FNA rates. In one recent study, diagnostic performance and the unnecessary biopsy rate were evaluated with simulations using various nodule size cutoffs applied to the ATA and Korean Thyroid Association/Korean Society of Thyroid Radiology guidelines (KTA/KSThR) 22 . Among the various simulations, the 15 mm cutoff for intermediate suspicion, 25 mm cutoff for low suspicion and eliminating FNA for nodules of very low suspicion in the ATA guideline showed the highest specificity, accuracy and the lowest unnecessary biopsy rate 22 . These results suggest that the high specificity and low unnecessary FNA rate of the ACR guideline was due to the larger size cutoff which is in line with our study results 22 . There are several limitations to this study. First, 1,244 of the 1,384 thyroid nodules (89.9%) were diagnosed based on cytologic findings alone, which could have resulted in some missed malignancies. We only included  www.nature.com/scientificreports/ the nodules with definitive diagnostic cytopathologic findings (benign or malignant) at US-FNA, core needle biopsy, or surgery. Also, 5.2% (21/396) of the follicular carcinomas were diagnosed after surgery. Thus, a selection bias exists. Second, an experienced radiologist retrospectively re-assigned categories to thyroid nodules according to different risk stratification systems using US features prospectively recorded by 14 radiologists who were familiar with point-scale risk stratification. When US descriptors were recorded in this study, they could not be defined with the exact same definitions used in the other original guidelines, an issue which was not considered during data analysis, and this might have led to differences in the final assessments made in real-time examinations. Reassigning categories previously assigned according to the point-scale system to categories based on the pattern-recognition system might have also affected the results of this study. Third, the 14 radiologists performing the prospective imaging acquisition and analysis had variable levels of experience. Although interobserver variability and consistency are important considerations for choosing appropriate guidelines 27,28 , our study is reflective of actual clinical practice. Forth, the relatively high malignancy rate of thyroid nodules in our study is probably because we only included thyroid nodules which underwent FNA, which would naturally lead to a higher number of malignant nodules. Also, our institution is a tertiary referral center and that itself is a reason for the high malignancy rate of the study population.
In conclusion, application of the larger US-FNA size threshold of the ACR guideline resulted in increased diagnostic accuracy and decreased unnecessary FNA rates at the expense of decreased sensitivity. The mKwak guideline which is practical and easy to use showed superior diagnostic accuracy than the other guidelines, both original and modified. Further longitudinal multicenter studies with larger data are needed in the future to choose an accurate and effective risk stratification system for daily practice.

Methods
The institutional review board (IRB) of the Yonsei University College of Medicine approved this retrospective study and the requirement for informed consent for review of images and medical records was waived. And all methods were performed in accordance with the Declaration of Helsinki.
Study cohort. This study was performed from December 2015 to November 2016, during which 2,179 patients underwent US-FNA to diagnose thyroid nodules at our institution, a tertiary referral center. Among them, a total of 1704 thyroid nodules in 1602 patients were 10 mm or larger on US. 320 nodules were excluded because of a lack of definitive cytopathologic results after being initially diagnosed as nondiagnostic (n = 176), atypia or follicular lesion of undetermined significance (n = 110), follicular neoplasm or suspicion of follicular neoplasm (n = 27), or suspicion of malignancy (n = 7). Nodules were included if they had definitive diagnostic cytopathologic findings (benign or malignant) at US-FNA, core needle biopsy, or surgery. Finally, 1,384 thyroid nodules in 1,301 patients were included (Fig. 1).
Mean age of the 1,301 patients was 50.2 ± 13.6 years old (range 18-90 years). Mean size of the 1,384 thyroid nodules was 23.2 ± 12.6 mm (range 10-100 mm). Of the total patients, 1,062 (81.6%) were women and 239 (18.4%) were men. Of the total patients, 77 had two nodules and three had three nodules.

US examinations.
Thyroid US was performed with a 5-12 MHz linear array transducer (iU22; Philips Medical Systems). US examinations were performed by one of 14 board-certified radiologists (5 faculties and 9 fellows) with 1-20 years of experience in thyroid imaging. US-FNAs were subsequently performed by the same radiologist who performed the thyroid US examination.
US features of thyroid nodules which underwent US-FNA were prospectively described and recorded in our institutional database at the time of US-FNA by the radiologist who performed the US and US-FNA according to composition, echogenicity, margin, calcifications, and shape. The composition was classified as solid, predominantly solid, predominantly cyst, spongiform nodule and cyst, the echogenicity was classified as hyperechogenicity, isoechogenicity, hypoechogenicity and marked hypoechogenicity, the margin was classified as well-defined, microlobulated and irregular margin, the calcification was classified as negative, egg-shell calcification, macrocalcification, microcalcification and mixed calcification. And the shape was classified as parallel and non-parallel. At our institution, US findings of solid composition, hypoechogenicity or marked hypoechogenicity, microlobulated or irregular margins, microcalcifications, and nonparallel shape were considered to be suspicious features for malignancy 29 . Data and statistical analysis. Cytopathology results from FNA and surgery were considered as the standard reference. One radiologist (J.Y.K) with 17 years of experience in thyroid imaging, blind to the patients' clinical data and pathological results, retrospectively re-assigned the TIRADS categories of each thyroid nodule using our institutional database which was made up of data collected by the radiologists who performed the US-FNAs. Ninety thyroid nodules (6.5%, 90/1,384) unspecified according to the ATA guideline including isoechoic or hyperechoic nodules with suspicious US features 7 were regarded as intermediate suspicion as the calculated malignancy rates of these nodules were within the range of 10-20% 30 .
Indications for FNA were based on US features and lesion size according to the various guidelines we used in this study 3,7,11 . A size threshold of 10 mm was used to indicate US-FNA in all thyroid nodules with suspicious US features in the Kwak TIRADS because the Kwak TIRADS recommends US-FNA when thyroid nodules more than 10 mm in size have suspicious US features rather than applying different size thresholds according to the final assessment category 8,29 . We applied the size criteria of the ACR TIRADS to the Kwak, ATA and EU guidelines according to similar recommended malignancy risk of each category 3,7,8,11 , and defined the new guidelines as the mKwak, mATA and mEU guidelines, respectively (Supplementary Table S1  www.nature.com/scientificreports/ strategy was applied for very low suspicion category of ATA guideline with recommended risk of malignancy of less than 3% 7 . For mildly suspicious thyroid nodules with a recommended malignancy risk of 5% in the ACR TIRADS, FNA was recommended when the nodule was 25 mm or larger 3 . The same size threshold was applied for nodules of low risk according to the EU guideline rather than the present size threshold of 20 mm because the recommended risks of malignancy was 2-4% 11 . The recommended malignancy risk was 5-20% for moderately suspicious nodules in the ACR TIRADS and FNA was recommended when the nodule was 15 mm or larger 3 . A size threshold of 15 mm was applied instead of 10 mm for nodules of intermediate suspicion according to the ATA guideline with a recommended malignancy risk of 10-20% 7 . We also used a size threshold proposed by the ACR TIRADS to the Kwak guideline 3,8 : 25 mm size threshold for category 4a, 15 mm for category 4b and 10 mm for category 4c and 5. As the spongiform nodule and isolated macrocalcifications have no suspicious US feature according to Kwak TIRADS, they are considered as category 3 8 . Thyroid nodules were classified as nodules for which US-FNA was indicated and those for which it was not, according to the FNA criteria provided by each guideline 3,7,8,11 . To compare the demographics between benign and malignant nodules, the independent two sample t-test was used to compare continuous data including patient age and the Chi-square test was used to compare categorical data including patient sex. Since some patients had more than one nodule, the generalized estimated equation (GEE) was used to compare both continuous and categorical data between benign and malignant nodules. Malignancy rates according to the final assessment by each system were calculated and compared with GEE. We also evaluated diagnostic performances including sensitivity, specificity, accuracy, negative predictive value (NPV), positive predictive value (PPV), likelihood ratio (LR) and area under the receiver operating characteristic curve (AUC) along with 95% confidence intervals (CI). The sensitivity, specificity, accuracy, NPV, PPV and LR were compared with GEE. The Delong method was used to compare AUC. The unnecessary biopsy rate for the diagnosis of thyroid cancer was defined as the number of benign nodules among the biopsy-required nodules. Statistical analysis was performed with SAS software (version 9.4, SAS Inc.). A two-sided P < 0.05 was considered to indicate statistical significance.