Malignancy risk stratification of thyroid nodules: comparisons of four ultrasound Thyroid Imaging Reporting and Data Systems in surgically resected nodules

To compare the efficiency of four different ultrasound (US) Thyroid Imaging Reporting and Data Systems (TI-RADS) in malignancy risk stratification in surgically resected thyroid nodules (TNs). The study included 547 benign TNs and 464 malignant TNs. US images of the TNs were retrospectively reviewed and categorized according to the TI-RADSs published by Horvath E et al. (TI-RADS H), Park et al. (TI-RADS P), Kwak et al. (TI-RADS K) and Russ et al. (TI-RADS R). The diagnostic performances for the four TI-RADSs were then compared. At multivariate analysis, among the suspicious US features, marked hypoechogenicity was the most significant independent predictor for malignancy (OR: 15.344, 95% CI: 5.313-44.313) (P < 0.05). Higher sensitivity was seen in TI-RADS H, TI-RADS K, TI-RADS R comparing with TI-RADS P (P < 0.05 for all), whereas the specificity, accuracy and area under the ROC curve (Az) of TI-RADS P were the highest (all P < 0.05). Higher specificity, accuracy and Az were seen in TI-RADS K compared with TI-RADS R (P = 0.003). With its higher sensitivity, TI-RADS K, a simple predictive model, is practical and convenient for the management of TNs in clinical practice. The study indicates that there is a good concordance between TI-RADS categories and histopathology.

(BI-RADS) 18 . The latter has been widely used as a standard method to describe mammographic and US features of breast lesions to correlate with breast malignancies. In 2011, Kwak et al. 12 developed a risk stratification method for thyroid malignancy according to the number of suspicious US features including solid composition, hypoechogenicity, marked hypoechogenicity, microlobulated or irregular margins, microcalcifications, and taller than-wide shape. In the same year, Russ et al. 13 established their TI-RADS classification and proposed an equation for predicting the probability of malignancy in TNs with and without elastography 19 . Nonetheless, the limitation of these studies [10][11][12][13] is inherent due to using FNAC as the gold standard. FNAC diagnosis includes a percentage of undetermined lesions (the Bethesda category III, IV and V classifications) whose final results (benign or malignant) are questionable since surgery is not performed on all of them [20][21][22] . For the reason of sampling errors, cytological examination can not replace the pathological diagnosis. Due to its uncertainty, a validation study against a surgical reference standard to confirm the utility of previous four TI-RADS categories is mandatory in clinical practice. Therefore, we performed this retrospective study with surgical series of 1011 TNs with an aim to compare the efficiencies of the four TI-RADS classifcations in malignancy risk stratification of TNs, which would provide evidences to select an appropriate system under a special circumstance.

Materials and Methods
This retrospective study was approved by our institutional review board and the requirement for informed consent from the patients was waived. The study was performed in accordance with relevant regulations.
Patients. From September 2015 to December 2016, a consecutive of 1140 patients with TNs underwent thyroid US examinations and surgeries in this referral hospital. The exclusion criteria were as follows: (a) patients with incomplete US information (103 nodules); (b) nodules with undetermined pathological results (26 nodules). For analysis in patients with multiple nodules, we selected the nodules most suspicious for malignancy at US. When no nodules were suspicious for malignancy, the largest one would be evaluated. Finally, the study group consisted of 1011 pathologically proven nodules in 1011 patients (768 women and 243 men; mean age, 51.0 years ± 13.7; age range, 13-84 years). The diameter of the nodules ranged from 4.0 to 92.0 mm (mean, 18.4 mm ± 13.3).

Conventional US. Conventional US was performed with Siemens S2000 (Siemens Medical Solutions,
Mountain View, CA, USA; 5-14 MHz linear transducer), IU22 (Philips Medical Systems, Bothell, WA, USA; 5-12 MHz linear transducer) or Logiq E9 (GE Medical Systems, Milwaukee, WI, USA; 6-15 MHz linear transducer) instruments by three radiologists who were board-certified with more than 3 years of experience in thyroid US. All the US examinations were complied with the same protocol for thyroid scanning. The patient lied in the supine position, with their neck on a high pad. Conventional US images of the thyroid nodule were acquired by carefully scanning the thyroid and adjacent tissues both transversely and longitudinally. The US machine settings such as gain, focus, depth, time gain compensation, dynamic range, wall filter, color gain, were constantly adjusted until good quality US images were obtained. Conventional transverse, longitudinal and color Doppler US images were stored for each target nodule and then the images were recorded in the internal hard-disk for further off-line analysis. The nodule's size was defined by the maximal diameter at US. The patients' images with lymphadenopathy would also be stored. Image Interpretation. One of two radiologists who did not involved in image capture reviewed the US images and analyzed TI-RADS categories independently with 6 and 13 years of experience respectively in thyroid US. Patients' medical information including previous imaging results and histopathological results were blinded to the two reviewers. They were firstly asked to read carefully the four TI-RADSs until they understood the TI-RADSs and then assessed the US characteristics defined by the authors. Then the two radiologists discussed a baseline consensus in lexicon for TI-RADS and US characteristics including location, composition, echogenicity, echostructure, margin, calcifcations, shape, vascularization, halo sign, capsule and cervical lymph node (Fig. 1). Location was categorized as right, left and isthmus. Composition was classified as solid (complete solid), predominantly solid (cystic portion ≤50%), predominantly cystic (cystic portion >50%) 11,12 and spongiform (aggregation of multiple microcystic components in more than 50% of the nodule) according to the ratio of the cystic portion to the solid portion in the nodule 10,13 . Echogenicity was classified as hyper-, iso-, hypoechogenicity (compared with the normal thyroid gland) or marked hypoechoic (lower echogenicity than the adjacent strap muscle) [11][12][13] . Echostructure was categorized according to that the nodule echo was even or not. Heterogenous echoexture was defined as mixed echogenecity due to the aggregation of multiple microcystic components intervening the solid component 11 . Margin was classified as well circumscribed, microlobulated (presence of many small lobules on the surface of the nodule) or irregular margin and infiltrative (poorly defined margin with adjacent glanular structure) 11 . Calcifications were categorized as microcalcifications (≤1 mm in diameter, visualized with or without acoustic shadows), macrocalcifications (>1 mm in diameter, or rim calcification) 12 , mixed calcification (presence of microcalcifications and macrocalcifications at the same time) 23 , hyperechoic spot (present tiny bright reflectors with a clear-cut comet-tail artifact at conventional US) 10,12,13 , and no calcification. Kwak et al. 12 regarded it as having microcalcification that a nodule had both types of calcifications, Park et al. 11 defined microcalcifications as calcifications that were equal to or less than 0.5 mm in diameter. Shape was categorized as taller than wide (greater in its anteroposterior dimension than in its transverse dimension) or wider than tall [10][11][12][13] . Vascularization which was classified as avascular, hypovascularized (poorly blood flow signal), hypervascularized (highly vascularized on color Doppler) or penetrating vessels (vessels are not visualized in its interior, only afferent vessels that penetrate the lesion) 10 . Halo sign which was defined as a hypoechoic rim around a nodule included absent halo sign, partly halo and complete fine sign 11 . Capsule was defined as circinate hyperechogenicity around a nodule 10 .
Cervical lymph node was classified as normal and lymphadenopathy including lymph nodes with minimal diameter > 6.0 mm or nodes with a absent hyperechoic hilum 10,11 .
The TI-RADS categories were previously reported by Horvath E et al. 10 , Park et al. 11 , Kwak et al. 12 , Russ et al. 13 . We have summarized the classification of the different TI-RADS categories in Table 1.

Statistical analysis.
Statistical analyses were performed with SPSS software for Windows (version 20.0; Chicago, IL, USA) and MedCalc software (version 15.2, Mariakerke, Belgium). Independent two-sample t test was used to compare the continuous data including patient age and nodule size. Chi-square test was used to compare the categorical data including US features and patient sex. With adjustment for all variables, multivariate logistic regression analysis was performed to determine independent predictors for malignancy from the US characteristics that showed statistical significance. Odds ratios (ORs) with relative 95% confidence intervals (CIs) were also calculated to determine the relevance of all potential predictors for malignancy. The cut-off value for each TI-RADS category, was obtained from receiver operating characteristic (ROC) analysis when Youden index was maximum, as well as sensitivity and specificity. Positive predictive value (PPV), negative predictive value (NPV) and accuracy were all calculated by the diagnostic test 2 × 2 contingency tables. ROC curve analysis was performed to assess the diagnostic performance. The sensitivity and specificity were compared by Mcnemar test. Z test was applied to compare the area under the ROC curves (Azs). Statistical significance was determined at a P value less than 0.05.

Result
Of the 1011 TNs included in this study, 547 (54.1%) were diagnosed as benign and the remaining 464 (45.9%) were diagnosed as malignant. Mean age of the patients with nodules diagnosed as malignant was significantly   Table 3).
The malignancy rates of four TI-RADSs were all with signifcant differences among categories (P < 0.001 for all). The TI-RADS categories whose malignancy rates are all at the range of the recommendtion except the categories of TI-RADS P 2, TI-RADS K 3, TI-RAD R 3 and TI-RADS R 4a. (Table 4). The correlation coeffcient of four TI-RADSs between category and malignancy rate was 0.712, 0.731, 0.775, 0.733 respectively.
The categories were dichotomized into findings as positive and negative for FNA with the cut-off values and the diagnostic performances of four TI-RADSs were listed in Table 5. Higher sensitivity and negative predictive value were seen for TI-RADS H, TI-RADS K, TI-RADS R in comparison with TI-RADS P (P < 0.05 for all), whereas there were no significant statistical differences comparing with each orther (P > 0.05 for all). The specificity, accuracy and Az for TI-RADS P were the highest compared with the other systems (P < 0.05 for all). Higher specificity, accuracy and Az were seen for TI-RADS K compared with TI-RADS R (P = 0.003). The specificity, accuracy and Az of TI-RADS H and TI-RADS R were lower and no significant statistical difference was seen between them (P = 0.101). (Tables 5, 6

Discussion
The TI-RADS H 10 was a prospective study equation with 10 variables, defining categories 1, 2, 3, 4a, 4b, 5 and 6. Recently, they prospectively evaluated the diagnostic accuracy of their TI-RADS and modified category 4 to 4a, 4b, 4c 5 . They intergrated other factors including imaging findings, a nodule's changes over time, previous FNAC results, different diffuse pathologies (e.g. Graves' disease, Hashimoto's thyroiditis, De Quervain thyroiditis) and varying clinical situations. These might be useful in management of different classifications of thyriod nodules. Calification (macrocalcification or microcalcification) and hypervascularity were significantly associated with malignancy in their study. In the present study, however, macrocalcification and hypervascular were not identified to be risk factors. The malignancy rate of each category is all at the range of the recommendtion.
Park et al. proposed their TI-RADS 11 in a retrospective study with 12 aspects of TNs, adding size and lymph node abnormality and resulting in 5 categories: T-US 1-5 with an increasing the risk of malignancy. In the current study, size was also significantly different between benign and malignant nodules. Lymph node abnormality was a risk factor at univariate analysis whereas not at multivariate analysis. The result was probably attributed to interferences of other variables including microcalcification, microlobulated or irregular margin, or marked hypoechogenicity, which were all the malignancy risk factors. The malignancy risk was 6.3% among category 2 nodules which was lower than recommendtion (8.0 ~ 23.0%). US features mentioned in category 2 were all not risk factors in the present study, which was possibly the cause.  Kwak et al. 12 created a predictive model based on US characteristics in a retrospective study that included 1658 nodules, considering that the risk of malignancy increased with the number of suspicious malignant US features including solid structure, marked hypoechogenicity, hypoechogenicity, microcalcification, microlobulated or irregular margin, and taller than wider shape. Our study was in concidence with them that solid composition was the predictor for carcinoma. During the process of reviewing images, we regarded the nodule as positive if there was a suspicious US features in it. It is practical and convenient for the management of TNs in clinical practice. The malignancy rate of each category were all at the range of the recommendtion.
Russ et al. published their TI-RADS system 13 based on 24 US characteristics. Their study was based on a retrospective analysis of 500 FNAC nodules from one observer at a single institution. In 2013, they prospectively evaluated the diagnostic accuracy of their categories on 4550 nodules with and without elastography 19 . Other authors had adopted it and had developed their own classification systems 25,26 . The malignancy risk was 2.6% (3/182) among category 3 nodules which was beyond the recommended malignancy rate (<2.0%). Surgical cases might be responsible for this result. The malignancy risk was 16 Table 3. Association between thyriod malignancy and various US features. Note-β, regression coefficient; OR, odds ratio; CI, confidence interval.

Scoring System and Category
Final Diagnosis *  Table 4. Comparison of malignancy rates with four TI-RADSs. * Data are numbers of patients, with percentages in parentheses.

Recommended
which was beyond the recommended malignancy rate (2.0~10.0%). This can translate to that hypoechogenicity, which is a US feature of 4a category, is malignancy risk factor at both univariate analysis and multivariate analysis.
That the nodules in our study were surgical series might be one of the reasons. The present study suggests that solid composition, hypoechogenicity, marked hypoechogenicity, homogeneous echotexure, microlobulated or irregular margin, microcalcification, mixed calcification and taller than-wide shape were independent US features in prediction of thyroid malignancy, consistently matching other published literatures 12,14,16,[27][28][29] . The current study had higher sensitivity and accuracy than those in previous studies [10][11][12][13] . The underlying reason is that our findings are specific to surgical patient cohorts with histopathology results, while the previous study focused on the TNs under the FNAC. TI-RADS P had higher diagnosis performance compared to the other three systems and had the higher specificity which is especially important in the management of TNs. Higher specificity can lower the rate of false-positive findings and eventually aviod overtreatment and reduce the number of unnecessary FNAC 25 . However, TI-RADS P had lower sensitivity relatively. As a tool used to select high-risk nodules for FNAC, higher sensitivity is very important in clinical practice.     The malignancy nodules which were diagnosed benign category by Park et al. had the US features including hypoechogenicity with halo sign, macrocalcification or predominantly hyperechogenicity. Among these features, absent or present halo sign has no significant difference at multivariate analysis, hypoechogenicity is a important US feature in prediction of thyroid malignancy. These may be the reasons of its lower sensitivity. Although TI-RADS P stratified nodules into categories, it was not easy to assign every thyroid nodule into the equation proposed during reviewing the US images (e.g. predominantly solid nodule with halo sign). TI-RADS H, TI-RADS K and TI-RADS R achieved higher sensitivity to identify those nodules with high malignancy risk. TI-RADS K and TI-RADS R recommended FNAC for thyriod nodules with one or more suspicious US feature, which may have contributed to the higher sensitivity. Although Horvath E et al. intergrated many factors, this stereotypic US application was difficult for radiologists to use. Therefore, it was not easy to apply it to clinical practice 12 .
The specificity of TI-RADS R was lower than that of TI-RADS K (P = 0.003). The specificity, accuracy and Az of TI-RADS H and TI-RADS R were lower and no significant statistical differences were found. Macrocalcification and iso-echogenicity are in malignant classification of TI-RADS H and TI-RADS R, respectively that may bring about their lower specificity. Comparing with the other three scoring systems, TI-RADS K was a simplicity and convenience predictive model based on five US characteristics, however, other three approaches had 10, 12, 24 aspects of TNs respectively [10][11][12][13] . As long as there is only one suspicious US feature in nodule, the nodule is positive with TI-RADS K. The TI-RADS categories whose malignancy rates are all at the range of the recommendtion except the categories of TI-RADS P 2, TI-RADS K 3, TI-RADS R 3 and TI-RADS R 4a. The results indicates that the TI-RADSs are appliable to both the general population with thyriod nodules and surgical series. The malignancy risks of TI-RADS K 3, TI-RADS R 3 and TI-RADS R 4a in surgical series are higher than in general population. The malignancy risk of TI-RADS P 2 in surgical series is lower than in general population. Inter-observer agreements were all substantial with four TI-RADSs. Perfect agreements of intra-observer agreements were obtained for TI-RADS P, TI-RADS K and TI-RADS R, whereas substantial agreement for TI-RADS H. To our knowledge, this was the first study correlating US findings with ultimate histopathology in the surgical specimen to compare different TI-RADSs. Consequently, the study's results of the diagnostic capacity of the classifications are not biased by the inherent inaccuracy of FNAC cytohistology results. FNAC diagnosis includes a percentage of undetermined lesions during general populations whose final results (benign or malignant) were unknown since surgery was not performed on all of them. Furthermore, in the surgical series, we collected information of the other nonsuspicious nodules present in surgical series, correlating pathology findings with nodules classified as benign patterns, that otherwise would confirm their absolute non-malignant aetiology.
Recently, with TI-RADS classifications being created, the TI-RADS system is continuously improved and modified according to new evidence, might including contrast-enhanced ultrasound 30,31 , elastosonography findings 31, 32 , PET (positron emission tomography) findings, or other imaging techniques in the future. The TI-RADS system allows the clinicians to easily understand the malignancy risk of a thyroid nodule from the US report and make more correct treatment decisions such as follow-up, FNAC or operation.
Our research has several limitations. Firstly, the study was a surgical series that overrepresentation of cancers (45.9%) was present, compared to the FNAC-based series (i.e. 4.0-5.0%) 1 , which may lead to selection bias. However, at present, only histopathology is the gold standard for diagnosis of TNs 33 . Secondly, as a result of the retrospective research, various US machines and operators possibly limited the image interpretation by radiologists. However, all the US machines in this study were high-end instruments and were reviewed by experienced radiologists. In addition, the US images were scanned and stored under the same protocol, which reduced the influence to a minimal extent, still, a prospective study design is needed. Finally, it is a single center experience in a tertiary referral hospital and multi-center studies with large case series are mandatory. Further prospective studies are anticipated to verify our results.

Conclusion
In conclusion, all the four TI-RADSs provide effective malignancy risk stratification for TNs. With its higher sensitivity, TI-RADS K, a simple predictive model based on five US characteristics, is practical and convenient for the management of TNs in clinical practice. The study also indicates that the TI-RADSs are appliable to surgical series, in addition to the general population.