Development and Validation of Nomograms for Malignancy Prediction in Soft Tissue Tumors Using Magnetic Resonance Imaging Measurements

The objective of this study was to develop, validate, and compare nomograms for malignancy prediction in soft tissue tumors (STTs) using conventional and diffusion-weighted magnetic resonance imaging (MRI) measurements. Between May 2011 and December 2016, 239 MRI examinations from 236 patients with pathologically proven STTs were included retrospectively and assigned randomly to training (n = 100) and validation (n = 139) cohorts. MRI of each lesion was reviewed to assess conventional and diffusion-weighted imaging (DWI) measurements. Multivariate nomograms based on logistic regression analyses were built using conventional measurements with and without DWI measurements. Predictive accuracy was measured using the concordance index (C-index) and calibration plots. Statistical differences between the C-indexes of the two models were analyzed. Models were validated by leave-one-out cross-validation and by using a validation cohort. The mean lesion size, presence of infiltration, edema, and the absence of the split fat sign were significant and independent predictors of malignancy and included in the conventional model. In addition to these measurements, the mean and minimum apparent diffusion coefficient values were included in the DWI model. The DWI model exhibited significantly higher diagnostic performance only in the validation cohort (training cohort, 0.899 vs. 0.886, P = 0.284; validation cohort, 0.791 vs. 0.757, P = 0.020). Calibration plots showed fair agreements between the nomogram predictions and actual observations in both cohorts. In conclusion, nomograms using MRI features as variables can be utilized to predict the malignancy probability in patients with STTs. There was no definite gain in diagnostic accuracy when additional DWI features were used.


Materials and Methods
patients. The institutional review board approved this retrospective study (Samsung Medical Center, 2017-05-087-001); requirement for informed consent was waived. From May 2011 to December 2016, 3,573 MRI examinations including DWI were performed for suspected soft tissue and bone tumors; 502 examinations with pathologically proven STTs were included. Exclusion criteria were (a) previous treatment such as excision, chemotherapy, or radiotherapy (n = 151); (b) lipoma or well-differentiated liposarcoma (n = 60); (c) cystic lesions without a solid component (n = 27); (d) poor-quality MRI (e.g. image distortion with susceptibility artifacts, n = 13); and (e) two or more MRI examinations for the same lesion of a same patient (e.g. simple follow-up MRI, n = 12). A total of 239 MRI examinations from 236 patients (mean age, 48.8 years; range, 9-93 years; 126 male [mean age, 50.6 years; range, 9-90 years] and 110 female patients [mean age, 46.8 years; range, 9-93 years]) were included; 40 of the subjects were overlapped with a previous study 26 ; This prior article dealt with tumor spatial heterogeneity whereas in this manuscript we report on predictive nomograms for STT. The numbers of STTs that were pathologically confirmed by image-guided biopsy, surgical excision, and both were 28, 93, and 118, respectively. Patients were assigned randomly to the training (n = 100) or validation (n = 139) cohort (Fig. 1).
Axial-plane DWI was performed using a single-shot spin-echo echo-planar sequence. Sensitizing diffusion gradients were applied sequentially in the x, y, and z directions (field-of-view, 160-350 mm; matrix size, 128 × 128-256 × 256; repetition time/echo time, 5,000 ms/61-69 ms; fat suppression, chemical shift-selective; slice thickness, 5 mm; echo train length, 59-67; number of averages, 2; b-values, 0, 400, and 800 s/mm 2 ) 24,27,28 . The apparent diffusion coefficient (ADC) map was generated using all b-values. Parallel acquisition was performed with sensitivity-encoding technique (SENSE) by using parallel reduction factor of 1-2 for conventional sequences and 3 for DWI, respectively. Qualitative assessments of conventional image measurements were performed by three radiologists (20,18, and 13 years of experience in musculoskeletal radiology). The following characteristics were analyzed in consensus by the three radiologists, who were blinded to the clinical information and histopathologic results: Morphology (infiltration, lobulation), component (fat, fibrosis, necrosis, hemorrhage, septation, target sign), T 1 and T 2 heterogeneity, perilesional findings (edema, split fat sign, tail sign), and others (deep location involvement, neurovascular bundle invasion, bone invasion) (Supplemental Material).
Another radiologist (3 years of experience in musculoskeletal MRI) blinded to clinical information and histopathologic results evaluated quantitative MRI measurements: the mean (size mean ) and maximum (size max ) sizes and mean (ADC mean ) and minimum (ADC min ) ADC values. The lesion's longitudinal, anteroposterior, and transverse dimensions were measured on MRI; the maximums and means of the three orthogonal dimensions were recorded. For each lesion, one axial ADC map plane was selected that showed the largest tumor section diameter. The most peripheral portion of each lesion was excluded to minimize partial-volume effects. Referring to axial post-contrast FS T 1 -weighted imaging, the region of interest was manually placed on the ADC map maximally within the contrast-enhancing area; regions with necrosis, cystic changes, or dense calcification were avoided. www.nature.com/scientificreports www.nature.com/scientificreports/ statistical analysis. Continuous and categorical variables are summarized as the median (range) and frequency (%), respectively. For two independent group comparisons, continuous variables and categorical variables were analyzed with the independent t-test or Mann-Whitney test and the chi-squared test or Fisher's exact test, respectively. For use in clinical practice, age, size mean , size max , ADC mean , and ADC min were categorized using a 5-year scale for age, a 1-cm scale for size mean and size max , and 0.1 × 10 −3 mm 2 /s scale for ADC mean and ADC min to estimate a model predicting malignancy; non-significant cutoffs were excluded. Stepwise selection was applied to the training set using a logistic regression model from all combinations of candidate cutoffs. The model's goodness-of-fit was checked with the R 2 value and Hosmer-Lemeshow test. Likelihood ratio chi-squared statistics for testing the null model against the model with all predictors and the selected models were presented. Variables with P < 0.05 were considered independent predictors of malignancy and used for nomogram modeling. Among independent predictors, ADC mean and ADC min were excluded from Model I (the conventional model); they were included in Model II (the DWI model). The nomogram's predictive performance was measured by the concordance index (C-index), which is equivalent to the area under the receiver operating characteristic curve. Models were validated by leave-one-out cross-validation within the training cohort and by using a validation cohort. Then, we selected one model each from Models I and II with the smallest difference between the training and validation cohort C-indexes to determine the most valid model. The nomogram's optimal cutoff for predictive probability was determined by maximizing Youden's index; the sensitivity, specificity, positive predictive value, and negative predictive value were calculated based on the cutoff. The chi-squared test was performed to detect differences in C-indexes between nomograms for Models I and II. In univariate and multivariate analyses, differences were statistically significant at P < 0.05.
Nomogram calibrations for both models were assessed for the training and validation cohorts by plotting observed probabilities against nomogram-predicted probabilities of malignancy. Bootstrapping with 1,000 resamples was used to adjust for bias. Statistical analyses were performed using SAS version 9.4 (SAS Institute, Cary, NC, USA) and R version 3.3.2 (R development Core Team, Vienna, Austria).     estimation of the prediction model. The results of the univariate analysis for each candidate cutoff point are shown in Table 3. An age >50 years; size mean > 3 cm; size max > 4 cm; presence of infiltration, lobulation, necrosis, hemorrhage, edema, or the tail sign; an ADC mean < 1.3 × 10 −3 mm 2 /s; ADC min < 0.9 × 10 −3 mm 2 /s; and

Variables
Odds ratio (95% CI) P  Table 4. Selected variables used to build the models based on the multivariate analysis. CI, confidence interval; ADC, apparent diffusion coefficient. † 10 −3 mm 2 /s. www.nature.com/scientificreports www.nature.com/scientificreports/ absence of the target sign and split fat sign were significant between non-malignant and malignant cases. From all combinations of variables with significant candidate cutoffs and other variables including the absence of the target and split fat signs, a stepwise logistic regression analysis for Model I revealed that a size mean > 3 cm, and presence of infiltration, edema, or the split fat sign retained independent significance for predicting malignancy. After the addition of ADC mean < 1.3 × 10 −3 mm 2 /s and ADC min < 0.9 × 10 −3 mm 2 /s to these variables, size mean > 3 cm, presence of infiltration, and the split fat sign retained independent significance (Table 4). R 2 values were 0.455 and 0.449, and P-values from the Hosmer-Lemeshow test were 0.598 and 0.874 for Models I and II, respectively. Chi-squares from the likelihood ratio for models including all predictors, Model I, and Model II were 70.92, 54.46, and 59.51, respectively.

Construction of nomograms for predicting malignancy. Independent variables for predicting malig-
nancy were used to construct nomograms for Models I and II. The conventional nomogram (Model I) was formulated using conventional variables only, whereas the DWI nomogram (Model II) was formulated using ADC mean < 1.3 × 10 −3 mm 2 /s and ADC min < 0.9 × 10 −3 mm 2 /s in addition to conventional variables (Fig. 2). By determining the score from all variables on a total point scale, probabilities of malignancy could be determined by drawing a vertical line to the total score (Figs 3 and 4). In both models, the nomograms showed that the split fat sign contributed most to the probability of malignancy. Other variables showed moderate impacts on the probability of malignancy except for ADC min in Model II. Calibration plots presented fair agreements between the prediction by nomogram and actual observation of malignancy in the training and the validation cohorts (Fig. 5).    www.nature.com/scientificreports www.nature.com/scientificreports/ Models I and II, respectively. P-values for analyzing C-index differences between the models were 0.284 and 0.020 in the training and validation cohorts, respectively, with statistical significance only in the validation cohort.

Discussion
Despite superior soft tissue contrast and resolution, MRI had limited STT characterization and differential diagnosis ability, with conflicting conclusions reported by previous studies [18][19][20] . Berquist et al. 11 reported a sensitivity of 90-96% and specificity of 82-96% for malignancy prediction using traditional imaging features including size, margins, and signal intensity heterogeneity. However, Kransdorf et al. 18 concluded that MRI was incapable of reliably distinguishing between benign and malignant STTs; a specific diagnosis was made in only 24%. Crim et al. 20 reported that MRI had insufficient accuracy in differentiating benign from malignant STTs. Considering no single imaging feature was sufficient to distinguish benign from malignant STTs in most previous investigations, we combined individual measurements and formulated nomograms to diagnose STTs simply in daily clinical practice with high diagnostic performance.
Our study demonstrated that the mean size and presence of infiltration, edema, and the split fat sign were independent predictors in differentiating non-malignant and malignant STT. Adding DWI measurements including ADC mean and ADC min improved diagnostic performance significantly only in the validation cohort. Moulton et al. 25 reported that lesion size, margination, and edema were the best predictors using a stepwise logistic regression analysis; adding any fourth imaging feature did not improve accuracy. Except for the split fat sign, which they did not evaluate, these findings are comparable to ours.
The split fat sign was described to suggest a slow-growing tumor originating from the intermuscular space around the neurovascular bundle 13 . Although nonspecific, it is a common finding in benign peripheral nerve sheath tumor 13,29 . In our study, of 34 STTs (training and validation cohorts) showing the split fat sign, 30 were non-malignant and four were malignant. Because of high proportions of schwannomas in the non-malignant groups of both the training (23/50, 46%) and validation (26/72, 36%) cohorts, contribution of the split fat sign to the malignancy probability might be overestimated, which is a study limitation. All four malignant STTs showing the split fat sign had slow-growing characteristics [30][31][32] , suggesting that slowly enlarging STTs may demonstrate this sign despite malignancy. Our result is comparable to that of Murphy et al. 13 , who stated that the split fat sign might be noted in malignant peripheral nerve sheath tumor. They found that fat rims of malignant peripheral nerve sheath tumors were more frequently incomplete because of its aggressive and infiltrative growth pattern, which might improve malignant STT diagnoses showing the split fat sign.
Although some authors found that size was not useful in distinguishing benign from malignant STTs 20 , size was consistently a statistically significant predictor of malignancy in most studies 1,3,5-8,33-35 . However, it is unclear from most reports whether they used the maximum or mean tumor diameter. Our results suggested that the mean size was more significant than the maximal size in predicting malignancy, which corresponds with a report by Harish et al. 3 .
Since Rydholm 6 and Myhre-Jensen 7 described that most malignant STTs are deep whereas only about 1% of all benign STTs are deep, deep location has been regarded as an established risk factor for malignancy 1,2,33 . However, some authors recently reported that depth relationship to fascia is less important as a predictor of malignant potential 8,36 , which were also comparable to our results. Considering that the previous literatures by Rydholm 6 and Myhre-Jensen 7 were reported in the early 1980s, we supposed that these contrasting results might have been resulted from advances in diagnostic modality including MRI which helped detecting more deep-seated benign STTs.
We designed and conducted this study hypothesizing that adding DWI measurements could improve accuracy. However, diagnostic accuracy gains were marginal and showed statistical significance between the models only in the validation cohort. These results were partially comparable to those reported by Jeon et al. 24 . Although they concluded that adding DWI to conventional MRI can improve the diagnostic performance for the differentiation between malignant and benign STTs, the accuracy were the same for an experienced reader regardless of whether DWI was used or not. Considering their results together with those of our study, we supposed that the added value by using DWI might be limited for experienced readers.
A strength of our study is that we developed systematic imaging approach based on predictive models for overall STT differentiation using nomograms, in contrast to the previous studies which used subjective method 11,17,20 or evaluated only specific subtypes of STTs 3,4,9,10,24 . Further investigations to compare the diagnostic performance of prediction models and conventional non-quantified approach might be necessary.
Our study has several limitations. First, MRI parameters were variable because of the retrospective analysis. Second, our prediction model validity was imperfect with a small difference in diagnostic performance between training and validation cohorts. The C-index values were lower in the validation cohorts, possibly owing to heterogeneous and diverse pathology in the two groups. Additionally, because of randomization when constructing the training and validation cohorts, which is different from true temporal validation, the generalizability of our models might be limited. Nonetheless, we randomized patients to minimize the possibility of bias. All the MRIs were obtained using machines of the same manufacturer, which also could be one of the limitations in terms of generalizability. Third, we excluded lipomas, well-differentiated liposarcomas, and cystic tumors without solid components, which may have resulted in selection bias. Moreover, a high proportion of schwannomas in both cohorts could cause a selection bias, as described above. Fourth, the use of consensus precluded inter-observer variability evaluations. Despite the fact that inter-observer agreement is a significant variable in MRI diagnostic accuracy, we sought to increase confidence for each imaging variable by using three experienced readers' consensus analyses; quantitative measurements were evaluated by only one reader, which is another limitation. Fifth, the use of 0 s/mm 2 for the first b-value instead of 50 s/mm 2 might lead to perfusion related contribution to the ADC measurement 27 . Sixth, STT with intermediate malignancy (e.g., 'locally aggressive' and 'rarely metastasizing') were classified as 'non-malignant' tumors with benign lesions, although they may require specific treatment strategies.