Introduction

Soft tissue sarcomas are rare neoplasms of mesenchymal origin, which often require multimodal treatment1; in contrast, benign tumors require less aggressive management. Determining whether a soft tissue tumor (STT) is benign or malignant is the most important step of the treatment algorithm. In clinical practice, magnetic resonance imaging (MRI) is the preferred imaging modality for STT characterization, local staging, and preoperative planning1,2. Generally, the major criteria for diagnosing malignant STTs using conventional MRI include a large size, deep location, and heterogeneous signal intensity2,3,4,5,6,7,8; other criteria were also suggested2,4,5,9,10,11. Additionally, several characteristic MRI features such as the “target sign”12,13, “split fat sign”10,13,14, and “tail sign”4,15 were reported to be helpful in diagnosing specific STT subtypes. However, it is challenging to distinguish benign from malignant STTs using only MRI; most STTs have nonspecific imaging findings and substantial overlap exists between the imaging features of malignant and benign lesions4,5,9,16,17. Consequently, the unsatisfactory diagnostic performance of MRI for distinguishing benign from malignant STTs was reported previously18,19,20. Diffusion-weighted imaging (DWI), which measures the random motion of water protons and provides a quantitative parameter of water diffusion in tissue, has been reported to be useful in tumor characterization21 and treatment response evaluation22,23 in musculoskeletal imaging. Also, it was suggested that DWI can potentially improve diagnostic performance in the differentiation of benign and malignant STTs24.

Most previous MRI studies regarding STT differentiation included small numbers of patients, focused on specific subtypes, or focused on a limited number of imaging findings4,8,9,10,24. Few investigations described a systematic imaging approach for differentiating between benign and malignant STTs8,25.

A nomogram incorporates a variety of factors and is a reliable and pragmatic prediction tool to assess the overall probability of a specific outcome. To the best of our knowledge, no study has used nomograms for STT differentiation. We aimed to build predictive nomograms by combining known clinical and MRI measurements described in the previous literatures2,3,4,5,6,7,8,9,10,11,12,14,15 and to validate them using a validation cohort. Moreover, the diagnostic performance of a nomogram based on conventional and DWI measurements together was compared with that of a nomogram based on conventional measurements alone to determine whether diagnostic accuracy increases when using DWI measurements.

Materials and Methods

Patients

The institutional review board approved this retrospective study (Samsung Medical Center, 2017-05-087-001); requirement for informed consent was waived. From May 2011 to December 2016, 3,573 MRI examinations including DWI were performed for suspected soft tissue and bone tumors; 502 examinations with pathologically proven STTs were included. Exclusion criteria were (a) previous treatment such as excision, chemotherapy, or radiotherapy (n = 151); (b) lipoma or well-differentiated liposarcoma (n = 60); (c) cystic lesions without a solid component (n = 27); (d) poor-quality MRI (e.g. image distortion with susceptibility artifacts, n = 13); and (e) two or more MRI examinations for the same lesion of a same patient (e.g. simple follow-up MRI, n = 12). A total of 239 MRI examinations from 236 patients (mean age, 48.8 years; range, 9–93 years; 126 male [mean age, 50.6 years; range, 9–90 years] and 110 female patients [mean age, 46.8 years; range, 9–93 years]) were included; 40 of the subjects were overlapped with a previous study26; This prior article dealt with tumor spatial heterogeneity whereas in this manuscript we report on predictive nomograms for STT. The numbers of STTs that were pathologically confirmed by image-guided biopsy, surgical excision, and both were 28, 93, and 118, respectively. Patients were assigned randomly to the training (n = 100) or validation (n = 139) cohort (Fig. 1).

Figure 1
figure 1

Flow diagram for patient selection. STT, soft tissue tumor.

MRI techniques

All examinations were performed using 3.0-T MRI scanners (Intera Achieva or Ingenia, Philips Medical Systems, Best, The Netherlands). Depending on the lesion’s location, dedicated coils were used with various MRI parameters. Conventional protocols included axial and coronal turbo spin echo (TSE) T1-weighted imaging (repetition time/echo time, 400–520 ms/15–16 ms) and axial and sagittal TSE T2-weighted imaging (2,411–5,366 ms/80–100 ms). Axial and coronal TSE fat-suppressed (FS) T1-weighted imaging (441–561 ms/15–16 ms; fat suppression, chemical shift-selective) was obtained after contrast administration (Gd-DOTA, Dotarem®, Guerbet, Roissy CdG Cedex, France).

Axial-plane DWI was performed using a single-shot spin-echo echo-planar sequence. Sensitizing diffusion gradients were applied sequentially in the x, y, and z directions (field-of-view, 160–350 mm; matrix size, 128 × 128–256 × 256; repetition time/echo time, 5,000 ms/61–69 ms; fat suppression, chemical shift-selective; slice thickness, 5 mm; echo train length, 59–67; number of averages, 2; b-values, 0, 400, and 800 s/mm2)24,27,28. The apparent diffusion coefficient (ADC) map was generated using all b-values. Parallel acquisition was performed with sensitivity-encoding technique (SENSE) by using parallel reduction factor of 1–2 for conventional sequences and 3 for DWI, respectively.

Clinical and imaging measurement analysis

For clinical data (age, sex, and pathological STT results), electronic medical records were reviewed. Cases were categorized as benign, intermediate or malignant according to the pathological results; Lesions with benign and intermediate biologic behavior were regarded as one category and defined as non-malignant tumors.

Qualitative assessments of conventional image measurements were performed by three radiologists (20, 18, and 13 years of experience in musculoskeletal radiology). The following characteristics were analyzed in consensus by the three radiologists, who were blinded to the clinical information and histopathologic results: Morphology (infiltration, lobulation), component (fat, fibrosis, necrosis, hemorrhage, septation, target sign), T1 and T2 heterogeneity, perilesional findings (edema, split fat sign, tail sign), and others (deep location involvement, neurovascular bundle invasion, bone invasion) (Supplemental Material).

Another radiologist (3 years of experience in musculoskeletal MRI) blinded to clinical information and histopathologic results evaluated quantitative MRI measurements: the mean (sizemean) and maximum (sizemax) sizes and mean (ADCmean) and minimum (ADCmin) ADC values. The lesion’s longitudinal, anteroposterior, and transverse dimensions were measured on MRI; the maximums and means of the three orthogonal dimensions were recorded. For each lesion, one axial ADC map plane was selected that showed the largest tumor section diameter. The most peripheral portion of each lesion was excluded to minimize partial-volume effects. Referring to axial post-contrast FS T1-weighted imaging, the region of interest was manually placed on the ADC map maximally within the contrast-enhancing area; regions with necrosis, cystic changes, or dense calcification were avoided.

Statistical analysis

Continuous and categorical variables are summarized as the median (range) and frequency (%), respectively. For two independent group comparisons, continuous variables and categorical variables were analyzed with the independent t-test or Mann-Whitney test and the chi-squared test or Fisher’s exact test, respectively. For use in clinical practice, age, sizemean, sizemax, ADCmean, and ADCmin were categorized using a 5-year scale for age, a 1-cm scale for sizemean and sizemax, and 0.1 × 10−3 mm2/s scale for ADCmean and ADCmin to estimate a model predicting malignancy; non-significant cutoffs were excluded. Stepwise selection was applied to the training set using a logistic regression model from all combinations of candidate cutoffs. The model’s goodness-of-fit was checked with the R2 value and Hosmer-Lemeshow test. Likelihood ratio chi-squared statistics for testing the null model against the model with all predictors and the selected models were presented. Variables with P < 0.05 were considered independent predictors of malignancy and used for nomogram modeling. Among independent predictors, ADCmean and ADCmin were excluded from Model I (the conventional model); they were included in Model II (the DWI model). The nomogram’s predictive performance was measured by the concordance index (C-index), which is equivalent to the area under the receiver operating characteristic curve. Models were validated by leave-one-out cross-validation within the training cohort and by using a validation cohort. Then, we selected one model each from Models I and II with the smallest difference between the training and validation cohort C-indexes to determine the most valid model. The nomogram’s optimal cutoff for predictive probability was determined by maximizing Youden’s index; the sensitivity, specificity, positive predictive value, and negative predictive value were calculated based on the cutoff. The chi-squared test was performed to detect differences in C-indexes between nomograms for Models I and II. In univariate and multivariate analyses, differences were statistically significant at P < 0.05.

Nomogram calibrations for both models were assessed for the training and validation cohorts by plotting observed probabilities against nomogram-predicted probabilities of malignancy. Bootstrapping with 1,000 resamples was used to adjust for bias. Statistical analyses were performed using SAS version 9.4 (SAS Institute, Cary, NC, USA) and R version 3.3.2 (R development Core Team, Vienna, Austria).

Results

Patient demographics

The median age was 51 years (range, 9–87 years) in the training cohort and 49 years (range, 9–93 years) in the validation cohort. Malignant STTs were identified in 50% (50/100 cases) in the training cohort and in 48% (67/139 cases) in the validation cohort. Among 28 STTs that were pathologically confirmed by image-guided biopsy, 15 were benign. 13 of them showed no significant interval change or suspicious features for malignancy during follow up period (average, 15 months; range, 2–46 months); follow up was lost in the other two cases. The pathological reports were different between that from image-guided biopsy and surgical excision in three malignant STTs; false-negative core biopsy results were obtained in two myxoid liposarcomas and a low grade fibromyxoid sarcoma. Lesions with intermediate biologic behavior including fibromatosis, inflammatory myofibroblastic tumor, epithelioid hemangioendothelioma, solitary fibrous tumor, myoepithelioma, and angiomatoid fibrous histiocytoma were categorized as non-malignant group. Descriptive characteristics for both cohorts are summarized and compared in Table 1. The two cohorts showed similar demographics and MRI features without significant statistical differences. Detailed histopathological diagnoses of the cohorts are summarized in Table 2.

Table 1 Descriptive characteristics of the study population.
Table 2 Details of the histopathological diagnoses in the training and validation cohorts.

Estimation of the prediction model

The results of the univariate analysis for each candidate cutoff point are shown in Table 3. An age >50 years; sizemean > 3 cm; sizemax > 4 cm; presence of infiltration, lobulation, necrosis, hemorrhage, edema, or the tail sign; an ADCmean < 1.3 × 10−3 mm2/s; ADCmin < 0.9 × 10−3 mm2/s; and absence of the target sign and split fat sign were significant between non-malignant and malignant cases. From all combinations of variables with significant candidate cutoffs and other variables including the absence of the target and split fat signs, a stepwise logistic regression analysis for Model I revealed that a sizemean > 3 cm, and presence of infiltration, edema, or the split fat sign retained independent significance for predicting malignancy. After the addition of ADCmean < 1.3 × 10−3 mm2/s and ADCmin < 0.9 × 10−3 mm2/s to these variables, sizemean > 3 cm, presence of infiltration, and the split fat sign retained independent significance (Table 4). R2 values were 0.455 and 0.449, and P-values from the Hosmer–Lemeshow test were 0.598 and 0.874 for Models I and II, respectively. Chi-squares from the likelihood ratio for models including all predictors, Model I, and Model II were 70.92, 54.46, and 59.51, respectively.

Table 3 Demographic and MRI characteristics of non-malignant and malignant soft tissue tumors.
Table 4 Selected variables used to build the models based on the multivariate analysis.

Construction of nomograms for predicting malignancy

Independent variables for predicting malignancy were used to construct nomograms for Models I and II. The conventional nomogram (Model I) was formulated using conventional variables only, whereas the DWI nomogram (Model II) was formulated using ADCmean < 1.3 × 10−3 mm2/s and ADCmin < 0.9 × 10−3 mm2/s in addition to conventional variables (Fig. 2). By determining the score from all variables on a total point scale, probabilities of malignancy could be determined by drawing a vertical line to the total score (Figs 3 and 4). In both models, the nomograms showed that the split fat sign contributed most to the probability of malignancy. Other variables showed moderate impacts on the probability of malignancy except for ADCmin in Model II. Calibration plots presented fair agreements between the prediction by nomogram and actual observation of malignancy in the training and the validation cohorts (Fig. 5).

Figure 2
figure 2

Nomograms for predicting the probability of malignancy in soft tissue tumors by using conventional variables only (a: Model I) and by using ADC values in addition to conventional variables (b: Model II). ADC, apparent diffusion coefficient.

Figure 3
figure 3

A 53-year-old woman with spindle cell lipoma. (a) Axial T1- and (b) T2-weighted images of the left shoulder showing a deep-located mass with mean size of 6.4 cm. Scattered areas of high signal intensity on T1-weighted image are noted, suggesting intratumoral fat component (arrows). (c) Axial fat-suppressed contrast-enhanced T1-weighted image revealed heterogeneous enhancement. (d) Split fat sign was observed between the tumor and triceps brachii muscle on sagittal T2-weighted image (arrowheads). (e) The mean and minimum ADC values of the lesion were measured to be 2.60 × 10−3 mm2/s and 1.90 × 10−3 mm2/s, respectively. The probability of malignancy was calculated to be less than 0.1 by both models I and II.

Figure 4
figure 4

A 20-year-old man with alveolar rhabdomyosarcoma. (a) Axial T1- and (b) T2-weighted images of the left hand showing a lobulated mass with mean size of 2.9 cm and peritumoral edema (not shown). Split fat sign was not evident. (c) Heterogeneous enhancement was seen on the axial fat-suppressed contrast-enhanced T1-weighted image. Infiltration along the extensor tendon (arrows) and tail sign (arrowheads) were noted. (d) The mean and minimum ADC values of the lesion were measured to be 0.85 × 10−3 mm2/s and 0.46 × 10−3 mm2/s, respectively. The probability of malignancy was calculated to be between 0.8–0.9 by both models I and II.

Figure 5
figure 5

Calibration plots of the probability of malignancy in the (a) training and (b) validation cohorts. The nomogram-predicted probability of malignancy is plotted on the x-axis; the actual probability of malignancy is plotted on the y-axis. The 45-degree line through the origin represents the perfect calibration model in which the predicted probabilities are identical to the actual probabilities.

Performance of the two models and validation

In the training cohort, sensitivity, specificity, positive predictive value, negative predictive value, and accuracy for Model I were 0.78, 0.86, 0.85, 0.80, and 0.82 using a nomogram cutoff of 176; those for Model II were 0.80, 0.86, 0.85, 0.81, and 0.83 using a nomogram cutoff of 188, respectively. Applying these nomogram cutoffs to the validation cohort, the Model I sensitivity, specificity, positive predictive value, negative predictive value, and accuracy were 0.72, 0.72, 0.71, 0.73, and 0.72; those for Model II were 0.73, 0.76, 0.74, 0.75, and 0.75, respectively. C-index values for Models I and II were 0.886 (95% confidence interval [CI], 0.821–0.951) and 0.899 (95% CI, 0.841–0.958) in the training cohort, respectively; those for Models I and II were 0.757 (95% CI, 0.675–0.839) and 0.791 (95% CI, 0.715–0.867) in the validation cohort, respectively. Cross-validation showed C-index values of 0.853 (95% CI, 0.776–0.930) and 0.878 (95% CI, 0.811–0.946) for Models I and II, respectively. P-values for analyzing C-index differences between the models were 0.284 and 0.020 in the training and validation cohorts, respectively, with statistical significance only in the validation cohort.

Discussion

Despite superior soft tissue contrast and resolution, MRI had limited STT characterization and differential diagnosis ability, with conflicting conclusions reported by previous studies18,19,20. Berquist et al.11 reported a sensitivity of 90–96% and specificity of 82–96% for malignancy prediction using traditional imaging features including size, margins, and signal intensity heterogeneity. However, Kransdorf et al.18 concluded that MRI was incapable of reliably distinguishing between benign and malignant STTs; a specific diagnosis was made in only 24%. Crim et al.20 reported that MRI had insufficient accuracy in differentiating benign from malignant STTs. Considering no single imaging feature was sufficient to distinguish benign from malignant STTs in most previous investigations, we combined individual measurements and formulated nomograms to diagnose STTs simply in daily clinical practice with high diagnostic performance.

Our study demonstrated that the mean size and presence of infiltration, edema, and the split fat sign were independent predictors in differentiating non-malignant and malignant STT. Adding DWI measurements including ADCmean and ADCmin improved diagnostic performance significantly only in the validation cohort. Moulton et al.25 reported that lesion size, margination, and edema were the best predictors using a stepwise logistic regression analysis; adding any fourth imaging feature did not improve accuracy. Except for the split fat sign, which they did not evaluate, these findings are comparable to ours.

The split fat sign was described to suggest a slow-growing tumor originating from the intermuscular space around the neurovascular bundle13. Although nonspecific, it is a common finding in benign peripheral nerve sheath tumor13,29. In our study, of 34 STTs (training and validation cohorts) showing the split fat sign, 30 were non-malignant and four were malignant. Because of high proportions of schwannomas in the non-malignant groups of both the training (23/50, 46%) and validation (26/72, 36%) cohorts, contribution of the split fat sign to the malignancy probability might be overestimated, which is a study limitation. All four malignant STTs showing the split fat sign had slow-growing characteristics30,31,32, suggesting that slowly enlarging STTs may demonstrate this sign despite malignancy. Our result is comparable to that of Murphy et al.13, who stated that the split fat sign might be noted in malignant peripheral nerve sheath tumor. They found that fat rims of malignant peripheral nerve sheath tumors were more frequently incomplete because of its aggressive and infiltrative growth pattern, which might improve malignant STT diagnoses showing the split fat sign.

Although some authors found that size was not useful in distinguishing benign from malignant STTs20, size was consistently a statistically significant predictor of malignancy in most studies1,3,5,6,7,8,33,34,35. However, it is unclear from most reports whether they used the maximum or mean tumor diameter. Our results suggested that the mean size was more significant than the maximal size in predicting malignancy, which corresponds with a report by Harish et al.3.

Since Rydholm6 and Myhre-Jensen7 described that most malignant STTs are deep whereas only about 1% of all benign STTs are deep, deep location has been regarded as an established risk factor for malignancy1,2,33. However, some authors recently reported that depth relationship to fascia is less important as a predictor of malignant potential8,36, which were also comparable to our results. Considering that the previous literatures by Rydholm6 and Myhre-Jensen7 were reported in the early 1980s, we supposed that these contrasting results might have been resulted from advances in diagnostic modality including MRI which helped detecting more deep-seated benign STTs.

We designed and conducted this study hypothesizing that adding DWI measurements could improve accuracy. However, diagnostic accuracy gains were marginal and showed statistical significance between the models only in the validation cohort. These results were partially comparable to those reported by Jeon et al.24. Although they concluded that adding DWI to conventional MRI can improve the diagnostic performance for the differentiation between malignant and benign STTs, the accuracy were the same for an experienced reader regardless of whether DWI was used or not. Considering their results together with those of our study, we supposed that the added value by using DWI might be limited for experienced readers.

A strength of our study is that we developed systematic imaging approach based on predictive models for overall STT differentiation using nomograms, in contrast to the previous studies which used subjective method11,17,20 or evaluated only specific subtypes of STTs3,4,9,10,24. Further investigations to compare the diagnostic performance of prediction models and conventional non-quantified approach might be necessary.

Our study has several limitations. First, MRI parameters were variable because of the retrospective analysis. Second, our prediction model validity was imperfect with a small difference in diagnostic performance between training and validation cohorts. The C-index values were lower in the validation cohorts, possibly owing to heterogeneous and diverse pathology in the two groups. Additionally, because of randomization when constructing the training and validation cohorts, which is different from true temporal validation, the generalizability of our models might be limited. Nonetheless, we randomized patients to minimize the possibility of bias. All the MRIs were obtained using machines of the same manufacturer, which also could be one of the limitations in terms of generalizability. Third, we excluded lipomas, well-differentiated liposarcomas, and cystic tumors without solid components, which may have resulted in selection bias. Moreover, a high proportion of schwannomas in both cohorts could cause a selection bias, as described above. Fourth, the use of consensus precluded inter-observer variability evaluations. Despite the fact that inter-observer agreement is a significant variable in MRI diagnostic accuracy, we sought to increase confidence for each imaging variable by using three experienced readers’ consensus analyses; quantitative measurements were evaluated by only one reader, which is another limitation. Fifth, the use of 0 s/mm2 for the first b-value instead of 50 s/mm2 might lead to perfusion related contribution to the ADC measurement27. Sixth, STT with intermediate malignancy (e.g., ‘locally aggressive’ and ‘rarely metastasizing’) were classified as ‘non-malignant’ tumors with benign lesions, although they may require specific treatment strategies.

In conclusion, nomograms using MRI features as variables can be utilized to predict the probability of malignancy in patients with STTs. There was no definite gain in diagnostic accuracy for differentiating non-malignant and malignant STTs when additional DWI features were used.