MRI based volumetric measurements of vestibular schwannomas in patients with neurofibromatosis type 2: comparison of three different software tools

Neurofibromatosis type 2 is a neurogenetic disorder with an incidence of about 1:33.000. Hallmarks are bilateral benign vestibular schwannomas, which can lead to deafness or brainstem compression. Volumetric tumor measurements are essential to assess the efficacy of new therapies. We present a statistical and methodical comparison of three volumetric image analysis tools. We performed volumetric measurements on phantoms with predefined volumes (0.1 to 8.0 ml) and tumors seen on 32 head MRI scans from eight NF2 patients with BrainLab, ITK-Snap, or OsiriX. The software was compared with regard to accuracy and reproducibility of the measurements and time required for analysis. The mean volume estimated by all three software programs differed significantly from the true volume of the phantoms, but OsiriX and BrainLab gave estimates that were not significantly different from each other. For the actual tumors, the estimated volumes with all three software tools showed a low coefficient of variability, but the mean volume estimates differed among the tools. OsiriX showed the shortest analysis time. Volumetric assessment of MRI images is associated to an intrinsic risk of miscalculation. For precise volumes it is mandatory to use the same volumetric tools for all measurements.

phantoms. We used a commercially available make-up storage box with identical compartments (side length 2 cm) to create the larger phantoms. Each compartment was filled with contrast agent (Gadenobate dimeglumine 0.5 M (MultiHance) in a dilution of 1:20 with NaCl) ranging from 1 to 8 ml, using a precision pipette (Rainin Pipet-Lite XLS 0.1-10 ml for 5 ml and 8 ml and Rainin Pipet XLS 100 µl-1 ml for 1 ml to 3 ml). We also produced a set of microphantoms ranging from 0.1 to 0.7 ml (Supplementary Table S1) using micro reaction vessels filled with contrast agent.
patients. Eight patients with a clinical and molecular diagnosis of NF2 were included in this study (three females, five males) with a median mean age of 41 years (range 29-67 years) ( Table 1). All patients exhibited bilateral vestibular schwannoma; one of the patients had undergone previous surgery. For this study, a total of 32 MRI scans were available for volumetric evaluation. The selected VS of each patient was measured in all consecutive scans respectively. The 8 patients were selected to obtain a range of different tumor sizes from 0.15 to 11 ml (mean 4,5 ml). Follow-up scans were performed about every three months, and the median total followup time included in the study was 8.5 months (range 7-26 months) per patient. All patients were on long-term medication with bevacizumab for their vestibular schwannomas (2.5 mg/kg to 7.5 mg/kg 2 to 3-weekly). Three patients (#4, #5 and #6) discontinued medication during follow-up for four, five and three months respectively due to limiting side-effects (proteinuria and abnormal estrous cycle). Treatment duration ranged from 10 to 288 months at last MRI scan with a median duration of 61.5 months.
MRi. All scans were conducted at the Radiological Practice Altona in Hamburg, Germany, using a Siemens Magnetom Skyra (3.0 T). For this study, we used T1-weighted high-resolution contrast enhanced acquisition with 192 slices with a slice-thickness of 1 mm without intersectional gaps.
Software. All MRI scans were evaluated volumetrically three times with each software tool in order to determine intra-rater variability. A total of 43 scans (five phantoms, six microphantoms and eight patients with four scans each) were measured with three different software solutions for volumetric assessment of MRI-based cranial imaging. In total, 387 manual segmentation and semi-automated volumetric measurements were part of the investigation. Analysis time was recorded to evaluate usability of the software (patient #1 to #5). The three software programs chosen for comparison are: (1) OsiriX Lite (v.7.0.3.) 32-bit used on an iMac (3,2 GHz Intel Core i5, main memory: 8 GB, 1867 MHz DDR3; graphics: AMD Radeon R9 M390 2048 MB). OsiriX is an macOS-based image processing software for viewing and analyzing DICOM images and a fully functional PACS workstation, including an extensive range of measurement tools for volumetric assessment. For this post-hoc analysis for research purposes only we used the free-of-charge lite version. In this version, which is not approved for clinical diagnostics, the number of image analysis tools is limited but the volumetric analysis is fully functional and has the same processing time as the clinically accredited version. The clinically-accredited version is available through the company's website (https ://www.osiri x-viewe r.com/)

Statistics. Statistical analyses of volume estimates for phantoms and of estimated tumor volumes in NF2
patients were carried out with R version 3.6.0 (2019-04-26). Comparisons of the actual volumes of the phantoms to the volumes measured with each of the software programs was performed using t-tests, and comparison of estimates obtained by the software programs to each other was done by analysis of variance or, if the data were not normally distributed, the Kruskal-Wallis test. Comparisons of volumes of patient tumors measured with different software programs was performed using repeated measure analysis of variance. Linear regression was used to assess the relationship of phantom volume to coefficient of variation.

phantoms.
Phantoms of six different known volumes below 1 ml (microphantoms) and another five different volumes from 1 to 8 ml were measured in triplicate using three semiautomated volumetric programs: OsiriX, ITK-Snap and BrainLab (Fig. 1). The volume estimates of all three software programs differed significantly from the actual volumes of the phantoms. The OsiriX and BrainLab software underestimated the size of the phantoms, on average, and ITK-Snap substantially overestimated the size of the phantoms.
The analysis of variance shows that the difference between the volumes estimated by the various softwares and the actual volume of the phantoms is significantly different (p < 10 -5 ), but the residuals are not normally distributed, violating a key assumption of the analysis of variance. However, the Kruskal-Wallace test, which is non-parametric, confirmed that the overall difference between the volumes estimated by the various softwares and the actual volume of the phantoms is highly significantly different (p < 10 -10 ).
Post-hoc Tukey analysis and pairwise t-tests were performed to compare the differences between the actual volumes of the phantoms to the volumes estimated by each pair of software programs to each other to determine where the differences lay. Both tests show the same result: ITK-Snap differs greatly from OsiriX and BrainLab, but the OsiriX and BrainLab estimates do not differ significantly from each other.
In order to determine if the difference between the estimated volume and true volume (Δ-volume) was greater for larger volume phantoms than for smaller volume phantoms, we used a linear regression of the phantom volume (independent variable) on the Δ-volume (dependent variable). The absolute magnitude of the difference between the estimated volume and true volume (Δ-volume) was significantly greater for larger volume phantoms than for smaller volume phantoms with BrainLab, but for ITK-Snap, the association was in the opposite direction (larger Δ-volume with smaller phantoms). There was no significant association with OsiriX.
Linear regression was also used to assess possible associations between the coefficient of variation and the true volume of the phantoms. No association was found between the coefficient of variation and the phantom volumes for OsiriX or ITK-Snap. Larger coefficients of variation were seen with larger phantom volumes for BrainLab (p < 0.001).
tumors. The results of the VS volumetry are shown in Fig. 2. The true volumes of the VS tumors measured in the NF2 patients are unknown, but the precision of the semi-automated volume calculation of the software can be assessed by the coefficient of variation of the triplicate volume measurements for each of the five tumors on each exam (Table 2). In most cases, the coefficient of variation is less than 5%, which is considered to be good measurement reproducibility, for all three software programs. We performed post hoc analysis to determine which programs differed from the others and whether the exam number had any effect. We did this with pair-wise t-tests, adjusting for multiple comparisons: The tumor volumes estimated for each software differed significantly from those estimated by each of the others at each exam time.
Measurement speed. Anova shows that the mean times required for performing volumetric analysis with these three software programs are significantly different (Fig. 3). Post-hoc Tukey HSD tests show that OsiriX is much faster than either of the other two programs. However, the time measurement is not strictly comparable as BrainLab runs on a different operating system and different hardware.

Discussion
We compared three different volumetric tools using calibrated phantoms and serial MRIs of patients with neurofibromatosis 2 and VS. We expected very accurate volume estimates for all of the phantoms because of their bright contrast and smooth, flat edges. However, none of the three software tools tested was able to calculate an accurate volume for the phantoms. In fact, OsiriX and BrainLab underestimated the actual size of the phantoms while ITK-Snap calculated volumes significantly above the real volume. The demarcation of the boundary is essential for calculating reliable volumes; this may be impeded in very small or insufficiently confined objects (Fig. 4). Both OsiriX and BrainLab provided volume estimates that were close to the true volumes of the 2 ml, 3 ml, 5 ml and 8 ml phantoms, but the volume estimates obtained with ITK-Snap were significantly greater than the true volumes of all of these phantoms. However, BrainLab produced even more accurate results for smaller phantom sizes than for larger. This association is remarkable, because it contrasts to the intuitive assumption that the precision of smaller measurements would be worse than for larger measurements. The relative miscalculation of the 1 ml phantom could be due to a blurrier border of the sharp edges ( Figs. 1 and 4). Another factor might be the number of slices that cover one object. For flat objects the inter-slice volume-loss might be relatively bigger than for compact or spherical objects.
By comparing the sofware's results to each other the volumes calculated by BrainLab and OsiriX seem to be at least comparable as the differences were not significant, while ITK-Snap produced significantly different results.
The patient tumor data set differs from phantom data set in two ways: (1) We do not know the true volume of the VS, so we cannot judge accuracy by comparison to the ground truth; we can only compare the measurements obtained by the software programs to each other; and (2) We have measurements of the tumors at four different times; the phantoms were just measured at one time.
Vestibular schwannomas in patients with NF2 usually progress over time 18 . Even in patients in whom bevacizumab treatment produces tumor regression, sustained progression usually resumes if treatment is discontinued 6 and may be seen even if bevacizumab therapy is maintained 8 . Accurate and reliable volumetric measurement tools are, therefore, crucial for risk stratification and therapy. However, neither the absolute nor baseline volume of VS are crucial for the assessment of the natural history or therapy success; rather, variables like time to progression or growth rate are important outcome measures. The shape of the tumor as well as the distribution of the contrast agent may have an important influence on the accuracy and reproducibility of the calculated volume. Image quality, slice thickness and radiological experience are further factors influencing the result of volumetric analysis. Nevertheless volumetric analysis outmatches two-dimensional analysis by far 11,12,15,16 and makes therapeutic decisions more reliable. To date many different software tools with more or less different   Table 2. Patients: tumor volumes were estimated in triplicate in a total of 32 MRI scans with each of three different methods. We calculated the means, standard deviations (SD) and coefficients of variation (CV) of the three volume estimates at each time point for each patient using each method. www.nature.com/scientificreports/ greater than those estimated by OsiriX, which were in turn greater than those estimated by BrainLab. This systematic difference is important as it illustrates that volumetric values estimated with different software platforms cannot be compared reliably and that different mathematical algorithms may produce different results in regard to image quality, and tumor size. Also contrast intensity and shape of the target volume may play an important role. Regarding the operating time, OsiriX seems to be much faster than ITK-Snap or BrainLab (post-hoc Tukey HSD Test). Tumor volume calculation is 1-4 min per analysis faster with OsiriX. As the BrainLab analysis is bound to the neurosurgical work station the comparison to the other two tools is not valid, because the processorspeed and operating system are different.

conclusion
Volumetric assessment of MRI images is associated to an intrinsic risk of error for all three tested volumetric tools. Moreover, different software tools exhibited systematic differences, meaning that volumetric measurements made using different software cannot be reliably compared.
For precise volumes and comparable results, it is mandatory to use the same volumetric tool for all measurements. However, all tools tested produce results that are clinically useful.

Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.