Reliability of pre-operative diffusion tensor imaging parameter measurements of the cervical spine in patients with cervical spondylotic myelopathy

The present study assessed test–retest and inter-observer reliability of diffusion tensor imaging (DTI) in cervical spondylotic myelopathy (CSM), as well as the agreement among measurement methods. A total 34 patients (12 men, 22 women; mean age, 58.7 [range 45–79] years) who underwent surgical decompression for CSM, with pre-operative DTI scans available, were retrospectively enrolled. Four observers independently measured fractional anisotropy (FA) values twice, using three different measurement methods. Test–retest and inter-observer reliability was assessed using intraclass correlation coefficients (ICCs). Overall, inter-observer agreements varied according to spinal cord level and the measurement methods used, and ranged from poor to excellent agreement (ICC = 0.374–0.821), with relatively less agreement for the sagittal region of interest (ROI) method. The radiology resident and neuro-radiologist group showed excellent test–retest reliability at almost every spinal cord level (ICC = 0.887–0.997), but inter-observer agreements varied from fair to good (ICC = 0.404–0.747). Despite excellent test–retest reliability of the ROI measurements, FA measurements in patients with CSM varied widely in terms of inter-observer reliability. Therefore, DTI parameter data should be interpreted carefully when applied clinically.

Test-retest and inter-observer reliability. ICC values for the four observers from the test-retest reliability assessment are demonstrated in Table 2. Test-retest reliability varied among observers 1 and 2 (ICC = 0.460-0.959); however, observers 3 and 4 showed excellent test-retest reliability at all spinal cord levels for the three measurement methods (ICC = 0.887-0.997), except for observer 4 at the C1/C2 level using the sagittal ROI method (ICC = 0.645).
Based on the test-retest reliability results, the four observers were divided into two groups: the medical student group (observers 1 and 2) and the radiology resident and neuro-radiologist group (observers 3 and 4). ICC values within these two groups for the three measurement methods are shown in Table 3. Inter-observer agreement between observers 1 and 2 varied widely across the different spinal cord levels and measurement methods (ICC = 0.510-0.954), which were similar to the test-retest reliability results.
However, despite the excellent test-retest reliability, the inter-observer agreements of the radiology resident and neuro-radiologist group (observers 3 and 4) also varied widely, with fair-to-good agreement (ICC = 0.404-0.747) for almost every spinal segment and all three measurement methods. In particular, there was poor inter-observer agreement at the C2/C3 (ICC = 0.275 and 0.302 for the first and second measurements, respectively), C5/C6 (ICC = 0.222 and 0.404 for the first and second measurements, respectively), and C7/T1 levels (ICC = 0.084 and 0.157 for the first and second measurements, respectively) when using the sagittal ROI method. There was excellent inter-observer agreement only at the C3/C4 (ICC = 0.792 and 0.800 for the first and Table 1. Overall ICCs of FA measurements among the four observers. Two elective medical university students (observers 1 and 2), one third-year radiology resident (observer 3), and one neuro-radiologist with 2 years of experience in diffusion tensor imaging (observer 4). FA, fractional anisotropy, ICC, intraclass correlation coefficient, CI, confidence interval, ROI, region of interest. www.nature.com/scientificreports/ second measurements, respectively) and C4/C5 levels (ICC = 0.756 and 0.773 for the first and second measurements, respectively) when using the manual ROI method. The differences between observers 3 and 4 in the calculated mean FA values from C1/C2 through C7/T1 for all subjects are shown in Table 4. Between observers 3 and 4, there were statistically significant differences in mean FA values for all three measurement methods. Among the three methods, the sagittal ROI method yielded a statistically significantly lower value than did the other methods for both observers. There was no statistically significant difference in the mean FA values obtained by the mean versus manual ROI measurements for both observers.

Discussion
This study is one of the first to evaluate the reliability of DTI in assessing the cervical spinal cord in adult patients with CSM. The aim was to assess the test-retest and inter-observer reliability of FA values at all intervertebral disc levels of the whole cervical spinal cord in patients with CSM. To date, only three studies [17][18][19] have reported assessment of the reliability of ROI placement to quantify DTI measurements in the cervical spinal cord. Two studies were performed on pediatric spinal cords, with or without spinal cord injury. Only one study was performed on the cervical spinal cord in a healthy adult population.
Mulcahey et al. 18 reported good to strong test-retest reliability for diffusivity values at each level of the spinal cord. Likewise, the reliability of FA values for the mid-C4 level and levels at and below C5-C6 was good. Despite only fair repeatability for FA values at several levels, the data suggested that repeated DTI values can be obtained for children with chronic cervical-level spinal cord injury, with evidence of good to strong reliability for mean diffusivity (MD), axial diffusivity (AD), and radial diffusivity (RD), and fair-to-good reliability for FA. Barakat et al. 17 also reported that inter-and intra-observer agreement between two ROI measurement methods (a freehand ROI and a fixed-size ROI) showed moderate (ICC = 0.5) to strong (ICC = 0.84) agreement in the normal pediatric spinal cord, and that FA values showed the highest variability among DTI parameters (ICC = 0.10-0.87).
Brander et al. 19 conducted a study on the reproducibility of ROI measurements using a freehand technique in a healthy adult population. They reported that the intra-observer variation of the measurements for wholecord FA and for the ADC showed almost perfect agreement, when using both ROI-and tractography-based measurements. There was greater variation in measurements of individual columns. Inter-observer agreement varied from moderate to strong for whole-cord FA and for the ADC.
In our study, repeated ROI measurements revealed wide variations in the inter-observer reliability, which was in contrast to the findings of previous articles. Although the radiology resident and neuro-radiologist group (observers 3 and 4) showed excellent test-retest reliability, inter-observer reliability showed fair-to-good Table 2. Test-retest reliability of the three measurement methods among the four observers. Two elective medical university students (observers 1 and 2), one third-year radiology resident (observer 3), and one neuroradiologist with 2 years of experience in diffusion tensor imaging (observer 4). FA, fractional anisotropy, ICC, intraclass correlation coefficient, CI, confidence interval, ROI, region of interest. www.nature.com/scientificreports/ agreement for almost every spinal segment using the three measurement methods. Furthermore, there were statistically significant differences in the calculated mean FA values obtained when these observers used the three ROI measurement methods. This finding has practical implications because DTI data for the analysis of cervical cord abnormalities can include a basic measurement error bias, which may cause a reliability problem in clinical use. If the placement of the ROI includes more CSF or adjacent structures, the FA value will decrease; Table 3. Inter-observer reliability for the three measurement methods within two groups. *The group with two medical students (observers 1 and 2). † The group with a radiology resident and neuro-radiologist (observers 3 and 4). FA, fractional anisotropy, ICC, intraclass correlation coefficient, CI, confidence interval, ROI, region of interest.  www.nature.com/scientificreports/ Fig. 1 clearly demonstrates this phenomenon. Observer 1 placed more single-voxel ROIs than did observer 3. According to the axial T 2 -weighted TSE scan, the coverage of the total ROI placement exceeded the true outline of the spinal cord and contained more CSF. Consequently, calculated FA values were markedly lower for observer 1 than for observer 3. Furthermore, in patients with CSM, a compressed spinal cord and the low spatial resolution of DTI make it difficult to place an exact ROI that includes only the spinal cord. Recently, several reports have indicated that pre-operative FA values correlated well with symptoms or functional status in patients with CSM and could predict surgical outcomes 7,14,15 . According to our results, DTI parameters and clinical data analysis should be interpreted with caution. According to our data, the sagittal ROI measurement method showed lower test-retest and inter-observer reliability and statistically different mean FA values as compared with the other methods. Thus, it seems that the sagittal ROI method is not appropriate for measurement in a clinical setting. Particularly at the C2/C3, C5/C6, and C7/T1 levels, there was poor inter-observer agreement when using sagittal ROI measurement. This result is similar to that of Barakat et al. 17 . The upper cervical (C2/C3) and lower cervical (C7/T1) levels were located at the edge of the coil that was used for imaging the patients. In these regions, the signal-to-noise ratio (SNR) decreases and the signal drop is marked. Furthermore, the relatively wide CSF space, as compared to other spinal levels, may lead to inclusion of more CSF when placing the ROI. All these factors can contribute to lower interobserver agreement when using the sagittal ROI method. The low agreement at the C5/C6 level may have been related to the C5/C6 level being the most commonly affected segment, with central canal compromise, in patients www.nature.com/scientificreports/ with CSM, while the lowest cervical levels (C4-C7) were most sensitive to cardiac motion. Therefore, obscured anatomical margins and some cardiac-related artifacts may have biased the placement of ROIs. However, about the resolution against this ROI problems, Yokohama et al. 20 reported the more reliable and better visibility DTI method in 3-T MRI, called reduced FOV or so-called zonally oblique multislice (ZOOM) DTI. They concluded that ZOOM DTI provides better visibility with less distortion and high accuracy using a small FOV and a shorter practical scan time compared with conventional DTI. Moreover, using this ZOOM DTI method, Iwasaki et al. 21 reported the pre-surgical FA values are affected by "aligend fibers effect", which is compressed fibers show higher FA value and those values are not suitable for prognostic predictors. After all, it is thought that accurate parameter measurement is important for DTI's clinical utility and therefore technical improvement is necessary to clearly distinguish the boundaries between the spinal cord and CSF by reducing image distortion and improving the spatial resolution. A method like ZOOM DTI mentioned above would be a good solution and authors also proposed that to attain further rapid, high resolution DTI sequences, combined ZOOM DTI and recently introduced techniques such as turbo spin-echo (TSE)-DWI and multi-band SENSE are desirable.
There are some limitations to our study. First, the resolution of the DTI in this study is low, and this may have increased the deviation between inter-observers in accessing the reliability. In fact, our institution currently uses DTI protocol by increasing slice thickness and NEX (to 4 mm and 14 respectively), and the reduced FOV technology could also be a solution to increase resolution. If there was a sufficient increase in resolution, it is possible that the reliability of the DTI parameter measurements has increased. In addition, appropriate anatomic reference may have to be provided, but only in the sagittal ROI measurement method used the guidance of the T2-weighted scan. If T2-based (T2WI or T2* weighted image) references were used, it would yield better reproducibility of placement of ROI. Second, only patients with CSM were enrolled. Compared with a normal healthy population, anatomical changes, such as underlying spondylosis or cord compression, may make it difficult to place the ROI exactly on the spinal cord. However, CSM is the most common form of spinal cord dysfunction 22 and the reliability of FA measurements in this group of patients has clinical importance. Third, although all observers who performed measurements had a consensus training session for ROI placement before the study, their experience in DTI and related measurements was relatively low. Furthermore, the training using the standardized protocol was insufficient, especially in the unexperienced observer, it was possible that these problems were flawed in interpreting the results of the reliability. The result that test-retest values are high and inter-observer reliability is relatively low itself can be said to mean the fact that each observer has applied different measurement methods depending on the understanding of anatomy and MR images, even though the consensus training was performed prior to the measurement. Therefore, it is believed that the reliability of DTI can be concluded in a true sense if the training using standardized protocol is conducted prior to the study, and when experts with sufficient practical experience evaluate as observers. Fourth, we did not use cardiac gating during the image acquisition, which can diminish flow artifacts from CSF. However, using cardiac gating may increase other motion artifacts caused by respiration or swallowing, because of the lengthened examination time. Fifth, we evaluated only FA values among DTI parameters, and did not investigate the ADC, AD, RD, or MD. Previous studies 17,18 demonstrated relatively low and variable reliability in FA values as compared with other diffusion parameters. Thus, further study evaluating the reliability of other DTI parameters is needed to reveal overall reliability. Finally, the ideal reliability study requires a prospective study design using a priori hypothesis and a positive design, high resolution DTI study is required to support our findings.
In conclusion, for use as a diagnostic tool, data obtained by measurements on DTI should have high reliability and reproducibility. Despite excellent test-retest reliability of the ROI measurements, FA values in patients with CSM varied widely in terms of inter-observer reliability in our study. Therefore, DTI parameters should be interpreted with caution when applied clinically. Furthermore, education and practical training in DTI methods are imperative to ensure for reliable assessment for the measurements.

Methods
This retrospective study was approved by the Seoul National University Bundang Hospital institutional review board (IRB No: B-1406-256-102). This study was conducted in accordance with relevant guidelines and regulations/declaration of Helsinki. The research holds out no more than minimal risks to participants and was reviewed through an Expedited Review. So, the requirement for informed consent was waived by the Seoul National University Bundang Hospital institutional review board.

Study subjects.
We retrospectively searched the electronic medical record system and Radiology Department database at our institution between July 2013 and December 2013, for cases meeting the following inclusion criteria: (1) a clinical diagnosis of CSM, (2) availability of pre-operative diffusion tensor MRI scans, and (3) the use of surgical decompression. All patients had neurological signs and symptoms with clear evidence of cervical spinal cord compression due to cervical spondylosis on conventional cervical spine MRI. The evaluation of myelopathy was performed using a modified Japanese Orthopedic Association score. We excluded patients with (1) tumor-, trauma-, or infection-related cord compression, (2) prior surgery, (3) coexisting neurologic disorders, such as acute transverse myelitis or multiple sclerosis, and (4) suboptimal image quality due to severe artifacts. -79] years) were enrolled in this study. For surgical treatment, anterior cervical discectomy and fusion was performed in 28 cases, laminoplasty was performed in four cases, and posterolateral fusion was performed in two cases. DTI protocol. All pre-operative DTI was performed within 2 weeks prior to surgery. All MRI examinations of the cervical spinal cord were performed using a 3-T MRI scanner (Achieva, Philips Medical Systems, Best, The Netherlands). No upgrade or other changes were made to the MRI system software in this study. During the image acquisition process, all subjects were placed in the supine position with 16-channel neurovascular coils applied to the cervical region.
Sensitivity-encoding (SENSE) single-shot echo planar imaging (EPI) was used. Used MR protocols are 23  After sending all source images of the DTI to a personal computer, diffusion tensor parameters/fiber tracking were evaluated using the fiber assignment by continuous tracking (FACT) algorithm implemented within the DTI task card software (the Extended MR WorkSpace 2.6, Philips Medical Systems) 6,24 . In the axial b0 image, two slices (C1 and C7 levels) were selected. Circular ROI that included the entire spinal cord was placed and fiber tracking was performed. Only fibers passing through the ROIs were displayed. The thresholds for tracking termination were 0.2 for FA and 30° for the angle between 2 contiguous eigen-vectors.
Image and measurement analysis. A total of four observers (two elective medical university students [observers 1 and 2], one third-year radiology resident [observer 3], and one neuro-radiologist with 3 years of experience in DTI [observer 4]) independently measured FA values, twice, after consensus training. To prevent recall bias, each measurement was performed at an interval of 1 month. After sending all source DTI images to a personal computer, each observer, who was blinded to the clinical condition of each patient, measured the FA value in the cervical spinal cord at the level of each spine segment. For the FA measurements, ROIs were manually drawn on axial and sagittal color tensor maps along the cervical spinal cord at the level of each cervical intervertebral disc. Spine segments were selected for each disc level from C1/2 to C7/T1, with reference to a mid-sagittal T 1 -weighted image.
Three measurement methods were used for placing the ROIs in this study. (1) In the mean ROI method, for each single voxel inside the spinal cord on the axial image, special attention was paid to select ROIs while avoiding partial volume effects, magnetic susceptibility effects, and motion artifacts. Average FA values for all voxels inside the spinal cord at each spine segment level were calculated (Fig. 2a,b). (2) In the manual ROI method, each observer manually outlined an ROI up to the outer margin of the spinal cord on an axial FA map, using a freehand technique, which represented approximately one voxel, while being cautious to avoid volume-averaging effects with the cerebrospinal fluid (CSF) (Fig. 2c). (3) In the sagittal ROI method, each ROI was placed manually on the sagittal FA map, similar to the second method (manual ROI) (Fig. 2d). In this method, ROI selection for each spinal level was guided by reconstructed sagittal b0 maps, and axial and sagittal turbo spin-echo (TSE) T 2 -weighted images.
One of the authors (musculoskeletal radiologist with 4 years of experience in spinal DTI analysis) conducted the image and measurement analyses. To assess test-retest and inter-observer reliability, the FA values measured by the four observers using these three methods were compared. Statistical analysis. Statistical analyses were performed by one author. The test-retest-and inter-observer reliability of each FA value obtained by the four observers using three measurement methods were assessed using intraclass correlation coefficients (ICCs) and a two-way random model. Test-retest and inter-observer reliability depends primarily on good training of the observers and good standardization of the task. The ICC value could range from 0 to 1; ICC values of less than 0.40 represented poor agreement, values of 0.40-0.75 represented fair-to-good agreement, and values greater than 0.75 represented excellent agreement.
The differences in the mean FA value, averaged per cord level from C1/C2 through C7/T1 of all study subjects, for all three measurement methods among the observers were assessed using the wilcoxon signed rank test test. Analyses were performed using SPSS (ver. 21.0, SPSS Inc., Chicago, IL, USA) and MedCalc software (version 13.0, MedCalc Software, Mariakerke, Belgium). A P value < 0.05 was considered statistically significant.

Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.