Fully automated grey and white matter spinal cord segmentation

Axonal loss in the spinal cord is one of the main contributing factors to irreversible clinical disability in multiple sclerosis (MS). In vivo axonal loss can be assessed indirectly by estimating a reduction in the cervical cross-sectional area (CSA) of the spinal cord over time, which is indicative of spinal cord atrophy, and such a measure may be obtained by means of image segmentation using magnetic resonance imaging (MRI). In this work, we propose a new fully automated spinal cord segmentation technique that incorporates two different multi-atlas segmentation propagation and fusion techniques: The Optimized PatchMatch Label fusion (OPAL) algorithm for localising and approximately segmenting the spinal cord, and the Similarity and Truth Estimation for Propagated Segmentations (STEPS) algorithm for segmenting white and grey matter simultaneously. In a retrospective analysis of MRI data, the proposed method facilitated CSA measurements with accuracy equivalent to the inter-rater variability, with a Dice score (DSC) of 0.967 at C2/C3 level. The segmentation performance for grey matter at C2/C3 level was close to inter-rater variability, reaching an accuracy (DSC) of 0.826 for healthy subjects and 0.835 people with clinically isolated syndrome MS.

Histopathological studies have provided evidence of spinal cord grey matter (GM) and white matter (WM) changes in a large spectrum of neurological conditions 1 . In particular, the loss of axons in the spinal cord is thought to represent one of the main contributing factors to clinical disability. In vivo, axonal loss can be assessed indirectly by estimating the reduction in cervical cord cross-sectional area (CSA) over time (i.e., spinal cord atrophy) from magnetic resonance imaging (MRI) through the use of image segmentation methods; the reduction in CSA over time has been shown to correlate with clinical scores of physical disability [2][3][4] . It must be noted that the measure of cord CSA alone cannot differentiate between the individual rates of GM and WM atrophy, which may have different clinical relevance, and should be best studied independently 5 . However, in pathologies such as multiple sclerosis (MS) where the presence of focal and diffuse lesions may obscure the boundary between tissue-types 6 , tissue-specific segmentation can be challenging, and may potentially lead to biased morphometric estimates. Spinal cord GM area was the strongest correlate of disability in MS using multivariate models that include brain GM and WM volumes, fluid-attenuated inversion recovery lesion load, T1 lesion load, spinal cord CSA, number of spinal cord T2 lesions, age, sex, and disease duration 3,4 .
Over the last two decades, a number of semi-automated segmentation methods have been proposed for the estimation of spinal cord CSA 2,7-15 , and more recently, methods that jointly segment both CSA and GM [16][17][18][19][20] . Recent studies by El Mendili et al. 14,21 , which are based on the threshold-based method (TbM) previously described by Losseff et al. 2 , has shown high reproducibility and accuracy in measuring cord CSA using a semi-automated double threshold-based method (DTbM) that requires minimal manual intervention. As compared to the reference method, DTbM had a Dice Similarity Coefficient (DSC) value of 0.98 for cord CSA at C2 vertebral level in a group of 82 healthy subjects and 30 patients with various neurological conditions such as amyotrophic lateral sclerosis, spinal muscular atrophy and spinal cord injury. While semi-automated methods have improved considerably over the years, reducing intra-and inter-rater variability could have direct implications for future clinical trials of neuroprotection or longitudinal observational studies.
Multi-atlas automated segmentation methods have been used successfully in different parts of the brain. A fully automated spinal cord segmentation method, based on an iterative Non-Local Simultaneous Truth And Performance Level Estimation (iNLS) multi-atlas framework, has also been presented recently 16 . The method demonstrated high segmentation accuracy for the non-pathological spinal cord with a median DSC of 0.81 for CSA segmentation and 0.72 for GM segmentation. More specifically, the iNLS multi-atlas framework uses 3 degrees-of-freedom rigid body transform per template propagation step in order to achieve the reported accuracy.
Recently, a different multi-atlas segmentation propagation and fusion technique called STEPS (Similarity and Truth Estimation for Propagated Segmentations) 22 has demonstrated a statistically significant increase in segmentation accuracy in several key brain structures such as the hippocampus compared with other methods. The algorithm operates in two steps as follows. In the first step, there is the segmentation propagation, where each element of the template library is registered to the target image using an affine registration (12 degrees-of-freedom), followed by a non-rigid registration; in the second step, the best deformed templates are selected based on the locally normalised cross correlation, and are fused into a consensus segmentation. The combination of a local similarity metric for image fusion and the use of non-rigid registration for template propagation makes STEPS a good candidate algorithm for CSA and GM segmentation. However, the affine/non-rigid registration steps require a good initialisation to achieve high accuracy.
Another patch-based multi-altas segmentation and fusion technique is PatchMatch [23][24][25] . While the original PatchMatch algorithm was designed to look for similarities between two 2D images 26 , multiple extensions to include 3D MR images have been proposed since. Of relevance here is the optimized PatchMatch Label fusion (OPAL), which was first used to produce accurate and fast segmentations of the hippocampus 23,25 . This work has been recently extended to multi-modal data (4D) for the purpose of MS lesion detection 24 . PatchMatch methods are normally good at localising and roughly segmenting objects in images due to the computational efficiency of the OPAL random search algorithm, albeit with reduced accuracy.
In this work, we propose a new fully automated spinal cord segmentation technique that incorporates two different multi-atlas segmentation propagation and fusion algorithms: OPAL and STEPS. The OPAL random search is used to localise and roughly segment the spinal cord, and the STEPS algorithm is then used to perform slice-wise tissue segmentation. The proposed method is evaluated retrospectively using a dataset that includes healthy volunteers and people with MS.

Methods
Segmentation methods. The proposed method combines two existing label fusion segmentation techniques: OPAL for detecting the spinal cord and STEPS for accurately segmenting the GM and whole spinal cord. Bearing in mind that the typical slice thickness of spinal cord images is approximately 3-5 mm 27 , the proposed method performs all registrations and segmentations in a 2D slice-wise manner before merging them into a 3D volume. A schematic representation of the proposed pipeline is shown in Fig. 1. Additional details pertaining to each step individually are discussed in subsequent sections.
Step 1: Localisation using OPAL. While the original PatchMatch algorithm was designed to match 2D patches within the same image 26 , the method has been recently extended to 3D 23,25 and 4D data 24 , with the search range extended to an external reference library. Due to the excellent performance in terms of computational time and good segmentation accuracy in some applications, OPAL is used here in its original form to simply localise the spinal cord. This cord localisation step is achieved by providing an external database of spinal cord images and associated manually segmented cords to the OPAL algorithm, all of which are then propagated to the new unseen image. This step has an average computational time of less than 1 sec. While OPAL was found to perform well both for cord and GM segmentations in healthy volunteers, the segmentation performance degraded rapidly in the presence of pathology (e.g. MS lesions), not allowing to use the default application over MS subjects. OPAL is publicly available inside NiftySeg software package (niftyseg.sf.net). Step 2: Segmentation using STEPS. The main characteristic of STEPS is that it introduces a spatially variant image similarity term in a STAPLE framework 22 , enabling the characterisation of both image similarity and human rater performance in a unified manner. The STEPS segmentation process is divided into two stages: segmentation propagation and fusion. Starting from a template library with associated manual segmentations, all the templates (excluding the image under analysis) are first registered to the target image using initially a rigid-only registration and then a non-rigid registration. All registration was done using the NiftyReg software package (niftyreg.sf.net). The normalised cross correlation (NCC) was then estimated between each deformed template and the target image, quantifying the similarity between the two images. The most similar deformed templates according to the NCC were finally fused into a consensus segmentation according to the locally normalised cross correlation (LNCC) between the registered template images and the target image. A consensus probabilistic CSA and GM segmentation is obtained using the STEPS algorithm as implemented in NiftySeg (niftyseg.sf.net). The probabilistic nature of the consensus segmentation implicitly encodes partial volume effect, improving tissue boundary localisation and delineation. Finally, to produce binary segmentations, the probabilistic consensus masks were thresholded at 0.5.
Note that, in order to increase the performance of the fusion step, the centre of mass of the OPAL cord segmentation was used to initialise the rigid registration step between templates and target image. The OPAL cord segmentation was also used to mask non-cord regions from the non-linear registration step, further improving the performance of the template registration step. Algorithm Parameters. The pipeline works in a slice-wise manner (2D). All experiments used the following parameters. OPAL was used with the original parameters 23,25 . The 2D patch size was 5 × 5 voxels and the number of inner iterations was 5. Finally, the numbers of threads and the number of best-matches were both set to 10.
The OPAL mask was then dilated eight times to ensure that there was sufficient boundary information to perform both affine and non-rigid registrations 28,29 .
For STEPS, the parameters used were as suggested by Cardoso et al. 22 . The number of best templates was X = 15, standard deviation of the Gaussian smoothing kernel for LNCC estimation σ = 1.5 and the Markov Random Field (MRF) spatial consistency set to 0.55.
The linear registration of the templates for the STEPS label fusion has been done using a symmetric 6 degree-of-freedom affine registration using a block matching approach 30,31 . The non-rigid alignment has been done using the fast free-form registration algorithm 32 with the LNCC (σ = 1) as a similarity measure. The pyramidal approach was set to use 6 levels, and the grid spacing at the lowest level of the pyramid set to 5 voxels.
The input image was resampled to the template library resolution using a linear interpolation, and the obtained binary masks were resampled back to the original resolution using nearest-neighbour interpolation in order to maintain their categorical nature.

Figure 2. Example of GM and CSA consensus masks obtained from the three raters masks intersection.
Consensus mask for a healthy control (rows 1 and 2) and a RRMS patient (rows 3 and 4). DSC respects the consensus mask is overlayed in each segmentation.
Scientific RepoRts | 6:36151 | DOI: 10.1038/srep36151 Data. We analysed data acquired in 102 subjects participating in a study of spinal cord MR imaging in MS 33 .
There were: 25 healthy controls, 19 people with clinically isolated syndrome (CIS), 20 with primary progressive multiple sclerosis (PPMS), 17 with secondary progressive multiple sclerosis (SPMS), and 21 with relapsing remitting multiple sclerosis (RRMS). We recruited people with different types of MS because they are expected to have a different degree of pathological changes in the spinal cord. MS lesions involving at least two spinal cord WM columns and extending to the GM are more frequent in SPMS than in the early phase of MS 33 . Written informed consent was obtained from all participants and this work was approved by the local research ethics committee.
Three expert raters working independently and with the raw images, first outlined GM manually and semi-automatically outlined the cord CSA using the 'cord finder' 10 option available with JIM v.6 (Xinapse systems, www.xinapse.com/). Based on the previously published methodology by Losseff et al. 2 , a 15 mm section of the high-resolution 3D-FFE volumetric scan (i.e. 3 slices) was extracted, with the middle slice passing through the C2-3 intervertebral disc. 'Cord finder' option runs an active surface model that it needs the manual place of a seed. Raters manually positioned a seed in the centre of the cord in each of the three slices. For each subject, the associated consensus segmentation of the three raters (at least two raters agree) for the cord CSA and GM were estimated using majority voting, see Fig. 2.
The C2-3 cervical cord level was used to estimate the performance of the proposed algorithm as the inter-rater variability was found to be lower in cord areas surrounded by CSF. Nonetheless, the application to other spinal cord segments, such as thoracic or lumbar areas would only require an area specific template database. Losseff    et al. 2 indicates that, within the cord, this location (C2-3 level) yields the most reproducible cross-sectional area values and it is the most convenient region to measure atrophy in the spinal cord. To avoid bias due to lack of consensus between manual segmentations, segmentations with a 3D DSC lower than 0.7 between two of the three expert raters, or images of poor quality (due to noise, motion, etc.) were excluded from this study. A total of 7 subjects (4 PPMS, 2 SPMS and 1 RRMS) were excluded, which resulted in a final group of 95 subjects (25 controls, 19 CIS, 16 PPMS, 15 SPMS and 20 RRMS). Table 1 shows the demographic information of the selected data.
Template library. The template library used in this work consisted of 3D imaging volumes from the 95 subjects described in the Data section. For each subject there were three 2D slices, with the centre slice through the C2-3 level. In order to maximise the size of the library, all the scans were left-right flipped, resulting in a final template library of 570 2D images in total. For each image in the template library, the associated consensus segmentation of the three raters for the cord CSA and GM was used as template (see Fig. 2).
Leave-one-out cross-validation. The proposed fully automated method, which estimates the CSA and segments the GM at the same time, was compared to the consensus segmentation of 3 raters. We also compared the performance of the proposed method to the iNLS multi-atlas framework -a fully automated GM/WM spinal cord segmentation method 16 . iNLS has been used with the parameters presented in Asman et al. 16 . A leave-one-out strategy was used for both iNLS and the proposed method, i.e. the image being segmented as well as its left-right flipped version 15 , were removed from the template library during segmentation. Both iNLS and the proposed method used the same library as described in the Data section.
All binary segmentations obtained using the proposed method were compared to the manual segmentations, the consensus segmentation and the iNLS results. To assess the performance, we used the 3D DSC, the mean  surface distance (MD) and the Hausdorff distance (HD) between the masks 15,34 . Statistical differences in performance were estimated using a two tail unequal variance paired t-test as the error distribution was approximately Gaussian in most experiments. Furthermore, all patients' data (CIS, PPMS, SPMS and RRMS) were split into two independent groups, one with visible lesions at C2/C3 level and one without, in order to assess the impact of the presence of lesions on the segmentation performance.
Test-retest assessment. For assessing the reliability of the presented method we performed a test-retest experiment. Test-retest data were composed of 5 different healthy controls that repeated the same MR imaging protocol 3 times on separate occasions, with a minimum of a week and a maximum of two weeks between measurements. We segmented the images using the same library as described in Data Section and one rater analysed all the data.

Results
Tables 2, 3 and Fig. 3, show the evaluation results for the CSA and GM segmentation at C2/C3 level respectively, with the mean (std) and p-value for the DSC, MSD and HD estimated with respect to the consensus masks. In Tables 2 and 3, significant differences (p < 0.05) have been marked with "*". For the qualitative results, the best and the worst DSC results are shown for each patient group for GM segmentation (see Fig. 4).
We have found that the proposed method provides more accurate results for the CSA segmentation at C2/C3 level when compared to iNLS (DSC and MSD -p < 0.05) for all groups. When comparing the proposed method (see Table 3) with the iNLS, superior results were found for healthy controls, however, lower performance, but not statistically significant, is obtained for GM DSC on SPMS group. There are significant differences for GM segmentation (DSC -p > 0.05) between the methods for CIS, PPMS and RRMS subjects. Overall, the proposed method outperforms the state-of-the-art method iNLS (p < 0.05) for CSA and GM segmentation at C2/C3 level (see last row in Tables 2 and 3), and for GM segmentation the obtained MSD and HD by Rater 1 are not significantly different from proposed method. Table 4 shows the results from splitting the patients' data into two groups depending on the presence of visible lesions at the C2/C3 level. There were significant differences for CSA and GM segmentation between the absence and presence of lesions. Finally, Table 5 presents the reliability assessment results for CSA and GM regions mean and std.
Experiments were performed on a server (

Discussion
We propose the combined use of OPAL and STEPS for the segmentation of the whole spinal cord CSA and GM. Although we have found the output of OPAL was highly dependent on the amount of noise and the presence of lesions (producing highly spurious segmentations in some situations), our results nonetheless demonstrate that OPAL can provide a sufficiently accurate initial segmentation of the spinal cord in order to facilitate a constrained STEPS label fusion process.
Asman et al. 16,36 highlight some caveats of using non-rigid registration for spinal cord imaging. In the proposed method, the OPAL segmentation was used to initialise and constrain, providing a region mask to the slice-wise non-rigid registration performed by STEPS, thus ameliorating most of these concerns. The computational time needed to perform these slice wise non-rigid registrations was also deemed adequate, taking approximately one second per slice and template. Moreover, the use of non-rigid registration reduces the need for a refinement step as proposed in Asman et al. 36 .
The fact that the proposed method gets lower, but not statistically different, performance for GM DSC on SPMS group (see Table 3) maybe due to the low number of subjects and the presence of large lesions. The use of non-rigid registration in the proposed method might be detrimental in subjects with very large lesions, as the non-rigid step introduces too much flexibility and uncertainty when deforming the candidate labels (see columns SPMS and RRMS at Fig. 4). Overall, similar differences were found for the MSD and HD scores. Table 4 shows the effect of the presence or absence of lesions, these differences are specially large and significant for GM (DSC and MSD p < 0.05).  Tables 2 and 3 show that the proposed method is able to segment the CSA with a very high accuracy, and only slightly different to the three rater consensuses. Method gets a DSC of 0.968 that is not a clinically significant difference. The low variability between raters for CSA segmentation is thanks to the semi-automated nature of the 'cord finder' method from JIM. We would like to note that the results presented have been obtained in a three slice subset of the full volume. As spinal cord imaging usually has dozens or hundreds slices per volume, consistency across slices can be exploited in the future to reduce the amount of errors produced by the proposed algorithm. Moreover, the reduced number of slices and voxels per mask, in conjunction with the evaluation metrics, highlight the quality of the obtained results. Covering a larger spinal cord area would hide any small miss-segmented area due to repetitive spinal cord nature.

Results in
On a group specific basis, the CSA segmentation performed very well in all the groups and the presence of lesions has a low impact (see Table 4). Test-retest assessment also confirms these results (see Table 5). It is also important to note that the proposed method has a lower CSA segmentation HD than Rater 1 for all groups, suggesting that the errors introduced by human raters with different levels of expertise or training can be larger than the errors introduced by the proposed method, further justifying the need for advanced automated segmentation techniques for multi-centre studies and trials.
Regarding the GM segmentation, the proposed method provided acceptable results for the groups without lesions (healthy controls and CIS), achieving better MSD and HD than Rater 1. While Rater 1 tended to over-segment the GM when comparing to the consensus mask, the presented method had a tendency to only slightly under-segment the GM when compared to the consensus mask, resulting in lower MSD and HD. On a group specific basis, the GM segmentation performed better in healthy controls and CIS patients than in the other three MS groups, which is likely to be related to the limited number of lesions identified in the CIS cohort 33 . GM test-retest assessment results are also promising (see Table 5).
From Figs 3 and 4, one can observe that the worst results correspond to outliers in terms of Dice score. In these situations, the presence of hyperintensities favour the use of anatomical a priori assumptions by the human raters. In future we will include anatomical priors and uncertainty estimates in the segmentation process so that the segmentation algorithm reverts back to the prior model in very uncertain situations.
The state-of-the-art results obtained as part of the ISMRM 2016 satellite spinal cord challenge further demonstrate that the proposed method is robust to differences in scanner manufacturer and acquisition parameters when compared to 4 independent rater segmentations.
In summary, this paper introduces a fully-automated GM and whole spinal cord segmentation technique based on a collaborative effort between two multi-label fusion techniques. OPAL quickly finds the cord and then STEPS segments GM and CSA. Both healthy and different types of MS subjects were used for performance assessment, and the results have been compared to the consensus of three expert manual segmentations. The accuracy of the proposed method was found to be close, comparable and in some situations better, than a single rater. GM segmentation results obtained in PPMS, SPMS and RRMS subjects showed promising results but further work is necessary to remove the influence of MS lesions.
The proposed method has only been applied to the cervical area (C2-3 level) of the spinal cord because it is the most convenient region to measure atrophy in the spinal cord in MS. Specifically, the obtained results for the CSA segmentation suggests that it may be used in a large multi-centre neuroprotective trials in progressive MS. Moreover, application to other spinal cord segments, such as thoracic or lumbar areas, or other sequences would only require an area/sequence specific template database.  Table 5. Test-retest measurements obtained from C2-3 spinal cord level. Mean (std) for volume, area, coefficient of variation (COV) and Dice score coefficient (DSC).