A novel deep learning based hippocampus subfield segmentation method

The automatic assessment of hippocampus volume is an important tool in the study of several neurodegenerative diseases such as Alzheimer's disease. Specifically, the measurement of hippocampus subfields properties is of great interest since it can show earlier pathological changes in the brain. However, segmentation of these subfields is very difficult due to their complex structure and for the need of high-resolution magnetic resonance images manually labeled. In this work, we present a novel pipeline for automatic hippocampus subfield segmentation based on a deeply supervised convolutional neural network. Results of the proposed method are shown for two available hippocampus subfield delineation protocols. The method has been compared to other state-of-the-art methods showing improved results in terms of accuracy and execution time.

www.nature.com/scientificreports/ such as the Dilated Dense UNET 27 have been proposed. However, one of the major problems of supervised deep learning methods is their hunger for training data to be able to generalize on unseen data. In this paper, we propose a novel deep-learning based segmentation method that takes benefit of a problem specific preprocessing that locates the data in a canonical geometrical and intensity space therefore simplifying the segmentation problem and thus reducing the need for lots of manually labeled data. The proposed method has been validated using two hippocampus subfield segmentation protocols with publically available datasets.

Materials and methods
Training data. In this work, we have used two different datasets including two manual labeling hippocampus subfield segmentation protocols, both with high-resolution (HR) T1w and T2w MR images (see Fig. 1). Details of these datasets are given below: Kulaga-Yoskovitz dataset. This dataset includes 25 subjects from a public repository (http:// www. nitrc. org/ proje cts/ mni-hisub 25) (31 ± 7 years, 12 males, 13 females) with manually segmented labels dividing the HC in three parts (CA1-3, DG-CA4 and Subiculum). The Ethics Committee of the Montreal Neurological Institute and Hospital approved the study and written informed consent was obtained from all participants in accordance with the standards of the Declaration of Helsinki. Participants gave their written informed consent prior to scanning and received a monetary compensation. MR data from each subject consist of an isotropic 3D-MPRAGE T1-weighted (0.6 mm 3 ) and anisotropic 2D T2-weighted TSE images (0.4 × 0.4 × 2 mm 3 ). Images underwent automated correction for intensity non-uniformity, intensity standardization and were linearly registered to the MNI152 space. T1w and T2w images were resampled to a resolution of 0.4 mm 3 . To reduce interpolation artifacts, the T2w data was upsampled using a non-local super-resolution method 28 . For more details about the labeling protocol see the original paper 13 .
Winterburn dataset. This dataset contains 5 subjects with 0.3 × 0.3 × 0.3 mm 3 high resolution T1-weighted and T2-weighted images obtained by 2 × interpolation of 0.6 × 0.6 × 0.6 mm 3 acquisitions and their corresponding manual segmentations. The HR images are publicly available at the CoBrALab website (http:// cobra lab. ca/ atlas es). These MR images were taken from 5 healthy volunteers (2 males, 3 females, aged 29-57). The study was conducted in keeping with the Declaration of Helsinki, was approved by the Centre for Addiction and Mental Health Research Ethics Board, and all subjects provided written, informed consent for data acquisition and sharing. High-resolution T1-weighted images were acquired using the 3D inversion-prepared fast spoiled gradientrecalled echo acquisition (TE/TR = 4.3 ms/9.2 ms, TI = 650 ms, α = 8°, 2-NEX and isotropic resolution of 0.6 mm 3 ). High-resolution T2-weighted images were acquired using the 3D fast spin echo acquisition, FSE-CUBE (TE/TR = 95.3 ms/2500 ms, ETL = 100 ms, 2NEX, and isotropic resolution of 0.6 mm 3 ). Reconstruction filters, ZIPX2 and ZIP512, were also used resulting in a final isotropic 0.3 mm 3 dimension voxels. The hippocampus and each of their subfields were segmented manually by an expert rater including 5 labels (CA1, CA2/3, CA4/  Image preprocessing. The images were preprocessed using the following steps: (1) Denoising using the Spatially Adaptive Non-Local Means Filter 29 , (2) Intensity inhomogeneity correction using the N4 bias field correction 30 , (3) Affine registration to the Montreal Neurological Institute (MNI) space by applying the Advanced Normalization Tools (ANTs) package 31 . This registration was estimated using the T1w MNI152 template (at 0.5 mm 3 resolution) and the T1w images, and applied to both T1w and T2w images (a rigid transformation from T2w to T1w was previously estimated and later concatenated with T1w transformation to perform a single interpolation step when registering both T1w and T2w images). (4) Cropping: To reduce the memory requirements and the computational cost, the images were cropped around HC area, (5) Finally, the cropped images were intensity normalized by subtracting the image mean and dividing by its standard deviation.

Proposed method.
Our proposed method is based on a variant of the well-known UNET architecture 26 .
The proposed UNET has 4 resolution levels (from 0.5 to 4 mm). We used three blocks of BatchNormalization, 3D convolution (kernel size of 3 × 3 × 3 voxels) plus ReLU layers for each resolution level. We also used dropout layers (with 0.5 rate) in the encoding part of the UNET to minimize overfitting problems. The input of the network consists of a tensor with two channels (T1 and T2 images). The first resolution level has 64 filters and the next levels multiply by 2 this number to compensate the loss of spatial resolution. Similarly, the number of filters is reduced by 2 in the ascending path of the encoder at each resolution level. The output is also a tensor of nc channels represent the probabilities of each subfield and the background.
We also used a modified version of deep supervision 32 approach that helps to train very deep networks by producing segmentations at different resolution levels. Deep supervision has been shown to not only counteract the adverse effects of gradient vanishing but also to speed up convergence and produce highly accurate results even with limited data. The main difference of our implementation compared to Dou et al., is that we used upsampled low-resolution outputs also as inputs of the next level of the decoder (concatenated with the upsampled features and the encoder shortcut) to help in the next resolution level (only for 1 and 2 mm resolution levels). The resulting network has 56 layers and 35,085,580 trainable parameters. In Fig. 2 the scheme of the proposed network is shown. We will refer to the deep supervised variant of the UNET as DS-UNET3D.
The loss function plays a major role in the training process and an exhaustive search of the most suitable function for the proposed architecture and problem has to be done. One of the most common loss functions for classification is the categorical cross entropy. However, for segmentation purposes it is common to use the dice loss (DL) 33 as it directly optimizes the segmentation metric most commonly used and it is more robust to the class imbalance problem (1). Recently, a Generalized Dice Loss (GDL) 34 was proposed to deal with the wellknown dependency of the dice index with the size of the labels (2). Inspired by GDL, we propose in this paper the Generalized Jaccard Loss (GDL) (4) which is a variant of Jaccard loss (3) following the same idea to reduce label size dependency: Alternatively, in our proposed GJL loss we did not use the squared volume to normalize but just the volume, i.e. w c = 1/ N i=1 t ci . As the size of our training datasets is small (specially for the Winterburn dataset) we used different approaches of data augmentation. We randomly smooth and sharpen the images to simulate different image quality conditions during the training. We expanded the Winterburn dataset with automatic segmentations of the Kulaga-Yoskovitz dataset using the method HIPS. Note that to generate these segmentations only training data was used as library. Finally, we used also mixup 35 as a data agnostic method for performing data augmentation.
Batch normalization is a highly effective manner to speed up the training process and to improve the results by minimizing the internal covariate shift. However, we realized that when used with small batch sizes it behaves sub-optimally at test time. The reason of this issue is that Batch Normalization layer behaves differently at training and test time. During training, the mean and standard deviation of the activation maps are computed for the whole batch using a moving average estimation to enforce stability during training. However, at test time the network does not processes any batch of data and therefore cannot estimate the mean and standard deviation of the batch, as a result, the network uses the historical mean and standard deviation stored during training. Unfortunately, when using small batch sizes (N = 1 in our case) the stored values do not work very well. Nevertheless, if we run the network in training mode we force the network to use the current mean and standard deviation of the new case and the results are significantly improved. We call this, training time batch normalization (TTBN).

Experiments and results
In this section, the analysis of the different options of the proposed method and their results are presented. To evaluate the segmentation accuracy, we have used the DICE coefficient 36 measured in the linear MNI152 space. All experiments were performed using tensorflow 1.2.0 and keras 2.2.4 using Titan Xp Nvidia GPU with 12 GB RAM. To train the network, we used an Adam optimizer 37 with default parameters during 200 epochs and we test different loss functions with multiscale loss weights (0.1, 0.2 and 0.7) for low, medium and high-resolution outputs respectively (see Fig. 2). A batch size of one was used in all our experiments.
Kulaga-Yoskovitz and Winterburn datasets were preprocessed as described in "Image preprocessing". To increase the size of the training data, we put together left and right crops by left-right flipping the left crops to generate right oriented crops. This yield 50 right crops in Kulaga-Yoskovitz dataset and 10 right crops in Winterburn dataset. Since the size of both datasets is quite small we have used a K-fold cross validation strategy to increase the relevance of our findings. Specifically, we used K = 5 in both datasets. In Kulaga-Yoskovitz dataset this let each fold with 40 training images (5 of them for validation) and 10 test images. In Winterburn dataset each fold had 8 training images (2 of them for validation) and 2 test images.

Analysis of the proposed method.
There are many factors that affect the performance of deep learning methods such as the architecture, the loss function, data augmentation strategies, etc. In this section we will present some experiments that show their effects on the proposed method.
The first option we tested was the loss function. We compared 5 different loss functions using the same exact network initialization. In Table 1 the validation DICE of each loss function is compared for both datasets. We also included the categorical cross entropy (CCE) in the comparison as it is a common loss used in segmentation/ classification. As can be noted, CCE performed worse than dice loss which is a common loss function used in segmentation. Curiously, the GDL failed giving a really low dice compared with the other losses. JL performed similar than dice loss. The proposed GJL was the best performing loss in both datasets and therefore was selected a loss function of the proposed method.
To study the impact of the proposed architecture, we run the proposed network with deep supervision and compared it with the classic UNET. In Table 2 the results of the comparison are shown. As can be seen, the proposed architecture was able to improve the results in both datasets.
, www.nature.com/scientificreports/ It is well-known that in deep learning the amount of training data plays a major role (probably the biggest) in the quality of the network results. Unfortunately, manually labeled cases of hippocampal subfields is a rare resource due to the difficulty of generating such data. Automatic data augmentation has been traditional used to artificially increase the number of training cases. This has been usually done applying random transformations on the available training data (rotation, scale, etc.). In this project, we have used a combination of different methods to augment the number of training cases. In the case of Kulaga-Yuskevitz, we randomly smooth and sharpen the cropped images to generate low and high-quality images to improve generalization capabilities of the network. We also used mixup 35 to linearly combine inputs and outputs (alfa = 0.3). Mixup is a data-agnostic data augmentation method that has been proven beneficial specially when using a small training dataset 38 . In the case of Winterburn dataset, we used the same approach but in addition we increased the training dataset using automatic segmentations of the Kulaga-Yuskevitz dataset with the HIPS method 18 which is a patch-based multiatlas label fusion based method (using as atlases the training cases of each fold). In the Table 3 the results of the proposed method with and without data augmentations are shown. As expected, data augmentation strategies helped to improve the results in both datasets. The improvement in Winterburn dataset was more important given the small size of the training set (N = 6).
A last experiment was performed to evaluate the effect of the TTBN technique. In Table 4 the results of both datasets are shown. As can be seen, TTBN helped in both data sets, but the improvement in the Winterburn dataset was relatively greater.
For the final results of both datasets, we estimated each structure dice, average dice among structures and whole hippocampus dice. In Table 5 the k-fold cross validation results for both datasets are shown. An example result of DeepHIPS for both protocols is shown in Fig. 3.

Standard resolution vs high resolution.
The proposed method uses high resolution MR images but these sequences are not always available either in research or in clinical environments. However, it would be desirable to be able to analyze legacy data. For this reason, we evaluated the proposed method using standard resolution (1 × 1 × 1 mm 3 ) images upsampled to 0.5 × 0.5 × 0.5 mm 3 using B-spline interpolation and a super-  28 . To do it, we reduced the resolution of the HR images by a factor 2 and later we upsampled them using the described methods. Tables 6 and 7 show the results for both datasets. The results confirm that the proposed method can produce competitive results when using standard resolution images. Note that the results using LASR are better than using B-spline interpolation for both datasets and closely resembles those obtained using the original HR images.   Table 7. Kulaga-Yoskovitz dataset mean DICE and standard deviation for each structure segmentation using the B-spline interpolation and LASR to the previously downsampled image to be segmented. Best results in bold. Results using the HR images are also provided for comparison.  39 can further reduce the gap between original HR data results and the upsampled standard resolution images. However, this is beyond the objectives of the work and will be studied in a future research.
Method comparison. The proposed method was compared with state-of-the-art related methods. Specifically, for the Kulaga-Yoskovitz dataset we compared with HIPS method 18 and a recent deep learning-based method named ResDUnet dedicated to hippocampus subfields segmentation 27 . In both cases we used published results in their papers for the comparison. In Table 8 we show the results of the comparison. We included also the inter and intra-rater accuracy for comparison purposes. As can be noticed, the proposed method outperformed previous state-of-the-art methods. It is also worth to note that the proposed method improved the inter-rater accuracy and got very close to the intra-rater accuracy.
For the Winterburn dataset, we compared with HIPS method 18 that represents the state of the art in this dataset. In Table 9, we show the results of the comparison. We included also the intra-rater accuracy for comparison purposes. As can be noticed the proposed method outperformed HIPS method by a large margin and got very close to the intra-rater accuracy.
Regarding to the execution time, the proposed network takes around 1 s to segment a new case. The whole DeepHIPS pipeline (including preprocessing) takes around 2 min while HIPS method takes around 20 min.

Discussion
In this paper, we have presented a new deep learning-based method for HR hippocampus subfield segmentation that we called DeepHIPS. We have validated the proposed method using 2 publically available datasets (Winterburn and Kulaga-Yoskovitz).
Our proposed method first preprocesses the HR T1 and T2 images to improve their quality and to locate them into a standard space (MNI152) to finally crop the region of interest to process. From the architecture point of view, our model is a 3D UNET variant that uses deep supervision and low-resolution feedback to make easier the training process. We found that this variant worked better that the classic UNET.
We have also proposed a novel loss function (GJL) based on the Jaccard similarity index that enables to improve the accuracy of the network borrowing ideas from a modified version of the GDL (i.e. using linear volume weights instead of quadratic). We further improve the results using classical data augmentation techniques such as image mirroring and intensity transformation and more modern ones such as mixup.
Finally, we improved the results of the network at test time by running Batch normalization layers in training mode instead of test mode. We found that when using small batch sizes (N = 1 in our case) batch normalization layers didn't behave properly due to the use of the stored mean and standard deviation during training. Using current sample statistics systematically improved the results in all our experiments despite the simplicity of the approach (especially in the Winterburn dataset with an improvement of nearly the 9%). We called this strategy Training Time Bach Normalization (TTBN).  www.nature.com/scientificreports/ We compared the results of the proposed method with state-of-the-art methods in two datasets. In the Kulaga-Yoskovitz dataset we compared with HIPS method and a recent deep learning-based method named ResDUnet. The proposed method improved the results of both methods for all subfields and got closer to the intra-rater accuracy which can be considered as the upper bound of the method. For the Winterburn dataset, we compared with HIPS method and again, the proposed method improved the results for all subfields and the overall accuracy got very close to the intra-rater accuracy.
We also studied the accuracy of the propose method using standard resolution images (1 mm 2 ) upsampled to HR (0.5 mm 2 ) using superresolution method (LASR). Although the accuracy slightly dropped compared to the HR results we found it still very competitive making possible the use of legacy data.
We are aware that the training libraries of the proposed method are quite small to ensure a good generalization (especially in the case of Winterburn) and our future efforts will be directed to increase the size of these libraries by manually labeling new cases and using semi-supervised approaches to automatically extend the training dataset size.
From an efficiency point of view the proposed method is not only more accurate but also more efficient than previous state of the art (HIPS) reducing by a factor 10 the total execution time.

Conclusion
In this work, we have presented a new method for HR hippocampus subfield segmentation based on a deep learning approach and we have validated it with two publically available datasets (Winterburn and Kulaga-Yoskovitz) showing competitive results in both accuracy and efficiency. We plan to make fully accessible the DeepHIPS pipeline through the new release of our online image analysis service volbrain (http:// volbr ain. upv. es) so researchers around the world can use our pipeline without requiring complex pipeline installations or the use of expensive hardware (GPUs, etc.).