Segmentation of the Proximal Femur from MR Images using Deep Convolutional Neural Networks

Magnetic resonance imaging (MRI) has been proposed as a complimentary method to measure bone quality and assess fracture risk. However, manual segmentation of MR images of bone is time-consuming, limiting the use of MRI measurements in the clinical practice. The purpose of this paper is to present an automatic proximal femur segmentation method that is based on deep convolutional neural networks (CNNs). This study had institutional review board approval and written informed consent was obtained from all subjects. A dataset of volumetric structural MR images of the proximal femur from 86 subjects were manually-segmented by an expert. We performed experiments by training two different CNN architectures with multiple number of initial feature maps, layers and dilation rates, and tested their segmentation performance against the gold standard of manual segmentations using four-fold cross-validation. Automatic segmentation of the proximal femur using CNNs achieved a high dice similarity score of 0.95 ± 0.02 with precision = 0.95 ± 0.02, and recall = 0.95 ± 0.03. The high segmentation accuracy provided by CNNs has the potential to help bring the use of structural MRI measurements of bone quality into clinical practice for management of osteoporosis.


I. INTRODUCTION
STEOPOROSIS is a public health problem characterized by increased fracture risk secondary to low bone mass and microarchitectural deterioration of bone tissue.Bone mass or bone mineral content is currently assessed most commonly via dual-energy x-ray absorptiometry (DXA) [1], [2].Over the years, cross-sectional imaging methods such as quantitative computed tomography (qCT) [3]- [9] and magnetic resonance imaging (MRI) [10]- [14] have been shown to provide useful additional clinical information beyond DXA secondary to their This work was supported in part by NIH R01 AR066008 and NIH R01 AR070131 and was performed under the rubric of the Center for Advanced Imaging Innovation and Research (CAI2R, www.cai2r.net),an NIBIB Biomedical Technology Resource Center (NIH P41 EB017183).Asterisk indicates corresponding author.
*C. M. Deniz is with the Department of Radiology, New York University School of Medicine, New York, NY 10016 USA, is affiliated with NYU WIRELESS, New York University Tandon School of Engineering, Brooklyn, NY 11201, USA (e-mail: cemmurat.deniz@nyumc.org).
ability to image bone in 3-D and provide metrics of bone structure and quality [15].
MRI has been successfully performed in vivo for structural imaging of trabecular bone architecture within the proximal femur [16]- [18].MRI provides direct detection of trabecular architecture by taking advantage of the MR signal difference between bone marrow and trabecular bone tissue itself.Osteoporosis related fracture risk assessment using MR images requires image analysis methods to extract information from trabecular bone using structural markers, such as topology and orientation of trabecular networks [19]- [21], or using finite element modelling [22]- [24].Bone quality metrics derived from FE analysis of MR images are shown to correlate with high resolution qCT imaging, and reveals information about bone quality which is not provided by DXA [18].These technical developments overlay the significance of image analysis tools to determine hip fracture risk.
Initial studies on hip fracture risk assessment focused on specific locations of the region of interest (ROI), such as femoral neck, femoral head and ward's triangle, for extracting fracture risk relevant parameters [18].Currently, hip fracture risk assessment algorithms begin to investigate the integrity of whole proximal femur [25]- [27].This brings the requirement of segmentation of the whole proximal femur prior in order to study bone quality and assess fracture risk [18], [28].In general, segmentation of the proximal femur is achieved by manual selection of the periosteal or endosteal borders of bone on MR images by an expert.Given the large number of slices for a single subject acquired by MRI during a scan session, manual segmentation of proximal femur hinders the practical use of MRI based hip fracture risk assessment.In addition, manual segmentation is subject to inter-rater errors that are difficult to characterize.To overcome these challenges, automatic segmentation of the whole proximal femur is required.
The use of convolutional neural networks (CNNs) has revolutionized image recognition, speech recognition and natural language processing [29].Deep CNNs have recently been successfully used in medical research for image segmentation [30]- [34] and computer aided diagnosis [35]- [38].In contrast to previous approaches of segmentation which relies on the development of hand-crafted features, deep CNNs learn increasingly complex features from data automatically.In this work, we propose to use two different CNN architectures for automatic segmentation of proximal femur and compare their performance.

A. Convolutional Neural Network
Various convolutional neural network architectures have been used for automatic segmentation of biomedical images [30]- [34].In this work, two different supervised deep convolutional neural network architectures are used and evaluated for automatic proximal femur segmentation on MR images.An overview of the proposed approach for automatic segmentation of the proximal femur is presented in Fig. 1.The first approach (CNN#1) uses pyramidal CNN architecture by using information from local regions around that voxel as an input to predict whether the voxel belongs to trabecular bone or not.In this classical type of CNN, as the networks get deeper, the size of feature maps decrease, which we define as the contracting path (Fig. 2a).The output of the CNN#1 is a probability of the center voxel of the input patch being trabecular bone.The second approach (CNN#2) uses so-called u-net architecture [39] which was built upon a fully convolutional network [40].In the u-net architecture, the network uses larger images as input and starts with a contracting path similarly to the first approach.Each pooling operation is followed by two convolutional layers with twice as many feature maps.After the contracting path, the network starts to expand in a way more or less symmetric to the contracting path, with some cropping and copying from the contracting path.This yields a u-shaped architecture (Fig. 2b).The output of the CNN#2 is a trabecular bone probability map of the center area of the input image.The size of the center area depends on the number of layers in the contracting/expanding paths.
In both CNNs, we use flipping for data augmentation [41] since our dataset contained images from subjects who had been scanned either at the right hip or left hip.In CNN#1, 32x32 patches were extracted with an ordered overlap of 8 strides during training.Augmented data is split into mini batches for training.The initialization of the convolution kernel weights is known to be important to achieve convergence.In all experiments, we use the so-called Xavier [42] weight initialization method.The Xavier initializer is designed to keep the scale of the gradients roughly the same in all layers.This prevents the exploding or vanishing gradient, enabling effective learning during CNN training.We use unpadded 3x3 convolutions and 2x2 max-pooling operations with stride 2 to gradually decrease the size of the feature maps.In the expanding path of CNN#2, upsampling the feature map size is followed by an unpadded 2x2 convolution that halves the number of feature maps.For non-linearly transforming data within each layer of the CNN, rectifier linear unit (ReLU) [43] is used as an activation function.ReLU is defined as max 0, .In the last layer of the CNN, we use softmax to compute the conditional distribution over the voxel label.
A challenge of training a CNN is to generalize learning to data which was unseen during training.When the CNN performs poorly on such a test data, it is generally caused by overfitting to the training dataset.There are many approaches used to prevent overfitting, such as keeping the size of the network small and regularization techniques such as dropout [44].We use dropout for all the CNNs and L 2 regularization on the convolution kernels in some CNNs to prevent overfitting during training.During training, dropout removes nodes from the network with probability p.During testing, all nodes are used.This node removal can be viewed as training an ensemble of networks.Networks using dropout takes longer to be trained compared to the same network trained without dropout.However, the better generalization from using dropout is invaluable.In CNN#1, we use dropout after each hidden layer.In CNN#2, dropout between the 3 rd and 4 th contracting path layers is used.
The output of the softmax layer from the CNN is used to define a loss function which aims to minimize the error between the ground truth and the automatic segmentation via training.In our implementation, loss function, cross entropy loss (CE), is defined as a negative log-probability of a target label (groundtruth) from an expert manually-segmented MR image.In medical images, the anatomical structure of interest usually occupies a small portion of the image.This potentially biases the CNN prediction towards background which constitutes the large portion of the images.To overcome this imbalanced class problem, we re-weighted the loss function during training.We achieve this by incorporating the number of proximal femur, N p , and background, N b , voxels into the loss value such that the error in voxels belong to trabecular bone are given more importance: where N is the number of voxels, y i is a binary variable indicating if the trabecular bone is a correct prediction, p i is the probability of model prediction to be trabecular bone.We used the Tensorflow [45] software library to implement CNNs for machine learning.In the minimization of the loss function, we used stochastic gradient descent (SGD) with momentum as an optimization algorithm for CNN#1.For CNN#2, we used adaptive moment estimation [46]

B. Inference and Postprocessing
To predict the segmentation of the voxels in the border region of the images, we extrapolated the missing content by mirroring the input image during inference.For CNN#1, for each voxel there is a deterministic patch which can be used to calculate the probability of the center voxel of the batch being trabecular bone.On the other hand, for CNN#2, the probability of any voxel being trabecular bone can be calculated using multiple batches which covers that voxel at the center area of the patch.Because of this reason, during inference we use multiple patches for each voxel in CNN#2 and average the probability of that voxel to calculate the probability of that voxel being trabecular bone.In total, we divided the mirrored image into 49 patches that cover the full mirrored image with an ordered overlap.
We performed basic post-processing on the segmentation results from the CNN to remove small clusters of misclassified bone regions.Since trabecular bone forms a 3D connected volume and covers the most number of voxels at the output of CNN, we imposed volumetric constraints by removing clusters with volumes smaller than the maximum volume of connected labels.The label corresponding to the maximum connected volume represents the proximal femur.This approach successfully removes those small clusters which were misclassified as proximal femur during the inference.
The total time required for inference and postprocessing for the segmentation of data from one subject with 60 coronal slices was approximately 58 minutes and 5 minutes for CNN#1 and CNN#2, respectively.

C. Dataset
This study had institutional review board approval, and

D. Evaluation
We used manual segmentations of the proximal femur from 21 subjects as the ground truth to evaluate different CNN structures.We defined trabecular bone and background voxels as positive and negative outcomes, respectively.We evaluated the performance of CNNs using receiver operating characteristics (ROC) and precision-recall curve (PRC) analysis, Dice similarity coefficient (DSC), sensitivity/recall, and precision.The DSC metric [47], also known as F 1 -score, measures the similarity/overlap between manual and automatic segmentations.DSC metric is the most used metric in validating medical volume segmentations [48] and it is defined as: (2) where TP, FP, FN are detected number of true positives, false positives and false negatives, respectively.Sensitivity/recall measures the portion of trabecular bone voxels in the ground truth that are also identified as a trabecular bone voxel by the automatic segmentation.Sensitivity/recall is defined as:

/
(3) Similarly, specificity measures the portion of background voxels in the ground truth that are also identified as a background voxel by the automatic segmentation.Specificity is defined as: (4) Lastly, precision, also known as positive predictive value (PPV), measures the proportion of trabecular bone voxels in the ground truth and voxels identified as trabecular bone by the automatic segmentation.It is defined as: (5) ROC curve analysis provides a means of evaluating the performance of automatic segmentation algorithms and selecting a suitable decision threshold.We used the area under the ROC curve (AUC) and PRC (AP: average precision) as a measure of classifier's performance for comparing different CNNs.The output of CNNs defines a probability of a voxel to be trabecular bone.Using PRC analysis, the optimal threshold was selected for each CNN model to distinguish trabecular bone voxels from background in comparing the performance of CNNs.The optimal operating point for each CNN was selected by choosing the point on the precision recall curve that has the smallest Euclidean distance to the maximum precision and recall.The voxels having higher probability than selected threshold is predicted as trabecular bone and the rest as background.

III. RESULTS
ROC curve and PRC analysis of modeled CNNs on the whole test dataset is presented in Fig. 3 CNN#2 with 32 initial feature maps and 3 layers in the contracting/expanding paths outperformed the other CNNs with AUC = 0.9956 and AR = 0.9692.This model achieved the highest accuracy on the segmentation of the proximal femur and was selected for visualizing the segmentation results.
Optimum threshold was applied to the segmentation probability maps to calculate a binary segmentation mask.Postprocessing was applied to this segmentation mask before

IV. DISCUSSION
We present an algorithm for automatic proximal femur segmentation from structural MR images.A pure CNN based approach was complemented with a basic morphological postprocessing algorithm for preventing spurious regions.The automatic segmentation results indicate that the requirement of expert knowledge on location specifications and training/time for segmentation of the proximal femur may be avoided using CNNs, which brings the use of proximal femur MRI measurements closer to clinical practice, given that manual segmentation of hip MR images can require approximately 1.5-2 hours of effort.
To the best of our knowledge, this is the first description of automatic segmentation of MR image of the proximal femur using a CNN.CNN-based automatic segmentation of MR images has been performed in the brain [49], including for brain tumors [31], microbleeds [50], and skull stripping for brain extraction [51] CNN-based automatic segmentation has also been described in the pancreas [34] and for knee cartilage [30] In the future, we expect the number of imaging applications to only rapidly increase, especially given that there are publicly available software libraries such as Tensorflow to create CNNs and that the algorithm can be executed on a commercially available graphics cards/desktop computers.
In our implementation of the segmentation algorithms, we used 2D convolutional kernels which could be one of the reasons of misclassified trabecular bone regions.Even though information from consecutive slices are incorporated in 2D convolutions, global connectivity of the proximal femur may not be modeled properly using 2D convolutions alone.Although, we used post-processing to prevent misclassified small regions, another approach could be using 3D convolutional kernels which can potentially model 3D connectivity of the proximal femur better.CNNs with 3D convolutional networks are however computationally more demanding and can result in higher overfitting due to increased number of weights to train.
In the CNN#2, mirrored images are used during inference for calculating the probability of each voxel being part of the proximal femur.This resulted in inferencing on multiple patches covering the image and averaging the probability to calculate the output segmentation mask.Mirrored images can also be used during training, which removes the necessity of multiple calculations for averaging during inference.However, the increase in the input size of the network can result in an increased training time and GPU memory requirement.On the other hand, using mirrored images for modelling will reduce the  time required by inference and postprocessing to ~6 seconds making it possible to obtain segmented trabecular bone images immediately after MR scan.
This study has limitations.First, the training dataset consisted only of 41 subjects.In the future, with larger datasets, we expect the performance of the CNN to improve.Second, even though we implemented multiple CNNs with different number of feature maps and layers for CNN#2, the hyperparameter optimization [52] for the CNN training parameters was not implemented in the current study.In the future, the optimization of learning rate and dropout rate will be performed.We expect the misclassified trabecular bone regions to disappear and the postprocessing would be obsolete.
In conclusion, we have demonstrated the feasibility of automatic segmentation of MR images of the proximal femur using a deep CNN.Combined with a basic morphological postprocessing algorithm, the method has the potential to eliminate manual segmentation and facilitate workflow for MRI-based assessment of bone quality in the hip related to osteoporosis or other musculoskeletal disorders.

Fig. 1 .Fig. 2 .
Fig. 1.Overview of the proposed learning algorithm for automatic segmentation of the proximal femur.Training CNN yields automatic proximal segmentation model that is used in model evaluation on a test dataset.The result of the model is the probability of the bone which is used to obtain proximal femur segmentation mask with a threshold.

Fig. 3 .
Fig. 3. Left panel shows the receiver operating characteristics (ROC) curves of different CNNs modeled in this work.For CNN#2, number of feature maps and layers in the contracting/expanding paths are presented in the legend with the area under the curve (AUC).Right panel shows the precision-recall curves of modeled CNNs.In the legend, average precision (AP) is presented for comparison of different models.

Fig. 4 .
Fig. 4. Analysis of the trained automatic segmentation models using 21 test subjects.

Fig. 5 .
Fig. 5. 3T MR image of one of the test subjects (a) is shown with the ground truth/hand segmentation mask (b) and the segmentation mask (c) achieved by CNN#2 with 32 features and 3 layers.Red arrow indicates a location which was misclassified by the CNN.Misclassified regions were removed by post-processing using connectivity and proximal femur size prior information (d).
(Adam).Parameters used in training the CNNs are outlined in Table I.We performed experiments on a server using Intel Xenon E5-2650 2.3GHz CPUs and a NVIDIA 12GB Tesla M40 GPU card.Training each epoch took approximately 50 minutes and 6 minutes for CNN#1 and CNN#2, respectively.