Multi-muscle deep learning segmentation to automate the quantification of muscle fat infiltration in cervical spine conditions

Muscle fat infiltration (MFI) has been widely reported across cervical spine disorders. The quantification of MFI requires time-consuming and rater-dependent manual segmentation techniques. A convolutional neural network (CNN) model was trained to segment seven cervical spine muscle groups (left and right muscles segmented separately, 14 muscles total) from Dixon MRI scans (n = 17, 17 scans < 2 weeks post motor vehicle collision (MVC), and 17 scans 12 months post MVC). The CNN MFI measures demonstrated high test reliability and accuracy in an independent testing dataset (n = 18, 9 scans < 2 weeks post MVC, and 9 scans 12 months post MVC). Using the CNN in 84 participants with scans < 2 weeks post MVC (61 females, 23 males, age = 34.2 ± 10.7 years) differences in MFI between the muscle groups and relationships between MFI and sex, age, and body mass index (BMI) were explored. Averaging across all muscles, females had significantly higher MFI than males (p = 0.026). The deep cervical muscles demonstrated significantly greater MFI than the more superficial muscles (p < 0.001), and only MFI within the deep cervical muscles was moderately correlated to age (r > 0.300, p ≤ 0.001). CNN’s allow for the accurate and rapid, quantitative assessment of the composition of the architecturally complex muscles traversing the cervical spine. Acknowledging the wider reports of MFI in cervical spine disorders and the time required to manually segment the individual muscles, this CNN may have diagnostic, prognostic, and predictive value in disorders of the cervical spine.


Results
CNN accuracy and reliability. Training of the CNN segmentation model was completed in 100,000 iterations (Supplementary Material Fig. 1), and the accuracy and reliability of the trained CNN model was evaluated on the independent testing dataset (n = 18). Figures 1 and 2 compare the GT segmentations to the CNN segmentations from a randomly selected testing scan. The CNN accuracy for the primary measure of MFI was high; for all muscle groups, the absolute value of the mean bias in MFI was less than 2.0%, the MFI mean absolute error (MAE) was less than 2.0%, and the MFI root mean squared error (RMSE) was less than 3.0% (Table 2 and Fig. 3). Likewise, the reliability of the CNN MFI measures was excellent with the ICC 2,1 exceeding 0.800 for all muscle groups (Table 3 and Fig. 3) 31 . The accuracy and reliability of the secondary measure of muscle volume were generally lower than the MFI measurements (Table 2, 3, and Fig. 4). While the reliability of the MFSS, SSCap, LS, SCM, and TR muscle volume was good (ICC 2,1 > 0.600), the reliability of the SPCap and LC volumes were fair (ICC 2,1 = 0.407-0.462) to poor (ICC 2,1 = 0.207-0.395), respectively (Table 3 and Fig. 4). The mean Sørensen-Dice index between the CNN and GT was > 0.65 for all muscle groups except the TR (Sørensen-Dice index < 0.50). The volume ratios were greater than one for all muscles indicating that the CNN model generated segmentations of a larger volume than the GT. See Table 4 for a summary of all segmentation metrics.  www.nature.com/scientificreports/ Interrater reliability. Another trained independent rater segmented a subset of the dataset (n = 10) to assess the reliability of the manual segmentations between human raters. Interrater reliability for the MFI measures was excellent (ICC 2,1 > 0.800) for most muscles except for the left SSCap, which demonstrated good interrater reliability (ICC 2,1 = 0.742) (Supplementary Table 2 Fig. 4). Paired t-tests demonstrated significant differences in MFI between each muscle group with the deep cervical muscles (MFSS, LC, and SSCap) having greater MFI than the more superficial muscles (Fig. 5A). When averaging across all muscle groups, female participants had significantly higher MFI than male participants (1.8% ± 0.8% higher MFI in females, p = 0.026). Sex differences within each muscle groups were assessed with a one-way ANCOVA controlling for age and BMI. Females had significantly higher MFI than males in the SSCap, SPCap, and TR (Supplementary Table 3 and Fig. 5B). Two-tailed partial correlations controlling for sex and BMI demonstrated that age and MFI were moderately correlated (r > 0.300) in the deep cervical muscles (MFSS, LC, SSCap), while age and MFI were only weakly correlated (r < 0.300) in the more superficial muscles. Finally, BMI was only weakly correlated (r < 0.300) with MFI across the muscle groups after controlling for sex and age (Fig. 6).

Discussion
We trained and tested a CNN model for segmentation of seven cervical spine muscle groups (left and right muscles segmented separately) using high-resolution fat-water Dixon MRI in participants within 2 weeks of a MVC-related whiplash injury. We demonstrate the feasibility of developing a CNN model for a complex,  www.nature.com/scientificreports/ multi-muscle segmentation task: 14 muscles total. The trained CNN model was highly efficient -segmenting an image, and all muscles within that image, in less than 30 s compared to 4 to 8 h with manual segmentation. Across all muscle groups, we report high accuracy (mean bias < 2.0%) and high reliability (ICC 2,1 > 0.800) for the CNN MFI measures compared to manual segmentation. Using the CNN measures, we identified higher MFI in females than males, and we also identified significant positive correlations between age (controlling for sex and BMI) and MFI, which was most pronounced (r > 0.300) for the deep cervical muscles: MFSS, LC, and SSCap. Overall, the CNN model permits the automatic extraction of accurate and reliable measures of muscle composition. The use of these measures as secondary markers in clinical trials, may lead to an improved understanding of spinal disease, delivering more sensitive and specific measures of spinal pathology, where the slow progression, often masked by age-related changes, pose a major roadblock to measuring therapeutic success. Such work could also advance the management of spinal conditions by encouraging efficiencies and innovation in clinical assessment and therapy development because the measures may detect change earlier than clinical endpoints (more sensitive) and are independent of assessor variability (more objective). Furthermore, these measures may aid in augmented intelligence-driven clinical decision making, allowing the clinician to better risk-and treatment-stratify patients using this information along with other clinical markers and endpoints 14 .
In the cervical spine, MFI has been reported to be present in greater magnitude in the deeper muscles surrounding the vertebrae, compared to the more superficial muscles 18,20 . We also found this to be true for our CNN MFI measurements supporting these previous findings. Prior research as well as our current study highlight the difficulty in determining whether elevated MFI is due to a pathological process or more part of normal aging, as age and MFI have been demonstrated to be associated 29,[32][33][34] . For this study, the imaging data were from a parent study of individuals with neck pain due to a whiplash injury from a MVC, and thus aging as well as musculoskeletal trauma may have contributed to the MFI expressions in our cohort. We demonstrated higher MFI in females than males, which is consistent with previous findings in the lumbar spine musculature in asymptomatic 33 and symptomatic 30 participants. The effect of sex was present even when controlling for age and BMI. BMI, however, is only a rough estimate of body composition, and more accurate measures of body fat percentage (e.g., skinfold testing or bioimpedance) and physical function could help better understand the relationship between sex and MFI 35 . Sex and age are potential confounds when studying MFI. The inclusion of a sufficient number of male and female participants without whiplash symptoms following a MVC or an uninjured pain-free cohort is necessary to determine the nature of MFI expression and whether the magnitude is influenced by disorder severity, age, or sex.
While the accuracy and reliability of the CNN MFI measures were high for all muscle groups, the accuracy and reliability of the muscle volumes were generally lower, especially for the SPCap and LC, which had fair to Table 2. CNN MFI and volume accuracy. Bland-Altman plots and Pearson correlations were used to assess the accuracy of the convolutional neural network (CNN) model compared to the ground truth (GT) for muscle fat infiltration (MFI) and muscle volume measures on the testing dataset (n = 18) (Fig. 3, 4). Mean = mean CNN measure ± 1 standard error. Bias = mean difference between CNN and GT. LA = limits of agreement. MAE = mean absolute error. RMSE = root mean squared error.  In the correlation plots, the dashed black line represents the best fit line, and the linear regression coefficient (β) of the ground truth (GT) on CNN (intercept = 0) is also provided (solid black line), which can be used to correct the CNN measurement bias. In the Bland-Altman plots, the dashed black and gray lines indicate the mean difference (i.e., bias) ± 1.96 × standard deviation (i.e., 95% limits of agreement). See Table 3 for the intraclass correlation coefficients (ICC 2,1 ). www.nature.com/scientificreports/ poor reliability, respectively. The lower reliability of the muscle volumes was not unexpected, and several important limitations regarding the volume measurements should be noted. First, we chose not to segment the full length of each muscle group, and instead, limited the segmentations to the axial slices that cross the vertebral levels listed in Table 1 for each muscle group. This was to standardize the segmentation process because some muscles become difficult to differentiate close to their attachments and some muscles originated from outside the field of view (e.g. trapezius). Second, the axial slices do not cross most muscles perpendicularly, meaning that the area on an axial slice may not represent a true cross-sectional area of the muscle at that level. This is a common limitation for imaging studies that report cross-sectional area and is a good argument for reporting full volume instead, if possible 23 . While the CNN model only trained on the segmentations restricted to the pre-determined vertebral levels, the CNN model output was not restricted to these vertebral levels, and additional slices not included in the GT segmentations appear to have been included in the segmentations output by the CNN model. This is demonstrated by the volume ratio being greater than one for all muscle groups and a positive mean bias for muscle volume for all muscles except the TR, indicating that the CNN muscle segmentations produced a systematically greater volume than the GT. Despite the low CNN reliability for the volume measures, the mean bias in the muscle segmentations was less than 5 ml for all muscle groups, which is relatively low. Despite advancements in MRI technology (e.g., improved image resolution and contrast), manual segmentation of the cervical spine muscles requires a high-level of knowledge in cervical spine anatomy and a substantial amount of time (i.e., 4 to 8 h per scan in this study). Due to the high-level of expertise and time requirements, we only trained and tested the CNN model using segmentations from one rater, which were then reviewed by an additional independent rater. This raises concern that the CNN model may be biased to a single rater's interpretation of the anatomy. To mitigate this concern, in a subset of the dataset, we assessed the interrater reliability of manual segmentation with a third human rater, and we demonstrated excellent to good reliability for the MFI and volume measures of all muscles groups besides the TR muscle volume measures, which had poor reliability. Factors contributing to differences in segmentations between raters include the complex three-dimensional muscle and bone architecture as well as challenges in visualizing borders between adjacent muscle groups. The complexity of the TR muscular architecture likely led to the poor interrater reliability values for the TR muscle volume. The upper TR muscle fibers are primarily vertical near its cranial attachment, but fan out at the lower vertical levels, becoming almost parallel to the axial slices. Despite receiving similar instruction and training, the two raters may have used slightly different techniques to segment this muscle 23 . Caution should be used when interpreting the TR muscle measures as the segmentations likely do not generalize to all raters or participants.
The challenge of differentiating between adjacent muscles drove our previous decision to combine the multifidus and semispinalis cervicis (MFSS) 23 . The MFSS are both deep extensor muscles with similar actions. These muscle groups have many vertebral attachments and multiple layered fiber bundles that make boundaries challenging to differentiate. Similar reasoning was used for the grouping the longus colli and longus capitis (LC). In Table 3. CNN MFI and volume reliability. Intraclass correlation coefficients (ICC 2,1 , two-way random, absolute agreement, single measure) were used to assess the reliability of the convolutional neural network (CNN) model compared to the ground truth (GT) for the muscle fat infiltration (MFI) and muscle volume measures on the testing dataset (n = 18). CI = confidence interval. p = F-test with true value of 0.   Table 3 for the intraclass correlation coefficients (ICC 2,1 ). www.nature.com/scientificreports/ comparison to our previous CNN model for the MFSS, which used two-dimensional convolutional layers, here three-dimensional convolutional layers were employed. The choice of the three-dimensional model was due to the increased complexity of the muscle architecture, particularly of the superficial muscles, where the shape and spatial relationship of the muscles varies substantially along the superior-inferior axis. The three-dimensional model includes this additional spatial information, possibly allowing the model to better learn the complex three-dimensional architecture of the muscles, leading to improved performance.
In our previous CNN model, we trained the model using just the water-only images and limited the measure to the MFSS 24 . Here, we chose to train on the in-phase and out-of-phase images for two reasons. First, using the corresponding pairs of images generated from the Dixon sequence (e.g., fat-only and water-only images or in-phase and out-of-phase images), provides complementary information, which may have helped better define the muscle boundaries 36 . Second, the in-phase and out-of-phase images were used over the water-only and fatonly images due to the presence of the fat-water swapping artifact, which appears frequently in Dixon imaging (≈10% of the fat-water imaging scans) 37,38 . The fat-water swapping artifact results from misclassification of the fat and water signal in areas of magnetic field inhomogeneities, resulting in regions where fat-only and water-only images contain water-only and fat-only voxels, respectively. This artifact is not present in the in-phase and outof-phase magnitude images. The use of both image contrasts does increase the number of features leading to a larger network size, higher model complexity, and increased computational costs. While we did not test whether the use of both the in-phase and out-of-phase images improves the segmentation performance, it is plausible as the images contain unique tissue contrast. Outputting the in-phase and out-of-phase images is an option on most scanners but may not be the default setting. With three-dimensional models, especially multi-modal models, memory does become an issue, limiting the batch size for training to only three datasets in this study. We are currently exploring the trade-off between the spatial window size and model performance to reduce the memory demands without sacrificing accuracy.
Here we employed the dense V-Net, a fully CNN model, to perform multi-muscle segmentation. We chose the dense V-Net because this model demonstrated state-of-the-art performance in a multi-organ segmentation task using abdominal CT images 39 . U-Net is another commonly used fully CNN segmentation model with both two-dimensional and three-dimensional architectures and structural similarities to V-Net 40,41 . U-Net has demonstrated high performance for lung segmentation from radiographs and bone segmentation from MRI in addition to many other tasks 42,43 . In this study, we did not compare different CNN architectures, but the results of a recent segmentation challenge showed similar performance between three-dimensional V-Net and three-dimensional U-Net models for a knee cartilage MRI segmentation task 44 . The application of CNN's and deep learning into medical imaging analysis has been a major advancement in the field, leading to significant gains in segmentation performance across multiple medical imaging applications (for a comprehensive review see Hesamian et al. (2019)) 45 . New architectures are continuing to be developed, leading to further improvements in segmentation performance over the V-Net and U-Net architectures 46 . Examples include recurrent neural networks, such as long Table 4. CNN segmentation performance metrics. The convolutional neural network (CNN) segmentation performance was further assessed on the testing dataset (n = 18) using the Sørensen-Dice index (Dice), Jaccard index (JI), conformity coefficient (CC), true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), and volume ratio (VR). Metrics shown = mean ± 1 standard error. www.nature.com/scientificreports/ short-term memory networks, and ensembles of neural networks 47,48 . Adopting these more advanced networks would likely improve the accuracy and reliability of the cervical spine muscle segmentations reported in this study as well as the resulting MFI measures. We present these findings from the perspective of cervical spine conditions; however, we are actively working to expand this technology to the entire musculoskeletal system. A major barrier in developing the CNN models is the availability of large, diverse annotated datasets for training. The use of images from the same site, sequence, and imaging parameters likely reduces the generalizability of the trained CNN model and is a recognized limitation in this study. We are currently building a global coalition of musculoskeletal clinicians and researchers to pool clinical-and research-based imaging datasets to develop a large multi-site and multi-cultural annotated musculoskeletal imaging database for research purposes. Using this database, we aim to develop models that generalize to images of varying resolution, field-of-view, orientation, and image contrast (multi-modality and multi-scale) to establish normative reference values to inform clinical care on a patient-by-patient basis 49,50 . Fortunately, many past examples of open imaging databases for organizing and sharing imaging datasets exist to guide this process (e.g., OpenNeuro) 51 . Efficiently generating the annotated datasets with sufficient accuracy remains the greatest hurdle. We are currently exploring ways to employ crowd sourcing strategies and gamify the segmentation task, with the goal of developing a web-based educational platform targeted to health professional trainees to learn musculoskeletal anatomy interactively.
Based on our findings, we now have the technology to automate the segmentation of multiple muscles of the cervical spine, permitting the quantitative comprehensive assessment of cervical spine muscle composition. Our success in the cervical spine, with its architecturally complex anatomy, suggests that effectively extending these methods to other body regions is possible, and we have efforts underway in the lumbar spine, foot, leg, hip, and shoulder using both CT and MRI. The integration of the CNN models into the conventional clinical workflow When averaging across all muscle groups, female participants had significantly greater MFI than male participants (1.8% ± 0.8% higher MFI in females, p = 0.026). Females had significantly higher MFI than males in the SSCap, SPCap, and TR (one-way ANCOVA controlling for age and BMI). Estimated marginal means are shown. See Table 2 and Supplementary Table 3 for additional information. Error bars = 1 standard error. *p < 0.05, **p < 0.01, ***p < 0.001.

Scientific Reports
| (2021) 11:16567 | https://doi.org/10.1038/s41598-021-95972-x www.nature.com/scientificreports/ as a postprocessing step should be straightforward, and in the not-too-distant future, these methods could provide clinicians with quantitative metrics of muscle characteristics extracted from the images obtained in a conventional musculoskeletal imaging series. Finally, these muscle measures would complement the examination and standard imaging findings and may provide increased diagnostic, prognostic, and predictive information to better inform the assessment and management of a wide variety of musculoskeletal and neuromuscular Figure 6. Relationship between muscle fat infiltration (MFI), age, and body mass index (BMI) across the muscle groups. Partial correlations (Pearson's r) were performed to identify linear relationships between MFI and age or MFI and BMI in 84 participants from the first time point (< 2 weeks following a motor vehicle collision, 61 females, 23  www.nature.com/scientificreports/ conditions. Relating these findings to clinical examinations across a patient population with varying levels of pain and disability is required before definitive conclusions can be made. This is well underway.

Methods
Participants. MRI datasets from 84 participants (61 females, 23 males, age = 34.2 ± 10.7 years) were obtained from a prospective observational longitudinal study exploring recovery from whiplash (ClinicalTrials.gov Identifier: NCT02157038). Datasets from the first (< 2 weeks post MVC) and fourth (12 months post MVC) study time points were used in the present study. Inclusion criteria included age 18 to 65 years, Quebec Task Force whiplash grades of II to III, and < 1 week post MVC with a primary complaint of neck pain. Exclusion criteria included a history of a previous MVC, spinal fracture, previous spinal surgery, previous diagnosis of cervical or lumbar radiculopathy, history of neurological or metabolic disorders, and contraindications to MRI. The study was approved by Northwestern University's Institutional Review Board. All applicable institutional and governmental regulations concerning the ethical use of human volunteers were followed during the course of this research according to the Declaration of Helsinki, and written informed consent was obtained from every participant. Prior to working with the datasets, identifying personal information was removed. . Each muscle was segmented manually by tracing their muscle borders on consecutive axial slices using methods previously described 23 . The muscle groups were segmented with separate labels for the left and right muscles at predetermined cervical levels where each muscle group is consistently present and can be accurately segmented. A mid-sagittal slice was used to identify the axial slices corresponding to each vertebral level ( Table 1). The segmentation masks contained the background (label = 0) and each muscle labeled with an integer value (labels = 1-14). A single rater (VB) blinded to any demographic or clinical information segmented the 14 muscles of interest from 52 cervical spine Dixon scans. The segmentations were then reviewed by an additional independent, blinded rater (KW). Time required to segment a single Dixon scan ranged from 4 to 8 h. These segmentations were used as the ground truth (GT) for training and testing the CNN model. To assess interrater reliability, a third independent rater (RA), segmented a randomly selected subset of the Dixon scans (n = 10). All raters were doctoral level health professionals (physical therapy (RA) and chiropractic (VB and KW)) with extensive training in cervical spine anatomy and musculoskeletal imaging. The raters were permitted to use the fat, water, in-phase, and out-of-phase images to guide the segmentations.

Image acquisition.
Data augmentation. Data augmentation was used to increase the variability in the training dataset. First, the images were split into training and testing datasets. The training dataset consisted of images from 17 participants (14 females, 3 males, age = 33.7 ± 11.4 years) with 17 scans from the first study time point and 17 scans from the fourth study time point (34 training scans total). The testing dataset consisted of images from 18 participants (11 females, 7 males, age = 31.7 ± 9.6 years) with 9 scans from the first time point and 9 scans from the fourth time point (18 testing scans total). Participants in the testing dataset were independent from the participants in the training dataset. From the training dataset, 2,000 augmented images were generated by applying a series of random mirroring (left-right flip), elastic deformation (number of control points = 3, sigma = 10), anisotropic spatial scaling (percentage = ± 2.5), and rotation (left-right axis rotation = ± 2.5°, anterior-posterior axis rotation = ± 2.5°, superior-inferior axis rotation = ± 5.0°). The non-augmented (i.e., raw) training images (n = 34) were used to assess the model performance across the training iterations using the Sørensen-Dice index (i.e., validation dataset). Data augmentation and training and testing the CNN model were performed using NiftyNet (Version 0.6.0), an open-source deep-learning platform built on TensorFlow (Version 1.15) in Python (Version 3.6) and designed specifically for medical imaging analysis 53 .
Dense V-Net. CNNs are a class of deep neural networks that preserve spatial information in the network architecture and can learn patterns within images. Here we used the dense V-Net, a fully CNN model designed for segmentation tasks 39 . The dense V-Net consists of batch-wise spatial dropout, dense features stacks, V-Net downsampling and upsampling subnetworks, and dilated convolutions [54][55][56] . Skip connections in the V-Net architecture forward higher resolution information to the final segmentation. To limit bias towards predicting the image background, a loss function based on the Sørensen-Dice index (Dice Hinge) was employed and minimized. The output after soft-max transformation is probabilistic segmentation masks for each muscle with the same dimensions as the input volume. www.nature.com/scientificreports/ Training. Each dataset was first resampled to 0.7 mm × 0.7 mm × 3.0 mm resolution and zero padded (90 × 60 × 8 voxels). A three-dimensional dense V-Net model was trained using the in-phase and out-of-phase images of the augmented training dataset (spatial window = 360 × 240 × 32 voxels, learning rate = 0.001, activation function = ReLu, optimizer = Adam, loss function = Dice Hinge, regularization = ℓ2, decay = 0.00001, samples per volume = 3, batch size = 3, window sampling = uniform). Prior to training, histogram standardization and label normalization were performed. The dense V-Net model was initialized with random weights, and training was completed once the Sørensen-Dice index plateaued on the validation dataset.
Performance. Performance of the CNN model was assessed using the Sørensen-Dice index (Dice), Jaccard index (JI), conformity coefficient (CC), true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), and volume ratio (VR) measures (Supplementary Table 1) 57 . Percent muscle fat infiltration (MFI) and muscle volume (ml) were measured using the segmentation masks from the GT and the CNN model for each muscle. MFI was calculated as the mean fat-only signal within a muscle divided by the sum of the mean fat-only signal and the mean water-only signal within a muscle multiplied by 100: Accuracy and reliability between the GT and the CNN model were assessed using Bland-Altman plots, Pearson correlations, and intraclass correlation coefficients (ICC 2,1 ). The reliability between the two manual raters was also similarly calculated to assess interrater reliability between two human raters. For reliability and accuracy, MFI was considered the primary measure, while volume was used as a secondary measure to further assess the CNN segmentations.
MFI assessment and characterization. Next, we used the trained CNN model to automatically segment the datasets of 84 participants from the first study time point (< 2 weeks following MVC, 61 females, 23 males, age = 34.2 ± 10.7 years) to examine differences in MFI between the muscle groups and then explore the relationship between MFI and sex, age, and BMI. The MFI of each muscle was calculated from the CNN segmentations, and then the left and right MFI measures for each muscle group were averaged to limit the number of comparisons and because we had no a priori hypotheses regarding left-right differences in MFI. A repeated measures ANCOVA with a within-subject variable of the muscle group, a between-subject factor of sex, and covariates of BMI and age was performed. This was followed by two-tailed paired t-tests to assess differences in MFI between the muscle groups, and pair-plots were generated to assess correlations in MFI between each muscle group. To identify sex differences in MFI for each muscle group, a one-way ANCOVA (i.e., multiple linear regression) was performed with a fixed factor of sex and covariates of age and BMI. Two-tailed partial Pearson correlations were performed to identify linear relationships between MFI and age or MFI and BMI while correcting for sex and BMI or sex and age, respectively. As these analyses were exploratory and aimed at further characterizing the MFI measures, no corrections for multiple comparisons were performed. Since the muscle groups were segmented at specific vertebral levels rather than across the entire superior-inferior extent of the muscle, the muscle volume measures obtained from the CNN model were not expected to represent an accurate measure of muscle size, and therefore, a more in-depth analysis of muscle volumes was not performed. Statistical analyses were performed using IBM SPSS Statistics (Version 27.0, Armonk, NY, USA), and an α < 0.05 was considered statistically significant.

Data availability
The de-identified datasets used in this study are available from the corresponding author upon reasonable request. The CNN segmentation model was developed using open-source Python packages (Tensorflow and NiftyNet). We are making the code and model openly available for transparency, replication, reproduction, and further research in more diverse samples. These will be made available on GitHub (https:// github. com/ kenne thawe berii/). We will also release code to segment and compute volume and MFI from a cervical spine Dixon fat-water imaging scan.