Introduction

The iliopsoas muscles, predominantly made up of slow-twitch fibers, are a composite of the psoas major and iliacus muscles; they are anatomically separate in the abdomen and pelvis but are merged together in the thigh. The iliopsoas is engaged during most day to day activities, including posture, walking and running. Together these muscles serve as the chief flexor of the hip and a dynamic stabiliser of the lumbar spine1, with the psoas uniquely having role in the movement of both the trunk and lower extremities2. Given the key involvement of the iliopsoas muscles in daily activities, there is increasing interest in its potential as a health biomarker. This has most commonly taken the form of a cross-sectional area (CSA) through one (generally the right) or both iliopsoas muscles, with the most common measurement taken through the psoas muscle. This CSA can be used either as an independent measurement or as a ratio to vertebral body size3,4 or in the form of the psoas muscle index, calculated as the psoas muscle major CSA divided by the height squared5. Indeed, psoas CSA has been suggested as a predictor of sarcopenia6, surgical outcome and length of hospital stay post surgery7,8,9, poor prognosis in response to cancer treatment10, morbidity following trauma4, a surrogate marker of whole body lean muscle mass11, cardiovascular fitness12, changes in cardiometabolic risk variables following lifestyle intervention13 and even risk of mortality14,15.

Measurements of the psoas major muscle are most commonly made from CSA of axial MRI or CT images7,12, with most studies generally relying on manual annotation of a single slice, through the abdomen, these tend to be retrospectively repurposed from clinical scans rather than a specific acquisition16,17,18. However, the CSA of the psoas muscle varies considerably along its length2 therefore small differences in measurement position can potentially have a significant effect on its overall measured size. Moreover, there is a lack of consistency within the literature regarding the precise location at which measurement of the psoas CSA should be made, with researchers using a variety of approaches including: the level of the third lumbar vertebrae (L3)6,9,10,17,18, L43,4,14,16, between L4-L511,13, as well at level of the umbilicus7,8,19 the precise position of which is known to vary with obesity/ascites. There is further discrepancy between studies regarding whether the measurements should comprise of one single10 or both psoas muscles17, with the majority of publications combining the areas of both muscles.

This lack of consistency together with the relatively low attention given to robustness and reproducibility of its measurement, and the reliance on images from retrospective clinical scans have led many to question its validity as a biomarker20. A more objective proposition may be to measure total psoas muscle volume21,22,23,24,25, from dedicated images. A variety of approaches have been used thus far: inclusion of muscle between L2-L521, psoas muscle volume from L3 and approximately the level of the iliopectineal arch (end point estimated from images in publications)22,23, from the origin of the psoas at lumbar vertebrae (unspecified) to its insertion in the lesser trochanter24, or with no anatomical information provided at all25. Whilst all of these approaches include substantially more muscle than is included in simple CSA measurements, these are still incomplete volume measurements. Moreover, measuring the entire psoas muscle volume as a single entity is challenging, since even with 3D volumetric scans it is difficult to differentiate between composite iliacus and psoas muscles once they merge at the level of the inguinal ligament. Therefore, to measure psoas volume as an independent muscle it is necessary to either assign an arbitrary cutoff and not include a considerable proportion of the psoas muscle (estimated to be approximately 50% in some studies22) or simply include the iliacus muscle and measure the iliopsoas muscle volume in its entirety.

Convolutional Neural Networks (CNNs) have become a strong tool for automated image segmentation, especially architectures such as the U-Net26 for two-dimensional (2D) data or the V-Net27 for three-dimensional (3D) data. These techniques owe their popularity to the modest amount of training data required, robustness and fast execution speed. CNNs have been applied for automated muscle segmentation in computed tomography28,29,30, specifically for 2D segmentation of the psoas major muscle29, as well as MRI31,32.

The increasing use of whole body imaging33 in large cohort studies such as the UK Biobank (UKBB), which plans to acquire MRI scans from the neck to the knee in 100,000 individuals34, requires different approaches to image analysis. Manual image segmentation is time consuming and infeasible in a cohort as large as the UKBB. However, this dataset provides a unique opportunity to measure iliopsoas muscles volume in a large cross-sectional population. Therefore, development of a robust and reliable automated method is essential. In this paper, we present an automated method to segment iliopsoas muscle volume using a CNN and discuss results arising from 5000 participants from the UKBB imaging cohort, balanced for BMI, age, and gender.

Methods

Data

A total of 5000 subjects were randomly selected for this study, while controlling for BMI, age, and gender from the UKBB imaging cohort. Age was discretised into four groups: 44–53, 54–63, 63–72 and 73–82 years. The eight strata were defined to cover both age and gender. Weights were used to maintain the proportion of subjects within each age group to match that of the larger UKBB population.

Demographics for the study population (Table 1) were balanced for gender (female:male ratio of 49.9:50.1). The average age of the male subjects was \(63.3 \pm 8.4\) years and the female subjects was \(63.3 \pm 8.3\) years. The average BMI of the male subjects was \(27.0 \pm 3.9~\text {kg/m}^2\) (range 17.6–50.9 \(\text {kg/m}^2\)) and for female subjects \(26.2 \pm 4.7~\text {kg/m}^2\) (range 16.1–55.2 \(\text {kg/m}^2\)), with the mean for both groups being categorised as overweight. The self-reported ethnicity was predominantly White European (96.76%). As per the whole UKBB population, the sub-cohort in the current study was significantly healthier than the UK general population. The most common ailment were related to arthropathies, with smaller proportion reporting a variety of neoplasms, ranging from skin melanomas to benign neoplasms (Supplementary Table S1).

Table 1 Demographics of the subjects (\(n=5000\)).

Participant data from the UKBB cohort was obtained as previously described34 through UKBB Access Application number 23889. The UKBB has approval from the North West Multi-Centre Research Ethics Committee (REC reference: 11/NW/0382). All methods were performed in accordance with the relevant guidelines and regulations, and informed consent was obtained from all participants. Researchers may apply to use the UKBB data resource by submitting a health-related research proposal that is in the public interest. More information may be found on the UKBB researchers and resource catalogue pages (https://www.ukbiobank.ac.uk/).

Raw MR images were obtained from the UKBB Abdominal Protocol35 and preprocessed as previously reported 36,37. The data were acquired on the same model, a Siemens Aera 1.5 T scanner (Syngo MR D13) (Siemens, Erlangen, Germany), across three sites (Stockport, Newcastle, Reading, UK). The Dixon sequence involved six overlapping series that were acquired using a common set of parameters: TR = 6.67 ms, TE = 2.39/4.77 ms, in-plane voxel size \(2.232 \times 2.232\) mm, FA = \(10^\circ \) and bandwidth = 440 Hz. The first series, over the neck, consisted of 64 slices, slice thickness 3.0 mm and \(224 \times 168\) matrix; series two to four (covering the chest, abdomen and pelvis) were acquired during 17 second expiration breath holds with 44 slices, slice thickness 4.5 mm and \(224 \times 174\) matrix; series five, covering the upper thighs, consisted of 72 slices, slice thickness 3.5 mm and \(224 \times 162\) matrix; series six, covering the lower thighs and knees, consisted of 64 slices, slice thickness 4 mm and \(224 \times 156\) matrix. During preprocessing the data were resampled to voxel size \(2.232 \times 2.232 \times 3.0\) mm.

Manual annotation

A single expert radiographer manually annotated both iliopsoas muscles for 90 subjects using the open-source software MITK38. Each axial slice of the water images was examined, the iliopsoas identified, and the borders of the psoas and iliopsoas manually drawn for 90 subjects. On average, manual annotation of both muscles took five to seven hours per subject. The annotated data covered a broad range of age and BMI from male and female UKBB participants. A typical Dixon abdominal dataset, centred on the iliopsoas muscles, is shown in Fig. 1, manual iliopsoas muscle annotations are overlaid on the anatomical reference volume in red and a 3D rendering of the manual annotation.

Figure 1
figure 1

Iliopsoas muscle manual annotations: (a) axial, (b) sagittal, and (c) coronal views, (df) showing of the segmentation (red) overlaid on the anatomical reference data, and (g) 3D rendering of manual segmentation.

Model

We trained a model able to predict both muscles individually. The preprocessing steps for the training data, where the cropping is also needed for applying the model to unseen data, are as follows. Two arrays of size \(96 \times 96 \times 192\) were cropped around the hip landmarks36, to approximate the location of the muscles in order to perform the segmentation (an example of the cropped regions may be found in Supplementary Fig. S1). After cropping, each volume was normalised such that the signal intensities lie between zero and one, where the 99th percentile was used instead of the maximum to avoid possible spikes in signal intensity. That is, all signal intensities above the 99th percentile were mapped to one. Two sets of 16 training samples were generated for every subject by separating the right and the left muscles, introducing reflections exploiting the symmetry of the structures. Further data augmentation included seven random transformations consisting of translations by up to six voxels in-plane, up to 24 voxels out-of-plane, and random scaling ranging from \(-50\) to \(+50\)% out-of-plane and from \(-25\) to \(+25\)% in-plane, in addition to the original data. We chose larger factors for out-of-plane transformations to account for the skewed variability in shape and position of the muscles, reflecting the fact that there is more variation in height than width in the population. After data augmentation, 2880 training samples were produced from the original 90 manually annotated pairs of iliopsoas muscles.

The model used for 3D iliopsoas muscle segmentation closely follows a similar architecture to the U-Net26 and the V-Net27, with a contracting path and an expansive path connected by skip connections at each resolution level. These network architectures have been established as the gold standard for image segmentation over the last few years, as they require modest amounts of training data as a consequence of operating at multiple resolution levels while providing excellent results within seconds. Several convolution blocks are used in our model architecture. An initial block (I) contains a \(5 \times 5 \times 5\) convolution with eight filters followed by a \(2 \times 2 \times 2\) convolution with 16 filters and stride two. The down-sampling blocks in the contraction (\(D_{i,m}\)) consist of i successive \(5 \times 5 \times 5\) convolutions with m filters followed by a \(2 \times 2 \times 2\) convolution of stride with stride two, used to decrease the resolution. In the expansion, the up-sampling blocks (\(U_{j,n}\)) mirror the ones in the contraction where there are transpose convolutions instead of stride two convolutions. The block (L) at the lowest resolution level of the architecture consist of three successive \(5 \times 5 \times 5\) convolutions with 128 filters followed by a \(2 \times 2 \times 2\) transpose convolution of stride two and 64 filters. The final block (F) contains a \(5 \times 5 \times 5\) convolution with 16 filters followed by a single \(1 \times 1 \times 1\) convolution and a final sigmoid activation classification layer. All blocks incorporate skip connections between their input and output, resulting in residual layers. The architecture follows: \(I \rightarrow D_{2,32} \rightarrow D_{3,64} \rightarrow D_{3,128} \rightarrow L \rightarrow U_{3,128} \rightarrow U_{3,64} \rightarrow U_{3,32} \rightarrow F\) with skip connections between blocks at equivalent resolution levels. Padding is used for the convolutions throughout the network and a stride of one, unless otherwise specified, when moving between the resolution levels. Other than the final sigmoid activation, scaled exponential linear units (SELU) are used throughout the network. The SELU activation function has recently been proposed39, where the self-normalising properties allow it to bypass batch normalisation layers enabling higher learning rates that lead to more robust and faster training.

The model was trained minimising the Dice score coefficient (DSC) loss27 with a batch size of three using the Adam optimiser and a learning rate of 1e−4 until convergence at 100 epochs. The learning rate was determined through a parameter sweep (1e−1 to 1e−6). We performed all of the CNN development, learning, and predictions using Keras (TensorFlow backend)40 on a NVIDIA Titan V 12 GB GPU. We limited the batch size to three due to the GPU memory.

Validation

A common metric used to evaluate segmentation performance is the DSC, also known as the F1 score. It is defined as twice the intersection of the labels divided by the total number of elements. Intersection of labels can also be seen as a True Positive (TP) outcome. The total number of elements can also be seen as the sum of all False Positives (FP), False Negatives (FN) and twice the number of TPs.

$$\begin{aligned} \text {DSC} = \frac{2\,\text {TP}}{\text {FP} + 2\,\text {TP} + \text {FN}} \end{aligned}$$
(1)

For validation of the model, we performed a six-fold cross-validation experiment, where in a single iteration 75 of the manually annotated images (approximately 83%) were used to train the model and the performance was evaluated on the remaining 15 out-of-sample images (approximately 17%).

Statistical analysis

All summary statistics, hypothesis tests and figures have been performed using the R software environment for statistical computing and graphics41. Variables were tested for normality using the Shapiro–Wilk’s test, the null hypothesis was rejected in all cases. Spearman’s rank correlation coefficient (\(\rho \)) was used to assess monotonic trends between variables. The Wilcoxon rank-sum test was used to compare means between groups, and the Wilcoxon signed-rank test with paired observations. Methods for segmenting the iliopsoas muscle volume were compared using the Bland-Altman plot. Given the exploratory nature of the research, p-values \(< 0.05\) were judged to be statistically significant.

Results

Validation

A summary of the cross-validation experiment may be found in Supplementary Table S2. The average bias was \(-0.2\)% with upper and lower limits of agreement being 13.3% and \(-13.7\)%, respectively (Fig. 2). The overlap between the CNN-based and manual segmentations for two subjects is also provided in Fig. 2, where the DSCs are 0.85 (left) and 0.90 (right) for (b) and 0.96 in both for (c). With consistent DSCs from the cross-validation experiment showing a robust model performance on both muscles, we trained a final model using the entire 90 available manual annotations.

Figure 2
figure 2

Bland–Altman plot (a) of iliopsoas muscle volumes determined with CNN-based and manual segmentations (\(n=90\)), using a six-fold cross-validation experiment. Dotted lines represent the average bias (\(-0.2\)%) and the 95% limits of agreement. Overlays of the CNN-based and manual segmentations for two subjects (b,c), where the manual annotation is red, the CNN segmentation is green and the overlap is yellow.

Figure 3
figure 3

CNN segmentations of the left and right iliopsoas muscles overlaid in purple (right) and blue (left) from a range of body types and iliopsoas muscle volumes: (ac) small, (df) average, (gi) large and (jl) asymmetric. The top row for each subject displays the signal intensities without the segmentation result, the bottom row includes the iliopsoas muscle segmentations.

Example segmentations from our method are provided in Fig. 3, displaying a sample of 12 subjects covering a variety of body sizes and habitus. The first three subjects (a–c) have some of the smallest iliopsoas muscles (\(\text {total volume} \approx 346\) ml), the next three subjects (d–f) have typical iliopsoas muscles (\(\text {total volume} \approx 800\) ml) and the third set of three subjects (g–i) have some of the largest iliopsoas muscles (\(\text {total volume} \approx 1300\) ml). The final set of three subjects represent subjects whose left and right iliopsoas muscles differ in volume (\(\text {difference in volume} \approx 93\) ml for j and k, \(\text {difference in volume} = 182\) ml for l). We can see that the model performs well for all of them, with additional details regarding model validation provided in Supplementary Fig. S2.

Iliopsoas muscle volume

In each gender there was a small (approximately 2%) yet statistically significant asymmetry between left and right iliopsoas muscles (Wilcoxon signed-rank test; male: \(d = -7.3\) ml; female: \(d = -6.5\) ml; both \(p < 10^{-15}\)) (Fig. 4). These differences were not significantly associated with the handedness of the participants. Significantly larger iliopsoas muscle volumes were measured in male compared with female subjects (Table 2).

Figure 4
figure 4

Difference in volume (ml) between the left and right iliopsoas muscles, separated by gender. Negative values indicate the right iliopsoas muscle is larger than the left.

Table 2 Iliopsoas muscle volumes (\(n = 5000\)).

Relationship between iliopsoas muscle volume and physical characteristics

Significant correlations were observed between the total iliopsoas muscle volume and height in both genders (male: \(\rho = 0.51\); female: \(\rho = 0.54\), both \(p < 10^{-15}\)) (Fig. 5).

Figure 5
figure 5

Scatterplot of total iliopsoas muscle volume (ml) by height (cm), separated by gender.

To account for the potential confounding effect of height on iliopsoas muscle volume, an iliopsoas muscle index (IMI) was defined

$$\begin{aligned} \text {IMI} = \frac{\text {total iliopsoas muscle volume}}{\text {height}^2}, \end{aligned}$$
(2)

with units \(\text {ml}/\text {m}^2\). Significant correlations were observed between the IMI and BMI in both genders (male: \(\rho = 0.48\); female: \(\rho = 0.47\), both \(p < 10^{-15}\)) (Fig. 6).

Figure 6
figure 6

Scatterplot of iliopsoas muscle index (\(\text {ml/m}^2\)) by BMI (\(\text {kg/m}^2\)), separated by gender.

A significant negative correlation was observed between IMI and age in both genders (male: \(\rho = -0.31\), \(p < 10^{-15}\); female: \(\rho = -0.11\), \(p < 10^{-7}\)). However, the relationship could not be easily explained by a simple linear method (Fig. 7). In fact the decrease in IMI as a function of age accelerates for men, starting in their early 60s, while for women it remains relatively constant.

Figure 7
figure 7

Scatterplot of iliopsoas muscle index (\(\text {ml/m}^2\)) by age at recruitment (years), separated by gender. The curves are fit to the data using a generalised additive model with cubic splines.

Discussion

There is considerable interest in measuring psoas muscle size, primarily related to its potential as a sarcopenic marker, thereby making it an indirect predictor of conditions influenced by sarcopenia and frailty, including health outcomes such as morbidity, and mortality4,6,7,8,9,10,14,15. The complexity in measuring total muscle directly, particularly in a frail population has necessitated the reliance on easily measured surrogates and the psoas muscle CSA is increasingly used for this purpose. However there is little consistency in the field regarding how the psoas muscle is measured, with considerable variation between publications. An automated approach to analysis will reduce the need for manual annotation, allowing more of the muscle to be measured and enable much larger cohorts to be studied, this is particularly important as large population based biobanks are becoming more common. In this paper we have described a CNN-based method to automatically extract and quantify iliopsoas muscle volume from MRI scans for 5000 participants from the UKBB. Excellent agreement was obtained between automated measurements and the manual annotation undertaken by a trained radiographer as demonstrated by the extremely high DSC with testing data.

CNNs have been established as the gold standard in automated image segmentation. The results, which can be produced with a modest amount of manual annotations as training data and smart data augmentation, are highly accurate, fast, and reproducible. Manual annotations become a bottleneck for large-scale population studies, when the number of participants exceeds many thousand such as with the UKBB. Applying automated methods to vast amounts of data requires a thorough set of quality-control procedures beyond just out-of-sample testing data, which is often used to validate new methods in machine learning studies. Large-scale quality control can be done by steps such as looking at maximum and minimum values, asymmetric values (for symmetric structures such as the iliopsoas muscles), outliers, and overall behaviour of the results.

The vast majority of previous studies investigating psoas muscle size have relied on CSA measurements primarily because of data availability and time constraints3,4,6,7,8,9,10,11,13,14,16,17,18,19. Analysis of CSA is considerably less labour intensive than manually measuring tissue volumes, furthermore, many studies have repurposed clinical CT or MRI scans16,17,18 which typically will not have been acquired in a manner to enable volume measurements. This has led to psoas muscle CSA being measured at a variety of positions relating to lumbar landmarks including L3, L4 and between L4-5, as well as more unreliable soft tissue landmarks such as the umbilicus, with the CSA measurements used alone, relative to lumbar area, height, height squared or total abdominal muscle within the image at the selected level. While lumber landmarks should provide a relatively consistent CSA in longitudinal studies, comparison between studies and cohorts becomes almost impossible. This is further compounded by studies that have shown considerable variation in psoas CSA along its length2,42, and that regional differences in psoas CSA have been observed in athletes43, following exercise training or inactivity44. This appears to suggest that CSA at a fixed position may not accurately reflect changes in the psoas size elsewhere in response to health related processes. It is clear that to overcome these confounding factors, it is essential to measure total psoas volume.

In this study, we have trained a CNN to segment iliopsoas muscles, applied it to 5000 UKBB subjects and measured their total volume. This measurement includes the psoas major and iliacus muscles, and as mentioned in the proceeding section, the psoas minor muscle (if present). This reflects the practical difficulties of isolating the entire psoas muscle in images in a consistent and robust manner. The merging of the iliacus and psoas muscles below the inguinal ligament makes their separation not only impractical, but unachievable with standard imaging protocols. Similarly, it is not possible to separate the psoas major and minor muscles under these conditions, even if CSA measurements were to be made. Therefore, a standard operating procedure was required, either measure a partial psoas volume, selecting an anatomical cut-off before the junction with the iliacus muscle, or to include the iliacus and measure the iliopsoas muscle volume in its entirety. In this study we have opted for the latter, as selecting an arbitrary set point would clearly introduce a significant confounding factor with unforeseeable impact on the subsequent results. Thus, we have measured the entire iliopsoas muscle, and although literature comparisons are limited, as there is a paucity of comparable volumetric studies within the general population, our average reported values for male subjects (\(407.2 \pm 62.7\) ml) were within the range 351.1–579.5 ml in a cohort which included male athletes and controls43.

Furthermore, our CNN-based method performs very well, with a small but systematic underestimation of \(-0.2\)% when compared with manual annotations. Incremental improvement of the model is possible using straightforward techniques, such as increasing the number and variety of training data or expanding the breadth of data augmentation45. These are currently under investigation.

We observed a small (approximately 2%) but significant asymmetry in iliopsoas muscle volume, with the right muscle being larger in both male and female subjects. Previous studies have looked at the muscle asymmetry in tennis players, and found that the iliopsoas muscle was 13% smaller on the non-dominant compared with the dominant side of the body, whereas inactive controls the dominant size was 4% larger than the non-dominant43. Similarly footballers players have significantly larger psoas CSA on their dominant kicking side46. The best equivalent to this within the UKBB phenotyping data was handedness, which we found not to be related to left-right differences in iliopsoas volume in the current study. An additional factor which may contribute towards iliopsoas asymmetry relates to the presence or absence of the psoas minor muscle, a long slim muscle typically found in front of the psoas major. This muscle can often fail to develop during embryonic growth2 and there can be considerable differences in the incidence of agenesis which can be unilateral or bilateral with ethnicity thought to be a factor47. Further work is required to understand whether this contributes to the left-right asymmetry observed in the present study, since it is not possible to resolve this muscle on standard MRI images.

In line with previous studies of psoas CSA, male subjects had significantly larger iliopsoas muscles compared to females6. This is unsurprising since gender differences in both total muscle and regional muscle volumes are well established48,49. Indeed some studies have suggested using gender specific cut-offs of either psoas CSA alone or psoas muscle index to identify patients at risk of poorer health outcomes10. Furthermore, some studies have suggested that the magnitude of gender differences in trunk muscle CSA vary depending where are measured. This adds weight to the argument that volumetric measurements are perhaps more robust than CSA measures for this comparison50. It has been proposed that the gender differences in psoas volume could in part relate to the impact of height on psoas volume12. Indeed, we found a significant correlation between iliopsoas muscle volume and height similar to those previously reported by earlier studies49. However, the gender differences observed in our study were still present when correcting for height. Interestingly, it has been reported that the relationship between muscle volume and body weight is curvilinear, since increases in body weight often reflect gain in fat, as well as muscle mass. In the present study we observe a significant correlation between IMI and BMI. This is in agreement with previous studies of psoas CSA which have also shown a significant correlation with BMI6, indeed some studies combined both metrics as a prognostic marker17. We also found a significant correlation between IMI and age. It is widely reported that muscle mass declines with age, particularly beyond the fifth decade, a fundamental characteristic of sarcopenia51. The magnitude of this decline was relatively small, but this may arise by the limited age range within the UKBB data set (44–82 years), compared to other studies that have investigated the impact of age on muscle volume across the entire adult age span (18–88 years), which usually tend to reveal a more dramatic decline in muscle volume49.

In conclusion, we have developed a robust and reliable model using a CNN to automatically segment iliopsoas muscles and demonstrated the applicability of this methodology in a large cohort, which will enable future population-wide studies of the utility of iliopsoas muscle as a predictor of health outcomes.