Introduction

Magnetic Resonance Imaging (MRI) has emerged as a crucial part of diagnosing pathologies such as osteoarthritis, ligament damage, tumors, and others1,2,3. Within MRI, several sequences can be deployed that exploit intrinsic tissue properties, providing images of varying weightings that effectively visualize tissues such as muscle, ligaments, bone marrow, and others4. In musculoskeletal (MSK) applications, clinical imaging protocols consist mostly of 2D fast spin echo (FSE) acquisitions with T1 or T2 weighting in various acquisition planes, which do well in depicting the structure and morphology of the underlying anatomy5. However, compositional MRI (cMRI) techniques to assess actual tissue parameters are gaining more attention as a complement of qualitative imaging.

cMRI techniques like T2 relaxometry can provide maps of T2 values (or another intrinsic MR parameter) across an imaging volume rather than a morphological image. For MSK applications, T2 relaxometry offers sensitivity to water content, collagen content, and collagen fiber orientation in cartilage6, making it sensitive to biochemical changes that can precede morphological changes across several tissues and anatomies7,8. Pre-morphological change sensitivity has been best characterized in the knee, where T2 values are significantly higher across most cartilage compartments in healthy patients that later develop osteoarthritis (OA) compared to controls9,10. Additionally, T2 relaxometry offers quantitative MSK tissue health assessments, correlating with measures of hip cartilage and intervertebral disc (IVD) health11,14,15,16, whereas in conventional clinical imaging, only semiquantitative tissue health assessments are obtainable with expert annotation12,13. All of this makes cMRI a promising potential addition to clinical imaging protocols.

A major challenge facing clinical adoption of cMRI, however, is acquisition time: while mapping sequences like the magnetization-prepared angle-modulated partitioned k-space spoiled gradient echo snapshots (MAPSS) can provide robust MR parameter maps, their acquisition times can exceed 5–6 minutes, making their addition to a clinical scan protocol difficult17. Acquisitions can be accelerated by sampling fewer points in k-space, inducing aliasing artifacts in resulting images that must be removed through subsequent postprocessing. Some proposed approaches to these ends are reconstruction strategies such as parallel imaging (PI), compressed sensing (CS), model-based reconstructions, deep learning (DL), low-rank and sparse modeling methods, and MR Fingerprinting (MRF). Most of these approaches design an algorithm or exploit the redundancy of k-space acquisition across multiple coils to predict the appearance of the fully-sampled reconstructed image.

PI was one of the earliest techniques to accelerate MRI acquisition and has seen clinical adoption. Here, the redundancy of a multiple coil acquisition is leveraged to mitigate aliasing artifacts18,19,20, reducing clinical scan time up to acceleration factor R = 3 for MSK applications21,22. CS23 has also shown promise, where aliased images are iteratively reconstructed by minimizing an objective function, retaining fidelity to acquired k-space and imposing sparsity on the reconstructed image in another domain. CS has attained clinically acceptable MSK image quality through roughly R = 421,22,24,25, and up to R = 8 in research settings for knee cartilage T mapping26. Similarly, PI and CS have also been applied sequentially (and simultaneously) for further acceleration27.

For cMRI acceleration, model-based reconstructions have gained traction, integrating the physics of T2/T2* decay and T1 recovery into an objective function iteratively optimized to reconstruct maps, showing promise in brain and lumbar spine T2 mapping28,29,30. More generally, incorporation of the physics of MRI parameter recovery/decay has seen applications not just in model-based approaches, but in various aspects of other methodologies as well31. DL approaches have gained prominence in solving inverse problems such as reconstruction, allowing for cMRI reconstructions at higher R than other methods. Standalone DL approaches have seen promising results in knee MAPSS acceleration, T1 mapping, and T2 mapping sequences32,33,34,35,36. In other methodologies, DL has been integrated with model-based approaches while introducing loss functions to maintain fidelity to acquired k-space, seeing promise up to R = 8 in knee and brain T1 and T2 mapping37,38,39. DL has been applied to accelerate T2 mapping in MR Fingerprinting, where DL can remove aliasing artifacts from undersampled acquisitions and/or replacing time-consuming dictionary lookup steps to predict MR parameter maps, and exploiting spatial correlations within maps to improve reconstructions40,41. Lastly, aside from DL, low-rank and sparse modeling methods have emerged as a means of accelerating acquisitions, where several MRI images acquired at different echo times are decomposed into temporal basis functions and spatial coefficients to model an MRI parameter, showing promise through R = 842.

These works represent great progress, although avenues for improvement remain. Above all, these methods have optimized reconstructed images for full-volume performance; however, in MSK applications, clinical assessment relies on the inspection of precise anatomic features in specific anatomic regions, and consequently, the reconstruction quality cannot be compromised within these regions. Put differently, given clinical context, strong image quality may be most important in specific regions of an image, leaving room for algorithm optimization. Furthermore, most recent published approaches leverage k-space data in formal reconstruction approaches, but for niche applications such as region of interest (ROI)-focused optimization, such approaches may be outperformed by DL-based post-processing algorithms that denoise and fit undersampled T2-weighted images without using raw k-space. Moreover, performance of standard reconstruction algorithms is typically evaluated using metrics such as structural similarity index (SSIM), normalized root mean square error (NRMSE), and peak signal-to-noise ratio (PSNR), but recent works show these metrics may not provide the best correspondence with radiologist annotations43,44, leading other groups to propose alternate metrics to fill this niche45.

To these ends, this study proposes a recurrent UNet pipeline to postprocess undersampled coil-combined T2-weighted echo images, fitting and predicting T2 maps from accelerated MAPSS acquisitions in the knee, hip and lumbar spine46,47. These algorithms are trained with multi-component, ROI-specific losses that optimize predicted maps for T2 value and textural retention in cartilage and IVDs. In doing so, our approach allows for ROI-specific optimization, facilitating retention of small, crucial clinical features in tissues of interest while building on past applications of weighted loss functions for image processing tasks48.

To summarize, the contributions of this work are as follows:

  • By using a 4-component loss function in network training, we introduce the concept of “ROI-specific optimization” of cMRI accelerated acquisition pipelines.

  • We conduct a thorough ablation study of these 4 loss function components, proving the value of all in retaining textures in predicted maps while retaining high fidelity to ground truth T2 values.

  • Acknowledging that standard evaluation metrics such as SSIM and NRMSE provide suboptimal sensitivity to clinically relevant metrics, we conduct a thorough Gray Level Co-Occurrence Matrix (GLCM)-metric-based analysis of smooth and sharp textural retention in predicted maps, with an eye towards better evaluation of retention of small features crucial to clinical diagnoses49,50.

  • We build on limited literature in hip and lumbar spine cMRI accelerated acquisition schemes by developing and evaluating our pipeline not only in knee cartilage, as several other works have done, but also for hip cartilage and lumbar spine IVD in ultrafast acquisitions.

Methods

MAPSS acquisitions

Retrospective datasets including MAPSS in the knee (n = 244 patients, 446 scans), hip (n = 67 patients, 89 scans), and lumbar spine (n = 21 patients, 24 scans) acquired from clinical 3 T MRI scanners was used. Patients were scanned in accordance with all pertinent guidelines, including approval from the University of California, San Francisco Institutional Review Board (Human Research Protection Program), and informed consent was obtained from all study participants. MAPSS simultaneously acquired multiple T and T2 weighted images, using T or T2 preparation followed by 3D RF-spoiled gradient-echo Cartesian acquisition in a segmented radial centric view ordering during a transient state. A fat-selective inversion pulse was applied before either T51,52 or T2 preparation53. Each acquisition included T-prepared images at four spin-lock times (TSLs) for T quantification, and three additional T2-prepared images for T2 quantification (TSL = 0 ms images were shared for TE = 0 ms images). In this study, only T2-prepared images at four different TEs and corresponding T2 maps from the MAPSS sequence were used. ky-kz space was acquired within an elliptical coverage (area = 0.7 compared to rectangular ky-kz, not acquiring corner space). Knee images were acquired from patients having ACL injuries, with scans taken at baseline and 3 years post-reconstruction. Hip images were acquired from patients having hip OA. Lumbar spine images were acquired from healthy subjects or patients with low back pain. Table 1 shows acquisition parameters.

Table 1 Knee, hip and lumbar spine datasets and splits.

T2 Fitting and spatial undersampling

Later T2 weighted echo time images for each slice were registered to corresponding TE = 0 ms images using a 3D rigid registration algorithm with a normalized mutual information criterion54. Levenberg–Marquardt fitting of registered T2 weighted images yielded ground truth T2 maps55.

To simulate accelerated acquisition, coil-combined T2 weighted magnitude images after reconstruction (ARC for knee and hip) were Fourier transformed and retrospectively undersampled using a center-weighted Poisson disc pattern, fully sampling a central 5% square in ky-kz (R = 2, 3, 4, 6, 8, 10, 12). Acquisition times associated with ground truth and accelerated MAPSS acquisitions in each body part can be found in Supplementary Table S1. As MAPSS acquires phase-encode lines with elliptical coverage in ky-kz (relative area of 0.7 compared to rectangular coverage), phase encoding lines solely within the sampling ellipse were undersampled. Although working with synthesized k-space data generated from coil-combined magnitude images, retrospective undersampling was done and R reported with respect to elliptical coverage in ky-kz to accurately simulate an actual undersampling pattern and not overstate model performance56. However, for hip acquisitions, reconstructed space outside the y-FOV had already been discarded; thus, simulating acquisitions with application of ‘no phase wrap’ was not possible and undersampling patterns would differ from those implemented on a scanner. T2 weighted images from each echo time were undersampled with a unique pattern. For ky-kz lines not sampled at a given echo time, those ky-kz lines were initialized with the corresponding ky-kz from the image with the temporally closest echo time for which that ky-kz was sampled. Only ky-kz lines not sampled in images acquired at all echo times were zero-filled. k-Space was subsequently inverse Fourier transformed, yielding undersampled, aliased images.

DL pipeline training

DL architecture

An overview of the data processing and training schemes is shown in Fig. 1, while a detailed diagram depicting our proposed network architecture is in Supplementary Fig. S1 (“Full Model”; 39,808,710 trainable parameters). Magnitude images from data undersampled as specified were fed into a recurrent UNet network. The network contains an initial recurrent portion: aliased images from each T2 echo time have a 5-layer processing stream of 2D 3 × 3 convolutions with stride 1, yielding layers of depth 64, 128, 256, 512, and 1. Residual connections connect input aliased images with processing stream outputs. 2D 3 × 3 convolutions with stride 1 and residual connections transfer information between temporally adjacent corresponding hidden echo time processing layers with weighting parameter λw = 0.257. This soft-weighted view-sharing of neighboring T2 weighted echo time images facilitated sharing of feature map information between temporally adjacent echo time images, which can augment sharing of ky-kz initializations to improve network image predictions. Outputs of all 4 echo time image processing streams were concatenated and fed to a UNet that predicted T2 maps. 2D 3 × 3 convolutions with stride 2 were used for the encoder, and 2D 4 × 4 transpose convolutions with stride 2 for the decoder. Two additional architecture versions were also trained: one UNet with no recurrent portion (“No RNN”; 35,116,037 trainable parameters) and a second in which all layers apart from inputs to the recurrent portion and UNet had half the depth listed in Supplementary Fig. S1 (“Reduced Parameters”; 9,958,246 trainable parameters).

Figure 1
figure 1

Proposed pipeline. Experiments in proposed study entail generating ground truth T2 maps from MAPSS, simulating accelerated acquisition of T2-weighted MAPSS images, and training a network to predict T2 maps from undersampled images. (1) MAPSS contains 7 images, 3 that are T2 weighted, 3 T weighted, and 1 shared; the T2 and shared image weightings are extracted, registered, and fitted slice-wise to yield ground truth T2 maps. To simulate accelerated acquisition, each volume of coil-combined magnitude T2 weighted images acquired at a given echo time are Fourier transformed, undersampled along the ky–kz plane with a center-weighted Poisson disc pattern, and inverse Fourier transformed to yield a simulated accelerated acquisition of a volume. Finally, undersampled T2 weighted images acquired at all echo times for the same anatomic slice are concatenated and fed to the proposed recurrent UNet architecture, which predicts the T2 map appearance for the slice. Training is done slice-wise with a multi-component loss function that includes a novel ROI-specific L1 loss that optimizes predicted T2 maps in cartilage and IVD ROIs, with other components that improve training stability and encourage retention of textures.

Loss function

Networks were trained with the multi-part loss function shown in Eq. (1):

$$L_{network} = \lambda_{{L_{1} }} L_{{L_{1} }} + \lambda_{{L_{1,\phi } }} L_{{L_{1, \phi } }} + \lambda_{SSIM} L_{SSIM} + \lambda_{Feature} L_{Feature}$$
(1)

in which \(L_{{L_{1} }}\) is a scaled global L1 loss detailed in Eq. (2):

$$L_{{L_{1} }} = \left| {S\left( {T_{2} } \right) - S\left( {\hat{T}_{2} } \right)} \right|$$
(2)

where \(T_{2}\) represents ground truth T2, \(\hat{T}_{2}\) represents predicted T2, and \(S\left( x \right)\) is a translated and scaled sigmoid operator that assigns more weight to higher T2 values. Sharp contrasts and high \(T_{2}\) values can easily be lost in accelerated acquisition schemes, so \(S\left( x \right)\) proved useful through empirical testing in focusing networks to preserve these details. \(S\left( x \right)\) is defined below in Eq. (3):

$$S\left( x \right) = y_{l} + \left( {y_{h} - y_{l} } \right)\left( {1 + exp\left( { - \left( {10/\left( {x_{h} - x_{l} } \right)} \right)\left( {x - \left( {x_{l} + x_{h} } \right)/2 } \right)} \right)} \right)^{ - 1}$$
(3)

where \(x_{l}\), \(x_{h}\) were the low and high T2 value limits where the sigmoid operator weighting will transition from \(y_{l}\) to \(y_{h}\). Parameters selected for the knee were as follows: \(x_{l}\) = 0 ms, \(x_{h}\) = 100 ms, \(y_{l}\) = 0.1, \(y_{h}\) = 1.0. In the hip: \(x_{l}\) = 0 ms, \(x_{h}\) = 60 ms, \(y_{l}\) = 0.5, \(y_{h}\) = 1.0. In the lumbar spine: \(x_{l}\) = 0 ms, \(x_{h}\) = 150 ms, \(y_{l}\) = 0.25, \(y_{h}\) = 1.0. A schematic of the operator that results from parameters of all three anatomies can be found as Supplementary Fig. S2.

\(L_{{L_{1, \phi } }}\) is the ROI-specific L1 loss, and is described in Eq. (4):

$$L_{{L_{1, \phi } }} = \left| {S\left( {T_{2,\phi } } \right) - S\left( {\hat{T}_{2,\phi } } \right))} \right|$$
(4)

where \(T_{2,\phi }\) were ground truth T2 values in the tissue of interest \(\phi\) (IVD or cartilage), scaled by \(S\left( x \right)\) (Eq. (3)), and \(\hat{T}_{2,\phi }\) is the same for predicted T2. Pixels corresponding to \(\phi\) are obtained from segmentation masks, the generation of which is described in “Training and Segmentation Details”. For both \(L_{{L_{1} }}\) and \(L_{{L_{1, \phi } }}\), L1 norms were used instead of L2 due to reduced sensitivity to outliers, leading to more stable trainings.

\(L_{SSIM}\) is an SSIM loss, described in Eq. (5):

$$L_{SSIM} = 1 - SSIM$$
(5)

where SSIM was the structural similarity index between predicted and target maps.

\(L_{Feature}\) is a feature-based loss function designed to retain sharper textures, calculated as in Eq. (6):

$$L_{Feature} = \left| {VGG_{{T_{2} }} - VGG_{{\hat{T}_{2} }} } \right|$$
(6)

where \(VGG_{{T_{2} }}\) and \(VGG_{{\hat{T}_{2} }}\) were the outputs of the 21st layer of a VGG-1958 network pretrained on ImageNet when fed resized and normalized target and predicted T2 maps, respectively. Maps were resized to 224 × 224 × 1, concatenated with themselves along the channel axis to yield 224 × 224 × 3 inputs, and normalized such that the channels had mean pixel values of 0.485, 0.456 and 0.406, with standard deviations of 0.229, 0.224, and 0.225, respectively.

\(\lambda_{{L_{1} }}\),\(\lambda_{{L_{1,\phi } }}\), \(\lambda_{SSIM}\), \(\lambda_{Feature}\) were loss component weightings. All were positive-valued and optimized through constrained random hyperparameter searches with the following ranges:

  • Knee: \(\lambda_{{L_{1} }}\) = 1,\(\lambda_{{L_{1,\phi } }} = 50 - 150\), \(\lambda_{SSIM} = 0 - 2\), \(\lambda_{Feature} = 0 - 0.5\)

  • Hip: \(\lambda_{{L_{1} }}\) = 1,\(\lambda_{{L_{1,\phi } }} = 0 - 3\), \(\lambda_{SSIM} = 0 - 2\), \(\lambda_{Feature} = 0 - 1\).

  • Spine: \(\lambda_{{L_{1} }}\) = 1,\(\lambda_{{L_{1,\phi } }} = 1 - 10\), \(\lambda_{SSIM} = 10 - 100\), \(\lambda_{Feature} = 5 - 55\).

Training and segmentation details

Scans of all three anatomies were split into training, validation and test sets as shown in Table 1. In the knee, cartilage was segmented manually. In the hip, cartilage was segmented manually for 4 central slices per volume. Segmentation in both was performed by research assistants trained by radiologists with over 20 years of experience. Since the hip dataset had substantially fewer segmented than unsegmented slices, the hip training set was bootstrapped to equalize the number of slices with and without segmentations (1068 bootstrapped slices). Finally, in the lumbar spine, IVDs were segmented with an ensemble of coarse-to-fine context memory (CFCM) networks59. To calculate performance metrics and implement ROI-specific training losses, these segmentation masks were leveraged to identify pixels in tissues of interest (cartilage or IVD).

Signal values were scaled per slice for the middle 95% of pixel values to fall between 0 and 500 for the knee and lumbar spine, and 0 and 100 for the hip; these ranges were optimized empirically. During training, imaging volumes were augmented with random translation (± 10 pixels across phase and frequency directions) and random rotation (± 5 degrees about slice direction). All models were trained with learning rate 0.001 and Adam optimizer on an NVIDIA Titan Xp 12 GB GPU with batch size of 1 so the model would fit on a single GPU. Separate pipelines were trained for all 3 anatomies at R = 2, 3, 4, 6, 8, 10, and 12. For each pipeline, and at each trained R, a constrained random hyperparameter search was done for 15 iterations at 10 epochs per iteration to optimize \(\lambda_{{L_{1} }}\),\(\lambda_{{L_{1,\phi } }}\), \(\lambda_{SSIM}\), and \(\lambda_{Feature}\) for visual fidelity of predicted maps to ground truth. Visual fidelity was assessed in the search using NRMSE (calculated as shown in Eq. (7)) and Pearson’s r in the tissue of interest60.

$$NRMSE={{\Vert {T}_{2}-\widehat{{T}_{2}}\Vert }_{2,\phi }\left({\Vert {T}_{2}\Vert }_{2,\phi }\right)}^{-1}$$
(7)

Final pipelines across all anatomies and R were trained using optimized parameter sets until validation loss did not decrease for 10 epochs. Key training details are summarized as part of Table 1.

Experiments

Loss function ablation study

An ablation study is key to understand contributions of loss components. Given optimized loss function weights, every combination of loss components was ablated and corresponding models were retrained until validation loss no longer decreased. “No RNN” and “Reduced Parameters” networks were also trained while maintaining loss function components at optimized values to assess the utility of simpler architectures. NRMSE and Pearson’s correlation coefficient (r) were calculated in tissues of interest across the test set for original and ablated models to determine loss component contributions to performance. Pearson’s r was deemed an appropriate statistical test for this and subsequent experiments, as it is useful in assessing the linear relationship between related pairs of interval data. While no formal NRMSE test was done, it nonetheless allows for quantitative assessment of T2 quantification quality and easy comparison with results from other approaches. NRMSE is reported ± 1 standard deviation (s.d.); Pearson’s r was deemed significant in accordance with corresponding P values, α = 0.001, 0.01, and 0.05. NRMSEs within tissues of interest of a given scan were also multiplied by mean T2 values within the tissue of interest of that patient, generating T2 value equivalents of error rates.

To more specifically evaluate the utility of the ROI-specific loss component, two loss function configurations from the ablation study were further analyzed at all R: no ROI-specific loss component (\(\lambda_{{L_{1,\phi } }} = 0; { }\lambda_{{L_{1} }} ,{ }\lambda_{SSIM} ,{ }\lambda_{Feature} \ne 0\)) and no ROI-specific or feature-based components (\(\lambda_{{L_{1,\phi } }} ,\lambda_{Feature} = 0; { }\lambda_{{L_{1} }} ,{ }\lambda_{SSIM} \ne 0\)). These models were intended to represent baselines in which all loss functions were preserved except the ROI-specific component, and a standard reconstruction loss function of pixel and SSIM-based loss components, respectively. Pearson’s r—evaluated in tissues of interest and globally—was calculated to determine the degree and significance of correlation between predicted maps and ground truth, both globally and within tissues of interest, α = 0.001, 0.01, and 0.05.

Evaluation of accelerated acquisition scheme performance

Three versions of our pipeline (full pipeline, “No RNN,” and “Reduced Parameters”) were compared to state-of-the-art CS, DL, and DL/model-based solutions. At each R, MANTIS (54,413,056 trainable parameters) and MANTIS-GAN (54,413,056 [Generator] and 2,763,648 [Discriminator] trainable parameters) pipelines were trained using published network architectures, loss functions and undersampling strategies42,43. Loss function weightings for both were optimized through grid hyperparameter searches yielding the following: (MANTIS) \(\lambda_{data}\) = 0.1, \(\lambda_{cnn}\) = 1; (MANTIS-GAN) \(\lambda_{data}\) = 0.1, \(\lambda_{cnn}\) = 1, \(\lambda_{GAN}\) = 0.01. To apply CS reconstruction, original MAPSS T2-prepared images were Fourier transformed into coil-combined k-space, 1D-inverse Fourier transformed along the readout direction, and individual slices in \(k_{y} - k_{z}\) reconstructed using an \(L_{1}\) wavelet-based algorithm with regularization coefficient 0.00161. CS reconstructed images were registered to the TE = 0 ms echo time image using a 3D rigid registration algorithm with a normalized mutual information criterion and fitted using Levenberg–Marquardt fitting to yield \(T_{2}\) maps. Performance of these approaches and our proposed methods was evaluated through the following:

Comparison of global and ROI-specific performance

To test for completeness of training, performance of our proposed pipelines was compared against state-of-the-art models that did not use ROI-specific components in predicting T2 maps. Pearson’s r (α = 0.001, 0.01, and 0.05) was used to compare model performances and assess strength of correlations to ground truth T2.

Standard reconstruction metrics

Performance was reported in tissues of interest with standard reconstruction metrics: NRMSE (mean ± 1 s.d.) and Pearson’s r (α = 0.001, 0.01, and 0.05). NRMSEs were also converted into T2 value equivalents by tissue compartment as in the ablation study.

T2 value retention

Fidelity of predicted maps to ground truth T2 was also assessed. First, predicted and ground truth T2 values were compared across tissues of interest within the test set (mean ± 1 s.d.), generating violin plots for all three anatomies with overlaid boxplots for T2 value distribution comparison. T2 agreement was also assessed through Bland–Altman analysis.

Texture retention

Gray Level Co-Occurrence Matrix (GLCM)62 metrics were used to assess texture retention within tissues of interest. GLCM contrast and dissimilarity are maximized by large local pixel value changes and thus by sharper textures. GLCM homogeneity is maximized by small local pixel value changes, while GLCM energy and angular second moment (ASM) are maximized by few total pixel values within an image; hence, all three are maximized by smoothness. For each anatomy and R, we calculated these texture metrics at 4 orientations (θ = 0°, 45°, 90° and 135°; d = 1 pixel) and averaged across all orientations. Finally, we calculated intraclass correlation coefficients (ICCs) for all metrics with respect to ground truth (two-way mixed effects, single rater63) and reported 95% ICC confidence intervals (α = 0.001, 0.01, and 0.05). These tests were chosen as appropriate, as they assess both reliability and agreement of associated metrics, and in this use case, individual GLCM metric values themselves are considered the only rater, justifying the ICC test type selected.

Repeatability study

To assess the robustness of pipelines to different datasets, two additional splits of the knee, hip and spine datasets were made, ensuring no patient was part of multiple validation and/or test datasets and that all scans from a given patient were only in one of training, validation and test for each split (folds 2 and 3 in Supplementary Table S2, where fold 1 is the original split). Additional hyperparameters searches optimized loss function weights on the two new splits. Optimized loss weights and corresponding T2 quantification and texture retention performance for each splits is presented at all tested R in the same manner as for the primary split.

Raw multicoil data assessment

An in-house pipeline was developed that leveraged GE Orchestra 1.10 and other postprocessing tools to reconstruct coil-combined images from raw k-space data. As a proof of concept, knee MAPSS scans were performed on 3 volunteers, hip scans for 2, and lumbar spine for 2, all using the acquisition parameters listed for the retrospective datasets used for algorithm training, with raw k-space data saved for all. Multicoil k-space data (after ARC for knee and hip) was undersampled with the same center-weighted Poisson disc pattern described earlier, with each coil seeing the same undersampling pattern and ky-kz lines being shared across different T2 weighted echo time k-spaces as previously described. Coil-combined images resulting from undersampled multi-coil data at all tested R were fed through corresponding post-processing pipelines to predict T2 map appearance. A radiologist with 2 years of experience segmented knee cartilage, hip cartilage, and intervertebral discs from these acquisitions, allowing for visualizations of predicted T2 maps and NRMSE calculations in ROIs.

Results

Ablation study results

Voxel-wise performance metrics for ablation study models at R = 8 are shown in Supplementary Table S3, with T2 value NRMSE equivalents in Supplementary Table S4. Within the knee and hip, all loss components were necessary to obtain the optimal combination of high Pearson’s r and low NRMSE in cartilage. For the lumbar spine, while all loss components proved vital in maximizing Pearson’s r and minimizing NRMSE in IVDs, performance improved when the initial recurrent network was omitted. Though quantitative analysis is shown for all three pipeline versions in subsequent experiments, the full model is designated as best for knee and hip, and the no RNN for the spine.

ROI-specific and global assessments of best models and corresponding models trained without an ROI-specific loss (λ1,ϕ = 0) and models trained with a generic loss (λ1,ϕ = 0, λFeat = 0) are shown in Supplementary Table S5. In the knee and hip, across nearly all R, ROI-specific loss addition leads to improved correlations between predicted and ground truth cartilage T2, with diminished performance globally. In the lumbar spine, which was trained with a substantially fewer batches than the knee and hip pipelines, these trends were inconsistent across tested R. Example predictions and ground truth for one slice of a patient in each pipeline are shown in Supplementary Fig. S3, showing that patterns of local T2 value elevations in cartilage and IVDs are better preserved with an ROI-specific loss as opposed to pipelines trained without the loss component.

Visuals of network performance and comparison with state-of-the-art models

Predicted T2 maps are displayed at select R for knee, hip and lumbar spine models in Fig. 2 for our three pipelines and three methods from the literature. In knee, hip, and lumbar spine, T2 quantification performance is strongest with our proposed methods, maintaining low error rates, showing promising results compared with state-of-the-art methods through R = 10. Optimal architecture performances are further explored in Figs. 35. As shown in Fig. 3a, predicted T2 knee maps retained strong fidelity to ground truth within tibiofemoral joint cartilage. Patterns within predicted maps became slightly more diffuse as R increased to 10, as indicated by a slight rise in NRMSE for cartilage in the slice, but visually, T2 values and map patterns are preserved. As seen in Fig. 4a, hip predicted maps preserve T2 values well in femoral and acetabular cartilage through R = 10, although T2 patterns become more diffuse by R = 10. Figure 5a shows T2 map predictions in the lumbar spine. The L4-L5 IVD is shown in more detail, where T2 quantification performance was acceptable at R = 3, moderate at R = 6, and worse at R = 10, as indicated by rising IVD NRMSEs.

Figure 2
figure 2

Comparison of predicted T2 maps with ROI-specific methodologies to past approaches. (a) Predicted T2 maps in knee cartilage for a representative patient within test set. T2 quantification performance was best in pipelines trained with ROI-specific losses (Full Model, Reduced Parameters, and No RNN), where strong fidelity to T2 values and patterns of local elevations within cartilage were maintained through R = 10, while other tested approaches did a poorer job in predicting T2 values in these maps. (b) Predicted hip cartilage T2 maps showed similar trends, where performance of the full model was especially strong, showing low T2 quantification error and better retaining local T2 elevations through R = 10 than other approaches. (c) Predicted T2 maps in lumbar spine IVDs show higher T2 quantification errors than in hip and knee cartilage, but ROI-specific loss pipelines best preserved map textures and values.

Figure 3
figure 3

T2 quantification performance of optimal ROI-specific pipeline in knee cartilage. (a) Visual pipeline performance within the knee for a representative patient, with corresponding NRMSEs for cartilage in the predicted T2 map slice. Performance remains strong through R = 10, maintaining T2 patterns in the medial tibiofemoral cartilage, indicating pipeline utility. Predicted maps generated by the network are masked using a cartilage segmentation mask and superimposed on the ground truth, fully sampled TE = 0 ms MAPSS echo time image. (b) Bland–Altman plots for all scans within test set for which multiclass cartilage compartment segmentations were available (n = 16, 6 cartilage compartments for each). Predicted T2 values demonstrate minimal bias and tight limits of agreement across most tested R, with best performance coming from patellofemoral cartilage.

Figure 4
figure 4

T2 quantification performance of optimal ROI-specific pipeline in hip cartilage. (a) Visual pipeline performance within the hip for a representative patient, with corresponding NRMSEs for cartilage in the predicted T2 map slice. Predicted maps are masked using a cartilage segmentation mask and superimposed on the ground truth, fully sampled TE = 0 ms MAPSS echo time image. For this patient, T2 patterns maintain through R = 10, although local T2 elevations are more diffusely predicted at higher R. (b) Bland–Altman plots for all scans within test set (n = 15, 2 cartilage compartments for each). Plots demonstrate very limited bias and even tighter limits of agreement from R = 2 through R = 12 than knee pipeline, showing hip pipeline effectiveness in reproducing T2 values from accelerated MAPSS acquisitions.

Figure 5
figure 5

T2 quantification performance of optimal ROI-specific pipeline in lumbar spine intervertebral discs. (a) Visual pipeline performance within the lumbar spine IVDs for a representative patient, with corresponding NRMSEs for IVDs in the predicted T2 map slice. Predicted maps are masked using an IVD segmentation mask and superimposed on the ground truth, fully sampled TE = 0 ms MAPSS echo time image. Network performance is best through R = 6, after which local T2 elevations are diffuse and underestimated. (b) Bland–Altman plots for all scans within test set (n = 5, 5 IVDs plotted for each if segmentation of disc available). T2 value predictions reflect some bias and fairly wide limits of agreement, particularly above R = 4. These results indicate progress but the need for improvement. Smaller lumbar spine dataset and test set size are likely responsible for poorer model when compared to hip and knee performance, as well as the relatively smaller number of slices in kz, which exacerbates undersampling effects.

ROI and global performance comparisons of our selected pipelines against state-of-the-art approaches are in Supplementary Table S5. Across piplines trained with relatively large dataset (knee and hip), DL and model-based approaches (MANTIS and MANTIS-GAN) outperformed our proposed pipeline globally, but within cartilage ROIs, our pipeline exhibited stronger Pearson’s r at each tested R. These trends were not as strong in the lumbar spine pipelines, possibly owing to the randomness of training with a smaller dataset. Global and ROI-specific T2 predictions are further visualized in Supplementary Fig. S4, showing predicted T2 values exhibit substantially more visual fidelity to ground truth and lower NRMSE in state-of-the-art models compared to our pipeline, but a reversal of that trend in cartilage. In the lumbar spine, at some but not all R, those trends held, yielding similar conclusions to the Pearson’s r analysis.

Evaluation of T2 quantification performance and comparison with state-of-the-art models

Voxel-wise T2 evaluation fidelity

Pearson’s r and NRMSE across all anatomies and R for our approaches and state-of-the-art methods are in Table 2. T2 value NRMSE equivalents are in Supplementary Table S6. For all anatomies and across nearly all R, T2 quantification performance is strongest in our methods, particularly in the No RNN and full model pipelines, compared to state-of-the-art models.

Table 2 ROI-specific model performance in standard metrics from R = 2 through R = 12.

An exhaustive examination of knee T2 quantification performance, stratified by cartilage compartments, is in Supplementary Tables S7 and S8. For the full model, across all cartilage compartments, T2 estimation errors remained under 10% through R = 10 across all cartilage compartments while Pearson’s r ranged from 0.748 at R = 2 to 0.491 at R = 12, indicating strong correlations64 between predictions and ground truth at R = 2 and moderate correlations through R = 12. For some cartilage compartments and R, performance was stronger in the No RNN pipeline. Interestingly, quantification performance was strongest in patellofemoral joint cartilage, generally exhibiting lower NRMSE and stronger correlations. Our ROI-specific loss pipelines outperformed state-of-the-art models in each cartilage compartment.

Supplementary Tables S9 and S10 show hip T2 quantification performance across cartilage compartments. As in the knee, quantification performance was strong, with error rates across all cartilage under 9% through R = 12 for the no RNN and full model pipelines. While the no RNN pipeline had stronger quantification errors, the full model had higher Pearson’s r, which ranged from 0.794 at R = 2 to 0.517 at R = 12, showing strong correlations between predictions and ground truth through R = 3 and moderate correlations through R = 12. T2 quantification performance was slightly stronger in femoral than acetabular cartilage. Our pipelines again outperformed state-of-the-art models in each cartilage compartment.

Supplementary Tables S11 and S12 show lumbar spine T2 quantification performance, which was mixed. Pearson’s r across all discs was very high, ranging from 0.884 at R = 2 to 0.643 at R = 12 for the no RNN model, indicating strong correlations through R = 8 and moderate correlations through R = 12 to ground truth. That said, IVD error rates were markedly higher across all R than in hip and knee cartilage, ranging from 4.86% to 18.8%. Though there was some volatility, error rates and Pearson’s r generally showed poorest T2 quantification in L1/L2 and L2/L3 discs. Through R = 8, ROI-specific loss pipelines outperformed state-of-the-art models at nearly all disc levels, with stronger Pearson’s r in most IVD levels through R = 12.

T2 Value retention on region of interest averages

Bland–Altman plots are provided for the knee, hip and lumbar spine in Figs. 3b, 4b, and 5b. In knee and hip, T2 values are predicted with minimal bias with respect to ground truth. The ± 1.96 s.d. limits of agreement were less than approximately ± 6 ms with mean biases under ± 3 ms through R = 8 for knee cartilage (Fig. 3b). Among cartilage compartments, predictions in trochlear and patellar cartilage showed the least bias, while tibiofemoral cartilage T2 was generally slightly overestimated. In the hip (Fig. 4b), ± 1.96 s.d. limits of agreement were less than approximately ± 5 ms with mean biases under ± 3 ms through R = 12, although T2 quantification performance was similar across femoral and acetabular cartilage. In the lumbar spine (Fig. 5b), limits of agreement were considerably wider than the hip and knee pipelines, particularly above R = 4. While the line of equality was contained in these limits at all R, spine pipelines generally overestimated T2 values. While at some particular R, a disc level saw poorer T2 quantification than others (i.e. L2/L3 at R = 6), on balance, predicted maps yielded similar bias and error across all discs.

Supplementary Fig. S5 shows T2 value distributions in violin and boxplots. Plots reveal minimal bias in hip cartilage predicted T2 maps and slight but limited bias towards overestimating T2 in knee cartilage. In the lumbar spine, more volatility was observed in predicted T2 distributions, likely due to small test set size (n = 5), but at least through R = 6, these deviations had limited magnitude.

Texture retention

ICCs ± 1 s.d. for GLCM metrics are in Table 3 for our best performing pipelines: no RNN and full model. In knee cartilage, ICCs showed significant correlations between predicted and ground truth GLCM metrics at all R for smooth textures and many R for sharp textures, indicating good to excellent reliability in preserving smooth textures (ASM and energy) at all R and moderate reliability in preserving sharper textures at low R (dissimilarity). In hip cartilage, ICCs showed significant correlations across all R in preserving smooth textures, and at low to moderate R for sharper textures. Reliability in smooth texture preservation ranged from good to excellent for all R and moderate for sharper textures at low to medium R. In both knee and hip cartilage, the full pipeline saw substantially higher GLCM ICCs for smooth and sharper texture across nearly all R. Within the lumbar spine, ICCs were significant across nearly all R for smoother textures. While ICCs were reasonable high for some R in contrast metrics, confidence intervals were wide, limiting findings of significant correlations. ICCs showed moderate to excellent reliability in preserving smoother textures, and poor to moderate reliability for sharper textures. For the spine, the No RNN model yielded optimal texture retention.

Table 3 Texture retention analysis in No RNN and Full Model pipelines.

Repeatability study

Optimal loss weightings from hyperparameter searches on the two additional splits are in Supplementary Table S13. Results of trainings on additional splits in T2 quantification error, Pearson’s r, and texture metrics are in Supplementary Tables S14, S15, S16. In the knee and hip pipelines, experiments show comparable results across all folds for these metrics. In the lumbar spine, Pearson’s r exhibited similar values across all folds, but in some cases, mean texture metric ICCs and NRMSEs exhibited substantial differences. However, confidence intervals were very wide for ICCs and NRMSEs in the lumbar spine, likely due to limited test set size (n = 5).

Raw multicoil data assessment

Supplementary Fig. S6 shows T2 maps predicted from our proposed pipelines on retrospectively undersampled raw k-space data. In the knee, T2 quantification errors were low through R = 12, with local T2 elevations preserved and little dip in performance compared to corresponding retrospectively undersampled coil-combined knee data. In the hip, T2 quantification errors were low, with local T2 elevations reproduced at most R; while performance at higher R matched expected performance from coil-combined experiments, lower R quantification errors were slightly higher. Performance was more volatile in the lumbar spine, where through R = 4, T2 quantification errors matched expected results and local T2 patterns were generally preserved, but performance degraded substantially above R = 4.

Discussion and conclusions

In this work, we present data-driven pipelines that leverage recurrent UNet architectures and multi-component losses to accelerate MAPSS T2 mapping for anatomies where a subset of tissues is of particular clinical interest. By image processing and standard reconstruction metrics, through R = 10, our knee pipelines retained fidelity to T2 values with tight limits of agreement, preserving smooth textures with good to excellent reliability and sharper ones with moderate reliability for most tested R. While the no RNN pipeline delivered lower NRMSEs and higher Pearson’s r across many cartilage compartments and R than full model, its texture retention was poorer, making the full model better suited to preserve small, key diagnostic features. In hip cartilage, predicted maps retained T2 fidelity through R = 12 with tight limits of agreement, preserved smooth textures with good to excellent agreement across tested R, and maintained sharper textures at low to moderate R. As with the knee, texture retention was strongest in the full pipeline despite lower no RNN NRMSEs. In IVDs, the no RNN pipeline delivered best standard reconstruction metric and texture retention performance. Despite maintaining smoother textures with moderate to excellent agreement across tested R and preserving sharper textures at lower R, the IVD pipeline revealed biases and fairly wide limits of agreement in T2 preservation, particularly at R = 6 and higher. When assessed on retrospectively undersampled multicoil raw k-space data, the knee and hip pipelines saw minimal degradation in performance as compared to results from images undersampled via synthetic k-space, whereas the lumbar spine pipeline exhibited similar performance through R = 4. Furthermore, repeatability studies indicated that, particularly for the hip and knee, performance was stable with respect to datasets. All told, these metrics indicate promise for the knee and hip pipelines in MAPSS T2 mapping acceleration, and progress but room for improvement in IVDs.

Assessments of ROI-specific loss component utility showed its potential for improving predictions in accelerated acquisition schemes. When trained with sufficiently large datasets, as our knee and hip pipelines were, its inclusion saw stronger fidelity to local T2 patterns in cartilage ROIs and reduced T2 quantification errors compared to analogous pipelines trained without the ROI-specific loss component. Compared to state-of-the-art DL pipelines, knee and hip pipelines saw improved Pearson’s r in cartilage ROIs but poorer global Pearson’s r, as expected from the focused training approach. Interestingly, CS approaches exhibit relatively strong NRMSEs while generating relatively smooth predicted T2 maps; this is possibly because in training, DL-based approaches simultaneously removed aliasing artifacts and performed T2 fitting, and could attempt to preserve finer details than a CS approach performing those steps sequentially. While our approaches outperformed state-of-the-art methods at many R and tissue compartments in the lumbar spine, global Pearson’s r indicated this may have been partially due to some models being more completely trained than others. These results may have been different with a larger lumbar spine training set. Nonetheless, the value of ROI-specific loss functions in accelerated acquisition pipelines is clear: with sufficiently large datasets, they can optimize for ROIs and outperform state-of-the-art approaches at high R, as existing approaches are optimized for global and not ROI-specific performance.

We can contextualize performance by comparing quantification errors to clinically significant T2 changes. In the knee, T2 increases 13.4% in lateral femoral condyle (LFC) cartilage, 12.3% in medical femoral condyle (MFC) cartilage, and 8.1% in medial tibial condyle (MTC) cartilage among patients with mild OA compared to controls65. Our top-performing knee pipeline saw errors below this benchmark through R = 12 in the LFC and at R = 2 in the MTC. In IVDs, T2 decreases 36.3% in the nucleus pulposus and 24.2% in the annulus fibrosus from healthy to degenerative discs66. Our top-performing pipeline saw quantification errors for each disc below the more stringent 24.2% through R = 12. In the hip, T2 values among healthy patients that progress to OA within 18 months are 7.3% higher in femoral and 5.2% higher in acetabular cartilage compared to controls67. Our top-performing hip pipeline had errors below these benchmarks at all R in femoral cartilage and at R = 2 in acetabular cartilage. Clinical metrics thus depict promise for pipelines in all three anatomies in maintaining sub-clinical-significance quantification errors.

Clinical and standard metrics show knee and hip pipeline performances to be particularly promising—the T2 values, map texture preservation, and error rates relative to clinical benchmarks all mark meaningful progress towards reducing cMRI acquisition time for eventual clinical use. That said, while lumbar spine performance was strong by clinical metrics, it lagged the knee and hip by standard reconstruction metrics. One explanation is dataset size: the lumbar spine dataset had substantially fewer scans and imaging slices than the knee and hip. This has twofold impact: (1) the strength of a model trained from a smaller dataset is inherently limited, and (2) having only 5 test set scans limits statistical power and induces wide standard deviations of metrics, preventing significant conclusions from being reached. The effects of this small dataset size particularly surface in repeatability studies. Furthermore, lumbar spine acquisitions were more susceptible to breathing artifacts and had fewer slices than the hip and knee; undersampling therefore left fewer lumbar spine ky-kz lines sampled compared to the hip and knee, inducing worse initializations and possibly poorer performance. Nonetheless, to our knowledge, this is the first DL application to accelerate lumbar spine cMRI, marking progress that must be furthered with additional data procurement and algorithm development for clinical utility.

The GLCM-based textural retention evaluation demonstrated a framework through which reconstruction performance can be better evaluated than through standard metrics like SSIM, NRMSE, and PSNR. ICCs of GLCM metrics between predicted and ground truth T2 maps allow for intuitive, scaled measurements that can reflect how well a particular texture was preserved: for example, visual inspection of predicted T2 maps in knee and hip cartilage in Figs. 3, 4 indicate that sharp textures are preserved better by the hip pipeline. This qualitative observation is confirmed by the GLCM Dissimilarity ICCs observed for the full model in the hip and knee pipelines in Table 3 at several tested R. This work could be furthered by extending this analysis to additional GLCM metrics for an even more thorough assessment of textural feature retention. Additional future improvements could also include pre-processing cartilage and IVD tissues prior to GLCM metric calculation to improve stability of these metrics, as other groups have started to do68.

Moreover, by showing results at 7 acceleration factors instead of the 2–3 typical in the literature, we found performance did not always degrade steadily as R increased. Networks therefore may be sensitive not just to general undersampling patterns, but also the specific nature of the pattern. Thus, when future DL reconstruction pipelines are trained, a library of undersampling patterns may be advisable to encourage robustness to sampling patterns69.

This study has limitations. First, we used retrospectively undersampled coil-combined magnitude echo time images that, in the knee and hip, had undergone ARC processing in their reconstruction, with 4 edge slices discarded for all data. Due to coil combination and post-processing, the k-space being undersampled would not match the acquisition’s multi-coil k-space. Additionally, while we undersampled the MAPSS acquisition ellipse for each anatomy, the hip acquisitions had ‘no phase wrap’ applied, meaning that tested undersampling patterns would differ from those implemented on the scanner. While our raw k-space experiments show performance degradation was limited compared to coil-combined magnitude image experiments, models would be stronger if trained with a similarly sized multicoil k-space dataset. Second, this network is specific to our sampling patterns and acquisition parameters, and new pipelines would need to be trained should parameters like MAPSS T2 echo times be substantially changed. Finally, the lumbar spine dataset size is rather small, limiting the power of conclusions.

To conclude, this study shows a novel means of training DL pipelines to accelerate cMRI in anatomies where specific tissues are of heightened clinical importance. In knee and hip, pipelines were effective at high R in maintaining textures, keeping fidelity to T2 values, and minimizing T2 quantification errors, whereas in the lumbar spine, the pipeline performed reasonably by those same criteria, but poorer in T2 value fidelity and quantification errors. This reflects progress towards clinically useful pipelines that specialize in MSK T2 mapping. The GLCM-based textural retention analysis elucidates an alternate to standard reconstruction metrics, allowing for intuitive measures of the types of features best preserved by a accelerated acquisition schemes, potentially allowing for better quantitative assessment of model performance. Future directions include multicoil k-space training, simultaneous MAPSS T and T2 acceleration, and temporal undersampling of T2 weighted echo time images.