Towards fully automated segmentation of rat cardiac MRI by leveraging deep learning frameworks

Automated segmentation of human cardiac magnetic resonance datasets has been steadily improving during recent years. Similar applications would be highly useful to improve and speed up the studies of cardiac function in rodents in the preclinical context. However, the transfer of such segmentation methods to the preclinical research is compounded by the limited number of datasets and lower image resolution. In this paper we present a successful application of deep architectures 3D cardiac segmentation for rats in preclinical contexts which to our knowledge has not yet been reported. We developed segmentation models that expand on the standard U-Net architecture and evaluated models separately trained for systole and diastole phases (2MSA) and a single model trained for all phases (1MSA). Furthermore, we calibrated model outputs using a Gaussian process (GP)-based prior to improve phase selection. The resulting models approach human performance in terms of left ventricular segmentation quality and ejection fraction (EF) estimation in both 1MSA and 2MSA settings (Sørensen-Dice score 0.91 ± 0.072 and 0.93 ± 0.032, respectively). 2MSA achieved a mean absolute difference between estimated and reference EF of 3.5 ± 2.5%, while 1MSA resulted in 4.1 ± 3.0%. Applying GPs to 1MSA enabled automating systole and diastole phase selection. Both segmentation approaches (1MSA and 2MSA) were statistically equivalent. Combined with a proposed cardiac phase selection strategy, our work presents an important first step towards a fully automated segmentation pipeline in the context of rat cardiac analysis.


INTRODUCTION
Preclinical assessment of novel cardiovascular therapeutics relies on the use of rodent models. Cardiac magnetic resonance (CMR) imaging is important to this assessment due to its ability to differentiate tissue types with good contrast. Moreover, the non-invasive nature of CMR makes it well suited for a longitudinal follow-up of the same animal during the treatment. 1,2 Changes in volumes of left ventricle (LV) and the ejection fraction (EF) over time are commonly assessed in response to treatment as a proxy estimate of rodent cardiac function.
Currently, the assessment of LV function mainly relies on the manual selection and slice-by-slice segmentation of the ventricle at end-diastole (ED) and end-systole (ES) phases of the cardiac cycle to estimate EF. As the number of animals in a preclinical study can be very large and are followed-up in multiple timepoints, this task can add hours of laborious analysis. Furthermore, the quality of the resulting segmentations depends on experts' experience and often suffers from intra-and interoperator variability. [3][4][5] Thus, fully automated segmentation tools to standardise and speed up the segmentation of rodent CMR datasets are highly desirable for in vivo cardiac efficacy studies.
The quality of semi-and fully automated methods for cardiac segmentation of difficult human CMR datasets has been improving thanks to the uptake of convolutional neural networks (CNN). [6][7][8][9][10] However, there is limited research applied in the preclinical cardiac 3D MRI segmentation sphere in smaller rodent datasets with only two recent examples in mice. [6][7][8] To our knowledge, this work presents the first statistically robust study into automating the estimation of EF in rats and expands on the well-established 3D U-Net 9,11 and its derived architectures (i.e., Attention U-Net, 12 U-Net++ 13,14 and V-Net 15 ).
Supported by our preliminary work and observations from other teams, 12 LV segmentation of the ES achieves consistently lower performance compared to the ED when attempted by the same model, namely, 1-model segmentation approach (1MSA). As this can lead to undesirable downstream effects on EF estimation, we trained two independent models to segment diastole and systole separately, 2model segmentation approach (2MSA). Both 1MSA and 2MSA were benchmarked in terms of segmentation quality, robustness against noise and agreement for EF estimates compared to humanderived gold standards. Our results suggest that the presented 2MSA and 1MSA approaches are equivalent in terms of EF estimation and achieve close to human performance which highlights that both 1MSA and 2MSA methods are valid for use depending on the context. In a similar vein to previous research in the UK Biobank data, 16 the 1MSA approach has been able to segment all cardiac phases captured during the recording of the cardiac cycle. This has allowed us to evaluate the potential of the method for full automation, that is, automated phase selection and EF estimation. Upon probing different metrics derived from LV segmentations along the cardiac cycle (LV volume, LV surface area or mid-slice area), we implemented a GP which identified systole and diastole phases and subsequently estimated EF. This was compared against a conventional fourth-polynomial degree fitting strategy and human selection of systole and diastole. In the light of this evidence, 1MSA emerged as a method with the potential to inspire similar work in other biological systems and organs in the preclinical setting. Table 1 summarises the six segmentation datasets used in this study, consisting of short-axis 3D CMR images (multi-2D acquisition) from sham-operated rats or rats with induced myocardial infarcts (EF ranging from 41% -67 %). Eleven to thirteen short-axis CINE time series (temporal resolution 8 ms)

Data Acquisition
covering the LV and two perpendicular long-axis slices were acquired per animal. We use the term image stack to refer to the collection of 2D slices which make up a 3D CMR image of the rat LV with a resolution of 0.5x0.5x1.

Postprocessing
The results of the segmentation were subjected to thresholding. Subsequently, morphological operations were carried out eroding thin protrusions and closing potential holes within the LV cavity. 18

Implementation Details
Segmentation Approaches Definitions. 1MSA models were trained both on systole and diastole phases and used for segmentation of all phases in the cardiac cycle. 2MSA models were trained separately on systole or diastole phases and used for segmentation of systole or diastole, respectively.

Model Architectures.
A range of encoder-decoder architectures using CNNs have been implemented, namely, U-Net, 9 Attention U-Net, 12 UNet++ 13 and V-Net. 15 3D convolutions were applied and the downsampling steps were performed using a volumetric kernel of 2x2x2 voxels and a stride of 2x2x2 or 1x2x2 voxels (depending on the downsampling dimensions). Additionally, 3D padding was applied at the original image to avoid cropping the skip pathways. The padding was set to 'same' and the dilation rate was set to 1x1x1 voxels for the decoding path. Sigmoid was used as the last activation function.
Batch normalisation was used after each layer and the dimensions of the tensors were passed as (N, D, H, W, C), where N is the number of sequences (mini batch), D is the number of images in a sequence, H is the height of one image in the sequence, W is the width of one image in the sequence and C is the number of channels (set to 1).  The soft Dice loss is formulated as: where T is the true foreground segmentation with voxel values ti and P the predicted probabilistic segmentation for the mask over i image elements pi. The background class probability is 1 -P. ϵ was set to 1. As a trivial extension, the weighted Dice loss function was designed to penalise misclassification at the borders of the region of interest (ROI). This objective function is given by: The weight map w map (t i ) was implemented as described in Ronneberger et al. 9 : equation (3) setting w0 = 2 and σ = 1. wc is the weight map to balance the class frequencies, d1 denotes the distance to the border of the nearest cell and d2 the distance to the border of the second nearest cell.
Data Augmentation. One or more of the following augmentation strategies were implemented at random on the original images: elastic deformations, random shifts (across the x-axis and the y-axis), rotation (-20° to 20°), scaling (contraction or expansion), blurring (Gaussian filter with σ ϵ [0.5, 1.5]) and gamma correction (γ ~ (1,0.1)). The final training dataset was inflated by a factor of ten.
Model Optimisation. To segment the ROI, two approaches were explored: 2MSA and 1MSA. 2MSA consisted of training two separate models for each phase, whilst for 1MSA the model was trained with the whole set of images (regardless of their phase). The hyperparameters for the top-three initial best performing architectures for each approach were fine-tuned on the basis of Sørensen-Dice score (DSC). 21 Furthermore, ensembling strategies (voting and averaging) were evaluated.

Segmentation Approach Benchmarking
Model Selection. DSC distributions and convergence times were used as criteria for aiding the selection of models for 2MSA and 1MSA. EF predicted = EF actual + animal ID + approach + treatment + reader equation (4) where denotes an indicator function. reader and animal ID are random effects and approach and treatment are the model approach and treatment arms fixed effects (see Supplementary Information).
Additionally, contrasts were constructed to perform an equivalence test comparing both approaches with an equivalence margin of 2 % EF. 23 Noise Robustness Analysis. Three test sets were artificially generated based on the images from the Study 5 by introducing Gaussian, Rician 24 and Rayleigh 25 distributed noise with a signal to noise ratio (SNR) of 30. The CNNs were further tested against a mixed noise scenario, where SNR was set to 20 and any of the aforementioned noise distributions was applied to the image at random. DSC was reported, alongside with EF mean absolute differences (MD) both for 1MSA and 2MSA as per equation Agreement Analysis. The main endpoint of this study was to estimate the EF given a cardiac cycle.
Volumes were estimated by calculating the sum of masks for each slice.
MD and Bland-Altman plots 26 were used to analyse agreement between reference and estimated EF for both 1MSA and 2MSA segmentation approaches.

Interoperator agreement
The models were benchmarked against the interoperator variability (Op1 vs Op2) with a Bland-Altman plot for the Study 5 test set to examine method interchangeability.

1MSA Automation Feasibility Study
The availability of eleven to thirteen images stacks ordered along the temporal dimension allowed to represent a full cardiac cycle. Two different algorithms were devised to fit the cardiac cycle volume curve: a fourth-degree polynomial and a GP model. 27 The prior to fit the GP model to the data was a composite kernel consisting of a constant kernel multiplied by the radial-basis function (RBF) kernel (see Supplementary Information).
For automated systole and diastole phase selection, three different metrics were examined. These were the cardiac volume, slice area and surface area. The ground truth was taken as the manually selected diastole and systolic phase masks. MD and Bland-Altman plots were analysed to inform which metric would be best suited for this purpose. Once the systole and diastole phases had been selected, their respective heart volumes were used for EF estimation. The slice area was defined as the 2D slice whose area had the highest variance throughout the cardiac cycle. The surface area was the 2D surface mesh and was calculated using the Lewiner marching cubes algorithm. 28

Software
We implemented the networks in Python v3. 7

Model Optimisation
The Dice loss was found to be the optimal training loss reaching satisfactory scores in the range between 0.89 -0.90. Notably, persisting numerical instability issues prevented progress in studying further loss functions (i.e., hybrid loss functions).
The hyperparameters for both the systole and diastole final models from the 2MSA are shown in Table   5 and Table 6, respectively.    During the training process, our V-Net implementation converged up to two times faster (33 h for diastole 2MSA) than the rest of the CNN encoder-decoder architectures. In contrast, U-Net++ was the slowest network to train (56 h).

Model Selection
For the 2MSA, base models exhibit a mean DSC > 0.90 and right-skewed distributions (see Fig. 1).
Interestingly, ensemble models provide similar statistics. Exploratory attempts carried out by bagging five Attention U-Net models for systole segmentation yielded a mean DSC > 0.92.
Indeed, there is a high degree of overlap in DSC distributions amongst the tested architectures. However, a higher degree of variability can be observed when it comes to systole-trained models. In fact, diastolic segmentations reached better segmentation quality than systolic ones. Thus, better performances and lower convergence times, prompted us to select V-Net as the final segmentation model for systolic phases and Attention U-Net for diastolic phases for the 2MSA approach.
With regards to 1MSA, performance for base models and ensembles was also comparable. U-Net was selected as the final model on grounds of parsimony (see Fig. 2). Over 75 % of the segmented hearts had a DSC > 0.90 regardless of the phase both for 1MSA and 2MSA (see Supplementary Information).

Segmentation Quality Analysis
Representative examples of CNN image stack segmentations from both 1MSA and 2MSA final models are displayed in Fig. 3. From visual inspection of the segmented image stacks, different segmentation methods did not seem to result in any substantial differences in segmentation quality. However, systolic contours were consistently more irregular than diastolic contours. It was observed that basal and apical slices suffered from lower segmentation quality compared to midslices (see Supplementary   Information).

Statistical Assessment of Segmentation Approaches
The final models for both 1MSA and 2MSA were benchmarked to examine which one would be better.
Consequently, a LMM was fit with the data adjusting for operator and treatment arm (see Table 8). The segmentation of sham-operated and myocardial infarction treatment arms was statistically significantly different (p-value < 0.0001) and the operator effect (p-value < 0.0001) too. Diagnostics confirmed that the normality assumption was tenable (see Supplementary Information). Furthermore, a locally estimated scatterplot smoothing (LOESS) fit was used to aid the interpretation of these differences (see Supplementary Information). Regarding ΔEF MDs, no segmentation approach was shown to be superior, given that the absolute differences were in the same order of magnitude (see Table 9).

Noise Robustness Analysis
In terms of segmentation quality, 2MSA benefits from data augmentation, whilst it does not seem to elicit major changes for 1MSA (see Fig. 4). The effect of data augmentation on statistics for both segmentation approaches is reported in Supplementary Information.

Agreement Analysis
Bland-Altman plots are displayed in Fig. 5 for 1MSA vs Operator 1 and 2MSA vs Operator 2 approaches, respectively. Table 10 contains the main statistics extracted from these plots.
Regardless of the segmentation approach, there was a significant negative bias (p-value < 0.05).
Nevertheless, both 1MSA and 2MSA afforded comparable biases and variability. There was no clear trend upon increasing the magnitude of EF, thereby suggesting that the risk of proportional bias could be rejected.   The average discrepancy between the two operators is -5.6 ± 3.8 %. Interoperator agreement in systole and diastole volume estimation is provided in the Supplementary Information.  Bland-Altman plots for agreement analysis for pooled Study 5 and Study 6 using the 1MSA (bias = -3.7 ± 3.5) (a) and 2MSA (bias = -2.9 ± 3.2) (b). The bias is marked with a solid blue line (─) and its 95% CI shaded in blue. The equality line is marked with a solid black line (─). The upper and lower 95% CI are marked with discontinuous red lines (--) and are shaded with their respective 95% CI in red.

1MSA Automation Feasibility Study
The full capabilities of 1MSA could be exploited thanks to the collection of image stacks along the cardiac cycle during the MRI scanning process. Table 11 shows the MD between the EF determined from manually segmented masks and the EF from automatic phase selection for the different selection metrics. Of note, the slice area emerged as the best metric to perform phase selection, as it had the lowest ΔEF MD both for polynomial and GP. In contrast, taking the surface area as the metric for phase selection yields the highest MDs. Agreement both for polynomial and GP using the three selection metrics was analysed qualitatively with Bland-Altman plots (see Supplementary Information) and their statistics are summarised in Table 12.
In all cases, there is a significant bias (p-value < 0.05), given that its 95% CI does not overlap with the equality line. Interestingly, both the volume and the slice area display negative biases, while surface area overestimates the EF. The lowest bias is observed when using the slice area as the phase selection metric and the highest using volume. However, the fitting method used does not substantially impact the bias for any of the metrics.
The difference in absolute terms between the upper and the lower agreement bounds always ranges between 15 and 20. The slice area displays the broader agreement bounds, while the surface area provides the narrowest. No proportional bias is observed and the datapoints are normally distributed. The cardiac cycle graphs were scrutinised by visual inspection to determine which method would be better placed to resolve challenging tasks. GP fitting followed the tendency marked by datapoints, whereas the polynomials adopted a generic form which oftentimes ignored finer experimental trends.
An archetypical example is depicted in Figure 7.

DISCUSSION
U-Net and its derived architectures demonstrated excellent performances for rat LV cardiac segmentation. However, the extension of U-Net with attention gates, 12 deep supervision 13 or residual connections 15 did not to translate into better performances in our data. Thus, U-Nets remain as competitive and capable to successfully extract relevant information even with 2 downsampling dimensions. Therefore, simpler U-Nets render futile architectural fine-tuning and extensive skip connections. Against initial expectations, 17 ensembles have only proved non-inferiority to base models, possibly due to the correlated nature of constituent. This motivated the rationale that model selection should be based on grounds of parsimony, which in its turn reduced convergence speed. Indeed, it was noticed that complex architectures (i.e., U-Net++) required longer training times.
Interestingly, a factor that influenced the quality of the segmentation is the cardiac phase. The lower segmentation quality in systole could be caused by class imbalance and a higher variability of systole phases compared to diastolic ones (see Fig. 8). 34,35 This might also be the reason why basal and apical slices suffered from poorer segmentation quality. Increased interslice coherence could play a role in improving segmentation, as it was seen that using 3 downsampling dimensions did not elicit performance improvements. In effect, strategies like leveraging recurrent neural networks have been tried in similar settings 36 and it could be a potential avenue for future exploration. The hyperparameter tuning showed moderate performance improvement. Nonetheless, a few interesting facts were observed. Renormalisation deterred overfitting more effectively than direct regularisation techniques and this is in line with current knowledge within the imaging community. This claim is also corroborated by the substantial drop in DSC performance experienced whenever renormalisation was not implemented. Indeed, the application of dropout at random could deprive the model of its most critical training parameters and cause a performance erosion. Finally, the alleged superiority of activation functions with self-normalising properties (i.e. GELU and SELU) 37,38 was not observed in our context.
Data augmentation has the potential to improve performance, 7 but this was not the case for 1MSA, which could be due to the lack of capacity to incorporate instance variance. Further ablation studies could help better understanding this phenomenon.
In principle, it was expected that the 2MSA would lead to significantly better segmentations than 1MSA.
However, the benchmarking of both 1MSA and 2MSA revealed that they were virtually equivalent. This is indeed supported by the LMM fit, the equivalence test and MD analyses. Both 1MSA and 2MSA demonstrated to model appropriately EFs in the range between 50 % and 70 %, whilst higher variability was observed outside this region. This likely stems from the natural limitation that the number of available training examples is lower.
The segmentation capability of these models was acceptable when compared to humans, as the bias and limits of agreement in the Bland-Altman plots were satisfactory (detecting an EF change of 5 % translates to clinically meaningful results on cardiac performance). Additionally, further agreement analysis revealed that these models held a promise in remediating inter-and intraoperator variability, as well as dramatically reducing the time of analysis. In line with previous efforts, 39 both 1MSA and 2MSA segmentation approaches had a lower bias than between operators and the limits of agreement were comparable. Certainly, these models while substantially quicker (~12 s with automated segmentation vs. ~20 mins with manual segmentation based on in-house experience). A shortcoming of this study is the limited number of operators available. Taking note of this, the addition of more operators for the assessment of the intraoperator agreement should be explored in the future.
Comparable levels of agreement between 1MSA and 2MSA prompted further investigation into the benefits of the former approach. Its applicability for automated phase selection leveraging the temporal dimension of each image stack along the cardiac cycle was explored. This approach expands on previous research carried out with 2D human images. 40 Amongst the phase selection metrics, the slice area afforded the lowest bias, potentially due to its sensitivity to small changes. The presence of bias in the selection metrics can be attributed to potential computational artifacts, biases in the data or in the annotation process. The surface area yielded the narrowest limits of agreement, which translates into better precision, as it has lower susceptibility to the noise potentially present in some slices. Thus far, no indication of a selection metric providing both an optimal precision and accuracy has been found.
The addition of more rats into the study could help alleviate the bias which the surface area metric suffers from, whilst improving a high variability scenario would be more challenging. Therefore, using the surface area provides more reliable EF estimates than the slice area.
Next, the fitting method which was best suited to model the cardiac cycle was evaluated. Visual inspection of the GP and polynomial fits revealed that the GP ability to learn the distribution of datapoints, furnishes it with the desired flexibility needed to model the cardiac cycle. Additionally, GPs offer the possibility to develop more advanced approaches such as sparse and variational Gaussian Process 41 which can improve further the representation of the cardiac cycle.
The 1MSA provides a seamless phase selection which, coupled with estimation of the EF, allows for a full-automated process. This reduces the segmentation time dramatically and has potential to substantially accelerate pharmacological studies. Whilst 1MSA is beneficial in some contexts, 2MSA can be used in a context where sporadic human intervention is a possibility. It is a simpler approach which can prevent potential errors that might arise from the phase selection process. Thus, the suitability of using either 1MSA or 2MSA would be very much dependant on the specific requirements from the end-user.
Future work will focus around generalising these frameworks for new clinical studies and subsequently improving the predictive capabilities of the existing models by exploring continual learning 42,43 and transfer learning. 44,45 The success of our methodology with a relatively low number of training instances is paves the way for its application to other organs and animals of preclinical interest. In this respect, preliminary work indicates that transfer learning could be applied successfully in the segmentation of mice hearts and we aim to explore this area further. Our work deals with a fundamental issues of image segmentation in preclinical settings and the success of our models trained on a relatively small preclinical dataset highlights the potential of this technique to be applied to automated cardiac segmentation both in preclinical settings.

CONCLUSIONS
We present a successful application of CNNs for the segmentation of rat LV from CMR datasets.
Estimates of derived clinical parameters, such as systolic left-ventricular volume, diastolic leftventricular volume, and EF, demonstrate close to human performance in both 2MSA and 1MSA settings.
Combined with a novel phase selection strategy based on GPs modelled on the changes of the mid-slice are along the segmented cardiac cycle, our work presents an encouraging first step towards a fully automated rat LV segmentation pipeline. In the future, due to their potential to significantly speed up reporting of results to analysis teams, we hope that such tools will be useful when assessing pharmacological interventions designed to improve cardiac function. As such, we believe that the deep learning based automated segmentation pipeline presented here would inspire research into similar assistive technologies in the preclinical image analysis space in the years to come.

. Implementation
These variables are expressed as indicator functions as such:

Linear Mixed Model Diagnostics
The QQ plot corroborates that the mixed linear model residuals are normally distributed, although there is some deviation from the expected normal line towards the tails which could indicate kurtosis. Nonetheless, the residual distribution is good enough, so the assumption of normality is tenable.

LOESS fit
A LOESS fit was used as an exploratory analysis tool to identify trends in manual and automated segmentations (see Fig. S1). Naïve rats have high ejection fractions (> 55 %), while there is a higher degree of variability in the case of myocardial infarction rats whose EFs span a range from 30 % up to 70 %, approximately. The modelled EFs deviate from linearity in the extremes. For Op1, the diagonal switches sides at around the 50 % EF mark. This trend is also identified for Op2, albeit it is less pronounced. No major differences are observed when comparing the fits between the two segmentation approaches. Figure S1. LOESS fit for EF determined using automated segmentation vs. manual segmentation by segmentation approach (left: one-model segmentation approach, right: two-model segmentation approach). Both graphs plot the operators (─ Operator 1, ─ Operator 2) involved in the segmentation and the treatment arm ( myocardial infarction,  naïve)

Gaussian Process
The crux of GP lies in defining a kernel which establishes the prior covariance matrix between two function values x and x' such: The composite kernel consisting of a constant kernel multiplied by the radial-basis function (RBF) kernel is given by: where the constant value for the constant kernel is set to 0.1 with constant value bounds between 0.1 and 10, d( · , · ) is the Euclidean distance, l is a length-scale parameter which is set to 0.5 with bounds 0.1 to 10.  The Hausdorff Distance (HD) from A to B is defined as follows:

Model Selection Statistics
and subsequently, the bidirectional Hausdorff distance is calculated as specified in Equation 2.

HD(A,B) = max(δ H (A,B), δ H (B,A)) Equation S5
Intraclass Correlation Coefficient (ICC) was determined to evaluate agreement between the actual volumes (derived from the true masks) and the predicted volumes. A value of 1 indicates perfect reliability, whereas a value of 0 indicates no agreement.